notesTouchingCorrelationsWorkshopSeda

Welcome to Etherpad!

This pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!

Get involved with Etherpad at http://etherpad.org

Courtenay leads discussion on ML

machine learning is really just observing patterns,
in lots of data

what is it used for:
    sppech recognition
    face recognition
    language translation
    predicting consumer behavior
    predicting financial markets
    analyzing social networks
    making business decisions

workflow of machine learning:

repeat unti you get out what you want: correlations, profit, phd etc.

there are typically two types of problems:

classification : most times people talk about machine learning, and they mean classification
regression: predict the value of something in a range of values

classification asks questions like:

is this a good or bad email (spam)
- is this a dog, or cat, or a rabbit

regression:

given sales data for last 12 months, what will next months sales
is it always about:
- malignant tumors: it is something you cannot see, a wall that keeps you from seeing things

janet: is it like you want to predict what netflic thinks of you?

betsy:
    if i could at least come at the end
    wanted to say thank you so much
    i am having fun geeking out with all these women
    i have been telling people
    i thought it would be counter to the culture to do the fb thing
    it was safe to ask any question and yet it was not dumb
    you can take step back
    it was a wonderful moment to be able to ask those questions
    it was great to realize that i knew more than i thought i did
    as a sociologist, and i know statistics
    but i rejected it
    but realizing that i still have that knowledge and can use to analyze my new project
    so i got great interview questions yesterday
    when i have access to tech designers, these questions will help me to get to what i want faster
    i will sound more like i know what they are doing

    i looked at the datasets
    do i want to put that on my computer

ri: how did you select weka
i looked at mathlab, i am not there yet
what is a good tool?

courtenay: i did all of my phd in mathlab, i hate it, it is proprietary
weka is a shitty visualization tool
it is the only thing that i know of that will allow you to do machine learning
run real algorithms out of box
and load your dataset
more programming is scary for people who don't know how to program
weka was started in 1993
i think it was a reasonable choice for what i was trying to do here

if you don't want classifier, but just want to visualize, there are surely nicer guis

ri: that would be great

besty: is weka like a wordpress for data

seda: there is a data science course by mako hill, the material is online, i will add it to the email.

joanne:
    suggestion for a topic
    a workshop on the blockchain
    with some of the implications

seda: adversarial machine learning algorithms - emails that are sent in order to figure out how machine learning works

asking a question about "data cleaning"? do you use regression to fill in holes in data set
incomplete data
when you are missing labels
istead of getting a human to do it, you try to bootstrap your algorithm
you make your first guess at your algorithm
you may be compounding your errors
that is something that happens when you can incomplete data

regression example:

    you have a bunch of cities and some information about income (i didn;t say anything about whether this is mean or median)
    you have a new city, and you want to know housing prices there
    you try to find out based on what you know about other cities


janet: do you choose the model or does the machine choose the model?
courtenay: you choose the model, you still have to choose
there is a lot of domain knowledge
you can look at this example and say, yes it looks linear
or i am going to try a more complicated model
how do i know which of these models are better?
you need to have a additional data

lilly: how similar is this to what economist and sociologists have been doing?

instead of discovering the world, it has become about can we make money off of it.

courtenay: i will talk about this tomorrow

statistics vs machine learning
these are classical statistical techniques
how is this different? why is it more shiny?
some of it is real cultural differences
and some of it is the same
at this level it looks like a lot of statistics
but as the field evolves and things get complicated, it is a little different

kavita: what is a model, is it like an algorithm?

courtenay:

it is a semantic thing
the model is the object that you end up with at the end with parameters
and you have an algorithm that chaanges the model
here the model is linear

sylvia:

crude drawings of basic shaptes
linear: the two values on the x and y axis grow together
logarithmic: you have a data set which eventually levels off, like age
exponential: i lost my notes, you can use a logarithmic representation of the expoential curves
- exponential: sylvia is showing how you use algorithms to depict datasets with exponential growth

it is the same information but easier to read

is there a library of these kinds of models and you go to them and choose them?

seda: is this what you mean with a model, courtenay?
courtenay:

you look at the data and you look to see what function you can fit.

carlin: i think about it as an equation

what a graph is doing is solving the equation: if x is this, y is that...

courtenay:

there are a lot of alrogithms and they are basically mathematical functions
the way you get the model may be complicated, running the algorithm may take 10 hours

elizabeth:

Now I have to ask what is an algorithm , I thought I knew.

courtenay:

usually you can be doing optimization
typically you are iterating on how much error your prediction is making

karissa: it is a bunch of steps that the computer executes

lilly: is it like a way of solving rubic cubes

carlin: it is a set of instructions

it could be self-referential inside

elizabeth: sounds like a recipe and the result

courtenay:

algoritms is this broad set of things
those are used to figure out your model
if you have these data points, you can get this line
it is not always so simple
this is an important task

couldn't you look up the price of housing?
what abour predicting how much you woul be willing to pay for a product?

classification example:

example:
we have some fish
2 species A and B
hard to tell individual fish apart
one of them is in danger, you want to tell when it is that fish
you don't want it overfished, but you can't tell by looing at it

you go out and observe these fish, the length and the number of stripes that you see
you get dna samples to test for species
you figure out which species the fish belonged to

features: attributes you observe about each example
class labels: ground truth, you know that is the true answer, gold standard
training examples

    lilly:
        you are not sure, you don't know how you want to classify them
        i thought you were going to say, the classifier would help you discern the clusters
        in that case you don't have a ground truth, you want to discern the classifications

    courtenay: they look the same but they really are two different things
    you want to identify which fish is which
    you went to a lab
    and you have their dna

    seda: but isn't that a probabilistic model, too?!

    courtenay: this is a toy example with a ground truth

martha:
    what if it is a behavioral outcome and it depends on how you treat it
    the outcome depends on how i treated you
    you are not a fish

courtenay:
    that would be about data contamination?

martha:
    ri is a a
    courtenay is a b

    i give courtenay a great credit card
    but the result is the outcome


coutenary:
    in the real world the gold standard is more complicated

martha:

for alternative pegagogy
- we start with animals where the social complication is not visible
you credit is evaluated based on products you have consumed
but you can consume these products if you have a good credit score

carlin:

interesting problems here
what constitutes ground truth, when is it reliable enough
there is then a simplicity and complexity thing
people often default to animals, balls, sports, because there is a need to go to a simple pehnomena
which turns out not to be a simple phenomena
it is an important thing anyways

martha:

the firt thing you teach kids is animals
kids can be duck and cat
but not every kid can be like good credit outcome
i wonder why this example starts here?

courtenay:
    that is how it started in my machine learning course

berns: that is a great question

lilly: women can be fish, too. that was my example

the objects are not politicized yet or are depoliticized in the moment [[ie. the industrial rubber ball as the perfect simple object to built liveliness/character/spirit from]]

first example in machine learning book is how to choose most perfect embryo
there's a lot of desire that is going on in there

you want to predict what this fish is
you see that it is short, it has reasonably few stripes
that seems close to species a
but you see this other fish that is more consuming
still within specias a average range
but has a very different number of stripes
if we want to solve this problem, to guess what fish this is
we need a model
so i pull out math again
in nature a lot of things get distributed with a bell curve, the guassian distributio
attributes can also fall into this pattern
this says that most of the fish will fall in this middle part
you also see outliers
you see some longer ones, some shorter ones
no fish with length laess than 0
in this case, we have decided based on lots of years of studying animals
a good model for a ntural thing you see in nature
that this distribution is gaussian
you can fit a probability distribution of what you have seen
and you figure out what the mean is and the standard deviation
the probability model that you fit to species A
and if you do species B, you have a different model
it has a higher average
if they had different standard deviations, the bell curve would be broader or narrower
then, we know two things about the fish
we can model these things jointly
and we would have a 2 dimensional model
the length probably is on one axis
the stripe is on the other
most species a are in the middle of this probablility distribution
it falls in the middle of the cone there
some fall on the edges here

lilly:
you have data point

is this something you measured or is this the model
is this the plotting of the data
there are some assumptions about the actual data you have

courtenay:
    the guessing part is that you think it is going to approximately fit this shape
    you hope your sample size is big enough
    so that your model is valid

karissa:
    it can be a problem if people assume it is a curve like that and they find out later that everything they did is wrong

courtenay:
    not a lot of things follow this curve

janet: you are also selecting a model and see if it works

ri: how do you get to your model, what is the process?

courtenay: you look at the data

does it have a long tail?
often you have too many features
you visualize things differently
depending on how much data you have
looking at the numbers is a good idea
you good do a histogram
you take bins: 0-5, 6-10, 11-15...
and then you look to see in which box you put your data
you can just count and see
if it looks lik most are in the middle you can have that shape

karissa:

people love big data, if you have a lot of cases, you can pretty much tell instantly
a lot of them show up in the middle

courtenay: it can be approximated as one

you have four models
you have a model of the distribution in one species
you can have four bell curves
and then you can look at each species two features jointly
and hopefully they are well separated

now if we see a new fish, a data point, and it goes somewhere in this plane
on the bottom
now you see which of these models it falls closer to
this fish is closer to this b model
better
and you can make a more educated model that it is fish b

we observe those two feature attributes
we didn't have to send it off to the lab, we can guess now

janet: how do you say it, it is species b, or this is probably species b?

you may look at the priors

sylvia:

confidence level, if it is in the red area, you are rather confident

janet:

gender map
- someone with this height, long hair short hair, are you computer going to decide who is a male or female
- or if you can get tenure

courtenay:

a good scientists, yuo don't say it is species b
but if you are google, you may tell advertisers that it is a man or woman

carlin:

it is not a big deal to them if they get it wrong
it doesn't matter to them if they advertised to some of the wrong people

janet:
    it does not matter to them that gender may be fluid

courtenay:
    one take away: ml is using features that you can directly observe as a proxy to predict something you can't directly observe

there is no guarantee that you;ll be right
there may be a lot of overlap
a fish might be an outlier for its species
abnormally large
points in between your two models and you don't have more data, you can't really say

takeaways:
    precition only as good as your models
    that your data does follow a particular distribution
    need to observe a lot of fisn of each species to build accurate models of them
    machine learning is what happens when you feed your models 1000s of fish


courtenay:

what is your confidence that your dataset is right
people at mechanical turk, the labeilng, there could be all sorts of data cleanliness problems
almost for anything
you want to deal with the noise, the outliers
the bad people submiting the form twice

seda: claudia perlich was saying there is no wrong data

there is wrong interpretation fo data

courtenay: you can have adversarial data generation, for example, you can have wrong data

ri: but that would be hard to separate

sometimes you watch tv on your girlfriend's account

are there any advantages or is it worth thinking about the value of "unclean" data?

jojo: you can change it

are you typing a transcript you wonderful person?

martha:
    your prediction is as good as your models
    you only need your prediction to be as good as your need?
    most times people want to do a critique, they ask if it is accurate
    but maybe that is not the issue

courtenay:
    yes, maybe you only need some percentage of success

sylvia:
    if you get the wrong add, no big deal
    if you misdiagnose cancer, you need a more accurate model

then you need to watch out

carlin:

the scale is different in those two examples
any time it is a medical example
trying to take this wealth of statistics and to apply to a single body or case
you should do this cause you are likely to have this risk
to go back to that need is different, it also depends on the scale

courtenay:
    google is trying to do predictions for each user
    or netflix
    but you may not care

ri: amazon thinks that i am a recently divocred 50 year old who does yoga, not much damage?

courtenay:
    high paying jobs shown only to men

helen, at the sympsium

seda -
rachel law - vortex
uniqueness is a probabilistic feature in that moment...it is a combination of features

Wearable tech guy who says you can trade biometric profiles with people (to be someone else in that regard) is named Chris Dancy twitter: @ServiceSphere

lilly: chelsea clinton: internet access is key to gender equality

where do we think development data comes from
people hired by universities and world bank
who gather data through interviews
a shift in data collection
800000 data points
tech industry can solve any problem with data
techno, big data, and feminism
investing in the middle class is the best way to bring about democracy
history having the problems of big data over generalizing
the headline suggesting a correlation
we could unpack how these correlations have many levels of spuriousness and assumptions

janet: that they are correlated is not causal

lilly: but the headline

sylvia:

i stopped reading wired cause it is so obviousyl written for men

janet: has it gotten worse?

when you first read it did you think it was not that way?

ri: it became more like gq of gadgets,

sylvia: ads for cars, watches and alcohol,

janet: the spurious correlations - http://www.tylervigen.com/spurious-correlations

the number of movies with nicholas cage with murders in the pool

courtenay:

models get more acurate -> preditions get more accurate
this is true for our regression example, too:
- the more cities we observe the better our prediction

there are lots of different classifier models, this is just one type
this is a gaussian naive bayes classifier

you assume features have gaussian distribution
assumes each feature is unrelated to the other (not correlated with each other)

takeaways:
non complete list of things people use to make classification tasks

decision trees
nearest neighbor: really naive classifiers, with the fish, we thought it is pretty close to a, you compute its similarity to every example you have seen, because it was closest to a, and you throw out the statistical model out, sometimes it works really well
bayes: bayesian kind of classification methods
- these are kind of classical probabilistic methods with a lot of complications on top of them
- bayesian, you are looking at priors
  - the base rate, specifically you are incorporating
    - if you observe that 25 percent of the fish are a and the rest b
    - you incorporate that into your final result
logistic
support vector machines
neural networks

janet:
    we always here about bayesian stuff, is it that it includes probabilistic stuff
    variables with probabilistic stuff

courtenay:

real models usually use more than 2 features, it's hard to visualize, how they work and how they fail
we can maybe look at 2d or 3d.
it is really hard to understand at an intutive level why things are working out
you try to figure out how well your predictions are doing
this is your training set here
here is the test set
you need that to be labeled, too
so, when you have a model, you try to predict the things there without knowing the labels
then you look to see if you predicted well
which means you need more data
you can have a model and throw it out into the real world
but you want to sort of believe that it is going to do what you think it is going to do

last example:
    you can in the real world do stuff if your data is not fully labeled
    it is harder
    it is more uncertain
    you may have tons of data and no labels
    can we really not learn anything from it

ri: you mean like confirmed labels

courtenay: like the lab test

carlin:

for something to appear as data
won't some decisions have to be made
we have been using length
you know something has been measure, you need to know that it is inches
what do you need to know at minimum

sylvia:
labels

carlin: maybe i am talking about units

courtenay: yes, i am talking about ground truth labels

as long as you have examples
the fish does not need a name
stripes and species, i don't need their label
you can throw them into a plot and look at them
that is data
everything is data

courtenay:

you need to have a reasonable belief or faith that the measurements of the coffee grounds are related to something i am predicting in the real world
you might be wrong
maybe you are measuring something that has no correlation
history of science: what peole thought caused diseases, that seemed reasonable at the time, but it wasn't that
advertisement, there is no guarantee that if you are a male in a specific city
that the ad will work
it gets very subjective very fast in the real world

can we learn something if we don't have ground truty, say about the species of fish that you have
maybe we took measuremenets of all the fish
but we didn't even know they were from 2 differen speices populations
not just a mater of manually labeling the data, we don't even know what the labels should be
so in this case you get into the broad heading of unsupervised learning
if you know the species, you know they are male or female
you now have a bunch of numers and data
and you are interested in the kind patterns of data

janet: i love the terminology

like workers that are unsupervised

sylvia:

like when you have a child
there is actually a correct answer
whatever is learning, you are giving that answer

carlin: it is a matter of whether there is prior classification that supports that

courtenay:
    labels -> supevised learning
    without -> unsupervised
    standard techniques that you use

    here are the lengths and stripes
    we have clusters, each point is a single fish
    we know that they are two different species and they look like that
    this lovely toy example, in this particular two dimensional space that is perfectly visualizable
    you can't do it with your customers
    you don't know the structure of that data and you don't have a way to guess

    the most basic thing you can do is cluster analysis
    the toy example i will show you, a common algorithm called k-means clustering
    you start by guessing that there are clusters in your data
    you usually also hwo many clusters
    then you guess the centers
    and guess cluster membership
    it turns out that this will mathematically get you some nice clusters

    the algorithm
    you can picked two points, they are wrong, they are both in the same cluster
    you pick them at random
    then you do most obvious thing you can do, you measure distance to all the other points
    you draw aline
    you draw an ortogonal and perpenicular line
    assume that this is a reasonable way to measure things
    you do that, and you recompute the centers of the clusters
    if all these red things are a cluster, where would the center be
    you moved your cluster centers now
    you reiterate
    so, now you moved the points here
    the blue points have overtaken
    and once you do that, and reiterate until the cetner does not change anymore
    at the end you get here
    and you find your two species again

    caveats:
        you still have to guess the number of clusters
        two kinds of fish in this pond
        you guess at several different numbers of clusters and you do an evaluation
        you look at how tight the clusters are
        more clusters make your model more complex
        there is a bunch of hand waving stuff
        this is a thing you can do
        this is like density estimation
        figure out if there are denser places in your feture place
        this is an example of an algorithm



    kavita: the cluster analysis will not tell you how many clusters there are?

    courtenay: there are clever ways to guess
    we can look at the data
    and say there is two

    courtenay: you can use histograms
    visually you can look at it
    and know

    janet: i have been using cluster analysis in social newtork analysis
    bibliometric analysis
    you can get the computer to detect what it thinks is a clustering
    if it is two clusters, do i have the relative distances through things

sylvia:
    you can get a cluster and look to see if there are clusters in that

courtenay:
    yes there are hierarchical cluster things you can do
    in social networks there are a whole different set of things you might do

these two fish know each other
ou have this whole extra set of data
that does with your data
that goes with the attributes of each user

carlin:
    trickiest things in teaching
    kids will do an analysis
    and come with gender binary
    to figure out what is getting read, where gender gets assigned by twitter
    who is particiapting
    what kind of assignments are happening
    what makes them truthful

    what feature you use
    if it makes more sense and to look at

lilly:
    developmental biologist
    anne fauster sterling
    osteoporosis and how it is correlated with women
    and how come
    she critques that and builds a process model
    and how osteoprosis looks correlated with women
    if you get sports then less likely to get it
    race: the kind of work you do

berns:    both stregnthener is marketed to the petite white women

carlin: osteoprosis gets discussed without that kind of sepecificity

lilly: it takes a lot of labor to construct this other model
how can we use this data other kinds of process sotries of gender and race without reifying them
carlin:

in the hospital
you talk to people differently
not based on gender
but more specified risk

berns: hypertension is race based

that is discussed
cigarette smoking and hypertension
with regards to
i am trying to remember how it was taught
it is: i don't even think about it, it is just how it is
boneeba medicine??
there is a typical image for certain medications
they will be advertised to certain people
sometimes because their insurance is more likely to pay for that
it is not about systemtic issues
why is this women having these issues
an african american women is going to be more likely to be on this medication
it is presented as this is the problem of her race
and not that society was shit to her
inherently, this is what she will be, instead of what she went through

lilly: correlation becomes the local cause and they just deal with it that way

berns: then there are people who don't want to take the medicine

they are seen as non-adherent
that is supposed to be more compassionate
lathough some will call them non-compliant

janet: that is an interesting label,

THIS WAS SOMETIME BEFORE----
janet:
you are parenting your computer

sylvia: i think robots are adorable

it is cute to watch

------------

courtenay:
    they may be clusters of density
    a little more of grey area
    finding interesting clusters we may want to do something with
    this kind of analysis without labels may allow you to make reasonable guesses



FOR TOMORROW:

Baysian statistics explanations:
    http://www.kevinboone.net/bayes.html

Sylvia would like to explain regression (30 minutes?)

Neural networks are going to take over the world??

Seda mentions that there is AI that trains video game figures to act in specific ways
Seda says “what other politics are possible if there were other ways of querying data?”

seda: what kind of queries can we make with machine learning to get at where discrimination starts, the problem is that when you categorize you can then name and call out discrimination but once you create the new category then that has its own discramantory potential

ann e fausto - sterling (sp??) http://www.annefaustosterling.com/

domain knowledge
- you would use domain knowledge to get the parameters for a data set (

discussion during hands-on weka session:

    carlin: i like what you say about domain knowledge

    kavita: our purpose is that we have a new piece of glass, is this helping out what kind of glass it is

    courtenay: now it is not helping, because i took out the glass
    if everyone understands what these histograms generally show


    lilly: can you read this file for us

   c ourtenay: breast cancer data
   in this dataset
   there are 9 features here
   age, menopause, tumor size
   the thing we are trying to predict is whether can is likely to recur or not
   we are looking 286 examples
   and 201 did not have recurrence
   and 85 you did
   and that is the class you are trying to predict
   we would want to predict it by looking at some combination of the 9 features or some subset thereof

janet:   we have a woman 54, pre-menopausal, right brest

courtenay: i can do the walking for you, too

lilly: there is no x axis

courtenay: this is what we were talking about before
numerical vs. nominal
these are much more nominal

seda: you need to look at the arff file to find out what the values stand for

courtenay:
    the way you read this is that 68 cases got radiation therapy and about half of them had a recurrence and half didn't
    and the others didn't get radiation and did not have a recurrence
    the information you can glean here is how different the percentage of the classes are
    in this case it doesn't make sense
    recurrences were a far less frequent event


bernadette: it is a small amoount that reoccurs

courtenay: it is not unlikely
you have a veested interest in predicting who is goig to recur

kavita: does this mean that you are more likely to have recurrence if you get radiations

berns: the first part is people who did it
and it is 50 50
and the ones who didn't there was a better chance

courtenay: but you need to know whether getting radiation are those who were seen as more serious cases

sylvia:
    there are ways to present it to show that there is a clear relationship
    but there are ways, which doesn't show what the relationship is
    there are no clear relationships
    if we used some sort of algorithm, we could predict it
    but through visualization, especially because they have different population sizes
    it doesn't feel like a good example
    or it is a weakness of the proram
    it is a little lame of them

courtenay:
    i agree with you

sylvia:
    the whole point of visualization is to see things

berns: the safe thing we are agreeing on
you have to be careful with correlation and causation
i have been to a numbder of pharmaceutical presentations
they will take 2 people living 3-4 months longer
and they will make claims

janet: i see why you go into this
people who got radiation
it evened out
it looks like not getting radiation meant you did not have recurrence
that is why you collect a whole bunch of data
because you want to show why the finding is part of other factors

sylvia: this is also a way to manipulate data to get what you want

seda: they claim more data is always better

it is better to have
the noise
if you have a lot of features
instead of fitting a straight line
you would be fitting this thing that goes through every pint
you don't want to predict this thing in between
there is a way you can measure
i decide on a certain feature
i am looking at cglass and i want to know if it is transparent
you see that it is evenly distributed across all glass then you know it is not a relevant feature
so there is certain features
that it is an indicator of this label

the overfitting problem

DAY 2:

http://www.thenewyorkworld.com/
https://nycopendata.socrata.com/

Where to find data -- what the important attributes are

What there is data for
What there isn't data for

Lilly couldn't find any data on contractors

Unpacking "Mechanical Turk"

CUP Lab -- data siphons

Data politics in NYC

Martha: Certain datasets won't be more -- how much learning can your machine do?
Courtenay: Exploratory actions on data or finding
Picketty Dataset is Open!

Martha: difference between prediction and learning?
Courtenay: Maybe? Someone may or may not believe you have proven something with your predictions. There are no unknowns that you can point to.
"Machine learning" on a pedestal as separate from data mining or statistics is dangerous.
How many nation states in Picketty?
Martha: European ones?
Seda: 40 based on his definition?
Courtenay: You can still make predictions for a new country.
Cross validation -- hold out on one data point and then see how well you predict the missing country to test your model.
Seda: There's always a prediction, isn't the question how reliable the prediction is? What is prediction?
Courtenay: You don't know a value so you attempt to
Seda: Act of using a function to come up with a value you don't know.
Courtenay: You may be artificially obscuring the value to test. That's still prediction.
Kavita: Can you do predictions on datasets from the past in which it's not possible to go back and collect?
C: You can still do what you want -- it's a philosophical scientific thing. Going forward you're not going to be able to make predictions. But it can tell you if you have a good model of phenomenona.
Lilly: It reminds me of talking to mathematicians and scientists -- you don't have a theory unless you make predictions about the future. Ethnographers work differently : if you don't know how the data was created, you don't have a theory . New ways to explore models. Potential parameters are infinite.

C: the end game doesn't have to be classification. the field of machine learning is driven by prediction. but the techniques are statistical techniques. There are other ways of seeing if things are correlated.

Bernadette: last night I thought about farming data. labor has been low on the farm until the summer youth -- now it's spic and span. Number of workers with hours put in to crop outputs.

Jojo: i thought when you said farming data, you were talking about the labor of preparing data for use later on, the workers come in and clean it up and it is ready for harvesting.

Joanne: Wikileaks data is CSV.

Seda: text analysis will be interesting.

SLIDES/Courtenay presentation:

Courtenay: touch on what correlations are
using spurious correlations site: everyone knows correlation doesn't imply causation, but doesn't necessarily mean correlation.
Martha: but is it predictive?
C: no reason to believe that they would.
Seda: Google food: all sorts of debates. World Bank discussions. Google food trends worked because of years and years of data collected by scientists . How good is prediction without another kind of ground truth.
C: you can go out and look and a couple are going to look really great and all the rest won't work. You just pick the ones that look.
L: Isn't the point that you don't need common sense?
C: maybe they're both correlated to other things. Maybe there are other variables.
You convince yourself that they are correlated
Maybe they are correlated to other things, but you have convinced yourself that this is the correlation.
The correlation will be spurious because they
Seda: Constant is right now doing a workshop: how do we create common sense with machine learning.
B: we make lists when we hit problems.
C: human brains are good at making spurious correlations.
B: Cognitive Therapists

courtnenay showing correlations of different types and strengths

if your classifier doesn't work, you might just not have enough information.

c: sometimes you just have data and maybe you won't be able to predict what you want to pick
Seda: is there any data on data that confuses classifiers?

if it is uncorrelated with the class, it shouldn't throw off your classifier, your classifier will ignore it.
learn to weight things as zero.

discussion yesterday:
    if the length and number of stripes of fish are correlated, a model that assumes they are independent might not work very well
    because you count the same information tiwce becuase it's repeated in two places and the model doesn't take this into account

no double counts!! bad!

solution: could switch to a model that doesn't assume independent features.

the other philosophical broad point was the fight between statisticians and machine learning
i was vaguely aware of the fight, i was aware that there was some tension, maybe

yesterday one of you asked: is this any different between statistics.

here is a joke i found:
a table of differences between the two, mostly terminological, but a large grant in ml will get 1.000000 whereas in statistics a large grant is 50000.
weight vs. parameters etc.

lots of overlap and lots of cultural differences
the practices have evolved into different standards

andrew gelman says, maybe we should remove models and assumptions because then we can solve problems that the machine learning people can solve.

C: There are people who believe more or less in one or the other dogma

one commentators on stackexchange says
ml experts do not spend enough time on fundaments, and many of them do not understand optimal decision making and proper accuracy scoring rules.
statsiticans spend too little time learning good programming practice and new computational languages.

m: can you explain the second statement about statisticians

c: humans aren't super into change. a discipline evolved in a specific way. before computers were around. in a culture in which people don't jump to the most immediate new software. fewer people in statistics departments know how to
S: ML come from CS, statisticians come from mathematics
L: chalkboards, slow proofs (math) vs prototypes! (CS) fast moving
S: mathematician: if you don't understand what your algorithm is doing, it's wrong. One big issue: giant data sets.
Efficiency is about quantifying results.

when social cientists look at this debate, they say it is right or wrong. it is hard to make it stick, but it is working.
the test by which something is successful in the world is not whether it is right or wrong, but whether it "works"
ml person says, it is working, and the statistician says it is wrong

jojo: it depends on what you mean by what matters?

lilly: machine learning and statistics are competing for legitimacy on what is the right way to work with this data
it could be that the debates about what is right and wrong, by participating in those debates, the ml people may be legitimizing their discipline

martha: for some of these guys what is at stake is not publishing a paper, but having a successful company
If they say all that matters is that they have a correlation
different social worlds -- what's at stake

C: techniques developed in academia, adopted elsewhere.

courtenay: a lot of the techniques get developed in the academic setting, but in many cases, outside of academia, if it works, it works
columbia was very mathy and proof oriented
it is this academic thing
in practice it is a very computer science and engineering mind set: i built it, it works

lilly: some friends would consult cia and stuff
for intelligence vs. ad prediction there may be different standards?

courtenay: i don't know how theoretically, what the standards are behind that wall [of intelligence]

martha: you're trained pre-data science?

courtenay: i finished at the end of 2012. i was in machine learning courses in 2007-2008. hot stuff which was not neural networks, and a lot of that has been taken over. and they were interested in proofs.
taught hot methods at the time (not neural networks) by people concerned with theory and proof.

kavita: real timeness of data, ml people have access to data?
that the data is just constantly coming in and being optimized

statistician dealing with more static data?

courtenay: it is less about real time than dealing with larger datasets
which data scientists have been dealing with for a long time
statisticians may not be as comfortable

lillly: twitter search be one of these computational processes?
twitter search has a real time problem, topics are cultural context that are not indexable terms
so they hire mechanical turks to find something very quickly
timing matters, limitations. TurkWorkers to bootstrap.

C: detecting density of topics

martha: i just thought of something, the credit scoring had 12 items, that is how many items someone working on a paper could add up.
she would be doing the computation live, and that is about computational efficiency
the debate today is with machine learners saying, these people are archaic
it was computational efficiency, because i did not think of it as computational efficiency
because it is transformed with infrastructures.

courtenay:
    there is not necessarily that 12 variables is a bad system
    it is more about the number of data points rather than the number of variables (features)

martha; given all the data that could be credit data, it looks archaic
C: there is such a thing as too many variables
courtenay: there are too complex models as well
that sounds like plan bullshit to me

berns: you cannot have enough considerations
in this case

courtenay: i agree that you may need more than 12 factors, but for some things it may be enough.

martha: we are having the debate because computational infrastructure, when we phrase the debate, the only reason we are considering more than 12 because these guys have amplified their capacities in the last 50 years

lilly:
    a lab, machine learning, we are storing so much more data, we need to gain more financial value from this data
    we need to get more value, becuase we have more data
    not that we want more data because we can get more value

courtenay: it is cheap enough that you can store everything

martha:
    the debate that you are pointing out between statisticians who are cheap
    ml we can maximize,
    the debate is created by the economics of the environment


    courtenay:
        statisticians use computers
        they may not be up to par wth the latest in computational infrastructure

the data is probably generated by a tech company who is interested in doing this thing on its data
as far as academic departments go, they could be trying to solve the same problems

lilly: are you saying there is a difference is that cs people need grants to get machine, entrepreneurial grant getting

martha: the million dollars have to be for something

courtenay: they are also being snarky about it being a fad. it is cutting edge and popular and statistics has a marketing problem

ML gets the money because

berns: there is a race issue there, too, as to who teaches you statistics and computer science. my statistics teachers were people of color

S: within computer science there are layers of people who are more proofy. clean definitions.
C: then there are the ones who hack.
S: upper echelons -- it's class. middle class belt: less lofty. more likely to do applied stuff. don't mind being engaged in $. Privacy is upper, surveillance is middle. ML gets new folks: physicists and biochemists. Need the techniques. They go into hedgefunds. Big data systems. How do physicists deal with complex social issues.
LIlly: Physics such a male dominated field.
S: except Iran.
Hard crowd to read for me -- tend to be polymaths in my experience.

C: Engineering mindset: but practically it worked!
don't know what he means by worked, but proof is in the pudding.

M: usually managerial

C: lots of complaints to be had about this attitude. Broad take away: two outlooks:
    classical stats hypothesis testing
    ML: getting predictions to work even in the face of lack of interpretability of models

Lilly: as an ethnographer I now feel aligned with Classical Stats

C:If ML is more "successful" it comes from the large-scale resources; wring the last bits of success out of those things rather than doing something more profound.

M: could there be a synthesis?

C: most things are black boxes, but there's a real interest in doing this; run models backward; google deep dream. No one likes black boxes. People like to know how they work, also because they want to improve them .

feature normalization:
    what if we have this data:
        much less variability in # stripes athan length
        much difference in scales
        it is a problem if you are trying to calculate distance between things

change each feature to have
mean = 0
standard deviation = 1

obvious thing people in intro cs classes don't do

DATA SHARING (EMAIL PHOTOS!)

courtenay: something that looks extremely small
you are looking at particles, how toxic is something
the numbers may look very small to you, but you want to stretch it out to be able to evaluate its significance

martha: do we know that the difference is equal
it will then become testable as to wether it is meaningful

courtenay: it depends on your classification model, some models will need normalized data
you are not changing the information in that variable,
you made it easier for algorithms to work with it
and maybe for you to view as a human
there could be a diagonal relationship that you can see better in a context, because of the resolution

martha: if the variable is useless, after the transformation, it is still useless

courtenay: there are other normalizations that you can do
that is statistical normalization of data

martha: is tehre a relationship between normalization and the in ability to reverse engineer

courtenay: no, you usually have the raw data, and you know the mean so you can go back
it depends on
the final model probably takes it, the final classification model
somewhere inside of it has the value of the mean that ist needs to subtract off
that value is a parameter in the model, you know what that is
so that you can make a transofrmation on the raw data coming in, that also means that you can back it out
you are not obfuscating antyhing

feature selection and dimensionality reduction

we might not even need all the features we have to do well on prediction
we might need something that we don't have
sometimes the important thing is to figure out which ones to throw away

- features with no correlation to the class
- features that are redundant with each other
if their corelation is one to one, you can throw it away

2 features are redundant if they are highly correlated with each other

you're relaly getting the same information from both and overcomplicating model
could manually do it

dimensionality reduction

a way to compress data so that you can extract a smaller subset of uncorrelated fetures
set of mathematical transformation to do this

ou end up with a whole new set of features
each is a function of the features you put in
you have x, y, and z, you end up with a, b, and c
a, b, and c are functions of combinations of x,y, and z
such that they are all orthogonal to each other
the output variables don't have correlations with each other

it is a form of mathematical projection
youare changing your axis
i can't give you an intuition

martha: you perform something on each data point and transform it into something else

courtenay: itis an automatic way of compressing the correlated relatioship into uncorrelated variables

martha: you take variables that are corelated,

courtenay: now the features are uncorrelated.

ri: you need to know what is correlated

courtenay: you do a mathematical transformation that does it

lilly: whether two things are correlated is a statistical relationship, right, so the stats does the job for you?

courtenay: it is factor analysis
there is a bunch of ways to do that
principled components analysis

martha: compressed?

courtenay: you are asusming there is a lossless representation
and you do something that are now uncorrelated so that it is easier to feed into a model

lilly: we thought these were correlated

courtenay: ffter the transformation, you take the top x values and throw away the less important, less informative values in the bottom
that is purposeful transformation that way

ri: i thgouht we were trying to what does correlate, how come we can now all of a sudden identify that is correlated

martha: is compression like making juice out of vegetables

these are dimensionality reduced points, because it is too slow to eat carrots?

courtenay: basically, you are going to take the top few dimensions to a classifier, becuase you know these things are uncorrelated, you don't have bad feature correlations fucking up your modesl

if you have two variables that are really correlated, all the ifnromation that was contained in those two is compressed to one feature
you are not going to have the statistical prblem of overweighing these features

lilly: combine marriage and margerine into one feature

courtenay: there is something to be said about the distance on the y axis, it does not mean anything
the slight difference between the shapes

lillly: a band of difference is acceptable?
courtenay: yes
lilly: but the band can matter

courtenay: the mathematical transformation will not take semantics into account

balanced datasets:

    sometimes you notice your classifiest is doing suspiciously well - 95 percent accuracy
    then you notice that your data looks like this: class a has 95 percent a, 5 percent b

maybe you can go collect more examples of class B and make your model better
could use re-sampling methods to feed your classifier more balanced (if slightly synthetic) data

it is good to be aware of the relevative balance of classes in your data and think about how it might be affecting predictions

cross-validation:
    standard machine learning practice
    you need twice as much data, training and test set
    instead of 2 fixed dataset
    split data into train/test multiple times for multiple experiements and take average results
    more samples -> results more likely to be statistically valid
    weka: 10 fold validation, that means that it built 10 different classifiers

martha: is that the kind of validation that you do if you have lots of data

courtenay: if you have tons of data, you can do a single split and that is ok
this is going to help you more if you don't have a lot of data
it is generally a good thing to do, it is sampling more

berns: what do ml people call statistical significance?

courtenay: this means that you did ten fold cross-validation
the part of your paper, where you prove that there is statistical validity in results is a little bit more lax

overfitting:
    sylvia was describing the picture yesterday
    model complexity >> training examples

often happens when # features >> number train examples
your model is too complex
you didn't have enough training exmaples to justify the complexity of your model
a lot of the models use the numbers of features, weigh each feature
your model is the weighted version of those 12 features
but maybe you only say 6 examples
you have few data points and you are trying to fix a model with way more data points and you are going to do this overfitting
sort of: lets your model be more creatively wrong
occam's razoe: simpler models are usually better
classic visual demostration
simple linear model vs. more complicated model: you get every point, which would you expect to be more correct for new examples?
the canonical way of seeing if you overfit
you look at your prediction performance on the training set
if you are doing really well on your training data and shitty on the test data, you know you have done something wrong!! :)

feedback and future discussion:

    berns: people like one on one, i like it when we are in a group. i am not even good with breaking out into groups.

    courtenay: this was a good size group for hands on stuff today, the group was a little bigger yesterday, which made it harder for hands on, but then better for discussion

    kavita: the pace was good
at no point was i dragging

ri: the structure was well thought

joanne: great
i didn't think yesterday was overwhelming with the larger class size
i thought you were going over a lot of vocabulary
structurally to add: i wasn't sure what i was going to learn
i see machine learning all the time
one could that could help if there were 5 questions that were answered

courtenay: in the context that you see ml all the time, did it cater to your expectations

joanne: i was worried that it would be too technical
i am glad that i came
there might have been a way to point out what we will learn

courtenay: there was an initial description that was more technical
it is good that where we went
what language would be friendly to the people that we want to attend

berns: my friend was worried that it would be way over her head

ri: what worked well, if you know about machine learning, you should come
that added to the people explaining, that was a good dynamic management
it is difficult when you are teaching a technical subject to put it on a level to keep evryone interested

kavita: i would love to have a discussion on social and cultural significance of machine learning taking over certain functions

ri: i would say the opposite, let's get our datasets

courtenay: this was great, i was terrified
it was kind of a tough, it was a long road from do you want to do an ml workshop
to what would that look like
and to ask that question again and again
i am so sorry we didn't get to the societal implications or get to the data
this has been really fun
i got all kinds of perspectives and questions that i hadn't thought about
and tuaght yself things that i didn't know or had forgotten

ri: if you want to choose different spaces, this was a wonderful space

transinclusive
joanne: i don't want to call things all women
what if someone transitions
invalidating

berns: i have had people not feel included

joanne: you just can't say, no cis guys

eyebeam, i could talk with them if you need space
it is a nice space
new inc might be open, too
they might be good
eyebeam would be open

berns: because it is new york, we have our hands in such cool things
you wouldn't want to spam
if we could email a central person
i am putting out this event on thursday, i heard about this event and it might be of interest
a monthly news

joanne: if you had ela come and do some basic security: if she would be up for that, that would be amazing
especially since she has been doing threat modeling
she would be happy to test out
most of oher talks, she is often talking about things