Courtenay leads discussion on ML

machine learning is really just observing patterns, 
in lots of data

what is it used for:
    sppech recognition
    face recognition
    language translation
    predicting consumer behavior
    predicting financial markets
    analyzing social networks
    making business decisions
    
workflow of machine learning:
*    repeat unti you get out what you want: correlations, profit, phd etc.
*

there are typically two types of problems:
*classification : most times people talk about machine learning, and they mean classification
*regression: predict the value of something in a range of values
*
classification asks questions like:
*    is this a good or bad email (spam)
*is this a dog, or cat, or a rabbit
*
regression:
*    given sales data for last 12 months, what will next months sales
*
*is it always about:
*malignant tumors: it is something you cannot see, a wall that keeps you from seeing things
*
janet: is it like you want to predict what netflic thinks of you?

betsy:
    if i could at least come at the end
    wanted to say thank you so much
    i am having fun geeking out with all these women
    i have been telling people
    i thought it would be counter to the culture to do the fb thing
    it was safe to ask any question and yet it was not dumb
    you can take step back
    it was a wonderful moment to be able to ask those questions
    it was great to realize that i knew more than i thought i did
    as a sociologist, and i know statistics
    but i rejected it
    but realizing that i still have that knowledge and can use to analyze my new project
    so i got great interview questions yesterday
    when i have access to tech designers, these questions will help me to get to what i want faster
    i will sound more like i know what they are doing
    
    i looked at the datasets
    do i want to put that on my computer
    
ri: how did you select weka
i looked at mathlab, i am not there yet
what is a good tool?

courtenay: i did all of my phd in mathlab, i hate it, it is proprietary
weka is a shitty visualization tool
it is the only thing that i know of that will allow you to do machine learning
run real algorithms out of box
and load your dataset
more programming is scary for people who don't know how to program
weka was started in 1993
i think it was a reasonable choice for what i was trying to do here

if you don't want classifier, but just want to visualize, there are surely nicer guis

ri: that would be great

besty: is weka like a wordpress for data

seda: there is a data science course by mako hill, the material is online, i will add it to the email.

joanne:
    suggestion for a topic
    a workshop on the blockchain
    with some of the implications
    

*
seda:  adversarial machine learning algorithms - emails that are sent in order to figure out how machine learning works 

asking a question about "data cleaning"? do you use regression to fill in holes in data set
incomplete data
when you are missing labels
istead of getting a human to do it, you try to bootstrap your algorithm
you make your first guess at your algorithm
you may be compounding your errors
that is something that happens when you can incomplete data


regression example:
    
    you have a bunch of cities and some information about income (i didn;t say anything about whether this is mean or median)
    you have a new city, and you want to know housing prices there
    you try to find out based on what you know about other cities
    
    
janet: do you choose the model or does the machine choose the model?
courtenay: you choose the model, you still have to choose
there is a lot of domain knowledge
you can look at this example and say, yes it looks linear
or i am going to try a more complicated model
how do i know which of these models are better?
you need to have a additional data

lilly: how similar is this to what economist and sociologists have been doing?
*instead of discovering the world, it has become about can we make money off of it.
*
courtenay: i will talk about this tomorrow
*statistics vs machine learning
*these are classical statistical techniques
*how is this different? why is it more shiny?
*some of it is real cultural differences
*and some of it is the same
*
*at this level it looks like a lot of statistics
*but as the field evolves and things get complicated, it is a little different
*
kavita: what is a model, is it like an algorithm?

courtenay:
*    it is a semantic thing
*the model is the object that you end up with at the end with parameters
*and you have an algorithm that chaanges the model
*
*here the model is linear
*
*
sylvia:
*    crude drawings of basic shaptes
*linear: the two values on the x and y axis grow together
*logarithmic: you have a data set which eventually levels off, like age
*exponential: i lost my notes, you can use a logarithmic representation of the expoential curves
*exponential: sylvia is showing how you  use algorithms to depict datasets with exponential growth
it is the same information but easier to read


*    
*
is there a library of these kinds of models and you go to them and choose them?

seda: is this what you mean with a model, courtenay?
courtenay:
*    you look at the data and you look to see what function you can fit.
*
carlin: i think about it as an equation
*what a graph is doing is solving the equation: if x is this, y is that...
*
courtenay:
*    there are a lot of alrogithms and they are basically mathematical functions
*the way you get the model may be complicated, running the algorithm may take 10 hours
*
elizabeth:
*    Now I have to ask what is an algorithm, I thought I knew. 
*
courtenay:
*    usually you can be doing optimization
*typically you are iterating on how much error your prediction is making
*
karissa: it is a bunch of steps that the computer executes

lilly: is it like a way of solving rubic cubes

carlin: it is a set of instructions
*it could be self-referential inside
*
elizabeth: sounds like a recipe and the result


courtenay:
*    algoritms is this broad set of things
*those are used to figure out your model
*if you have these data points, you can get this line
*it is not always so simple
*this is an important task
*
couldn't you look up the price of housing?
what abour predicting how much you woul be willing to pay for a product?

classification example:
*example:
*we have some fish
*2 species A and B
*hard to tell individual fish apart
*one of them is in danger, you want to tell when it is that fish
*you don't want it overfished, but you can't tell by looing at it
*
you go out and observe these fish, the length and the number of stripes that you see
you get dna samples to test for species
you figure out which species the fish belonged to


features: attributes you observe about each example
class labels: ground truth, you know that is the true answer, gold standard
training examples
    
    lilly:
        you are not sure, you don't know how you want to classify them
        i thought you were going to say, the classifier would help you discern the clusters
        in that case you don't have a ground truth, you want to discern the classifications
        
    courtenay: they look the same but they really are two different things
    you want to identify which fish is which
    you went to a lab
    and you have their dna
    
    seda: but isn't that a probabilistic model, too?!
    
    courtenay: this is a toy example with a ground truth
    
martha:
    what if it is a behavioral outcome and it depends on how you treat it
    the outcome depends on how i treated you
    you are not a fish
    
courtenay:
    that would be about data contamination?
    
martha:
    ri is a a
    courtenay is a b
    
    i give courtenay a great credit card 
    but the result is the outcome
    
    
coutenary:
    in the real world the gold standard is more complicated
    
martha:
*    for alternative pegagogy
*we start with animals where the social complication is not visible
*
*you credit is evaluated based on products you have consumed
*but you can consume these products if you have a good credit score
*
carlin:
*    interesting problems here
*what constitutes ground truth, when is it reliable enough
*there is then a simplicity and complexity thing
*people often default to animals, balls, sports, because there is a need to go to a simple pehnomena
*which turns out not to be a simple phenomena
*it is an important thing anyways
*
martha:
*    the firt thing you teach kids is animals
*kids can be duck and cat
*but not every kid can be like good credit outcome
*i wonder why this example starts here?
*
courtenay:
    that is how it started in my machine learning course
    
berns: that is a great question

lilly: women can be fish, too. that was my example

the objects are not politicized yet or are depoliticized in the moment [[ie. the industrial rubber ball as the perfect simple object to built liveliness/character/spirit from]]

first example in machine learning book is how to choose most perfect embryo
there's a lot of desire that is going on in there

    
*you want to predict what this fish is
*you see that it is short, it has reasonably few stripes
*that seems close to species a
*but you see this other fish that is more consuming
*still within specias a average range
*but has a very different number of stripes
*if we want to solve this problem, to guess what fish this is
*we need a model
*so i pull out math again
*in nature a lot of things get distributed with a bell curve, the guassian distributio
*attributes can also fall into this pattern
*this says that most of the fish will fall in this middle part
*you also see outliers
*you see some longer ones, some shorter ones
*no fish with length laess than 0
*in this case, we have decided based on lots of years of studying animals
*a good model for a ntural thing you see in nature
*that this distribution is gaussian
*you can fit a probability distribution of what you have seen
*and you figure out what the mean is and the standard deviation
*the probability model that you fit to species A
*and if you do species B, you have a different model
*it has a higher average
*if they had different standard deviations, the bell curve would be broader or narrower
*then, we know two things about the fish
*we can model these things jointly
*and we would have a 2 dimensional model
*the length probably is on one axis
*the stripe is on the other
*most species a are in the middle of this probablility distribution
*it falls in the middle of the cone there
*some fall on the edges here
*
*
lilly:
    you have data point
*is this something you measured or is this the model
*is this the plotting of the data
*there are some assumptions about the actual data you have
*
courtenay:
    the guessing part is that you think it is going to approximately fit this shape
    you hope your sample size is big enough
    so that your model is valid
    
karissa:
    it can be a problem if people assume it is a curve like that and they find out later that everything they did is wrong
    
courtenay:
    not a lot of things follow this curve
    
janet: you are also selecting a model and see if it works

ri: how do you get to your model, what is the process?

courtenay: you look at the data
*does it have a long tail?
*often you have too many features
*you visualize things differently
*
*depending on how much data you have
*looking at the numbers is a good idea
*you good do a histogram
*you take bins: 0-5, 6-10, 11-15...
*and then you look to see in which box you put your data
*you can just count and see
*if it looks lik most are in the middle you can have that shape
*
*
karissa:
*    people love big data, if you have a lot of cases, you can pretty much tell instantly
*a lot of them show up in the middle
*
courtenay: it can be approximated as one
*you have four models
*you have a model of the distribution in one species
*you can have four bell curves
*and then you can look at each species two features jointly
*and hopefully they are well separated
*
now if we see a new fish, a data point, and it goes somewhere in this plane
on the bottom
now you see which of these models it falls closer to
this fish is closer to this b model
better
and you can make a more educated model that it is fish b

we observe those two feature attributes
we didn't have to send it off to the lab, we can guess now


janet: how do you say it, it is species b, or this is probably species b?
*you may look at the priors
*
sylvia:
*    confidence level, if it is in the red area, you are rather confident
*
janet:
*    gender map
*someone with this height, long hair short hair, are you computer going to decide who is a male or female
*or if you can get tenure
*
courtenay:
*    a good scientists, yuo don't say it is species b
*but if you are google, you may tell advertisers that it is a man or woman
*
carlin:
*it is not a big deal to them if they get it wrong
*it doesn't matter to them if they advertised to some of the wrong people
*
*
janet:
    it does not matter to them that gender may be fluid
    
courtenay:
    one take away: ml is using features that you can directly observe as a proxy to predict something you can't directly observe
    
there is no guarantee that you;ll be right
there may be a lot of overlap
a fish might be an outlier for its species
abnormally large
points in between your two models and you don't have more data, you can't really say


takeaways:
    precition only as good as your models
    that your data does follow a particular distribution
    need to observe a lot of fisn of each species to build accurate models of them
    machine learning is what happens when you feed your models 1000s of fish
    
    
courtenay:
*    what is your confidence that your dataset is right
*people at mechanical turk, the labeilng, there could be all sorts of data cleanliness problems
*almost for anything 
*you want to deal with the noise, the outliers
*the bad people submiting the form twice

seda: claudia perlich was saying there is no wrong data
*there is wrong interpretation fo data
*
courtenay: you can have adversarial data generation, for example, you can have wrong data

ri: but that would be hard to separate
*sometimes you watch tv on your girlfriend's account
*

are there any advantages or is it worth thinking about the value of "unclean" data?

jojo: you can change it


are you typing a transcript you wonderful person?

martha:
    your prediction is as good as your models
    you only need your prediction to be as good as your need?
    most times people want to do a critique, they ask if it is accurate
    but maybe that is not the issue
    
courtenay:
    yes, maybe you only need some percentage of success

sylvia:
    if you get the wrong add, no big deal
    if you misdiagnose cancer, you need a more accurate model
*then you need to watch out
*
carlin:
*the scale is different in those two examples
*any time it is a medical example
*trying to take this wealth of statistics and to apply to a single body or case
*you should do this cause you are likely to have this risk
*to go back to that need is different, it also depends on the scale
*
courtenay:
    google is trying to do predictions for each user
    or netflix
    but you may not care
    
ri: amazon thinks that i am a recently divocred 50 year old who does yoga, not much damage?

courtenay:
    high paying jobs shown only to men
    
helen, at the sympsium 
*   
    
seda - 
rachel law - vortex
uniqueness is a probabilistic feature in that moment...it is a combination of features

Wearable tech guy who says you can trade biometric profiles with people (to be someone else in that regard) is named Chris Dancy  twitter: @ServiceSphere

lilly: chelsea clinton: internet access is key to gender equality
*where do we think development data comes from
*people hired by universities and world bank
*who gather data through interviews
*a shift in data collection
*800000 data points
*tech industry can solve any problem with data
*techno, big data, and feminism
*investing in the middle class is the best way to bring about democracy
*history having the problems of big data over generalizing
*the headline suggesting a correlation
*we could unpack how these correlations have many levels of spuriousness and assumptions
*
janet: that they are correlated is not causal
*
lilly: but the headline

sylvia:
*    i stopped reading wired cause it is so obviousyl written for men
*
janet: has it gotten worse?
*when you first read it did you think it was not that way?
*
ri: it became more like gq of gadgets, 

sylvia: ads for cars, watches and alcohol, 

janet: the spurious correlations - http://www.tylervigen.com/spurious-correlations
*the number of movies with nicholas cage with murders in the pool
*

courtenay:
*    models get more acurate -> preditions get more accurate
*this is true for our regression example, too:
*the more cities we observe the  better our prediction
*
there are lots of different classifier models, this is just one type
this is a gaussian naive bayes classifier
*you assume features have gaussian distribution
*assumes each feature is unrelated to the other (not correlated with each other)
*

takeaways:
    non complete list of things people use to make classification tasks
*    decision trees
*nearest neighbor: really naive classifiers, with the fish, we thought it is pretty close to a, you compute its similarity to every example you have seen, because it was closest to a, and you throw out the statistical model out, sometimes it works really well
*bayes: bayesian kind of classification methods
*these are kind of classical probabilistic methods with a lot of complications on top of them
*bayesian, you are looking at priors
*the base rate, specifically you are incorporating
*if you observe that 25 percent of the fish are a and the rest b
*you incorporate that into your final result
*
*
*logistic
*support vector machines
*neural networks
*
*


janet:
    we always here about bayesian stuff, is it that it includes probabilistic stuff
    variables with probabilistic stuff
    
courtenay:
*    real models usually use more than 2 features, it's hard to visualize, how they work and how they fail
*we can maybe look at 2d or 3d. 
*it is really hard to understand at an intutive level why things are working out
*you try to figure out how well your predictions are doing
*this is your training set here
*here is the test set
*you need that to be labeled, too
*so, when you have a model, you try to predict the things there without knowing the labels
*then you look to see if you predicted well
*which means you need more data
*you can have a model and throw it out into the real world
*but you want to sort of believe that it is going to do what you think it is going to do
*
*
last example:
    you can in the real world do stuff if your data is not fully labeled
    it is harder
    it is more uncertain
    you may have tons of data and no labels
    can we really not learn anything from it
    
ri: you mean like confirmed labels
*
courtenay: like the lab test

carlin:
*    for something to appear as data
*won't some decisions have to be made
*we have been using length
*you know something has been measure, you need to know that it is inches
*what do you need to know at minimum
*
sylvia:
    labels 
    
carlin: maybe i am talking about units

courtenay: yes, i am talking about ground truth labels
*as long as you have examples
*the fish does not need a name
*stripes and species, i don't need their label
*you can throw them into a plot and look at them
*that is data
*everything is data
*
*
courtenay:
*    you need to have a reasonable belief or faith that the measurements of the coffee grounds are related to something i am predicting in the real world
*you might be wrong
*maybe you are measuring something that has no correlation
*
*history of science: what peole thought caused diseases, that seemed reasonable at the time, but it wasn't that
*advertisement, there is no guarantee that if you are a male in a specific city
*that the ad will work
*it gets very subjective very fast in the real world
*
*
can we learn something if we don't have ground truty, say about the species of fish that you have
maybe we took measuremenets of all the fish
but we didn't even know they were from 2 differen speices populations
not just a mater of manually labeling the data, we don't even know what the labels should be
so in this case you get into the broad heading of unsupervised learning
if you know the species, you know they are male or female
you now have a bunch of numers and data
and you are interested in the kind patterns of data

janet: i love the terminology
*like workers that are unsupervised
*
sylvia:
*    like when you have a child
*there is actually a correct answer
*whatever is learning, you are giving that answer
*
carlin: it is a matter of whether there is prior classification that supports that

courtenay:
    labels -> supevised learning
    without -> unsupervised
    standard techniques that you use
    
    here are the lengths and stripes
    we have clusters, each point is a single fish
    we know that they are two different species and they look like that
    this lovely toy example, in this particular two dimensional space that is perfectly visualizable
    you can't do it with your customers
    you don't know the structure of that data and you don't have a way to guess
    
    the most basic thing you can do is cluster analysis
    the toy example i will show you, a common algorithm called k-means clustering
    you start by guessing that there are clusters in your data
    you usually also hwo many clusters
    then you guess the centers
    and guess cluster membership
    it turns out that this will mathematically get you some nice clusters
    
    the algorithm 
    you can picked two points, they are wrong, they are both in the same cluster
    you pick them at random
    then you do most obvious thing you can do, you measure distance to all the other points
    you draw aline
    you draw an ortogonal and perpenicular line
    assume that this is a reasonable way to measure things
    you do that, and you recompute the centers of the clusters
    if all these red things are a cluster, where would the center be
    you moved your cluster centers now
    you reiterate
    so, now you moved the points here
    the blue points have overtaken
    and once you do that, and reiterate until the cetner does not change anymore
    at the end you get here
    and you find your two species again
    
    caveats:
        you still have to guess the number of clusters
        two kinds of fish in this pond
        you guess at several different numbers of clusters and you do an evaluation
        you look at how tight the clusters are
        more clusters make your model more complex
        there is a bunch of hand waving stuff
        this is a thing you can do
        this is like density estimation
        figure out if there are denser places in your feture place
        this is an example of an algorithm
        
        
    kavita: the cluster analysis will not tell you how many clusters there are?
    
    courtenay: there are clever ways to guess
    we can look at the data 
    and say there is two
    
    courtenay: you can use histograms
    visually you can look at it
    and know
    
    janet: i have been using cluster analysis in social newtork analysis
    bibliometric analysis
    you can get the computer to detect what it thinks is a clustering
    if it is two clusters, do i have the relative distances through things
    
sylvia:
    you can get a cluster and look to see if there are clusters in that
    
courtenay:
    yes there are hierarchical cluster things you can do
    in social networks there are a whole different set of things you might do
*these two fish know each other
*ou have this whole extra set of data
*that does with your data
*that goes with the attributes of each user
*
carlin:
    trickiest things in teaching
    kids will do an analysis
    and come with gender binary
    to figure out what is getting read, where gender gets assigned by twitter
    who is particiapting
    what kind of assignments are happening
    what makes them truthful
    
    what feature you use
    if it makes more sense and to look at 
    
lilly:
    developmental biologist
    anne fauster sterling
    osteoporosis and how it is correlated with women
    and how come
    she critques that and builds a process model
    and how osteoprosis looks correlated with women
    if you get sports then less likely to get it
    race: the kind of work you do
    
berns:    both stregnthener is marketed to the petite white women

carlin: osteoprosis gets discussed without that kind of sepecificity

lilly: it takes a lot of labor to construct this other model
how can we use this data other kinds of process sotries of gender and race without reifying them
carlin:
*    in the hospital
*you talk to people differently
*not based on gender
*but more specified risk
*
berns: hypertension is race based
*that is discussed
*cigarette smoking and hypertension
*with regards to 
*i am trying to remember how it was taught
*it is: i don't even think about it, it is just how it is
*
*boneeba medicine??
*there is a typical image for certain medications
*they will be advertised to certain people
*sometimes because their insurance is more likely to pay for that
*it is not about systemtic issues
*why is this women having these issues
*an african american women is going to be more likely to be on this medication
*it is presented as this is the problem of her race
*and not that society was shit to her
*inherently, this is what she will be, instead of what she went through
*
lilly: correlation becomes the local cause and they just deal with it that way

berns: then there are people who don't want to take the medicine
*they are seen as non-adherent
*that is supposed to be more compassionate
*lathough some will call them non-compliant
*
janet: that is an interesting label, 


THIS WAS SOMETIME BEFORE----
janet:
    you are parenting your computer
    
sylvia: i think robots are adorable
*it is cute to watch
------------


courtenay:
    they may be clusters of density
    a little more of grey area
    finding interesting clusters we may want to do something with
    this kind of analysis without labels may allow you to make reasonable guesses
    
 
FOR TOMORROW:
    
Baysian statistics explanations:
    http://www.kevinboone.net/bayes.html

Sylvia would like to explain regression (30 minutes?)

Neural networks are going to take over the world??

Seda mentions that there is AI that trains video game figures to act in specific ways 
Seda says “what other politics are possible if there were other ways of  querying data?” 

seda: what kind of queries can we make with machine learning to get at where discrimination starts, the problem is that when you categorize you can then name and call out discrimination but once you create the new category then that has its own discramantory potential

anne fausto-sterling (sp??) http://www.annefaustosterling.com/


domain knowledge
- you would use domain knowledge to get the parameters for a data set (

discussion during hands-on weka session:
    
    carlin: i like what you say about domain knowledge
    
    kavita: our purpose is that we have a new piece of glass, is this helping out what kind of glass it is
    
    courtenay: now it is not helping, because i took out the glass
    if everyone understands what these histograms generally show
    
    
    lilly: can you read this file for us
    
   c ourtenay: breast cancer data
   in this dataset
   there are 9 features here
   age, menopause, tumor size
   the thing we are trying to predict is whether can is likely to recur or not
   we are looking 286 examples
   and 201 did not have recurrence
   and 85 you did
   and that is the class you are trying to predict
   we would want to predict it by looking at some combination of the 9 features or some subset thereof


janet:   we have a woman 54, pre-menopausal, right brest


courtenay: i can do the walking for you, too

lilly: there is no x axis

courtenay: this is what we were talking about before
numerical vs. nominal
these are much more nominal


seda: you need to look at the arff file to find out what the values stand for

courtenay:
    the way you read this is that 68 cases got radiation therapy and about half of them had a recurrence and half didn't
    and the others didn't get radiation and did not have a recurrence
    the information you can glean here is how different the percentage of the classes are
    in this case it doesn't make sense
    recurrences were a far less frequent event
    
    
bernadette: it is a small amoount that reoccurs

courtenay: it is not unlikely
you have a veested interest in predicting who is goig to recur


kavita: does this mean that you are more likely to have recurrence if you get radiations

berns: the first part is people who did it
and it is 50 50
and the ones who didn't there was a better chance

courtenay: but you need to know whether getting radiation are those who were seen as more serious cases


sylvia:
    there are ways to present it to show that there is a clear relationship
    but there are ways, which doesn't show what the relationship is
    there are no clear relationships
    if we used some sort of algorithm, we could predict it
    but through visualization, especially because they have different population sizes
    it doesn't feel like a good example
    or it is a weakness of the proram
    it is a little lame of them
    
courtenay: 
    i agree with you

sylvia:
    the whole point of visualization is to see things
    
berns: the safe thing we are agreeing on
you have to be careful with correlation and causation
i have been to a numbder of pharmaceutical presentations
they will take 2 people living 3-4 months longer
and they will make claims

janet: i see why you go into this
people who got radiation
it evened out
it looks like not getting radiation meant you did not have recurrence
that is why you collect a whole bunch of data
because you want to show why the finding is part of other factors

sylvia: this is also a way to manipulate data to get what you want

seda: they claim more data is always better
*it is better to have 
*the noise
*if you have a lot of features
*instead of fitting a straight line
*you would be fitting this thing that goes through every pint
*you don't want to predict this thing in between
*there is  a way you can measure
*i decide on a certain feature
*i am looking at cglass and i want to know if it is transparent
*you see that it is evenly distributed across all glass then you know it is not a relevant feature
*so there is certain features
*that it is an indicator of this label
the overfitting problem


DAY 2:
    
http://www.thenewyorkworld.com/
https://nycopendata.socrata.com/

Where to find data -- what the important attributes are

What there is data for
What there isn't data for

Lilly couldn't find any data on contractors

Unpacking "Mechanical Turk"

CUP Lab -- data siphons

Data politics in NYC

Martha: Certain datasets won't be more -- how much learning can your machine do? 
Courtenay: Exploratory actions on data or finding 
Picketty Dataset is Open!

Martha: difference between prediction and learning?
Courtenay: Maybe? Someone may or may not believe you have proven something with your predictions. There are no unknowns that you can point to.
"Machine learning" on a pedestal as separate from data mining or statistics is dangerous.
How many nation states in Picketty?
Martha: European ones?
Seda: 40 based on his definition?
Courtenay: You can still make predictions for a new country. 
Cross validation -- hold out on one data point and then see how well you predict the missing country to test your model.
Seda: There's always a prediction, isn't the question how reliable the prediction is? What is prediction?
Courtenay: You don't know a value so you attempt to
Seda: Act of using a function to come up with a value you don't know.
Courtenay: You may be artificially obscuring the value to test. That's still prediction.
Kavita: Can you do predictions on datasets from the past in which it's not possible to go back and collect?
C: You can still do what you want -- it's a philosophical scientific thing. Going forward you're not going to be able to make predictions. But it can tell you if you have a good model of phenomenona.
Lilly: It reminds me of talking to mathematicians and scientists -- you don't have a theory unless you make predictions about the future. Ethnographers work differently: if you don't know how the data was created, you don't have a theory. New ways to explore models. Potential parameters are infinite.

C: the end game doesn't have to be classification. the field of machine learning is driven by prediction. but the techniques are statistical techniques. There are other ways of seeing if things are correlated.

Bernadette: last night I thought about farming data. labor has been low on the farm until the summer youth -- now it's spic and span. Number of workers with hours put in to crop outputs.

Jojo: i thought when you said farming data, you were talking about the labor of preparing data for use later on, the workers come in and clean it up and it is ready for harvesting.

Joanne: Wikileaks data is CSV.

Seda: text analysis will be interesting. 

SLIDES/Courtenay presentation:

Courtenay: touch on what correlations are 
using spurious correlations site: everyone knows correlation doesn't imply causation, but doesn't necessarily mean correlation.
Martha: but is it predictive?
C: no reason to believe that they would.
Seda: Google food: all sorts of debates. World Bank discussions. Google food trends worked because of years and years of data collected by scientists. How good is prediction without another kind of ground truth.
C: you can go out and look and a couple are going to look really great and all the rest won't work. You just pick the ones that look.
L: Isn't the point that you don't need common sense?
C: maybe they're both correlated to other things. Maybe there are other variables. 
You convince yourself that they are correlated
Maybe they are correlated to other things, but you have convinced yourself that this is the correlation.
The correlation will be spurious because they
Seda: Constant is right now doing a workshop: how do we create common sense with machine learning.
B: we make lists when we hit problems.
C: human brains are good at making spurious correlations.
B: Cognitive Therapists 

courtnenay showing correlations of different types and strengths

if your classifier doesn't work, you might just not have enough information.

c: sometimes you just have data and maybe you won't be able to predict what you want to pick
Seda: is there any data on data that confuses classifiers?

if it is uncorrelated with the class, it shouldn't throw off your classifier, your classifier will ignore it.
learn to weight things as zero.


discussion yesterday:
    if the length and number of stripes of fish are correlated, a model that assumes they are independent might not work very well
    because you count the same information tiwce becuase it's repeated in two places and the model doesn't take this into account
    
no double counts!! bad!


solution: could switch to a model that doesn't assume independent features.

the other philosophical broad point was the fight between statisticians and machine learning
i was vaguely aware of the fight, i was aware that there was some tension, maybe

yesterday one of you asked: is this any different between statistics.

here is a joke i found:
a table of differences between the two, mostly terminological, but a large grant in ml will get 1.000000 whereas in statistics a large grant is 50000.
weight vs. parameters etc.

lots of overlap and lots of cultural differences
the practices have evolved into different standards


andrew gelman says, maybe we should remove models and assumptions because then we can solve problems that the machine learning people can solve.


C: There are people who believe more or less in one or the other dogma

one commentators on stackexchange says
ml experts do not spend enough time on fundaments, and many of them do not understand optimal decision making and proper accuracy scoring rules.
statsiticans spend too little time learning good programming practice and new computational languages.

m: can you explain the second statement about statisticians


c: humans aren't super into change. a discipline evolved in a specific way. before computers were around. in a culture in which people don't jump to the most immediate new software. fewer people in statistics departments know how to 
S: ML come from CS, statisticians come from mathematics
L: chalkboards, slow proofs (math) vs prototypes! (CS) fast moving
S: mathematician: if you don't understand what your algorithm is doing, it's wrong. One big issue: giant data sets.
Efficiency is about quantifying results.


when social cientists look at this debate, they say it is right or wrong. it is hard to make it stick, but it is working.
the test by which something is successful in the world is not whether it is right or wrong, but whether it "works"
ml person says, it is working, and the statistician says it is wrong

jojo: it depends on what you mean by what matters?

lilly: machine learning and statistics are competing for legitimacy on what is the right way to work with this data
it could be that the debates about what is right and wrong, by participating in those debates, the ml people may be legitimizing their discipline


martha: for some of these guys what is at stake is not publishing a paper, but having a successful company
If they say all that matters is that they have a correlation
different social worlds -- what's at stake 

C: techniques developed in academia, adopted elsewhere.

courtenay: a lot of the techniques get developed in the academic setting, but in many cases, outside of academia, if it works, it works
columbia was very mathy and proof oriented
it is this academic thing
in practice it is a very computer science and engineering mind set: i built it, it works

lilly: some friends would consult cia and stuff
for intelligence vs. ad prediction there may be different standards?

courtenay: i don't know how theoretically, what the standards are behind that wall [of intelligence]


martha: you're trained pre-data science? 

courtenay: i finished at the end of 2012. i was in machine learning courses in 2007-2008. hot stuff which was not neural networks, and a lot of that has been taken over. and they were interested in proofs.
taught hot methods at the time (not neural networks) by people concerned with theory and proof.

kavita: real timeness of data, ml people have access to data?
that the data is just constantly coming in and being optimized

statistician dealing with more static data?


courtenay: it is less about real time than dealing with larger datasets
which data scientists have been dealing with for a long time
statisticians may not be as comfortable


lillly: twitter search be one of these computational processes?
twitter search has a real time problem, topics are cultural context that are not indexable terms
so they hire mechanical turks to find something very quickly
timing matters, limitations. TurkWorkers to bootstrap.

C: detecting density of topics

martha: i just thought of something, the credit scoring had 12 items, that is how many items someone working on a paper could add up.
she would be doing the computation live, and that is about computational efficiency
the debate today is with machine learners saying, these people are archaic
it was computational efficiency, because i did not think of it as computational efficiency
because it is transformed with infrastructures.

courtenay:
    there is not necessarily that 12 variables is a bad system
    it is more about the number of data points rather than the number of variables (features)
    
martha; given all the data that could be credit data, it looks archaic
C: there is such a thing as too many variables
courtenay: there are too complex models as well
that sounds like plan bullshit to me


berns: you cannot have enough considerations
in this case

courtenay: i agree that you may need more than 12 factors, but for some things it may be enough.

martha: we are having the debate because computational infrastructure, when we phrase the debate, the only reason we are considering more than 12 because these guys have amplified their capacities in the last 50 years

lilly:
    a lab, machine learning, we are storing so much more data, we need to gain more financial value from this data
    we need to get more value, becuase we have more data
    not that we want more data because we can get more value
    
courtenay: it is cheap enough that you can store everything

martha:
    the debate that you are pointing out between statisticians who are cheap
    ml we can maximize, 
    the debate is created by the economics of the environment
    
    
    courtenay:
        statisticians use computers
        they may not be up to par wth the latest in computational infrastructure
        
the data is probably generated by a tech company who is interested in doing this thing on its data
as far as academic departments go, they could be trying to solve the same problems


lilly: are you saying there is a difference is that cs people need grants to get machine, entrepreneurial grant getting

martha: the million dollars have to be for something

courtenay: they are also being snarky about it being a fad. it is cutting edge and popular and statistics has a marketing problem

ML gets the money because 

berns: there is a race issue there, too, as to who teaches you statistics and computer science. my statistics teachers were people of color

S: within computer science there are layers of people who are more proofy. clean definitions. 
C: then there are the ones who hack.
S: upper echelons -- it's class. middle class belt: less lofty. more likely to do applied stuff. don't mind being engaged in $. Privacy is upper, surveillance is middle. ML gets new folks: physicists and biochemists. Need the techniques. They go into hedgefunds. Big data systems. How do physicists deal with complex social issues.
LIlly: Physics such a male dominated field.
S: except Iran.
Hard crowd to read for me-- tend to be polymaths in my experience.

C: Engineering mindset: but practically it worked!
don't know what he means by worked, but proof is in the pudding.

M: usually managerial

C: lots of complaints to be had about this attitude. Broad take away: two outlooks: 
    classical stats hypothesis testing 
    ML: getting predictions to work even in the face of lack of interpretability of models
    
Lilly: as an ethnographer I now feel aligned with Classical Stats

C:If ML is more "successful" it comes from the large-scale resources; wring the last bits of success out of those things rather than doing something more profound.

M: could there be a synthesis?

C: most things are black boxes, but there's a real interest in doing this; run models backward; google deep dream. No one likes black boxes. People like to know how they work, also because they want to improve them.


feature normalization:
    what if we have this data:
        much less variability in # stripes athan length
        much difference in scales
        it is a problem if you are trying to calculate distance between things
        
change each feature to have
mean = 0
standard deviation = 1

obvious thing people in intro cs classes don't do

DATA SHARING (EMAIL PHOTOS!)


courtenay: something that looks extremely small
you are looking at particles, how toxic is something
the numbers may look very small to you, but you want to stretch it out to be able to evaluate its significance

martha: do we know that the difference is equal
it will then become testable as to wether it is meaningful

courtenay: it depends on your classification model, some models will need normalized data
you are not changing the information in that variable, 
you made it easier for algorithms to work with it
and maybe for you to view as a human
there could be a diagonal relationship that you can see better in a context, because of the resolution 

martha: if the variable is useless, after the transformation, it is still useless

courtenay: there are other normalizations that you can do
that is statistical normalization of data

martha: is tehre a relationship between normalization and the in ability to reverse engineer

courtenay: no, you usually have the raw data, and you know the mean so you can go back
it depends on 
the final model probably takes it, the final classification model
somewhere inside of it has the value of the mean that ist needs to subtract off
that value is a parameter in the model, you know what that is
so that you can make a transofrmation on the raw data coming in, that also means that you can back it out
you are not obfuscating antyhing


feature selection and dimensionality reduction

we might not even need all the features we have to do well on prediction
we might need something that we don't have
sometimes the important thing is to figure out which ones to throw away
*- features with no correlation to the class
*- features that are redundant with each other
*if their corelation is one to one, you can throw it away
*
2 features are redundant if they are highly correlated with each other
*you're relaly getting the same information from both and overcomplicating model
*could manually do it
*
dimensionality reduction
*a way to compress data so that you can extract a smaller subset of uncorrelated fetures
*set of mathematical transformation to do this
*
ou end up with a whole new set of features
each is a function of the features you put in
you have x, y, and z, you end up with a, b, and c
a, b, and c are functions of combinations of x,y, and z
such that they are all orthogonal to each other
the output variables don't have correlations with each other

it is a form of mathematical projection
youare changing your axis
i can't give you an intuition


martha: you perform something on each data point and transform it into something else

courtenay: itis an automatic way of compressing the correlated relatioship into uncorrelated variables

martha: you take variables that are corelated, 
*
*
courtenay: now the features are uncorrelated.

ri: you need to know what is correlated

courtenay: you do a mathematical transformation that does it

lilly: whether two things are correlated is a statistical relationship, right, so the stats does the job for you?

courtenay: it is factor analysis
there is a bunch of ways to do that
principled components analysis

martha: compressed?

courtenay: you are asusming there is a lossless representation
and you do something that are now uncorrelated so that it is easier to feed into a model

lilly: we thought these were correlated

courtenay: ffter the transformation, you take the top x values and throw away the less important, less informative values in the bottom
that is purposeful transformation that way

ri: i thgouht we were trying to what does correlate, how come we can now all of a sudden identify that is correlated

martha: is compression like making juice out of vegetables

these are dimensionality reduced points, because it is too slow to eat carrots?


courtenay: basically, you are going to take the top few dimensions to a classifier, becuase you know these things are uncorrelated, you don't have bad feature correlations fucking up your modesl

if you have two variables that are really correlated, all the ifnromation that was contained in those two is compressed to one feature
you are not going to have the statistical prblem of overweighing these features

lilly: combine marriage and margerine into one feature

courtenay: there is something to be said about the distance on the y axis, it does not mean anything
the slight difference between the shapes

lillly: a band of difference is acceptable?
courtenay: yes
lilly:  but the band can matter   

courtenay: the mathematical transformation will not take semantics into account

balanced datasets:
    
    sometimes you notice your classifiest is doing suspiciously well - 95 percent accuracy
    then you notice that your data looks like this: class a has 95 percent a, 5 percent b
    
maybe you can go collect more examples of class B and make your model better
could use re-sampling methods to feed your classifier more balanced (if slightly synthetic) data

it is good to be aware of the relevative balance of classes in your data and think about how it might be affecting predictions

cross-validation:
    standard machine learning practice
    you need twice as much data, training and test set
    instead of 2 fixed dataset
    split data into train/test multiple times for multiple experiements and take average results
    more samples -> results more likely to be statistically valid
    weka: 10 fold validation, that means that it built 10 different classifiers
    
martha: is that the kind of validation that you do if you have lots of data

courtenay: if you have tons of data, you can do a single split and that is ok
this is going to help you more if you don't have a lot of data
it is generally a good thing to do, it is sampling more

berns: what do ml people call statistical significance?

courtenay: this means that you did ten fold cross-validation
the part of your paper, where you prove that there is statistical validity in results is a little bit more lax 

overfitting:
    sylvia was describing the picture yesterday
    model complexity >> training examples
*    often happens when # features >> number train examples
*your model is too complex
*you didn't have enough training exmaples to justify the complexity of your model
*a lot of the models use the numbers of features, weigh each feature
*your model is the weighted version of those 12 features
*but maybe you only say 6 examples
*you have few data points and you are trying to fix a model with way more data points and you are going to do this overfitting
*
*sort of: lets your model be more creatively wrong
*
*occam's razoe: simpler models are usually better
*
*classic visual demostration
*simple linear model vs. more complicated model: you get every point, which would you expect to be more correct for new examples?
*
*the canonical way of seeing if you overfit
*you look at your prediction performance on the training set
*if you are doing really well on your training data and shitty on the test data, you know you have done something wrong!! :)
*
*

feedback and future discussion:
    
    berns: people like one on one, i like it when we are in a group. i am not even good with breaking out into groups.
    
    courtenay: this was a good size group for hands on stuff today, the group was a little bigger yesterday, which made it harder for hands on, but then better for discussion
    
    kavita: the pace was good
at no point was i dragging

ri: the structure was well thought

joanne: great
i didn't think yesterday was overwhelming with the larger class size
i thought you were going over a lot of vocabulary
structurally to add: i wasn't sure what i was going to learn
i see machine learning all the time
one could that could help if there were 5 questions that were answered

courtenay: in the context that you see ml all the time, did it cater to your expectations

joanne: i was worried that it would be too technical 
i am glad that i came
there might have been a way to point out what we will learn

courtenay: there was an initial description that was more technical
it is good that where we went
what language would be friendly to the people that we want to attend

berns: my friend was worried that it would be way over her head

ri: what worked well, if you know about machine learning, you should come
that added to the people explaining, that was a good dynamic management
it is difficult when you are teaching a technical subject to put it on a level to keep evryone interested

kavita: i would love to have a discussion on social and cultural significance of machine learning taking over certain functions

ri: i would say the opposite, let's get our datasets 

courtenay: this was great, i was terrified 
it was kind of a tough, it was a long road from do you want to do an ml workshop
to what would that look like
and to ask that question again and again
i am so sorry we didn't get to the societal implications or get to the data
this has been really fun
i got all kinds of perspectives and questions that i hadn't thought about
and tuaght yself things that i didn't know or had forgotten

ri: if you want to choose different spaces, this was a wonderful space


transinclusive
joanne: i don't want to call things all women
what if someone transitions
invalidating

berns: i have had people not feel included

joanne: you just can't say, no cis guys

eyebeam, i could talk with them if you need space
it is a nice space
new inc might be open, too
they might be good
eyebeam would be open


berns: because it is new york, we have our hands in such cool things
you wouldn't want to spam
if we could email a central person
i am putting out this event on thursday, i heard about this event and it might be of interest
a monthly news

joanne: if you had ela come and do some basic security: if she would be up for that, that would be amazing
especially since she has been doing threat modeling
she would be happy to test out
most of oher talks, she is often talking about things