Courtenay leads discussion on ML
machine learning is really just observing patterns,
in lots of data
what is it used for:
sppech recognition
face recognition
language translation
predicting consumer behavior
predicting financial markets
analyzing social networks
making business decisions
workflow of machine learning:
- repeat unti you get out what you want: correlations, profit, phd etc.
-
there are typically two types of problems:
- classification : most times people talk about machine learning, and they mean classification
- regression: predict the value of something in a range of values
-
classification asks questions like:
- is this a good or bad email (spam)
- is this a dog, or cat, or a rabbit
-
regression:
- given sales data for last 12 months, what will next months sales
-
- is it always about:
- malignant tumors: it is something you cannot see, a wall that keeps you from seeing things
-
janet: is it like you want to predict what netflic thinks of you?
betsy:
if i could at least come at the end
wanted to say thank you so much
i am having fun geeking out with all these women
i have been telling people
i thought it would be counter to the culture to do the fb thing
it was safe to ask any question and yet it was not dumb
you can take step back
it was a wonderful moment to be able to ask those questions
it was great to realize that i knew more than i thought i did
as a sociologist, and i know statistics
but i rejected it
but realizing that i still have that knowledge and can use to analyze my new project
so i got great interview questions yesterday
when i have access to tech designers, these questions will help me to get to what i want faster
i will sound more like i know what they are doing
i looked at the datasets
do i want to put that on my computer
ri: how did you select weka
i looked at mathlab, i am not there yet
what is a good tool?
courtenay: i did all of my phd in mathlab, i hate it, it is proprietary
weka is a shitty visualization tool
it is the only thing that i know of that will allow you to do machine learning
run real algorithms out of box
and load your dataset
more programming is scary for people who don't know how to program
weka was started in 1993
i think it was a reasonable choice for what i was trying to do here
if you don't want classifier, but just want to visualize, there are surely nicer guis
ri: that would be great
besty: is weka like a wordpress for data
seda: there is a data science course by mako hill, the material is online, i will add it to the email.
joanne:
suggestion for a topic
a workshop on the blockchain
with some of the implications
seda: adversarial machine learning algorithms - emails that are sent in order to figure out how machine learning works
asking a question about "data cleaning"? do you use regression to fill in holes in data set
incomplete data
when you are missing labels
istead of getting a human to do it, you try to bootstrap your algorithm
you make your first guess at your algorithm
you may be compounding your errors
that is something that happens when you can incomplete data
regression example:
you have a bunch of cities and some information about income (i didn;t say anything about whether this is mean or median)
you have a new city, and you want to know housing prices there
you try to find out based on what you know about other cities
janet: do you choose the model or does the machine choose the model?
courtenay: you choose the model, you still have to choose
there is a lot of domain knowledge
you can look at this example and say, yes it looks linear
or i am going to try a more complicated model
how do i know which of these models are better?
you need to have a additional data
lilly: how similar is this to what economist and sociologists have been doing?
- instead of discovering the world, it has become about can we make money off of it.
-
courtenay: i will talk about this tomorrow
- statistics vs machine learning
- these are classical statistical techniques
- how is this different? why is it more shiny?
- some of it is real cultural differences
- and some of it is the same
-
- at this level it looks like a lot of statistics
- but as the field evolves and things get complicated, it is a little different
-
kavita: what is a model, is it like an algorithm?
courtenay:
- it is a semantic thing
- the model is the object that you end up with at the end with parameters
- and you have an algorithm that chaanges the model
-
- here the model is linear
-
-
sylvia:
- crude drawings of basic shaptes
- linear: the two values on the x and y axis grow together
- logarithmic: you have a data set which eventually levels off, like age
- exponential: i lost my notes, you can use a logarithmic representation of the expoential curves
- exponential: sylvia is showing how you use algorithms to depict datasets with exponential growth
it is the same information but easier to read
is there a library of these kinds of models and you go to them and choose them?
seda: is this what you mean with a model, courtenay?
courtenay:
- you look at the data and you look to see what function you can fit.
-
carlin: i think about it as an equation
- what a graph is doing is solving the equation: if x is this, y is that...
-
courtenay:
- there are a lot of alrogithms and they are basically mathematical functions
- the way you get the model may be complicated, running the algorithm may take 10 hours
-
elizabeth:
- Now I have to ask what is an algorithm, I thought I knew.
-
courtenay:
- usually you can be doing optimization
- typically you are iterating on how much error your prediction is making
-
karissa: it is a bunch of steps that the computer executes
lilly: is it like a way of solving rubic cubes
carlin: it is a set of instructions
- it could be self-referential inside
-
elizabeth: sounds like a recipe and the result
courtenay:
- algoritms is this broad set of things
- those are used to figure out your model
- if you have these data points, you can get this line
- it is not always so simple
- this is an important task
-
couldn't you look up the price of housing?
what abour predicting how much you woul be willing to pay for a product?
classification example:
- example:
- we have some fish
- 2 species A and B
- hard to tell individual fish apart
- one of them is in danger, you want to tell when it is that fish
- you don't want it overfished, but you can't tell by looing at it
-
you go out and observe these fish, the length and the number of stripes that you see
you get dna samples to test for species
you figure out which species the fish belonged to
features: attributes you observe about each example
class labels: ground truth, you know that is the true answer, gold standard
training examples
lilly:
you are not sure, you don't know how you want to classify them
i thought you were going to say, the classifier would help you discern the clusters
in that case you don't have a ground truth, you want to discern the classifications
courtenay: they look the same but they really are two different things
you want to identify which fish is which
you went to a lab
and you have their dna
seda: but isn't that a probabilistic model, too?!
courtenay: this is a toy example with a ground truth
martha:
what if it is a behavioral outcome and it depends on how you treat it
the outcome depends on how i treated you
you are not a fish
courtenay:
that would be about data contamination?
martha:
ri is a a
courtenay is a b
i give courtenay a great credit card
but the result is the outcome
coutenary:
in the real world the gold standard is more complicated
martha:
- for alternative pegagogy
- we start with animals where the social complication is not visible
-
- you credit is evaluated based on products you have consumed
- but you can consume these products if you have a good credit score
-
carlin:
- interesting problems here
- what constitutes ground truth, when is it reliable enough
- there is then a simplicity and complexity thing
- people often default to animals, balls, sports, because there is a need to go to a simple pehnomena
- which turns out not to be a simple phenomena
- it is an important thing anyways
-
martha:
- the firt thing you teach kids is animals
- kids can be duck and cat
- but not every kid can be like good credit outcome
- i wonder why this example starts here?
-
courtenay:
that is how it started in my machine learning course
berns: that is a great question
lilly: women can be fish, too. that was my example
the objects are not politicized yet or are depoliticized in the moment [[ie. the industrial rubber ball as the perfect simple object to built liveliness/character/spirit from]]
first example in machine learning book is how to choose most perfect embryo
there's a lot of desire that is going on in there
- you want to predict what this fish is
- you see that it is short, it has reasonably few stripes
- that seems close to species a
- but you see this other fish that is more consuming
- still within specias a average range
- but has a very different number of stripes
- if we want to solve this problem, to guess what fish this is
- we need a model
- so i pull out math again
- in nature a lot of things get distributed with a bell curve, the guassian distributio
- attributes can also fall into this pattern
- this says that most of the fish will fall in this middle part
- you also see outliers
- you see some longer ones, some shorter ones
- no fish with length laess than 0
- in this case, we have decided based on lots of years of studying animals
- a good model for a ntural thing you see in nature
- that this distribution is gaussian
- you can fit a probability distribution of what you have seen
- and you figure out what the mean is and the standard deviation
- the probability model that you fit to species A
- and if you do species B, you have a different model
- it has a higher average
- if they had different standard deviations, the bell curve would be broader or narrower
- then, we know two things about the fish
- we can model these things jointly
- and we would have a 2 dimensional model
- the length probably is on one axis
- the stripe is on the other
- most species a are in the middle of this probablility distribution
- it falls in the middle of the cone there
- some fall on the edges here
-
-
lilly:
you have data point
- is this something you measured or is this the model
- is this the plotting of the data
- there are some assumptions about the actual data you have
-
courtenay:
the guessing part is that you think it is going to approximately fit this shape
you hope your sample size is big enough
so that your model is valid
karissa:
it can be a problem if people assume it is a curve like that and they find out later that everything they did is wrong
courtenay:
not a lot of things follow this curve
janet: you are also selecting a model and see if it works
ri: how do you get to your model, what is the process?
courtenay: you look at the data
- does it have a long tail?
- often you have too many features
- you visualize things differently
-
- depending on how much data you have
- looking at the numbers is a good idea
- you good do a histogram
- you take bins: 0-5, 6-10, 11-15...
- and then you look to see in which box you put your data
- you can just count and see
- if it looks lik most are in the middle you can have that shape
-
-
karissa:
- people love big data, if you have a lot of cases, you can pretty much tell instantly
- a lot of them show up in the middle
-
courtenay: it can be approximated as one
- you have four models
- you have a model of the distribution in one species
- you can have four bell curves
- and then you can look at each species two features jointly
- and hopefully they are well separated
-
now if we see a new fish, a data point, and it goes somewhere in this plane
on the bottom
now you see which of these models it falls closer to
this fish is closer to this b model
better
and you can make a more educated model that it is fish b
we observe those two feature attributes
we didn't have to send it off to the lab, we can guess now
janet: how do you say it, it is species b, or this is probably species b?
- you may look at the priors
-
sylvia:
- confidence level, if it is in the red area, you are rather confident
-
janet:
- gender map
- someone with this height, long hair short hair, are you computer going to decide who is a male or female
- or if you can get tenure
-
courtenay:
- a good scientists, yuo don't say it is species b
- but if you are google, you may tell advertisers that it is a man or woman
-
carlin:
- it is not a big deal to them if they get it wrong
- it doesn't matter to them if they advertised to some of the wrong people
-
-
janet:
it does not matter to them that gender may be fluid
courtenay:
one take away: ml is using features that you can directly observe as a proxy to predict something you can't directly observe
there is no guarantee that you;ll be right
there may be a lot of overlap
a fish might be an outlier for its species
abnormally large
points in between your two models and you don't have more data, you can't really say
takeaways:
precition only as good as your models
that your data does follow a particular distribution
need to observe a lot of fisn of each species to build accurate models of them
machine learning is what happens when you feed your models 1000s of fish
courtenay:
- what is your confidence that your dataset is right
- people at mechanical turk, the labeilng, there could be all sorts of data cleanliness problems
- almost for anything
- you want to deal with the noise, the outliers
- the bad people submiting the form twice
seda: claudia perlich was saying there is no wrong data
- there is wrong interpretation fo data
-
courtenay: you can have adversarial data generation, for example, you can have wrong data
ri: but that would be hard to separate
- sometimes you watch tv on your girlfriend's account
-
are there any advantages or is it worth thinking about the value of "unclean" data?
jojo: you can change it
are you typing a transcript you wonderful person?
martha:
your prediction is as good as your models
you only need your prediction to be as good as your need?
most times people want to do a critique, they ask if it is accurate
but maybe that is not the issue
courtenay:
yes, maybe you only need some percentage of success
sylvia:
if you get the wrong add, no big deal
if you misdiagnose cancer, you need a more accurate model
- then you need to watch out
-
carlin:
- the scale is different in those two examples
- any time it is a medical example
- trying to take this wealth of statistics and to apply to a single body or case
- you should do this cause you are likely to have this risk
- to go back to that need is different, it also depends on the scale
-
courtenay:
google is trying to do predictions for each user
or netflix
but you may not care
ri: amazon thinks that i am a recently divocred 50 year old who does yoga, not much damage?
courtenay:
high paying jobs shown only to men
helen, at the sympsium
seda -
rachel law - vortex
uniqueness is a probabilistic feature in that moment...it is a combination of features
Wearable tech guy who says you can trade biometric profiles with people (to be someone else in that regard) is named Chris Dancy twitter: @ServiceSphere
lilly: chelsea clinton: internet access is key to gender equality
- where do we think development data comes from
- people hired by universities and world bank
- who gather data through interviews
- a shift in data collection
- 800000 data points
- tech industry can solve any problem with data
- techno, big data, and feminism
- investing in the middle class is the best way to bring about democracy
- history having the problems of big data over generalizing
- the headline suggesting a correlation
- we could unpack how these correlations have many levels of spuriousness and assumptions
-
janet: that they are correlated is not causal
lilly: but the headline
sylvia:
- i stopped reading wired cause it is so obviousyl written for men
-
janet: has it gotten worse?
- when you first read it did you think it was not that way?
-
ri: it became more like gq of gadgets,
sylvia: ads for cars, watches and alcohol,
janet: the spurious correlations - http://www.tylervigen.com/spurious-correlations
- the number of movies with nicholas cage with murders in the pool
-
courtenay:
- models get more acurate -> preditions get more accurate
- this is true for our regression example, too:
- the more cities we observe the better our prediction
-
there are lots of different classifier models, this is just one type
this is a gaussian naive bayes classifier
- you assume features have gaussian distribution
- assumes each feature is unrelated to the other (not correlated with each other)
-
takeaways:
non complete list of things people use to make classification tasks
- decision trees
- nearest neighbor: really naive classifiers, with the fish, we thought it is pretty close to a, you compute its similarity to every example you have seen, because it was closest to a, and you throw out the statistical model out, sometimes it works really well
- bayes: bayesian kind of classification methods
- these are kind of classical probabilistic methods with a lot of complications on top of them
- bayesian, you are looking at priors
- the base rate, specifically you are incorporating
- if you observe that 25 percent of the fish are a and the rest b
- you incorporate that into your final result
-
-
- logistic
- support vector machines
- neural networks
-
-
janet:
we always here about bayesian stuff, is it that it includes probabilistic stuff
variables with probabilistic stuff
courtenay:
- real models usually use more than 2 features, it's hard to visualize, how they work and how they fail
- we can maybe look at 2d or 3d.
- it is really hard to understand at an intutive level why things are working out
- you try to figure out how well your predictions are doing
- this is your training set here
- here is the test set
- you need that to be labeled, too
- so, when you have a model, you try to predict the things there without knowing the labels
- then you look to see if you predicted well
- which means you need more data
- you can have a model and throw it out into the real world
- but you want to sort of believe that it is going to do what you think it is going to do
-
-
last example:
you can in the real world do stuff if your data is not fully labeled
it is harder
it is more uncertain
you may have tons of data and no labels
can we really not learn anything from it
ri: you mean like confirmed labels
courtenay: like the lab test
carlin:
- for something to appear as data
- won't some decisions have to be made
- we have been using length
- you know something has been measure, you need to know that it is inches
- what do you need to know at minimum
-
sylvia:
labels
carlin: maybe i am talking about units
courtenay: yes, i am talking about ground truth labels
- as long as you have examples
- the fish does not need a name
- stripes and species, i don't need their label
- you can throw them into a plot and look at them
- that is data
- everything is data
-
-
courtenay:
- you need to have a reasonable belief or faith that the measurements of the coffee grounds are related to something i am predicting in the real world
- you might be wrong
- maybe you are measuring something that has no correlation
-
- history of science: what peole thought caused diseases, that seemed reasonable at the time, but it wasn't that
- advertisement, there is no guarantee that if you are a male in a specific city
- that the ad will work
- it gets very subjective very fast in the real world
-
-
can we learn something if we don't have ground truty, say about the species of fish that you have
maybe we took measuremenets of all the fish
but we didn't even know they were from 2 differen speices populations
not just a mater of manually labeling the data, we don't even know what the labels should be
so in this case you get into the broad heading of unsupervised learning
if you know the species, you know they are male or female
you now have a bunch of numers and data
and you are interested in the kind patterns of data
janet: i love the terminology
- like workers that are unsupervised
-
sylvia:
- like when you have a child
- there is actually a correct answer
- whatever is learning, you are giving that answer
-
carlin: it is a matter of whether there is prior classification that supports that
courtenay:
labels -> supevised learning
without -> unsupervised
standard techniques that you use
here are the lengths and stripes
we have clusters, each point is a single fish
we know that they are two different species and they look like that
this lovely toy example, in this particular two dimensional space that is perfectly visualizable
you can't do it with your customers
you don't know the structure of that data and you don't have a way to guess
the most basic thing you can do is cluster analysis
the toy example i will show you, a common algorithm called k-means clustering
you start by guessing that there are clusters in your data
you usually also hwo many clusters
then you guess the centers
and guess cluster membership
it turns out that this will mathematically get you some nice clusters
the algorithm
you can picked two points, they are wrong, they are both in the same cluster
you pick them at random
then you do most obvious thing you can do, you measure distance to all the other points
you draw aline
you draw an ortogonal and perpenicular line
assume that this is a reasonable way to measure things
you do that, and you recompute the centers of the clusters
if all these red things are a cluster, where would the center be
you moved your cluster centers now
you reiterate
so, now you moved the points here
the blue points have overtaken
and once you do that, and reiterate until the cetner does not change anymore
at the end you get here
and you find your two species again
caveats:
you still have to guess the number of clusters
two kinds of fish in this pond
you guess at several different numbers of clusters and you do an evaluation
you look at how tight the clusters are
more clusters make your model more complex
there is a bunch of hand waving stuff
this is a thing you can do
this is like density estimation
figure out if there are denser places in your feture place
this is an example of an algorithm
kavita: the cluster analysis will not tell you how many clusters there are?
courtenay: there are clever ways to guess
we can look at the data
and say there is two
courtenay: you can use histograms
visually you can look at it
and know
janet: i have been using cluster analysis in social newtork analysis
bibliometric analysis
you can get the computer to detect what it thinks is a clustering
if it is two clusters, do i have the relative distances through things
sylvia:
you can get a cluster and look to see if there are clusters in that
courtenay:
yes there are hierarchical cluster things you can do
in social networks there are a whole different set of things you might do
- these two fish know each other
- ou have this whole extra set of data
- that does with your data
- that goes with the attributes of each user
-
carlin:
trickiest things in teaching
kids will do an analysis
and come with gender binary
to figure out what is getting read, where gender gets assigned by twitter
who is particiapting
what kind of assignments are happening
what makes them truthful
what feature you use
if it makes more sense and to look at
lilly:
developmental biologist
anne fauster sterling
osteoporosis and how it is correlated with women
and how come
she critques that and builds a process model
and how osteoprosis looks correlated with women
if you get sports then less likely to get it
race: the kind of work you do
berns: both stregnthener is marketed to the petite white women
carlin: osteoprosis gets discussed without that kind of sepecificity
lilly: it takes a lot of labor to construct this other model
how can we use this data other kinds of process sotries of gender and race without reifying them
carlin:
- in the hospital
- you talk to people differently
- not based on gender
- but more specified risk
-
berns: hypertension is race based
- that is discussed
- cigarette smoking and hypertension
- with regards to
- i am trying to remember how it was taught
- it is: i don't even think about it, it is just how it is
-
- boneeba medicine??
- there is a typical image for certain medications
- they will be advertised to certain people
- sometimes because their insurance is more likely to pay for that
- it is not about systemtic issues
- why is this women having these issues
- an african american women is going to be more likely to be on this medication
- it is presented as this is the problem of her race
- and not that society was shit to her
- inherently, this is what she will be, instead of what she went through
-
lilly: correlation becomes the local cause and they just deal with it that way
berns: then there are people who don't want to take the medicine
- they are seen as non-adherent
- that is supposed to be more compassionate
- lathough some will call them non-compliant
-
janet: that is an interesting label,
THIS WAS SOMETIME BEFORE----
janet:
you are parenting your computer
sylvia: i think robots are adorable
------------
courtenay:
they may be clusters of density
a little more of grey area
finding interesting clusters we may want to do something with
this kind of analysis without labels may allow you to make reasonable guesses
FOR TOMORROW:
Baysian statistics explanations:
http://www.kevinboone.net/bayes.html
Sylvia would like to explain regression (30 minutes?)
Neural networks are going to take over the world??
Seda mentions that there is AI that trains video game figures to act in specific ways
Seda says “what other politics are possible if there were other ways of querying data?”
seda: what kind of queries can we make with machine learning to get at where discrimination starts, the problem is that when you categorize you can then name and call out discrimination but once you create the new category then that has its own discramantory potential
anne fausto-sterling (sp??) http://www.annefaustosterling.com/
domain knowledge
- you would use domain knowledge to get the parameters for a data set (
discussion during hands-on weka session:
carlin: i like what you say about domain knowledge
kavita: our purpose is that we have a new piece of glass, is this helping out what kind of glass it is
courtenay: now it is not helping, because i took out the glass
if everyone understands what these histograms generally show
lilly: can you read this file for us
c ourtenay: breast cancer data
in this dataset
there are 9 features here
age, menopause, tumor size
the thing we are trying to predict is whether can is likely to recur or not
we are looking 286 examples
and 201 did not have recurrence
and 85 you did
and that is the class you are trying to predict
we would want to predict it by looking at some combination of the 9 features or some subset thereof
janet: we have a woman 54, pre-menopausal, right brest
courtenay: i can do the walking for you, too
lilly: there is no x axis
courtenay: this is what we were talking about before
numerical vs. nominal
these are much more nominal
seda: you need to look at the arff file to find out what the values stand for
courtenay:
the way you read this is that 68 cases got radiation therapy and about half of them had a recurrence and half didn't
and the others didn't get radiation and did not have a recurrence
the information you can glean here is how different the percentage of the classes are
in this case it doesn't make sense
recurrences were a far less frequent event
bernadette: it is a small amoount that reoccurs
courtenay: it is not unlikely
you have a veested interest in predicting who is goig to recur
kavita: does this mean that you are more likely to have recurrence if you get radiations
berns: the first part is people who did it
and it is 50 50
and the ones who didn't there was a better chance
courtenay: but you need to know whether getting radiation are those who were seen as more serious cases
sylvia:
there are ways to present it to show that there is a clear relationship
but there are ways, which doesn't show what the relationship is
there are no clear relationships
if we used some sort of algorithm, we could predict it
but through visualization, especially because they have different population sizes
it doesn't feel like a good example
or it is a weakness of the proram
it is a little lame of them
courtenay:
i agree with you
sylvia:
the whole point of visualization is to see things
berns: the safe thing we are agreeing on
you have to be careful with correlation and causation
i have been to a numbder of pharmaceutical presentations
they will take 2 people living 3-4 months longer
and they will make claims
janet: i see why you go into this
people who got radiation
it evened out
it looks like not getting radiation meant you did not have recurrence
that is why you collect a whole bunch of data
because you want to show why the finding is part of other factors
sylvia: this is also a way to manipulate data to get what you want
seda: they claim more data is always better
- it is better to have
- the noise
- if you have a lot of features
- instead of fitting a straight line
- you would be fitting this thing that goes through every pint
- you don't want to predict this thing in between
- there is a way you can measure
- i decide on a certain feature
- i am looking at cglass and i want to know if it is transparent
- you see that it is evenly distributed across all glass then you know it is not a relevant feature
- so there is certain features
- that it is an indicator of this label
the overfitting problem
DAY 2:
http://www.thenewyorkworld.com/
https://nycopendata.socrata.com/
Where to find data -- what the important attributes are
What there is data for
What there isn't data for
Lilly couldn't find any data on contractors
Unpacking "Mechanical Turk"
CUP Lab -- data siphons
Data politics in NYC
Martha: Certain datasets won't be more -- how much learning can your machine do?
Courtenay: Exploratory actions on data or finding
Picketty Dataset is Open!
Martha: difference between prediction and learning?
Courtenay: Maybe? Someone may or may not believe you have proven something with your predictions. There are no unknowns that you can point to.
"Machine learning" on a pedestal as separate from data mining or statistics is dangerous.
How many nation states in Picketty?
Martha: European ones?
Seda: 40 based on his definition?
Courtenay: You can still make predictions for a new country.
Cross validation -- hold out on one data point and then see how well you predict the missing country to test your model.
Seda: There's always a prediction, isn't the question how reliable the prediction is? What is prediction?
Courtenay: You don't know a value so you attempt to
Seda: Act of using a function to come up with a value you don't know.
Courtenay: You may be artificially obscuring the value to test. That's still prediction.
Kavita: Can you do predictions on datasets from the past in which it's not possible to go back and collect?
C: You can still do what you want -- it's a philosophical scientific thing. Going forward you're not going to be able to make predictions. But it can tell you if you have a good model of phenomenona.
Lilly: It reminds me of talking to mathematicians and scientists -- you don't have a theory unless you make predictions about the future. Ethnographers work differently: if you don't know how the data was created, you don't have a theory. New ways to explore models. Potential parameters are infinite.
C: the end game doesn't have to be classification. the field of machine learning is driven by prediction. but the techniques are statistical techniques. There are other ways of seeing if things are correlated.
Bernadette: last night I thought about farming data. labor has been low on the farm until the summer youth -- now it's spic and span. Number of workers with hours put in to crop outputs.
Jojo: i thought when you said farming data, you were talking about the labor of preparing data for use later on, the workers come in and clean it up and it is ready for harvesting.
Joanne: Wikileaks data is CSV.
Seda: text analysis will be interesting.
SLIDES/Courtenay presentation:
Courtenay: touch on what correlations are
using spurious correlations site: everyone knows correlation doesn't imply causation, but doesn't necessarily mean correlation.
Martha: but is it predictive?
C: no reason to believe that they would.
Seda: Google food: all sorts of debates. World Bank discussions. Google food trends worked because of years and years of data collected by scientists. How good is prediction without another kind of ground truth.
C: you can go out and look and a couple are going to look really great and all the rest won't work. You just pick the ones that look.
L: Isn't the point that you don't need common sense?
C: maybe they're both correlated to other things. Maybe there are other variables.
You convince yourself that they are correlated
Maybe they are correlated to other things, but you have convinced yourself that this is the correlation.
The correlation will be spurious because they
Seda: Constant is right now doing a workshop: how do we create common sense with machine learning.
B: we make lists when we hit problems.
C: human brains are good at making spurious correlations.
B: Cognitive Therapists
courtnenay showing correlations of different types and strengths
if your classifier doesn't work, you might just not have enough information.
c: sometimes you just have data and maybe you won't be able to predict what you want to pick
Seda: is there any data on data that confuses classifiers?
if it is uncorrelated with the class, it shouldn't throw off your classifier, your classifier will ignore it.
learn to weight things as zero.
discussion yesterday:
if the length and number of stripes of fish are correlated, a model that assumes they are independent might not work very well
because you count the same information tiwce becuase it's repeated in two places and the model doesn't take this into account
no double counts!! bad!
solution: could switch to a model that doesn't assume independent features.
the other philosophical broad point was the fight between statisticians and machine learning
i was vaguely aware of the fight, i was aware that there was some tension, maybe
yesterday one of you asked: is this any different between statistics.
here is a joke i found:
a table of differences between the two, mostly terminological, but a large grant in ml will get 1.000000 whereas in statistics a large grant is 50000.
weight vs. parameters etc.
lots of overlap and lots of cultural differences
the practices have evolved into different standards
andrew gelman says, maybe we should remove models and assumptions because then we can solve problems that the machine learning people can solve.
C: There are people who believe more or less in one or the other dogma
one commentators on stackexchange says
ml experts do not spend enough time on fundaments, and many of them do not understand optimal decision making and proper accuracy scoring rules.
statsiticans spend too little time learning good programming practice and new computational languages.
m: can you explain the second statement about statisticians
c: humans aren't super into change. a discipline evolved in a specific way. before computers were around. in a culture in which people don't jump to the most immediate new software. fewer people in statistics departments know how to
S: ML come from CS, statisticians come from mathematics
L: chalkboards, slow proofs (math) vs prototypes! (CS) fast moving
S: mathematician: if you don't understand what your algorithm is doing, it's wrong. One big issue: giant data sets.
Efficiency is about quantifying results.
when social cientists look at this debate, they say it is right or wrong. it is hard to make it stick, but it is working.
the test by which something is successful in the world is not whether it is right or wrong, but whether it "works"
ml person says, it is working, and the statistician says it is wrong
jojo: it depends on what you mean by what matters?
lilly: machine learning and statistics are competing for legitimacy on what is the right way to work with this data
it could be that the debates about what is right and wrong, by participating in those debates, the ml people may be legitimizing their discipline
martha: for some of these guys what is at stake is not publishing a paper, but having a successful company
If they say all that matters is that they have a correlation
different social worlds -- what's at stake
C: techniques developed in academia, adopted elsewhere.
courtenay: a lot of the techniques get developed in the academic setting, but in many cases, outside of academia, if it works, it works
columbia was very mathy and proof oriented
it is this academic thing
in practice it is a very computer science and engineering mind set: i built it, it works
lilly: some friends would consult cia and stuff
for intelligence vs. ad prediction there may be different standards?
courtenay: i don't know how theoretically, what the standards are behind that wall [of intelligence]
martha: you're trained pre-data science?
courtenay: i finished at the end of 2012. i was in machine learning courses in 2007-2008. hot stuff which was not neural networks, and a lot of that has been taken over. and they were interested in proofs.
taught hot methods at the time (not neural networks) by people concerned with theory and proof.
kavita: real timeness of data, ml people have access to data?
that the data is just constantly coming in and being optimized
statistician dealing with more static data?
courtenay: it is less about real time than dealing with larger datasets
which data scientists have been dealing with for a long time
statisticians may not be as comfortable
lillly: twitter search be one of these computational processes?
twitter search has a real time problem, topics are cultural context that are not indexable terms
so they hire mechanical turks to find something very quickly
timing matters, limitations. TurkWorkers to bootstrap.
C: detecting density of topics
martha: i just thought of something, the credit scoring had 12 items, that is how many items someone working on a paper could add up.
she would be doing the computation live, and that is about computational efficiency
the debate today is with machine learners saying, these people are archaic
it was computational efficiency, because i did not think of it as computational efficiency
because it is transformed with infrastructures.
courtenay:
there is not necessarily that 12 variables is a bad system
it is more about the number of data points rather than the number of variables (features)
martha; given all the data that could be credit data, it looks archaic
C: there is such a thing as too many variables
courtenay: there are too complex models as well
that sounds like plan bullshit to me
berns: you cannot have enough considerations
in this case
courtenay: i agree that you may need more than 12 factors, but for some things it may be enough.
martha: we are having the debate because computational infrastructure, when we phrase the debate, the only reason we are considering more than 12 because these guys have amplified their capacities in the last 50 years
lilly:
a lab, machine learning, we are storing so much more data, we need to gain more financial value from this data
we need to get more value, becuase we have more data
not that we want more data because we can get more value
courtenay: it is cheap enough that you can store everything
martha:
the debate that you are pointing out between statisticians who are cheap
ml we can maximize,
the debate is created by the economics of the environment
courtenay:
statisticians use computers
they may not be up to par wth the latest in computational infrastructure
the data is probably generated by a tech company who is interested in doing this thing on its data
as far as academic departments go, they could be trying to solve the same problems
lilly: are you saying there is a difference is that cs people need grants to get machine, entrepreneurial grant getting
martha: the million dollars have to be for something
courtenay: they are also being snarky about it being a fad. it is cutting edge and popular and statistics has a marketing problem
ML gets the money because
berns: there is a race issue there, too, as to who teaches you statistics and computer science. my statistics teachers were people of color
S: within computer science there are layers of people who are more proofy. clean definitions.
C: then there are the ones who hack.
S: upper echelons -- it's class. middle class belt: less lofty. more likely to do applied stuff. don't mind being engaged in $. Privacy is upper, surveillance is middle. ML gets new folks: physicists and biochemists. Need the techniques. They go into hedgefunds. Big data systems. How do physicists deal with complex social issues.
LIlly: Physics such a male dominated field.
S: except Iran.
Hard crowd to read for me-- tend to be polymaths in my experience.
C: Engineering mindset: but practically it worked!
don't know what he means by worked, but proof is in the pudding.
M: usually managerial
C: lots of complaints to be had about this attitude. Broad take away: two outlooks:
classical stats hypothesis testing
ML: getting predictions to work even in the face of lack of interpretability of models
Lilly: as an ethnographer I now feel aligned with Classical Stats
C:If ML is more "successful" it comes from the large-scale resources; wring the last bits of success out of those things rather than doing something more profound.
M: could there be a synthesis?
C: most things are black boxes, but there's a real interest in doing this; run models backward; google deep dream. No one likes black boxes. People like to know how they work, also because they want to improve them.
feature normalization:
what if we have this data:
much less variability in # stripes athan length
much difference in scales
it is a problem if you are trying to calculate distance between things
change each feature to have
mean = 0
standard deviation = 1
obvious thing people in intro cs classes don't do
DATA SHARING (EMAIL PHOTOS!)
courtenay: something that looks extremely small
you are looking at particles, how toxic is something
the numbers may look very small to you, but you want to stretch it out to be able to evaluate its significance
martha: do we know that the difference is equal
it will then become testable as to wether it is meaningful
courtenay: it depends on your classification model, some models will need normalized data
you are not changing the information in that variable,
you made it easier for algorithms to work with it
and maybe for you to view as a human
there could be a diagonal relationship that you can see better in a context, because of the resolution
martha: if the variable is useless, after the transformation, it is still useless
courtenay: there are other normalizations that you can do
that is statistical normalization of data
martha: is tehre a relationship between normalization and the in ability to reverse engineer
courtenay: no, you usually have the raw data, and you know the mean so you can go back
it depends on
the final model probably takes it, the final classification model
somewhere inside of it has the value of the mean that ist needs to subtract off
that value is a parameter in the model, you know what that is
so that you can make a transofrmation on the raw data coming in, that also means that you can back it out
you are not obfuscating antyhing
feature selection and dimensionality reduction
we might not even need all the features we have to do well on prediction
we might need something that we don't have
sometimes the important thing is to figure out which ones to throw away
- - features with no correlation to the class
- - features that are redundant with each other
- if their corelation is one to one, you can throw it away
-
2 features are redundant if they are highly correlated with each other
- you're relaly getting the same information from both and overcomplicating model
- could manually do it
-
dimensionality reduction
- a way to compress data so that you can extract a smaller subset of uncorrelated fetures
- set of mathematical transformation to do this
-
ou end up with a whole new set of features
each is a function of the features you put in
you have x, y, and z, you end up with a, b, and c
a, b, and c are functions of combinations of x,y, and z
such that they are all orthogonal to each other
the output variables don't have correlations with each other
it is a form of mathematical projection
youare changing your axis
i can't give you an intuition
martha: you perform something on each data point and transform it into something else
courtenay: itis an automatic way of compressing the correlated relatioship into uncorrelated variables
martha: you take variables that are corelated,
courtenay: now the features are uncorrelated.
ri: you need to know what is correlated
courtenay: you do a mathematical transformation that does it
lilly: whether two things are correlated is a statistical relationship, right, so the stats does the job for you?
courtenay: it is factor analysis
there is a bunch of ways to do that
principled components analysis
martha: compressed?
courtenay: you are asusming there is a lossless representation
and you do something that are now uncorrelated so that it is easier to feed into a model
lilly: we thought these were correlated
courtenay: ffter the transformation, you take the top x values and throw away the less important, less informative values in the bottom
that is purposeful transformation that way
ri: i thgouht we were trying to what does correlate, how come we can now all of a sudden identify that is correlated
martha: is compression like making juice out of vegetables
these are dimensionality reduced points, because it is too slow to eat carrots?
courtenay: basically, you are going to take the top few dimensions to a classifier, becuase you know these things are uncorrelated, you don't have bad feature correlations fucking up your modesl
if you have two variables that are really correlated, all the ifnromation that was contained in those two is compressed to one feature
you are not going to have the statistical prblem of overweighing these features
lilly: combine marriage and margerine into one feature
courtenay: there is something to be said about the distance on the y axis, it does not mean anything
the slight difference between the shapes
lillly: a band of difference is acceptable?
courtenay: yes
lilly: but the band can matter
courtenay: the mathematical transformation will not take semantics into account
balanced datasets:
sometimes you notice your classifiest is doing suspiciously well - 95 percent accuracy
then you notice that your data looks like this: class a has 95 percent a, 5 percent b
maybe you can go collect more examples of class B and make your model better
could use re-sampling methods to feed your classifier more balanced (if slightly synthetic) data
it is good to be aware of the relevative balance of classes in your data and think about how it might be affecting predictions
cross-validation:
standard machine learning practice
you need twice as much data, training and test set
instead of 2 fixed dataset
split data into train/test multiple times for multiple experiements and take average results
more samples -> results more likely to be statistically valid
weka: 10 fold validation, that means that it built 10 different classifiers
martha: is that the kind of validation that you do if you have lots of data
courtenay: if you have tons of data, you can do a single split and that is ok
this is going to help you more if you don't have a lot of data
it is generally a good thing to do, it is sampling more
berns: what do ml people call statistical significance?
courtenay: this means that you did ten fold cross-validation
the part of your paper, where you prove that there is statistical validity in results is a little bit more lax
overfitting:
sylvia was describing the picture yesterday
model complexity >> training examples
- often happens when # features >> number train examples
- your model is too complex
- you didn't have enough training exmaples to justify the complexity of your model
- a lot of the models use the numbers of features, weigh each feature
- your model is the weighted version of those 12 features
- but maybe you only say 6 examples
- you have few data points and you are trying to fix a model with way more data points and you are going to do this overfitting
-
- sort of: lets your model be more creatively wrong
-
- occam's razoe: simpler models are usually better
-
- classic visual demostration
- simple linear model vs. more complicated model: you get every point, which would you expect to be more correct for new examples?
-
- the canonical way of seeing if you overfit
- you look at your prediction performance on the training set
- if you are doing really well on your training data and shitty on the test data, you know you have done something wrong!! :)
-
-
feedback and future discussion:
berns: people like one on one, i like it when we are in a group. i am not even good with breaking out into groups.
courtenay: this was a good size group for hands on stuff today, the group was a little bigger yesterday, which made it harder for hands on, but then better for discussion
kavita: the pace was good
at no point was i dragging
ri: the structure was well thought
joanne: great
i didn't think yesterday was overwhelming with the larger class size
i thought you were going over a lot of vocabulary
structurally to add: i wasn't sure what i was going to learn
i see machine learning all the time
one could that could help if there were 5 questions that were answered
courtenay: in the context that you see ml all the time, did it cater to your expectations
joanne: i was worried that it would be too technical
i am glad that i came
there might have been a way to point out what we will learn
courtenay: there was an initial description that was more technical
it is good that where we went
what language would be friendly to the people that we want to attend
berns: my friend was worried that it would be way over her head
ri: what worked well, if you know about machine learning, you should come
that added to the people explaining, that was a good dynamic management
it is difficult when you are teaching a technical subject to put it on a level to keep evryone interested
kavita: i would love to have a discussion on social and cultural significance of machine learning taking over certain functions
ri: i would say the opposite, let's get our datasets
courtenay: this was great, i was terrified
it was kind of a tough, it was a long road from do you want to do an ml workshop
to what would that look like
and to ask that question again and again
i am so sorry we didn't get to the societal implications or get to the data
this has been really fun
i got all kinds of perspectives and questions that i hadn't thought about
and tuaght yself things that i didn't know or had forgotten
ri: if you want to choose different spaces, this was a wonderful space
transinclusive
joanne: i don't want to call things all women
what if someone transitions
invalidating
berns: i have had people not feel included
joanne: you just can't say, no cis guys
eyebeam, i could talk with them if you need space
it is a nice space
new inc might be open, too
they might be good
eyebeam would be open
berns: because it is new york, we have our hands in such cool things
you wouldn't want to spam
if we could email a central person
i am putting out this event on thursday, i heard about this event and it might be of interest
a monthly news
joanne: if you had ela come and do some basic security: if she would be up for that, that would be amazing
especially since she has been doing threat modeling
she would be happy to test out
most of oher talks, she is often talking about things