Courtenay leads discussion on ML

machine learning is really just observing patterns, 
in lots of data

what is it used for:
    sppech recognition
    face recognition
    language translation
    predicting consumer behavior
    predicting financial markets
    analyzing social networks
    making business decisions

workflow of machine learning:

there are typically two types of problems:
classification asks questions like:
regression:
janet: is it like you want to predict what netflic thinks of you?

betsy:
    if i could at least come at the end
    wanted to say thank you so much
    i am having fun geeking out with all these women
    i have been telling people
    i thought it would be counter to the culture to do the fb thing
    it was safe to ask any question and yet it was not dumb
    you can take step back
    it was a wonderful moment to be able to ask those questions
    it was great to realize that i knew more than i thought i did
    as a sociologist, and i know statistics
    but i rejected it
    but realizing that i still have that knowledge and can use to analyze my new project
    so i got great interview questions yesterday
    when i have access to tech designers, these questions will help me to get to what i want faster
    i will sound more like i know what they are doing

    i looked at the datasets
    do i want to put that on my computer

ri: how did you select weka
i looked at mathlab, i am not there yet
what is a good tool?

courtenay: i did all of my phd in mathlab, i hate it, it is proprietary
weka is a shitty visualization tool
it is the only thing that i know of that will allow you to do machine learning
run real algorithms out of box
and load your dataset
more programming is scary for people who don't know how to program
weka was started in 1993
i think it was a reasonable choice for what i was trying to do here

if you don't want classifier, but just want to visualize, there are surely nicer guis

ri: that would be great

besty: is weka like a wordpress for data

seda: there is a data science course by mako hill, the material is online, i will add it to the email.

joanne:
    suggestion for a topic
    a workshop on the blockchain
    with some of the implications


seda:  adversarial machine learning algorithms - emails that are sent in order to figure out how machine learning works 

asking a question about "data cleaning"? do you use regression to fill in holes in data set
incomplete data
when you are missing labels
istead of getting a human to do it, you try to bootstrap your algorithm
you make your first guess at your algorithm
you may be compounding your errors
that is something that happens when you can incomplete data


regression example:

    you have a bunch of cities and some information about income (i didn;t say anything about whether this is mean or median)
    you have a new city, and you want to know housing prices there
    you try to find out based on what you know about other cities


janet: do you choose the model or does the machine choose the model?
courtenay: you choose the model, you still have to choose
there is a lot of domain knowledge
you can look at this example and say, yes it looks linear
or i am going to try a more complicated model
how do i know which of these models are better?
you need to have a additional data

lilly: how similar is this to what economist and sociologists have been doing?
courtenay: i will talk about this tomorrow
kavita: what is a model, is it like an algorithm?

courtenay:
sylvia:
it is the same information but easier to read


is there a library of these kinds of models and you go to them and choose them?

seda: is this what you mean with a model, courtenay?
courtenay:
carlin: i think about it as an equation
courtenay:
elizabeth:
courtenay:
karissa: it is a bunch of steps that the computer executes

lilly: is it like a way of solving rubic cubes

carlin: it is a set of instructions
elizabeth: sounds like a recipe and the result


courtenay:
couldn't you look up the price of housing?
what abour predicting how much you woul be willing to pay for a product?

classification example:
you go out and observe these fish, the length and the number of stripes that you see
you get dna samples to test for species
you figure out which species the fish belonged to


features: attributes you observe about each example
class labels: ground truth, you know that is the true answer, gold standard
training examples

    lilly:
        you are not sure, you don't know how you want to classify them
        i thought you were going to say, the classifier would help you discern the clusters
        in that case you don't have a ground truth, you want to discern the classifications

    courtenay: they look the same but they really are two different things
    you want to identify which fish is which
    you went to a lab
    and you have their dna

    seda: but isn't that a probabilistic model, too?!

    courtenay: this is a toy example with a ground truth

martha:
    what if it is a behavioral outcome and it depends on how you treat it
    the outcome depends on how i treated you
    you are not a fish

courtenay:
    that would be about data contamination?

martha:
    ri is a a
    courtenay is a b

    i give courtenay a great credit card 
    but the result is the outcome


coutenary:
    in the real world the gold standard is more complicated

martha:
carlin:
martha:
courtenay:
    that is how it started in my machine learning course

berns: that is a great question

lilly: women can be fish, too. that was my example

the objects are not politicized yet or are depoliticized in the moment [[ie. the industrial rubber ball as the perfect simple object to built liveliness/character/spirit from]]

first example in machine learning book is how to choose most perfect embryo
there's a lot of desire that is going on in there


lilly:
    you have data point
courtenay:
    the guessing part is that you think it is going to approximately fit this shape
    you hope your sample size is big enough
    so that your model is valid

karissa:
    it can be a problem if people assume it is a curve like that and they find out later that everything they did is wrong

courtenay:
    not a lot of things follow this curve

janet: you are also selecting a model and see if it works

ri: how do you get to your model, what is the process?

courtenay: you look at the data
karissa:
courtenay: it can be approximated as one
now if we see a new fish, a data point, and it goes somewhere in this plane
on the bottom
now you see which of these models it falls closer to
this fish is closer to this b model
better
and you can make a more educated model that it is fish b

we observe those two feature attributes
we didn't have to send it off to the lab, we can guess now


janet: how do you say it, it is species b, or this is probably species b?
sylvia:
janet:
courtenay:
carlin:
janet:
    it does not matter to them that gender may be fluid

courtenay:
    one take away: ml is using features that you can directly observe as a proxy to predict something you can't directly observe

there is no guarantee that you;ll be right
there may be a lot of overlap
a fish might be an outlier for its species
abnormally large
points in between your two models and you don't have more data, you can't really say


takeaways:
    precition only as good as your models
    that your data does follow a particular distribution
    need to observe a lot of fisn of each species to build accurate models of them
    machine learning is what happens when you feed your models 1000s of fish


courtenay:

seda: claudia perlich was saying there is no wrong data
courtenay: you can have adversarial data generation, for example, you can have wrong data

ri: but that would be hard to separate

are there any advantages or is it worth thinking about the value of "unclean" data?

jojo: you can change it


are you typing a transcript you wonderful person?

martha:
    your prediction is as good as your models
    you only need your prediction to be as good as your need?
    most times people want to do a critique, they ask if it is accurate
    but maybe that is not the issue

courtenay:
    yes, maybe you only need some percentage of success

sylvia:
    if you get the wrong add, no big deal
    if you misdiagnose cancer, you need a more accurate model
carlin:
courtenay:
    google is trying to do predictions for each user
    or netflix
    but you may not care

ri: amazon thinks that i am a recently divocred 50 year old who does yoga, not much damage?

courtenay:
    high paying jobs shown only to men

helen, at the sympsium 

seda - 
rachel law - vortex
uniqueness is a probabilistic feature in that moment...it is a combination of features

Wearable tech guy who says you can trade biometric profiles with people (to be someone else in that regard) is named Chris Dancy  twitter: @ServiceSphere

lilly: chelsea clinton: internet access is key to gender equality
janet: that they are correlated is not causal
lilly: but the headline

sylvia:
janet: has it gotten worse?
ri: it became more like gq of gadgets, 

sylvia: ads for cars, watches and alcohol, 

janet: the spurious correlations - http://www.tylervigen.com/spurious-correlations

courtenay:
there are lots of different classifier models, this is just one type
this is a gaussian naive bayes classifier

takeaways:
    non complete list of things people use to make classification tasks


janet:
    we always here about bayesian stuff, is it that it includes probabilistic stuff
    variables with probabilistic stuff

courtenay:
last example:
    you can in the real world do stuff if your data is not fully labeled
    it is harder
    it is more uncertain
    you may have tons of data and no labels
    can we really not learn anything from it

ri: you mean like confirmed labels
courtenay: like the lab test

carlin:
sylvia:
    labels 

carlin: maybe i am talking about units

courtenay: yes, i am talking about ground truth labels
courtenay:
can we learn something if we don't have ground truty, say about the species of fish that you have
maybe we took measuremenets of all the fish
but we didn't even know they were from 2 differen speices populations
not just a mater of manually labeling the data, we don't even know what the labels should be
so in this case you get into the broad heading of unsupervised learning
if you know the species, you know they are male or female
you now have a bunch of numers and data
and you are interested in the kind patterns of data

janet: i love the terminology
sylvia:
carlin: it is a matter of whether there is prior classification that supports that

courtenay:
    labels -> supevised learning
    without -> unsupervised
    standard techniques that you use

    here are the lengths and stripes
    we have clusters, each point is a single fish
    we know that they are two different species and they look like that
    this lovely toy example, in this particular two dimensional space that is perfectly visualizable
    you can't do it with your customers
    you don't know the structure of that data and you don't have a way to guess

    the most basic thing you can do is cluster analysis
    the toy example i will show you, a common algorithm called k-means clustering
    you start by guessing that there are clusters in your data
    you usually also hwo many clusters
    then you guess the centers
    and guess cluster membership
    it turns out that this will mathematically get you some nice clusters

    the algorithm 
    you can picked two points, they are wrong, they are both in the same cluster
    you pick them at random
    then you do most obvious thing you can do, you measure distance to all the other points
    you draw aline
    you draw an ortogonal and perpenicular line
    assume that this is a reasonable way to measure things
    you do that, and you recompute the centers of the clusters
    if all these red things are a cluster, where would the center be
    you moved your cluster centers now
    you reiterate
    so, now you moved the points here
    the blue points have overtaken
    and once you do that, and reiterate until the cetner does not change anymore
    at the end you get here
    and you find your two species again

    caveats:
        you still have to guess the number of clusters
        two kinds of fish in this pond
        you guess at several different numbers of clusters and you do an evaluation
        you look at how tight the clusters are
        more clusters make your model more complex
        there is a bunch of hand waving stuff
        this is a thing you can do
        this is like density estimation
        figure out if there are denser places in your feture place
        this is an example of an algorithm



    kavita: the cluster analysis will not tell you how many clusters there are?

    courtenay: there are clever ways to guess
    we can look at the data 
    and say there is two

    courtenay: you can use histograms
    visually you can look at it
    and know

    janet: i have been using cluster analysis in social newtork analysis
    bibliometric analysis
    you can get the computer to detect what it thinks is a clustering
    if it is two clusters, do i have the relative distances through things

sylvia:
    you can get a cluster and look to see if there are clusters in that

courtenay:
    yes there are hierarchical cluster things you can do
    in social networks there are a whole different set of things you might do
carlin:
    trickiest things in teaching
    kids will do an analysis
    and come with gender binary
    to figure out what is getting read, where gender gets assigned by twitter
    who is particiapting
    what kind of assignments are happening
    what makes them truthful

    what feature you use
    if it makes more sense and to look at 

lilly:
    developmental biologist
    anne fauster sterling
    osteoporosis and how it is correlated with women
    and how come
    she critques that and builds a process model
    and how osteoprosis looks correlated with women
    if you get sports then less likely to get it
    race: the kind of work you do

berns:    both stregnthener is marketed to the petite white women

carlin: osteoprosis gets discussed without that kind of sepecificity

lilly: it takes a lot of labor to construct this other model
how can we use this data other kinds of process sotries of gender and race without reifying them
carlin:
berns: hypertension is race based
lilly: correlation becomes the local cause and they just deal with it that way

berns: then there are people who don't want to take the medicine
janet: that is an interesting label, 


THIS WAS SOMETIME BEFORE----
janet:
    you are parenting your computer

sylvia: i think robots are adorable
------------


courtenay:
    they may be clusters of density
    a little more of grey area
    finding interesting clusters we may want to do something with
    this kind of analysis without labels may allow you to make reasonable guesses



FOR TOMORROW:

Baysian statistics explanations:
http://www.kevinboone.net/bayes.html

Sylvia would like to explain regression (30 minutes?)

Neural networks are going to take over the world??

Seda mentions that there is AI that trains video game figures to act in specific ways 
Seda says “what other politics are possible if there were other ways of  querying data?” 

seda: what kind of queries can we make with machine learning to get at where discrimination starts, the problem is that when you categorize you can then name and call out discrimination but once you create the new category then that has its own discramantory potential

anne fausto-sterling (sp??) http://www.annefaustosterling.com/


domain knowledge
- you would use domain knowledge to get the parameters for a data set (

discussion during hands-on weka session:

    carlin: i like what you say about domain knowledge

    kavita: our purpose is that we have a new piece of glass, is this helping out what kind of glass it is

    courtenay: now it is not helping, because i took out the glass
    if everyone understands what these histograms generally show


    lilly: can you read this file for us

   c ourtenay: breast cancer data
   in this dataset
   there are 9 features here
   age, menopause, tumor size
   the thing we are trying to predict is whether can is likely to recur or not
   we are looking 286 examples
   and 201 did not have recurrence
   and 85 you did
   and that is the class you are trying to predict
   we would want to predict it by looking at some combination of the 9 features or some subset thereof


janet:   we have a woman 54, pre-menopausal, right brest


courtenay: i can do the walking for you, too

lilly: there is no x axis

courtenay: this is what we were talking about before
numerical vs. nominal
these are much more nominal


seda: you need to look at the arff file to find out what the values stand for

courtenay:
    the way you read this is that 68 cases got radiation therapy and about half of them had a recurrence and half didn't
    and the others didn't get radiation and did not have a recurrence
    the information you can glean here is how different the percentage of the classes are
    in this case it doesn't make sense
    recurrences were a far less frequent event


bernadette: it is a small amoount that reoccurs

courtenay: it is not unlikely
you have a veested interest in predicting who is goig to recur


kavita: does this mean that you are more likely to have recurrence if you get radiations

berns: the first part is people who did it
and it is 50 50
and the ones who didn't there was a better chance

courtenay: but you need to know whether getting radiation are those who were seen as more serious cases


sylvia:
    there are ways to present it to show that there is a clear relationship
    but there are ways, which doesn't show what the relationship is
    there are no clear relationships
    if we used some sort of algorithm, we could predict it
    but through visualization, especially because they have different population sizes
    it doesn't feel like a good example
    or it is a weakness of the proram
    it is a little lame of them

courtenay: 
    i agree with you

sylvia:
    the whole point of visualization is to see things

berns: the safe thing we are agreeing on
you have to be careful with correlation and causation
i have been to a numbder of pharmaceutical presentations
they will take 2 people living 3-4 months longer
and they will make claims

janet: i see why you go into this
people who got radiation
it evened out
it looks like not getting radiation meant you did not have recurrence
that is why you collect a whole bunch of data
because you want to show why the finding is part of other factors

sylvia: this is also a way to manipulate data to get what you want

seda: they claim more data is always better
the overfitting problem



DAY 2:

http://www.thenewyorkworld.com/
https://nycopendata.socrata.com/

Where to find data -- what the important attributes are

What there is data for
What there isn't data for

Lilly couldn't find any data on contractors

Unpacking "Mechanical Turk"

CUP Lab -- data siphons

Data politics in NYC

Martha: Certain datasets won't be more -- how much learning can your machine do? 
Courtenay: Exploratory actions on data or finding 
Picketty Dataset is Open!

Martha: difference between prediction and learning?
Courtenay: Maybe? Someone may or may not believe you have proven something with your predictions. There are no unknowns that you can point to.
"Machine learning" on a pedestal as separate from data mining or statistics is dangerous.
How many nation states in Picketty?
Martha: European ones?
Seda: 40 based on his definition?
Courtenay: You can still make predictions for a new country. 
Cross validation -- hold out on one data point and then see how well you predict the missing country to test your model.
Seda: There's always a prediction, isn't the question how reliable the prediction is? What is prediction?
Courtenay: You don't know a value so you attempt to
Seda: Act of using a function to come up with a value you don't know.
Courtenay: You may be artificially obscuring the value to test. That's still prediction.
Kavita: Can you do predictions on datasets from the past in which it's not possible to go back and collect?
C: You can still do what you want -- it's a philosophical scientific thing. Going forward you're not going to be able to make predictions. But it can tell you if you have a good model of phenomenona.
Lilly: It reminds me of talking to mathematicians and scientists -- you don't have a theory unless you make predictions about the future. Ethnographers work differently: if you don't know how the data was created, you don't have a theory. New ways to explore models. Potential parameters are infinite.

C: the end game doesn't have to be classification. the field of machine learning is driven by prediction. but the techniques are statistical techniques. There are other ways of seeing if things are correlated.

Bernadette: last night I thought about farming data. labor has been low on the farm until the summer youth -- now it's spic and span. Number of workers with hours put in to crop outputs.

Jojo: i thought when you said farming data, you were talking about the labor of preparing data for use later on, the workers come in and clean it up and it is ready for harvesting.

Joanne: Wikileaks data is CSV.

Seda: text analysis will be interesting. 

SLIDES/Courtenay presentation:

Courtenay: touch on what correlations are 
using spurious correlations site: everyone knows correlation doesn't imply causation, but doesn't necessarily mean correlation.
Martha: but is it predictive?
C: no reason to believe that they would.
Seda: Google food: all sorts of debates. World Bank discussions. Google food trends worked because of years and years of data collected by scientists. How good is prediction without another kind of ground truth.
C: you can go out and look and a couple are going to look really great and all the rest won't work. You just pick the ones that look.
L: Isn't the point that you don't need common sense?
C: maybe they're both correlated to other things. Maybe there are other variables. 
You convince yourself that they are correlated
Maybe they are correlated to other things, but you have convinced yourself that this is the correlation.
The correlation will be spurious because they
Seda: Constant is right now doing a workshop: how do we create common sense with machine learning.
B: we make lists when we hit problems.
C: human brains are good at making spurious correlations.
B: Cognitive Therapists 

courtnenay showing correlations of different types and strengths

if your classifier doesn't work, you might just not have enough information.

c: sometimes you just have data and maybe you won't be able to predict what you want to pick
Seda: is there any data on data that confuses classifiers?

if it is uncorrelated with the class, it shouldn't throw off your classifier, your classifier will ignore it.
learn to weight things as zero.


discussion yesterday:
    if the length and number of stripes of fish are correlated, a model that assumes they are independent might not work very well
    because you count the same information tiwce becuase it's repeated in two places and the model doesn't take this into account

no double counts!! bad!


solution: could switch to a model that doesn't assume independent features.

the other philosophical broad point was the fight between statisticians and machine learning
i was vaguely aware of the fight, i was aware that there was some tension, maybe

yesterday one of you asked: is this any different between statistics.

here is a joke i found:
a table of differences between the two, mostly terminological, but a large grant in ml will get 1.000000 whereas in statistics a large grant is 50000.
weight vs. parameters etc.

lots of overlap and lots of cultural differences
the practices have evolved into different standards


andrew gelman says, maybe we should remove models and assumptions because then we can solve problems that the machine learning people can solve.


C: There are people who believe more or less in one or the other dogma

one commentators on stackexchange says
ml experts do not spend enough time on fundaments, and many of them do not understand optimal decision making and proper accuracy scoring rules.
statsiticans spend too little time learning good programming practice and new computational languages.

m: can you explain the second statement about statisticians


c: humans aren't super into change. a discipline evolved in a specific way. before computers were around. in a culture in which people don't jump to the most immediate new software. fewer people in statistics departments know how to 
S: ML come from CS, statisticians come from mathematics
L: chalkboards, slow proofs (math) vs prototypes! (CS) fast moving
S: mathematician: if you don't understand what your algorithm is doing, it's wrong. One big issue: giant data sets.
Efficiency is about quantifying results.



when social cientists look at this debate, they say it is right or wrong. it is hard to make it stick, but it is working.
the test by which something is successful in the world is not whether it is right or wrong, but whether it "works"
ml person says, it is working, and the statistician says it is wrong

jojo: it depends on what you mean by what matters?

lilly: machine learning and statistics are competing for legitimacy on what is the right way to work with this data
it could be that the debates about what is right and wrong, by participating in those debates, the ml people may be legitimizing their discipline


martha: for some of these guys what is at stake is not publishing a paper, but having a successful company
If they say all that matters is that they have a correlation
different social worlds -- what's at stake 

C: techniques developed in academia, adopted elsewhere.

courtenay: a lot of the techniques get developed in the academic setting, but in many cases, outside of academia, if it works, it works
columbia was very mathy and proof oriented
it is this academic thing
in practice it is a very computer science and engineering mind set: i built it, it works

lilly: some friends would consult cia and stuff
for intelligence vs. ad prediction there may be different standards?

courtenay: i don't know how theoretically, what the standards are behind that wall [of intelligence]


martha: you're trained pre-data science? 

courtenay: i finished at the end of 2012. i was in machine learning courses in 2007-2008. hot stuff which was not neural networks, and a lot of that has been taken over. and they were interested in proofs.
taught hot methods at the time (not neural networks) by people concerned with theory and proof.

kavita: real timeness of data, ml people have access to data?
that the data is just constantly coming in and being optimized

statistician dealing with more static data?


courtenay: it is less about real time than dealing with larger datasets
which data scientists have been dealing with for a long time
statisticians may not be as comfortable


lillly: twitter search be one of these computational processes?
twitter search has a real time problem, topics are cultural context that are not indexable terms
so they hire mechanical turks to find something very quickly
timing matters, limitations. TurkWorkers to bootstrap.

C: detecting density of topics

martha: i just thought of something, the credit scoring had 12 items, that is how many items someone working on a paper could add up.
she would be doing the computation live, and that is about computational efficiency
the debate today is with machine learners saying, these people are archaic
it was computational efficiency, because i did not think of it as computational efficiency
because it is transformed with infrastructures.

courtenay:
    there is not necessarily that 12 variables is a bad system
    it is more about the number of data points rather than the number of variables (features)

martha; given all the data that could be credit data, it looks archaic
C: there is such a thing as too many variables
courtenay: there are too complex models as well
that sounds like plan bullshit to me


berns: you cannot have enough considerations
in this case

courtenay: i agree that you may need more than 12 factors, but for some things it may be enough.

martha: we are having the debate because computational infrastructure, when we phrase the debate, the only reason we are considering more than 12 because these guys have amplified their capacities in the last 50 years

lilly:
    a lab, machine learning, we are storing so much more data, we need to gain more financial value from this data
    we need to get more value, becuase we have more data
    not that we want more data because we can get more value

courtenay: it is cheap enough that you can store everything

martha:
    the debate that you are pointing out between statisticians who are cheap
    ml we can maximize, 
    the debate is created by the economics of the environment


    courtenay:
        statisticians use computers
        they may not be up to par wth the latest in computational infrastructure

the data is probably generated by a tech company who is interested in doing this thing on its data
as far as academic departments go, they could be trying to solve the same problems


lilly: are you saying there is a difference is that cs people need grants to get machine, entrepreneurial grant getting

martha: the million dollars have to be for something

courtenay: they are also being snarky about it being a fad. it is cutting edge and popular and statistics has a marketing problem

ML gets the money because 

berns: there is a race issue there, too, as to who teaches you statistics and computer science. my statistics teachers were people of color

S: within computer science there are layers of people who are more proofy. clean definitions. 
C: then there are the ones who hack.
S: upper echelons -- it's class. middle class belt: less lofty. more likely to do applied stuff. don't mind being engaged in $. Privacy is upper, surveillance is middle. ML gets new folks: physicists and biochemists. Need the techniques. They go into hedgefunds. Big data systems. How do physicists deal with complex social issues.
LIlly: Physics such a male dominated field.
S: except Iran.
Hard crowd to read for me-- tend to be polymaths in my experience.

C: Engineering mindset: but practically it worked!
don't know what he means by worked, but proof is in the pudding.

M: usually managerial

C: lots of complaints to be had about this attitude. Broad take away: two outlooks: 
    classical stats hypothesis testing 
    ML: getting predictions to work even in the face of lack of interpretability of models

Lilly: as an ethnographer I now feel aligned with Classical Stats

C:If ML is more "successful" it comes from the large-scale resources; wring the last bits of success out of those things rather than doing something more profound.

M: could there be a synthesis?

C: most things are black boxes, but there's a real interest in doing this; run models backward; google deep dream. No one likes black boxes. People like to know how they work, also because they want to improve them.


feature normalization:
    what if we have this data:
        much less variability in # stripes athan length
        much difference in scales
        it is a problem if you are trying to calculate distance between things

change each feature to have
mean = 0
standard deviation = 1

obvious thing people in intro cs classes don't do

DATA SHARING (EMAIL PHOTOS!)


courtenay: something that looks extremely small
you are looking at particles, how toxic is something
the numbers may look very small to you, but you want to stretch it out to be able to evaluate its significance

martha: do we know that the difference is equal
it will then become testable as to wether it is meaningful

courtenay: it depends on your classification model, some models will need normalized data
you are not changing the information in that variable, 
you made it easier for algorithms to work with it
and maybe for you to view as a human
there could be a diagonal relationship that you can see better in a context, because of the resolution 

martha: if the variable is useless, after the transformation, it is still useless

courtenay: there are other normalizations that you can do
that is statistical normalization of data

martha: is tehre a relationship between normalization and the in ability to reverse engineer

courtenay: no, you usually have the raw data, and you know the mean so you can go back
it depends on 
the final model probably takes it, the final classification model
somewhere inside of it has the value of the mean that ist needs to subtract off
that value is a parameter in the model, you know what that is
so that you can make a transofrmation on the raw data coming in, that also means that you can back it out
you are not obfuscating antyhing


feature selection and dimensionality reduction

we might not even need all the features we have to do well on prediction
we might need something that we don't have
sometimes the important thing is to figure out which ones to throw away
2 features are redundant if they are highly correlated with each other
dimensionality reduction
ou end up with a whole new set of features
each is a function of the features you put in
you have x, y, and z, you end up with a, b, and c
a, b, and c are functions of combinations of x,y, and z
such that they are all orthogonal to each other
the output variables don't have correlations with each other

it is a form of mathematical projection
youare changing your axis
i can't give you an intuition


martha: you perform something on each data point and transform it into something else

courtenay: itis an automatic way of compressing the correlated relatioship into uncorrelated variables

martha: you take variables that are corelated, 
courtenay: now the features are uncorrelated.

ri: you need to know what is correlated

courtenay: you do a mathematical transformation that does it

lilly: whether two things are correlated is a statistical relationship, right, so the stats does the job for you?

courtenay: it is factor analysis
there is a bunch of ways to do that
principled components analysis

martha: compressed?

courtenay: you are asusming there is a lossless representation
and you do something that are now uncorrelated so that it is easier to feed into a model

lilly: we thought these were correlated

courtenay: ffter the transformation, you take the top x values and throw away the less important, less informative values in the bottom
that is purposeful transformation that way

ri: i thgouht we were trying to what does correlate, how come we can now all of a sudden identify that is correlated

martha: is compression like making juice out of vegetables

these are dimensionality reduced points, because it is too slow to eat carrots?


courtenay: basically, you are going to take the top few dimensions to a classifier, becuase you know these things are uncorrelated, you don't have bad feature correlations fucking up your modesl

if you have two variables that are really correlated, all the ifnromation that was contained in those two is compressed to one feature
you are not going to have the statistical prblem of overweighing these features

lilly: combine marriage and margerine into one feature

courtenay: there is something to be said about the distance on the y axis, it does not mean anything
the slight difference between the shapes

lillly: a band of difference is acceptable?
courtenay: yes
lilly:  but the band can matter   

courtenay: the mathematical transformation will not take semantics into account

balanced datasets:

    sometimes you notice your classifiest is doing suspiciously well - 95 percent accuracy
    then you notice that your data looks like this: class a has 95 percent a, 5 percent b

maybe you can go collect more examples of class B and make your model better
could use re-sampling methods to feed your classifier more balanced (if slightly synthetic) data

it is good to be aware of the relevative balance of classes in your data and think about how it might be affecting predictions

cross-validation:
    standard machine learning practice
    you need twice as much data, training and test set
    instead of 2 fixed dataset
    split data into train/test multiple times for multiple experiements and take average results
    more samples -> results more likely to be statistically valid
    weka: 10 fold validation, that means that it built 10 different classifiers

martha: is that the kind of validation that you do if you have lots of data

courtenay: if you have tons of data, you can do a single split and that is ok
this is going to help you more if you don't have a lot of data
it is generally a good thing to do, it is sampling more

berns: what do ml people call statistical significance?

courtenay: this means that you did ten fold cross-validation
the part of your paper, where you prove that there is statistical validity in results is a little bit more lax 

overfitting:
    sylvia was describing the picture yesterday
    model complexity >> training examples

feedback and future discussion:

    berns: people like one on one, i like it when we are in a group. i am not even good with breaking out into groups.

    courtenay: this was a good size group for hands on stuff today, the group was a little bigger yesterday, which made it harder for hands on, but then better for discussion

    kavita: the pace was good
at no point was i dragging

ri: the structure was well thought

joanne: great
i didn't think yesterday was overwhelming with the larger class size
i thought you were going over a lot of vocabulary
structurally to add: i wasn't sure what i was going to learn
i see machine learning all the time
one could that could help if there were 5 questions that were answered

courtenay: in the context that you see ml all the time, did it cater to your expectations

joanne: i was worried that it would be too technical 
i am glad that i came
there might have been a way to point out what we will learn

courtenay: there was an initial description that was more technical
it is good that where we went
what language would be friendly to the people that we want to attend

berns: my friend was worried that it would be way over her head

ri: what worked well, if you know about machine learning, you should come
that added to the people explaining, that was a good dynamic management
it is difficult when you are teaching a technical subject to put it on a level to keep evryone interested

kavita: i would love to have a discussion on social and cultural significance of machine learning taking over certain functions

ri: i would say the opposite, let's get our datasets 

courtenay: this was great, i was terrified 
it was kind of a tough, it was a long road from do you want to do an ml workshop
to what would that look like
and to ask that question again and again
i am so sorry we didn't get to the societal implications or get to the data
this has been really fun
i got all kinds of perspectives and questions that i hadn't thought about
and tuaght yself things that i didn't know or had forgotten

ri: if you want to choose different spaces, this was a wonderful space


transinclusive
joanne: i don't want to call things all women
what if someone transitions
invalidating

berns: i have had people not feel included

joanne: you just can't say, no cis guys

eyebeam, i could talk with them if you need space
it is a nice space
new inc might be open, too
they might be good
eyebeam would be open


berns: because it is new york, we have our hands in such cool things
you wouldn't want to spam
if we could email a central person
i am putting out this event on thursday, i heard about this event and it might be of interest
a monthly news

joanne: if you had ela come and do some basic security: if she would be up for that, that would be amazing
especially since she has been doing threat modeling
she would be happy to test out
most of oher talks, she is often talking about things