Courtenay leads discussion on ML machine learning is really just observing patterns, in lots of data what is it used for: sppech recognition face recognition language translation predicting consumer behavior predicting financial markets analyzing social networks making business decisions workflow of machine learning: * repeat unti you get out what you want: correlations, profit, phd etc. * there are typically two types of problems: *classification : most times people talk about machine learning, and they mean classification *regression: predict the value of something in a range of values * classification asks questions like: * is this a good or bad email (spam) *is this a dog, or cat, or a rabbit * regression: * given sales data for last 12 months, what will next months sales * *is it always about: *malignant tumors: it is something you cannot see, a wall that keeps you from seeing things * janet: is it like you want to predict what netflic thinks of you? betsy: if i could at least come at the end wanted to say thank you so much i am having fun geeking out with all these women i have been telling people i thought it would be counter to the culture to do the fb thing it was safe to ask any question and yet it was not dumb you can take step back it was a wonderful moment to be able to ask those questions it was great to realize that i knew more than i thought i did as a sociologist, and i know statistics but i rejected it but realizing that i still have that knowledge and can use to analyze my new project so i got great interview questions yesterday when i have access to tech designers, these questions will help me to get to what i want faster i will sound more like i know what they are doing i looked at the datasets do i want to put that on my computer ri: how did you select weka i looked at mathlab, i am not there yet what is a good tool? courtenay: i did all of my phd in mathlab, i hate it, it is proprietary weka is a shitty visualization tool it is the only thing that i know of that will allow you to do machine learning run real algorithms out of box and load your dataset more programming is scary for people who don't know how to program weka was started in 1993 i think it was a reasonable choice for what i was trying to do here if you don't want classifier, but just want to visualize, there are surely nicer guis ri: that would be great besty: is weka like a wordpress for data seda: there is a data science course by mako hill, the material is online, i will add it to the email. joanne: suggestion for a topic a workshop on the blockchain with some of the implications * seda: adversarial machine learning algorithms - emails that are sent in order to figure out how machine learning works asking a question about "data cleaning"? do you use regression to fill in holes in data set incomplete data when you are missing labels istead of getting a human to do it, you try to bootstrap your algorithm you make your first guess at your algorithm you may be compounding your errors that is something that happens when you can incomplete data regression example: you have a bunch of cities and some information about income (i didn;t say anything about whether this is mean or median) you have a new city, and you want to know housing prices there you try to find out based on what you know about other cities janet: do you choose the model or does the machine choose the model? courtenay: you choose the model, you still have to choose there is a lot of domain knowledge you can look at this example and say, yes it looks linear or i am going to try a more complicated model how do i know which of these models are better? you need to have a additional data lilly: how similar is this to what economist and sociologists have been doing? *instead of discovering the world, it has become about can we make money off of it. * courtenay: i will talk about this tomorrow *statistics vs machine learning *these are classical statistical techniques *how is this different? why is it more shiny? *some of it is real cultural differences *and some of it is the same * *at this level it looks like a lot of statistics *but as the field evolves and things get complicated, it is a little different * kavita: what is a model, is it like an algorithm? courtenay: * it is a semantic thing *the model is the object that you end up with at the end with parameters *and you have an algorithm that chaanges the model * *here the model is linear * * sylvia: * crude drawings of basic shaptes *linear: the two values on the x and y axis grow together *logarithmic: you have a data set which eventually levels off, like age *exponential: i lost my notes, you can use a logarithmic representation of the expoential curves *exponential: sylvia is showing how you use algorithms to depict datasets with exponential growth it is the same information but easier to read * * is there a library of these kinds of models and you go to them and choose them? seda: is this what you mean with a model, courtenay? courtenay: * you look at the data and you look to see what function you can fit. * carlin: i think about it as an equation *what a graph is doing is solving the equation: if x is this, y is that... * courtenay: * there are a lot of alrogithms and they are basically mathematical functions *the way you get the model may be complicated, running the algorithm may take 10 hours * elizabeth: * Now I have to ask what is an algorithm, I thought I knew. * courtenay: * usually you can be doing optimization *typically you are iterating on how much error your prediction is making * karissa: it is a bunch of steps that the computer executes lilly: is it like a way of solving rubic cubes carlin: it is a set of instructions *it could be self-referential inside * elizabeth: sounds like a recipe and the result courtenay: * algoritms is this broad set of things *those are used to figure out your model *if you have these data points, you can get this line *it is not always so simple *this is an important task * couldn't you look up the price of housing? what abour predicting how much you woul be willing to pay for a product? classification example: *example: *we have some fish *2 species A and B *hard to tell individual fish apart *one of them is in danger, you want to tell when it is that fish *you don't want it overfished, but you can't tell by looing at it * you go out and observe these fish, the length and the number of stripes that you see you get dna samples to test for species you figure out which species the fish belonged to features: attributes you observe about each example class labels: ground truth, you know that is the true answer, gold standard training examples lilly: you are not sure, you don't know how you want to classify them i thought you were going to say, the classifier would help you discern the clusters in that case you don't have a ground truth, you want to discern the classifications courtenay: they look the same but they really are two different things you want to identify which fish is which you went to a lab and you have their dna seda: but isn't that a probabilistic model, too?! courtenay: this is a toy example with a ground truth martha: what if it is a behavioral outcome and it depends on how you treat it the outcome depends on how i treated you you are not a fish courtenay: that would be about data contamination? martha: ri is a a courtenay is a b i give courtenay a great credit card but the result is the outcome coutenary: in the real world the gold standard is more complicated martha: * for alternative pegagogy *we start with animals where the social complication is not visible * *you credit is evaluated based on products you have consumed *but you can consume these products if you have a good credit score * carlin: * interesting problems here *what constitutes ground truth, when is it reliable enough *there is then a simplicity and complexity thing *people often default to animals, balls, sports, because there is a need to go to a simple pehnomena *which turns out not to be a simple phenomena *it is an important thing anyways * martha: * the firt thing you teach kids is animals *kids can be duck and cat *but not every kid can be like good credit outcome *i wonder why this example starts here? * courtenay: that is how it started in my machine learning course berns: that is a great question lilly: women can be fish, too. that was my example the objects are not politicized yet or are depoliticized in the moment [[ie. the industrial rubber ball as the perfect simple object to built liveliness/character/spirit from]] first example in machine learning book is how to choose most perfect embryo there's a lot of desire that is going on in there *you want to predict what this fish is *you see that it is short, it has reasonably few stripes *that seems close to species a *but you see this other fish that is more consuming *still within specias a average range *but has a very different number of stripes *if we want to solve this problem, to guess what fish this is *we need a model *so i pull out math again *in nature a lot of things get distributed with a bell curve, the guassian distributio *attributes can also fall into this pattern *this says that most of the fish will fall in this middle part *you also see outliers *you see some longer ones, some shorter ones *no fish with length laess than 0 *in this case, we have decided based on lots of years of studying animals *a good model for a ntural thing you see in nature *that this distribution is gaussian *you can fit a probability distribution of what you have seen *and you figure out what the mean is and the standard deviation *the probability model that you fit to species A *and if you do species B, you have a different model *it has a higher average *if they had different standard deviations, the bell curve would be broader or narrower *then, we know two things about the fish *we can model these things jointly *and we would have a 2 dimensional model *the length probably is on one axis *the stripe is on the other *most species a are in the middle of this probablility distribution *it falls in the middle of the cone there *some fall on the edges here * * lilly: you have data point *is this something you measured or is this the model *is this the plotting of the data *there are some assumptions about the actual data you have * courtenay: the guessing part is that you think it is going to approximately fit this shape you hope your sample size is big enough so that your model is valid karissa: it can be a problem if people assume it is a curve like that and they find out later that everything they did is wrong courtenay: not a lot of things follow this curve janet: you are also selecting a model and see if it works ri: how do you get to your model, what is the process? courtenay: you look at the data *does it have a long tail? *often you have too many features *you visualize things differently * *depending on how much data you have *looking at the numbers is a good idea *you good do a histogram *you take bins: 0-5, 6-10, 11-15... *and then you look to see in which box you put your data *you can just count and see *if it looks lik most are in the middle you can have that shape * * karissa: * people love big data, if you have a lot of cases, you can pretty much tell instantly *a lot of them show up in the middle * courtenay: it can be approximated as one *you have four models *you have a model of the distribution in one species *you can have four bell curves *and then you can look at each species two features jointly *and hopefully they are well separated * now if we see a new fish, a data point, and it goes somewhere in this plane on the bottom now you see which of these models it falls closer to this fish is closer to this b model better and you can make a more educated model that it is fish b we observe those two feature attributes we didn't have to send it off to the lab, we can guess now janet: how do you say it, it is species b, or this is probably species b? *you may look at the priors * sylvia: * confidence level, if it is in the red area, you are rather confident * janet: * gender map *someone with this height, long hair short hair, are you computer going to decide who is a male or female *or if you can get tenure * courtenay: * a good scientists, yuo don't say it is species b *but if you are google, you may tell advertisers that it is a man or woman * carlin: *it is not a big deal to them if they get it wrong *it doesn't matter to them if they advertised to some of the wrong people * * janet: it does not matter to them that gender may be fluid courtenay: one take away: ml is using features that you can directly observe as a proxy to predict something you can't directly observe there is no guarantee that you;ll be right there may be a lot of overlap a fish might be an outlier for its species abnormally large points in between your two models and you don't have more data, you can't really say takeaways: precition only as good as your models that your data does follow a particular distribution need to observe a lot of fisn of each species to build accurate models of them machine learning is what happens when you feed your models 1000s of fish courtenay: * what is your confidence that your dataset is right *people at mechanical turk, the labeilng, there could be all sorts of data cleanliness problems *almost for anything *you want to deal with the noise, the outliers *the bad people submiting the form twice seda: claudia perlich was saying there is no wrong data *there is wrong interpretation fo data * courtenay: you can have adversarial data generation, for example, you can have wrong data ri: but that would be hard to separate *sometimes you watch tv on your girlfriend's account * are there any advantages or is it worth thinking about the value of "unclean" data? jojo: you can change it are you typing a transcript you wonderful person? martha: your prediction is as good as your models you only need your prediction to be as good as your need? most times people want to do a critique, they ask if it is accurate but maybe that is not the issue courtenay: yes, maybe you only need some percentage of success sylvia: if you get the wrong add, no big deal if you misdiagnose cancer, you need a more accurate model *then you need to watch out * carlin: *the scale is different in those two examples *any time it is a medical example *trying to take this wealth of statistics and to apply to a single body or case *you should do this cause you are likely to have this risk *to go back to that need is different, it also depends on the scale * courtenay: google is trying to do predictions for each user or netflix but you may not care ri: amazon thinks that i am a recently divocred 50 year old who does yoga, not much damage? courtenay: high paying jobs shown only to men helen, at the sympsium * seda - rachel law - vortex uniqueness is a probabilistic feature in that moment...it is a combination of features Wearable tech guy who says you can trade biometric profiles with people (to be someone else in that regard) is named Chris Dancy twitter: @ServiceSphere lilly: chelsea clinton: internet access is key to gender equality *where do we think development data comes from *people hired by universities and world bank *who gather data through interviews *a shift in data collection *800000 data points *tech industry can solve any problem with data *techno, big data, and feminism *investing in the middle class is the best way to bring about democracy *history having the problems of big data over generalizing *the headline suggesting a correlation *we could unpack how these correlations have many levels of spuriousness and assumptions * janet: that they are correlated is not causal * lilly: but the headline sylvia: * i stopped reading wired cause it is so obviousyl written for men * janet: has it gotten worse? *when you first read it did you think it was not that way? * ri: it became more like gq of gadgets, sylvia: ads for cars, watches and alcohol, janet: the spurious correlations - http://www.tylervigen.com/spurious-correlations *the number of movies with nicholas cage with murders in the pool * courtenay: * models get more acurate -> preditions get more accurate *this is true for our regression example, too: *the more cities we observe the better our prediction * there are lots of different classifier models, this is just one type this is a gaussian naive bayes classifier *you assume features have gaussian distribution *assumes each feature is unrelated to the other (not correlated with each other) * takeaways: non complete list of things people use to make classification tasks * decision trees *nearest neighbor: really naive classifiers, with the fish, we thought it is pretty close to a, you compute its similarity to every example you have seen, because it was closest to a, and you throw out the statistical model out, sometimes it works really well *bayes: bayesian kind of classification methods *these are kind of classical probabilistic methods with a lot of complications on top of them *bayesian, you are looking at priors *the base rate, specifically you are incorporating *if you observe that 25 percent of the fish are a and the rest b *you incorporate that into your final result * * *logistic *support vector machines *neural networks * * janet: we always here about bayesian stuff, is it that it includes probabilistic stuff variables with probabilistic stuff courtenay: * real models usually use more than 2 features, it's hard to visualize, how they work and how they fail *we can maybe look at 2d or 3d. *it is really hard to understand at an intutive level why things are working out *you try to figure out how well your predictions are doing *this is your training set here *here is the test set *you need that to be labeled, too *so, when you have a model, you try to predict the things there without knowing the labels *then you look to see if you predicted well *which means you need more data *you can have a model and throw it out into the real world *but you want to sort of believe that it is going to do what you think it is going to do * * last example: you can in the real world do stuff if your data is not fully labeled it is harder it is more uncertain you may have tons of data and no labels can we really not learn anything from it ri: you mean like confirmed labels * courtenay: like the lab test carlin: * for something to appear as data *won't some decisions have to be made *we have been using length *you know something has been measure, you need to know that it is inches *what do you need to know at minimum * sylvia: labels carlin: maybe i am talking about units courtenay: yes, i am talking about ground truth labels *as long as you have examples *the fish does not need a name *stripes and species, i don't need their label *you can throw them into a plot and look at them *that is data *everything is data * * courtenay: * you need to have a reasonable belief or faith that the measurements of the coffee grounds are related to something i am predicting in the real world *you might be wrong *maybe you are measuring something that has no correlation * *history of science: what peole thought caused diseases, that seemed reasonable at the time, but it wasn't that *advertisement, there is no guarantee that if you are a male in a specific city *that the ad will work *it gets very subjective very fast in the real world * * can we learn something if we don't have ground truty, say about the species of fish that you have maybe we took measuremenets of all the fish but we didn't even know they were from 2 differen speices populations not just a mater of manually labeling the data, we don't even know what the labels should be so in this case you get into the broad heading of unsupervised learning if you know the species, you know they are male or female you now have a bunch of numers and data and you are interested in the kind patterns of data janet: i love the terminology *like workers that are unsupervised * sylvia: * like when you have a child *there is actually a correct answer *whatever is learning, you are giving that answer * carlin: it is a matter of whether there is prior classification that supports that courtenay: labels -> supevised learning without -> unsupervised standard techniques that you use here are the lengths and stripes we have clusters, each point is a single fish we know that they are two different species and they look like that this lovely toy example, in this particular two dimensional space that is perfectly visualizable you can't do it with your customers you don't know the structure of that data and you don't have a way to guess the most basic thing you can do is cluster analysis the toy example i will show you, a common algorithm called k-means clustering you start by guessing that there are clusters in your data you usually also hwo many clusters then you guess the centers and guess cluster membership it turns out that this will mathematically get you some nice clusters the algorithm you can picked two points, they are wrong, they are both in the same cluster you pick them at random then you do most obvious thing you can do, you measure distance to all the other points you draw aline you draw an ortogonal and perpenicular line assume that this is a reasonable way to measure things you do that, and you recompute the centers of the clusters if all these red things are a cluster, where would the center be you moved your cluster centers now you reiterate so, now you moved the points here the blue points have overtaken and once you do that, and reiterate until the cetner does not change anymore at the end you get here and you find your two species again caveats: you still have to guess the number of clusters two kinds of fish in this pond you guess at several different numbers of clusters and you do an evaluation you look at how tight the clusters are more clusters make your model more complex there is a bunch of hand waving stuff this is a thing you can do this is like density estimation figure out if there are denser places in your feture place this is an example of an algorithm kavita: the cluster analysis will not tell you how many clusters there are? courtenay: there are clever ways to guess we can look at the data and say there is two courtenay: you can use histograms visually you can look at it and know janet: i have been using cluster analysis in social newtork analysis bibliometric analysis you can get the computer to detect what it thinks is a clustering if it is two clusters, do i have the relative distances through things sylvia: you can get a cluster and look to see if there are clusters in that courtenay: yes there are hierarchical cluster things you can do in social networks there are a whole different set of things you might do *these two fish know each other *ou have this whole extra set of data *that does with your data *that goes with the attributes of each user * carlin: trickiest things in teaching kids will do an analysis and come with gender binary to figure out what is getting read, where gender gets assigned by twitter who is particiapting what kind of assignments are happening what makes them truthful what feature you use if it makes more sense and to look at lilly: developmental biologist anne fauster sterling osteoporosis and how it is correlated with women and how come she critques that and builds a process model and how osteoprosis looks correlated with women if you get sports then less likely to get it race: the kind of work you do berns: both stregnthener is marketed to the petite white women carlin: osteoprosis gets discussed without that kind of sepecificity lilly: it takes a lot of labor to construct this other model how can we use this data other kinds of process sotries of gender and race without reifying them carlin: * in the hospital *you talk to people differently *not based on gender *but more specified risk * berns: hypertension is race based *that is discussed *cigarette smoking and hypertension *with regards to *i am trying to remember how it was taught *it is: i don't even think about it, it is just how it is * *boneeba medicine?? *there is a typical image for certain medications *they will be advertised to certain people *sometimes because their insurance is more likely to pay for that *it is not about systemtic issues *why is this women having these issues *an african american women is going to be more likely to be on this medication *it is presented as this is the problem of her race *and not that society was shit to her *inherently, this is what she will be, instead of what she went through * lilly: correlation becomes the local cause and they just deal with it that way berns: then there are people who don't want to take the medicine *they are seen as non-adherent *that is supposed to be more compassionate *lathough some will call them non-compliant * janet: that is an interesting label, THIS WAS SOMETIME BEFORE---- janet: you are parenting your computer sylvia: i think robots are adorable *it is cute to watch ------------ courtenay: they may be clusters of density a little more of grey area finding interesting clusters we may want to do something with this kind of analysis without labels may allow you to make reasonable guesses FOR TOMORROW: Baysian statistics explanations: http://www.kevinboone.net/bayes.html Sylvia would like to explain regression (30 minutes?) Neural networks are going to take over the world?? Seda mentions that there is AI that trains video game figures to act in specific ways Seda says “what other politics are possible if there were other ways of querying data?” seda: what kind of queries can we make with machine learning to get at where discrimination starts, the problem is that when you categorize you can then name and call out discrimination but once you create the new category then that has its own discramantory potential anne fausto-sterling (sp??) http://www.annefaustosterling.com/ domain knowledge - you would use domain knowledge to get the parameters for a data set ( discussion during hands-on weka session: carlin: i like what you say about domain knowledge kavita: our purpose is that we have a new piece of glass, is this helping out what kind of glass it is courtenay: now it is not helping, because i took out the glass if everyone understands what these histograms generally show lilly: can you read this file for us c ourtenay: breast cancer data in this dataset there are 9 features here age, menopause, tumor size the thing we are trying to predict is whether can is likely to recur or not we are looking 286 examples and 201 did not have recurrence and 85 you did and that is the class you are trying to predict we would want to predict it by looking at some combination of the 9 features or some subset thereof janet: we have a woman 54, pre-menopausal, right brest courtenay: i can do the walking for you, too lilly: there is no x axis courtenay: this is what we were talking about before numerical vs. nominal these are much more nominal seda: you need to look at the arff file to find out what the values stand for courtenay: the way you read this is that 68 cases got radiation therapy and about half of them had a recurrence and half didn't and the others didn't get radiation and did not have a recurrence the information you can glean here is how different the percentage of the classes are in this case it doesn't make sense recurrences were a far less frequent event bernadette: it is a small amoount that reoccurs courtenay: it is not unlikely you have a veested interest in predicting who is goig to recur kavita: does this mean that you are more likely to have recurrence if you get radiations berns: the first part is people who did it and it is 50 50 and the ones who didn't there was a better chance courtenay: but you need to know whether getting radiation are those who were seen as more serious cases sylvia: there are ways to present it to show that there is a clear relationship but there are ways, which doesn't show what the relationship is there are no clear relationships if we used some sort of algorithm, we could predict it but through visualization, especially because they have different population sizes it doesn't feel like a good example or it is a weakness of the proram it is a little lame of them courtenay: i agree with you sylvia: the whole point of visualization is to see things berns: the safe thing we are agreeing on you have to be careful with correlation and causation i have been to a numbder of pharmaceutical presentations they will take 2 people living 3-4 months longer and they will make claims janet: i see why you go into this people who got radiation it evened out it looks like not getting radiation meant you did not have recurrence that is why you collect a whole bunch of data because you want to show why the finding is part of other factors sylvia: this is also a way to manipulate data to get what you want seda: they claim more data is always better *it is better to have *the noise *if you have a lot of features *instead of fitting a straight line *you would be fitting this thing that goes through every pint *you don't want to predict this thing in between *there is a way you can measure *i decide on a certain feature *i am looking at cglass and i want to know if it is transparent *you see that it is evenly distributed across all glass then you know it is not a relevant feature *so there is certain features *that it is an indicator of this label the overfitting problem DAY 2: http://www.thenewyorkworld.com/ https://nycopendata.socrata.com/ Where to find data -- what the important attributes are What there is data for What there isn't data for Lilly couldn't find any data on contractors Unpacking "Mechanical Turk" CUP Lab -- data siphons Data politics in NYC Martha: Certain datasets won't be more -- how much learning can your machine do? Courtenay: Exploratory actions on data or finding Picketty Dataset is Open! Martha: difference between prediction and learning? Courtenay: Maybe? Someone may or may not believe you have proven something with your predictions. There are no unknowns that you can point to. "Machine learning" on a pedestal as separate from data mining or statistics is dangerous. How many nation states in Picketty? Martha: European ones? Seda: 40 based on his definition? Courtenay: You can still make predictions for a new country. Cross validation -- hold out on one data point and then see how well you predict the missing country to test your model. Seda: There's always a prediction, isn't the question how reliable the prediction is? What is prediction? Courtenay: You don't know a value so you attempt to Seda: Act of using a function to come up with a value you don't know. Courtenay: You may be artificially obscuring the value to test. That's still prediction. Kavita: Can you do predictions on datasets from the past in which it's not possible to go back and collect? C: You can still do what you want -- it's a philosophical scientific thing. Going forward you're not going to be able to make predictions. But it can tell you if you have a good model of phenomenona. Lilly: It reminds me of talking to mathematicians and scientists -- you don't have a theory unless you make predictions about the future. Ethnographers work differently: if you don't know how the data was created, you don't have a theory. New ways to explore models. Potential parameters are infinite. C: the end game doesn't have to be classification. the field of machine learning is driven by prediction. but the techniques are statistical techniques. There are other ways of seeing if things are correlated. Bernadette: last night I thought about farming data. labor has been low on the farm until the summer youth -- now it's spic and span. Number of workers with hours put in to crop outputs. Jojo: i thought when you said farming data, you were talking about the labor of preparing data for use later on, the workers come in and clean it up and it is ready for harvesting. Joanne: Wikileaks data is CSV. Seda: text analysis will be interesting. SLIDES/Courtenay presentation: Courtenay: touch on what correlations are using spurious correlations site: everyone knows correlation doesn't imply causation, but doesn't necessarily mean correlation. Martha: but is it predictive? C: no reason to believe that they would. Seda: Google food: all sorts of debates. World Bank discussions. Google food trends worked because of years and years of data collected by scientists. How good is prediction without another kind of ground truth. C: you can go out and look and a couple are going to look really great and all the rest won't work. You just pick the ones that look. L: Isn't the point that you don't need common sense? C: maybe they're both correlated to other things. Maybe there are other variables. You convince yourself that they are correlated Maybe they are correlated to other things, but you have convinced yourself that this is the correlation. The correlation will be spurious because they Seda: Constant is right now doing a workshop: how do we create common sense with machine learning. B: we make lists when we hit problems. C: human brains are good at making spurious correlations. B: Cognitive Therapists courtnenay showing correlations of different types and strengths if your classifier doesn't work, you might just not have enough information. c: sometimes you just have data and maybe you won't be able to predict what you want to pick Seda: is there any data on data that confuses classifiers? if it is uncorrelated with the class, it shouldn't throw off your classifier, your classifier will ignore it. learn to weight things as zero. discussion yesterday: if the length and number of stripes of fish are correlated, a model that assumes they are independent might not work very well because you count the same information tiwce becuase it's repeated in two places and the model doesn't take this into account no double counts!! bad! solution: could switch to a model that doesn't assume independent features. the other philosophical broad point was the fight between statisticians and machine learning i was vaguely aware of the fight, i was aware that there was some tension, maybe yesterday one of you asked: is this any different between statistics. here is a joke i found: a table of differences between the two, mostly terminological, but a large grant in ml will get 1.000000 whereas in statistics a large grant is 50000. weight vs. parameters etc. lots of overlap and lots of cultural differences the practices have evolved into different standards andrew gelman says, maybe we should remove models and assumptions because then we can solve problems that the machine learning people can solve. C: There are people who believe more or less in one or the other dogma one commentators on stackexchange says ml experts do not spend enough time on fundaments, and many of them do not understand optimal decision making and proper accuracy scoring rules. statsiticans spend too little time learning good programming practice and new computational languages. m: can you explain the second statement about statisticians c: humans aren't super into change. a discipline evolved in a specific way. before computers were around. in a culture in which people don't jump to the most immediate new software. fewer people in statistics departments know how to S: ML come from CS, statisticians come from mathematics L: chalkboards, slow proofs (math) vs prototypes! (CS) fast moving S: mathematician: if you don't understand what your algorithm is doing, it's wrong. One big issue: giant data sets. Efficiency is about quantifying results. when social cientists look at this debate, they say it is right or wrong. it is hard to make it stick, but it is working. the test by which something is successful in the world is not whether it is right or wrong, but whether it "works" ml person says, it is working, and the statistician says it is wrong jojo: it depends on what you mean by what matters? lilly: machine learning and statistics are competing for legitimacy on what is the right way to work with this data it could be that the debates about what is right and wrong, by participating in those debates, the ml people may be legitimizing their discipline martha: for some of these guys what is at stake is not publishing a paper, but having a successful company If they say all that matters is that they have a correlation different social worlds -- what's at stake C: techniques developed in academia, adopted elsewhere. courtenay: a lot of the techniques get developed in the academic setting, but in many cases, outside of academia, if it works, it works columbia was very mathy and proof oriented it is this academic thing in practice it is a very computer science and engineering mind set: i built it, it works lilly: some friends would consult cia and stuff for intelligence vs. ad prediction there may be different standards? courtenay: i don't know how theoretically, what the standards are behind that wall [of intelligence] martha: you're trained pre-data science? courtenay: i finished at the end of 2012. i was in machine learning courses in 2007-2008. hot stuff which was not neural networks, and a lot of that has been taken over. and they were interested in proofs. taught hot methods at the time (not neural networks) by people concerned with theory and proof. kavita: real timeness of data, ml people have access to data? that the data is just constantly coming in and being optimized statistician dealing with more static data? courtenay: it is less about real time than dealing with larger datasets which data scientists have been dealing with for a long time statisticians may not be as comfortable lillly: twitter search be one of these computational processes? twitter search has a real time problem, topics are cultural context that are not indexable terms so they hire mechanical turks to find something very quickly timing matters, limitations. TurkWorkers to bootstrap. C: detecting density of topics martha: i just thought of something, the credit scoring had 12 items, that is how many items someone working on a paper could add up. she would be doing the computation live, and that is about computational efficiency the debate today is with machine learners saying, these people are archaic it was computational efficiency, because i did not think of it as computational efficiency because it is transformed with infrastructures. courtenay: there is not necessarily that 12 variables is a bad system it is more about the number of data points rather than the number of variables (features) martha; given all the data that could be credit data, it looks archaic C: there is such a thing as too many variables courtenay: there are too complex models as well that sounds like plan bullshit to me berns: you cannot have enough considerations in this case courtenay: i agree that you may need more than 12 factors, but for some things it may be enough. martha: we are having the debate because computational infrastructure, when we phrase the debate, the only reason we are considering more than 12 because these guys have amplified their capacities in the last 50 years lilly: a lab, machine learning, we are storing so much more data, we need to gain more financial value from this data we need to get more value, becuase we have more data not that we want more data because we can get more value courtenay: it is cheap enough that you can store everything martha: the debate that you are pointing out between statisticians who are cheap ml we can maximize, the debate is created by the economics of the environment courtenay: statisticians use computers they may not be up to par wth the latest in computational infrastructure the data is probably generated by a tech company who is interested in doing this thing on its data as far as academic departments go, they could be trying to solve the same problems lilly: are you saying there is a difference is that cs people need grants to get machine, entrepreneurial grant getting martha: the million dollars have to be for something courtenay: they are also being snarky about it being a fad. it is cutting edge and popular and statistics has a marketing problem ML gets the money because berns: there is a race issue there, too, as to who teaches you statistics and computer science. my statistics teachers were people of color S: within computer science there are layers of people who are more proofy. clean definitions. C: then there are the ones who hack. S: upper echelons -- it's class. middle class belt: less lofty. more likely to do applied stuff. don't mind being engaged in $. Privacy is upper, surveillance is middle. ML gets new folks: physicists and biochemists. Need the techniques. They go into hedgefunds. Big data systems. How do physicists deal with complex social issues. LIlly: Physics such a male dominated field. S: except Iran. Hard crowd to read for me-- tend to be polymaths in my experience. C: Engineering mindset: but practically it worked! don't know what he means by worked, but proof is in the pudding. M: usually managerial C: lots of complaints to be had about this attitude. Broad take away: two outlooks: classical stats hypothesis testing ML: getting predictions to work even in the face of lack of interpretability of models Lilly: as an ethnographer I now feel aligned with Classical Stats C:If ML is more "successful" it comes from the large-scale resources; wring the last bits of success out of those things rather than doing something more profound. M: could there be a synthesis? C: most things are black boxes, but there's a real interest in doing this; run models backward; google deep dream. No one likes black boxes. People like to know how they work, also because they want to improve them. feature normalization: what if we have this data: much less variability in # stripes athan length much difference in scales it is a problem if you are trying to calculate distance between things change each feature to have mean = 0 standard deviation = 1 obvious thing people in intro cs classes don't do DATA SHARING (EMAIL PHOTOS!) courtenay: something that looks extremely small you are looking at particles, how toxic is something the numbers may look very small to you, but you want to stretch it out to be able to evaluate its significance martha: do we know that the difference is equal it will then become testable as to wether it is meaningful courtenay: it depends on your classification model, some models will need normalized data you are not changing the information in that variable, you made it easier for algorithms to work with it and maybe for you to view as a human there could be a diagonal relationship that you can see better in a context, because of the resolution martha: if the variable is useless, after the transformation, it is still useless courtenay: there are other normalizations that you can do that is statistical normalization of data martha: is tehre a relationship between normalization and the in ability to reverse engineer courtenay: no, you usually have the raw data, and you know the mean so you can go back it depends on the final model probably takes it, the final classification model somewhere inside of it has the value of the mean that ist needs to subtract off that value is a parameter in the model, you know what that is so that you can make a transofrmation on the raw data coming in, that also means that you can back it out you are not obfuscating antyhing feature selection and dimensionality reduction we might not even need all the features we have to do well on prediction we might need something that we don't have sometimes the important thing is to figure out which ones to throw away *- features with no correlation to the class *- features that are redundant with each other *if their corelation is one to one, you can throw it away * 2 features are redundant if they are highly correlated with each other *you're relaly getting the same information from both and overcomplicating model *could manually do it * dimensionality reduction *a way to compress data so that you can extract a smaller subset of uncorrelated fetures *set of mathematical transformation to do this * ou end up with a whole new set of features each is a function of the features you put in you have x, y, and z, you end up with a, b, and c a, b, and c are functions of combinations of x,y, and z such that they are all orthogonal to each other the output variables don't have correlations with each other it is a form of mathematical projection youare changing your axis i can't give you an intuition martha: you perform something on each data point and transform it into something else courtenay: itis an automatic way of compressing the correlated relatioship into uncorrelated variables martha: you take variables that are corelated, * * courtenay: now the features are uncorrelated. ri: you need to know what is correlated courtenay: you do a mathematical transformation that does it lilly: whether two things are correlated is a statistical relationship, right, so the stats does the job for you? courtenay: it is factor analysis there is a bunch of ways to do that principled components analysis martha: compressed? courtenay: you are asusming there is a lossless representation and you do something that are now uncorrelated so that it is easier to feed into a model lilly: we thought these were correlated courtenay: ffter the transformation, you take the top x values and throw away the less important, less informative values in the bottom that is purposeful transformation that way ri: i thgouht we were trying to what does correlate, how come we can now all of a sudden identify that is correlated martha: is compression like making juice out of vegetables these are dimensionality reduced points, because it is too slow to eat carrots? courtenay: basically, you are going to take the top few dimensions to a classifier, becuase you know these things are uncorrelated, you don't have bad feature correlations fucking up your modesl if you have two variables that are really correlated, all the ifnromation that was contained in those two is compressed to one feature you are not going to have the statistical prblem of overweighing these features lilly: combine marriage and margerine into one feature courtenay: there is something to be said about the distance on the y axis, it does not mean anything the slight difference between the shapes lillly: a band of difference is acceptable? courtenay: yes lilly: but the band can matter courtenay: the mathematical transformation will not take semantics into account balanced datasets: sometimes you notice your classifiest is doing suspiciously well - 95 percent accuracy then you notice that your data looks like this: class a has 95 percent a, 5 percent b maybe you can go collect more examples of class B and make your model better could use re-sampling methods to feed your classifier more balanced (if slightly synthetic) data it is good to be aware of the relevative balance of classes in your data and think about how it might be affecting predictions cross-validation: standard machine learning practice you need twice as much data, training and test set instead of 2 fixed dataset split data into train/test multiple times for multiple experiements and take average results more samples -> results more likely to be statistically valid weka: 10 fold validation, that means that it built 10 different classifiers martha: is that the kind of validation that you do if you have lots of data courtenay: if you have tons of data, you can do a single split and that is ok this is going to help you more if you don't have a lot of data it is generally a good thing to do, it is sampling more berns: what do ml people call statistical significance? courtenay: this means that you did ten fold cross-validation the part of your paper, where you prove that there is statistical validity in results is a little bit more lax overfitting: sylvia was describing the picture yesterday model complexity >> training examples * often happens when # features >> number train examples *your model is too complex *you didn't have enough training exmaples to justify the complexity of your model *a lot of the models use the numbers of features, weigh each feature *your model is the weighted version of those 12 features *but maybe you only say 6 examples *you have few data points and you are trying to fix a model with way more data points and you are going to do this overfitting * *sort of: lets your model be more creatively wrong * *occam's razoe: simpler models are usually better * *classic visual demostration *simple linear model vs. more complicated model: you get every point, which would you expect to be more correct for new examples? * *the canonical way of seeing if you overfit *you look at your prediction performance on the training set *if you are doing really well on your training data and shitty on the test data, you know you have done something wrong!! :) * * feedback and future discussion: berns: people like one on one, i like it when we are in a group. i am not even good with breaking out into groups. courtenay: this was a good size group for hands on stuff today, the group was a little bigger yesterday, which made it harder for hands on, but then better for discussion kavita: the pace was good at no point was i dragging ri: the structure was well thought joanne: great i didn't think yesterday was overwhelming with the larger class size i thought you were going over a lot of vocabulary structurally to add: i wasn't sure what i was going to learn i see machine learning all the time one could that could help if there were 5 questions that were answered courtenay: in the context that you see ml all the time, did it cater to your expectations joanne: i was worried that it would be too technical i am glad that i came there might have been a way to point out what we will learn courtenay: there was an initial description that was more technical it is good that where we went what language would be friendly to the people that we want to attend berns: my friend was worried that it would be way over her head ri: what worked well, if you know about machine learning, you should come that added to the people explaining, that was a good dynamic management it is difficult when you are teaching a technical subject to put it on a level to keep evryone interested kavita: i would love to have a discussion on social and cultural significance of machine learning taking over certain functions ri: i would say the opposite, let's get our datasets courtenay: this was great, i was terrified it was kind of a tough, it was a long road from do you want to do an ml workshop to what would that look like and to ask that question again and again i am so sorry we didn't get to the societal implications or get to the data this has been really fun i got all kinds of perspectives and questions that i hadn't thought about and tuaght yself things that i didn't know or had forgotten ri: if you want to choose different spaces, this was a wonderful space transinclusive joanne: i don't want to call things all women what if someone transitions invalidating berns: i have had people not feel included joanne: you just can't say, no cis guys eyebeam, i could talk with them if you need space it is a nice space new inc might be open, too they might be good eyebeam would be open berns: because it is new york, we have our hands in such cool things you wouldn't want to spam if we could email a central person i am putting out this event on thursday, i heard about this event and it might be of interest a monthly news joanne: if you had ela come and do some basic security: if she would be up for that, that would be amazing especially since she has been doing threat modeling she would be happy to test out most of oher talks, she is often talking about things