Algoliterary workshop - Towards Collective Gentleness?
http://constantvzw.org/site/Towards-Collective-Gentleness.html?lang=en
Notes Algoliterary lectures http://pad.constantvzw.org/p/algoliterary.lectures
Notes workshop: Variations on a glance http://pad.constantvzw.org/p/algoliterary.workshop.a-glance
--
planning
12:00 - 13:30 introduction / walk through
- general: algolit - sentiment - simple/complex algorithms - machine learning - neural networks - supervised/unsupervised
- topic: bias in data/flattening - how/where is the knowledge produced - look at the process
- we are sentiment thermometer
13:30 - 14:30 lunch break
14:30 - 18:00 possibilites of working on:
* with graphical interface:
scoring sentences we-are-a-sentiment-thermometer - exhibition
reverse algebra with word2vec - exhibition
word-embedding projector (Tensorboard/Tensorflow) - exhibition
*The projection of the Glove-dataset with the 10000 most common words is now on http://192.168.9.205:5001/#projector (Best use Chromium-browser ok in ffox)
*The projection of the Glove-dataset with the words used for the sentiment anaysis is now on http://192.168.9.205:5002/#projector check lexicon words - paper lists
create new lexicons i-could-have-written-that - exhibition writing machine
* in the terminal:
visualisation GloVe graphs (with Matplotlib)
nearest neighbour graphs (with Gensim)
reverse algebra with GloVe -on Cristina's computer / local install
revisit the sentiment thermometer scripts with different lexicons
links
Algoliterary Encounters
*Algolit wiki http://www.algolit.net/
*Notes of the workshop 'Variations on a Glance' yesterday: http://pad.constantvzw.org/p/algoliterary.workshop.a-glance
Workshop
*blogpost Rob Speer 'How to make a racist AI without really trying' https://blog.conceptnet.io/2017/07/13/how-to-make-a-racist-ai-without-really-trying/ & https://gist.github.com/rspeer/ef750e7e407e04894cb3b78a82d66aed
*variation on the tutorial: http://pad.constantvzw.org/public_pad/neural_networks_maisondulivre_workshop_script
*Git repository: https://gitlab.constantvzw.org/algolit/algolit/tree/master/algoliterary_encounter
*GloVe datasets: https://nlp.stanford.edu/projects/glove/
*CommonCrawl: https://commoncrawl.s3.amazonaws.com/cc-index/collections/index.html
References
*Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, July 2016 https://arxiv.org/abs/1607.06520
*Semantics derived automatically from language corpora necessarily contain human biases, August 2016 https://arxiv.org/abs/1608.07187
Case studies
*Perspective API http://www.algolit.net/index.php/Crowd_Embeddings & https://www.perspectiveapi.com/
NOTES START HERE
Looking at 'sentiments' - positive, negative ...
Text is not numbers: an extra step needed, how do you translate words to numbers
Q: Sentiment means feeling?
A: Example -- stockmarket. People buying and selling. When there is a sense of a company doing well, it is seen as a 'positive' sentiment. Also marketing. Word used in the field of biometrics too.
https://en.wikipedia.org/wiki/Sentiment_analysis
https://cloud.google.com/prediction/docs/sentiment_analysis
Same technique in different fields.
Wondering about vocabulary of "sentiment" whether it is proper to the field or a term used by Algolit.
Proposal: make a round to check definitions.
Working with Neural networks. It is hard. Working with a tutorial by Rob Speer. He showed how to make a model "a racist AI without really trying"
Rob Speer: "I want de-biasing to become one of those Lego bricks. I want there to be no reason not to do it."
https://blog.conceptnet.io/2017/07/13/how-to-make-a-racist-ai-without-really-trying/
[the problem/issue/potential of bias - ref. Variations on a glance]
Different steps of creating a model. Algolit developed interfaces to parts of the process, so you can explore through those.
1. word embeddings
2. an additional layer of machine learning
using stereotypically typed sentences to test:
a. I like Italian food (positive sentiment)
b. I like Mexican food (negative sentiment)
What does it mean to use these examples. How can we look at other biases, as in expressions (again: yesterday's workshop)
How can we work with biases ?
1. Do better science (so bias is 'solved')
2. we have to refuse to use this tech (so we avoid bias)
3. what if we consider biases to look in an oblique way at those technologies. Biases is inherent to human communication and human langage
De-binarizing bias?
Stay with the trouble/bias
What is bias? Nicolas used the example of 'bias cut' (cutting diagonally to the grain)
Different responses to certain sets 'spider' + 'flower' -- 'snake' + 'beautiful' ... 'music instrument' + 'cowboy'
Q: cultural?!
A: it is a test to develop a model. The idea is to separate stereotypes from prejudices. (and to correct for that? so that is actually going for option 1?)
The text of the 'we are a sentiment thermometer': http://pad.constantvzw.org/public_pad/neural_networks_maisondulivre_workshop_script
The script is based on english language. It's important to think about it because the first results are embedded in US context.
Common crawl = http://commoncrawl.org/
US " We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone."
YOU "Need years of free web page data to help change the world."
Q: when are the islands created exactly?
A: already in the GLove dataset. The way 'raw data' is written, is defining the way the 'islands' are constructed. Glove embeddings take 300 'sideway words'.
The projection of the Glove-dataset with the 10000 most common words is now on http://192.168.9.205:5001/#projector (Best use Chromium-browser ok in ffox)
The projection of the Glove-dataset with the words used for the sentiment anaysis is now on http://192.168.9.205:5002/#projector
Metaphors of word embeddings:
*words have windows of varying size
*'GloVe' - the idea of fitting a word like a glove
*gold standard
Tensor board : https://github.com/tensorflow/tensorboard/blob/master/README.md
TensorBoard: Visualizing Learning https://www.tensorflow.org/get_started/summaries_and_tensorboard
Can we think about this as a kind of 'computational etymology'?
Q: What is a Gold standard?
A: Annotated by humans. Disagreement list: researchers decide, or disgard. These lists are well developed in English. Not much in french or dutch. -> social economy around langage tools
interressant the word comes from... economy theory [ah?]
Q: What other pos-neg datasets are around?
A: The fact that the dataset has more neg than pos words has an effect. It is an easy one. "the same one used by the Deep Averaging Networks paper"
Q: I had a question about the humans! Who are they?
A: Students score the words. Sometimes Mechanical turks. Not part of the scientific description - where/when/under what conditions these 'separations' are made (NM: "how is this knowledge produced")
The gold standard dataset combined with the positivi/negative one is used to train the algorithm.
PP: Interesting that there are so many relations between vocabulary and locations of economy.(and where "experts and software coders" get their income in the end)
trying to understand that there are more neg than pos word. now: probability of negativity is much larger!
(linear regression for those who forget college statistics : https://en.wikipedia.org/wiki/Linear_regression)
At what point will the multi-dimensional become a binary: positive or negative. drawing the line?
Different narratives mix: statistics, linguistics, mathematics. It is in-between all these worlds. NOT having vocabulary means we can try to understand in a different way.
reference point. "majority baseline". 70%
appearance of French researchers on the scene -- anti-google dominance: QWant https://www.qwant.com/, European gps, Europeana. And government based project around langage.
(machine learning geopolitics : David Cournapeau of http://scikit-learn.org/stable/ have been working for https://en.wikipedia.org/wiki/Enthought) and now for https://www.cogent.co.jp/
lot of different intentions around machine learning : making money, making science, changing the world...
a mix of hard and soft-science.
Q: AI is driving cars, so machine needs to learn to see as a driver. What is the application here.
A: Marketing, politics, ... decision making in situations that go super-fast. Decision-helpers?
Amir: sentiment analysis difficult for now on WP but testing and trying.
Tensorboard presentation
allows you to visualise text datasets
pca : principal component analysis : shows differences.(variances to be more precise https://en.wikipedia.org/wiki/Variance )
method for dimens
you can watch PCA outcome - algorithm giving general overview
or you can give an axis between 2 words you're interested in, f.ex. human/creature in Frankenstein
difference between looking at 'terrorist' and 'islam' - different angles on the same word
if you look at the nearest point its coming from a body of text that is not far "geographically" from the word you are looking for. (actually not geographical but mathematical it has nothing to do with physical distance)
mixing metaphors:
"distance is just a mathematical way of measuring difference"
Ideas/things to work on:
what would a non-binary lists (?) of sentiments (?) be [domain specific? context aware?]?
develop a set of bias-enhancing techniques (what bias?)
understanding how conventions are iterated in machine learning, point at where/how this happens
think (more) about the importance of binary separations/2D flattenings
figure out how this graph was made https://www.nytimes.com/interactive/2017/11/07/upshot/modern-love-what-we-write-when-we-write-about-love.html and why it looks so much like other graphs we know https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783449/figure/pone-0073791-g003/
understand what is meant by separating 'prejudice' from 'stereotype' [good bias/bad bias?]
- What is meant with bias? How do we critizise it. And how do we use it/can we use it? [does this link to: "understand what is meant by separating 'prejudice' from 'stereotype'"]
- Look at other biases -- which other binaries (is this is what we tried to test with Paternalist classifier?)
How could we work better with stereotypes? Not just try to get rid of them. For example a tool for discourse analysis. For example fake news writers know which words to use to touch a certain public.
Try a different lexicon - other binaries
- Explore the dimensionalities of these tech, see what comes out of it
- look at model with balanced lexicon
- take model as starting point for literary creation, not just diagnosis.
- how is the dataset constructed, how do we construct knoweledge and biases and what can we do with this idea ta
- to give this algorithm & dataset a voice
- what can 'combined knowledge' be? a combination of images, poetry
- bias and a connection to your own memory, a way to create your own matrix of positivity and negativity
- Reclaim the biases, instead of being segregated by them. Being in power of the biases. [trying to understand what 'oblique machine learning' could mean?]
- Or without computers: which kind of algorithm are we training ourselves? Taking the word algorithm by giving it a body outside the computer.
- Reference to https://arxiv.org/abs/1608.07187. Educate about big-data processing, how to do that outside specialist circles. Create conditions to speak about them. As a way to speak about opening biases up.
- understanding the way labelling 'good faith' and 'bad faith' vandalism is made transparent (and discussable?) in ORES
just browsing a article about the "utility" for real life, consumerist application to diversify lexicon of negative-positive tags just read the begining (:-( )
https://blogs.sas.com/content/sgf/2017/03/01/how-to-extract-domain-specific-sentiment-lexicons/
?bias?
ML without bias is not possible
- Donna Harraway, in the context of philosophy of science. Knowledge production from a feminist perspective. Economies are always conditioning the knowledge being produced. Situated knowledge would express the conditions that were at stake. The bias is when the situatedness is not expressed. Bias is problematic when it's not visible to the reader.
Including doubt and visibility in order to catch negative implications.
It's necessary to make fast biased judgements all the time.
Going to check and correct the bias, because that's what we would do with humans that make mistake.
Situated knowledge experiment relates to the paternalism experiment from Correlations.
Reference to filter bubbles online, alternative news.
'embracing' the bias? reclaming the bias? What does this mean?
Instilate(?) some doubt in how the data is selected for these algorithms.
Neutrality is not neutral. It's important that people gather and resist to this 'neutrality'.
Reclaming the bias means to open up the racist or sexist component. Instead of being the subject, to take it in your own hands.
Use this model of ML to create algorithms which are filters and derive in conclusions. How do we educate them?
How do we contribute? To which algorithm are we contributing?
How to reclaim the algorithm and its problem?
If you look at the captions of Google and you see what they are doing.
A lot has to do with advertisement and marketing.
If you want to influence the data, you could campaign globally, for instance, to stop practicing tourism. Because tourism reforms the data a lot. Maybe the data will be more relevant after.
An idea is to generate a lot of tweets (with bots) from a particular perspective to tweak the twitter-text-database. (yes)
Do we need to develop our own tools to counter-act the targeted machines? Bots to counter-click?
Lost in scales. Between the local, global, very near things or very far. We can alter this automatic way of relating to big-data. It's not about denouncing the system, but about reformulating your own.
You can also relate to the method. Because even on a smaller scale, a reformulation relates to the same method.
acts of reclaiming
- make visible situatedness
- exposing the nature of the documents and files we use
- describe the collective of people that is present now
main problem is that these word-embeddings are used a lot, and considered to work well
CommonCrowl is an effort that tries to make the data available for people other than Google
It would be an idea to document how CommonCrawl is created, under what conditions?
You need conversations in order to know how to debias.
develop a set of bias-enhancing techniques (what bias?)
How to situate a dataset?
ORES, not looking at a user making an edit, but at the edit itself.
Looking at other types of metadata around the edit that is made.
Generating different types of metadata that don't relate to the user.
For example: ORES is looking at 70 features of a comment, and plots it into a space into 70 dimensions.
- the comment contains erased words
- if it is increasing the numbe
- does it contain informal words
- does it contain swear words
- is the comment made on a protected page?
As an alternative way to define the surrounding space of one word.
"Another Analytics is Possible" - Cristina
"Staying with the trouble." - Donna Harraway (present time) ( from the book 'Staying with the Trouble' / EXPERIMENTAL FUTURES : TECHNOLOGICAL LIVES, SCIENTIFIC ARTS, ANTHROPOLOGICAL VOICES
A series edited by Michael M. J. Fischer and Joseph Dumit / Duke University Press
"Staying with the bias" - Pierre
"A real beer for a virtual world" - beer glass"
"Cloud is positive now" - a false positive from the 'How to make a racist AI without really trying' tutorial
"the knowledge is not an object, it's a process" - Nicolas
"It smells good" - a comment on industrial food
"we are still at beta level" "and at some day we will grow up" - an often heared expression in Machine Learning reactions on painful errors
"Shut up and train" - Donna Harraway (Reagan era)
"Run fast bite hard" - Donna Harraway (Bush era)
Discussion final!
Juliane: play with lexicon, look at script
cut negative text in 2 using random function
not trusting random function of Python
computer is too slow to process with balanced lexicon
An will upload script to gitlab
samall contribution of juliane cutting randomy the negatvie text of lexicon
import random
with open('negative.txt') as f:
lines = f.read().splitlines()
longueur=len(lines)
moitie=longueur/2
demi = [lines[i] for i in sorted(random.sample(xrange(longueur), moitie)) ]
with open('moi2.txt', 'a') as f2:
f2.writelines(["%s\n" % item for item in demi])
BIAS CLUB
300 dimensions on Bias
Could you have a non-biased dataset
How could you situate a dataset
Getting biases fixed will be bias (aaaaaargh. iteration) Also, speaking about bias is biased
CommonCrawl is difficult to discuss, as the dataset is created for universal use. A case like Wikipeida is much more interesting perhaps
Wikipedia good platform to discuss
Working with communities is easier than working on a global scale because there are people to respond to the algorithm immediately and 'correct' it according to the community's set of values
what about the mirror image - "algorithms reflect human bias"
should we look at algorithms as a mirror, and what does it mean to 'get' a bias out of the data?
disturbed by the paper/image of the mirror
the image of the mirror implies that by removing the bias from the algorithm, you can remove the bias from human language
It puts the problem back to society, society should deal with it. And it seperates ML from society. [it relates to the 'infant' argument]
it takes responsability away, doesn't look at process (?) - it doesn't look at machine learning as an iterative process
the environments in which these technologies are produced, are not neutral at all
using this metaphor, the possibility to look at the mechanisms and processes that generate knowledge is considered redundant
backstory to the paper, relation to a Microsoft team that published a similar story, but published earlier
"scooping"
discrimination and bias become currency in machine learning communities
taking the politics out of the discussion by making it a 'syntaxable' issue? re-marginalizing knowledge of discrimination/racism and how it works.
cfr Alison Adam
the problem with using comparisons between gendered and non-gendered articled languages is that it suggests syntax is everything
Using the non-gender syntax in sentences doesn't imply that there is neutrality of the gender referred to.
The problem is big!
A formatted construction of technology.
Operational processes to create technology.
One thing is .. to stay with the bias. Or could we conceive another option? Staying with it ... what does it mean when you stay with ... weapons? Can we start again? Or is that impossible?
How to show "everything that is around the bed"
categorization is problem of construction
More data? More computing power? More dimensions?
Response Google to the image of two black people as a gorilla: "We’re appalled and genuinely sorry that this happened. We are taking immediate action to prevent this type of result from appearing. There is still clearly a lot of work to do with automatic image labeling, and we’re looking at how we can prevent these types of mistakes from happening in the future.”
Another response from Google when it came to tagging the words 'gay' and 'Jew' negatively 25/10/2017: https://motherboard.vice.com/en_us/article/j5jmj8/google-artificial-intelligence-bias (sorry for motherboard article) "We dedicate a lot of efforts to making sure the NLP API avoids bias, but we don't always get it right. This is an example of one of those times, and we are sorry. We take this seriously and are working on improving our models. We will correct this specific case, and, more broadly, building more inclusive algorithms is crucial to bringing the benefits of machine learning to everyone."
Bias is a political position?
word associations and the 'goodness of something'
The problem of the mirror is not perhaps reflected in the way it reproduces language, but in the conditions around it (annotations, crawling, etc).
mirror becomes a problem when it is considered as a truth/oracle
mirror IS NOT truth
Do we know what internet is? Juliane has been extracting the Glove dbase ... it takes a lot of time, so a blackbox.
strong becomes stronger.
Mirroir deformant. Deforming mirror.
So, is there any space for who/what is not connected?
So in what way is this special/specific from other systems of power/knowledge production?
Another perspective:
Look at this technology not as a diagnosis but as a generator.
How to use pos-neg words for/in text generators.
Looking at text generators.
http://publicdomainday.constantvzw.org/#1941
looking where words are used in a different way. Negative words: a lot of chemistry, nature, ... so let's look at them and see if it comes out different?
How would the sentiment-thermometer speak in graphs and example print outs?
See what words fall out of the pos/neg wordlists, because not in GloVe, or because they were not in CommonCrawl?
Could we use a generative act to create an act of feedback? Biased feedback?
An idea to adjust the word-embeddings: if a word should be more positive, a way to rewrite the word-embeddings.
Is CommonCrawl dealing with different languages? Can we compare the outputs from them?
ORES
70 features for ORES project
describe the edit not in terms of person who made it, but reducing/expanding nr of characters, whether it contains swear words or not
Paternalist classifier Cqrrelations http://snelting.domainepublic.net/affiliation-cat/constant/the-annotator
search for word clouds around it
maroccan misspellings cloud
Good faith / Bad faith vandalism
If you think about bias as something that you want to take on, then you need to distinguish between what is 'good' bias and what is 'bad' bias
difference between stereotype - prejudice
connecting it with the exercise from Cqrellations, The Annotator
Legal situation around the page of Chelsea Manning
no consensus whether page should be named Brendan or Chelsea Manning
'court' system for community to decide
bias-consciousness training
by-as
SLOGANS FOR BUILDING THE ALGOBIAS CONSCIOUSNESS
"Another Analytics is Possible" - Cristina
"Staying with the trouble." - Donna Harraway -see ref above
"Staying with the bias" - Pierre
"A real beer for a virtual world" - beer glass
"Cloud is positive now" - a false positive from the 'How to make a racist AI without really trying' tutorial
"the knowledge is not an object, it's a process" - Nicolas
"It smells good" - a comment on industrial food
"we are still at beta level" "and at some day we will grow up" - an often heared expression in Machine Learning reactions on painful errors
"Shut up and train" - Donna Harraway -see ref above
"Run fast bite hard" - Donna Harraway -see ref above
"Everything starts with a Y" - YSL advert campaign 2017
"That's Y" - YSL advert campaign 2017
Isabel would like to share this song with all of you: https://www.youtube.com/watch?v=V1bFr2SWP1I
see you at http://algolit.net !