Notes workshop Nicolas Maleve: Variations on a glance
http://constantvzw.org/site/Variations-on-a-Glance.html

In the framework of Algoliterary Encounters Nicolas Malevé proposes a workshop on computer vision.
Language, words, writing, descriptions and formulations are intimately linked to the way the millions of images on the internet are organised. Over the years, algorithmic techniques have evolved creating a new articulation of the relations between vision, information and knowledge. The recent breed of algorithms that power computer vision make heavy use of the techniques of machine learning. As other algorithms, machine learning algorithms need to be programmed, but they also need to be trained. Contemporary artificial intelligence aims to “teach” machines the cognitive abilities of humans.
But how do computer scientists understand human vision and how do they translate it in a concept they can work with? They are interested in a very specific aspect of human vision: the glimpse, the glance, the moment in perception that allow to take immediate decisions, a near reflex perception.
Nicolas introduces a method to assign relationships between images and words, as described in « What do we perceive in a glance of a real-world scene? » (Fei Fei et al. 2007). By proposing a variation on this method, he shifts the focus of the experiment, which is not so much to collect quantitative data from the participants but to discuss with the participants what is at stake in the experiment and how it models vision.

Notes from yesterday's lectures: http://pad.constantvzw.org/p/algoliterary.lectures
Notes from workshop Algolit: http://pad.constantvzw.org/p/algoliterary.workshop.collective-gentleness

--

Same kind of relations are at work for 500ms to interpret the images.
But the experiment shows that time constraint is actually not a problem.
interesting to note that copyright is waved when it comes to gathering material to train algorithms, but only for academics and commercial researchers (so not artists, activists, ...)

cultural assumptions from the algorithm: it is not just a bedroom, it is a hotelroom
economy & social world around machine learning
a lot of manual work
LabelMe as a tool to include cultural knowledge into machine vision techniques (http://labelme.csail.mit.edu/Release3.0/)
the work of an annotator

The computer analysis the configurations in the labeled shapes, and connect it to the label.
annotation process, needs to happen before the code can be ran
rarely discussed as work, while it gives maths a root in a daily world/cheap labour

"sometimes it seems that algorithms exists high above the earth, but no, the actual work is very much on the ground"

Notes on Image Annotation - Adela Barriuso&Antonio Torralba, 2012
https://arxiv.org/abs/1210.3448
Torralba, pioneer of image sets for annotation
Adela barriuso, shop owner in Mallorca, during low season she does annotation work - was a champion of annotation world / nowadays it would be considered normal
Now a lot of ML work is done in commercial context, but LabelMe was still developed in a context of research
Adela Barriuso: labeling the images during work in a shop "gives you a different perspective on the act of seeing"
"you are especially bothered by occlusions"
occlusion is central theme, because in labeling the contour asks for full objects

do you then label a bed as "a part of a bed"? There are many papers and discussions about this, and no consensus.

SUN database (ancestor of these databases)
http://groups.csail.mit.edu/vision/SUN/
SUN = Seeing UNderstanding [oh!]
it is happening at huge scale
make annotations & adopt classification or the labels (making sense of the labels)
-> questions notion of image + words + classes...

image-net
http://image-net.org/explore
current state of the art
linked to wordnet
WordNet as a system to standardize connections between objects, words, etc.
The limit of the dataset is the limit of what can be said in the tools that are built on top of it.
14 million images
found on the internet with crawlers > algorithmic curation
relies on what Google thinks, f.ex. image is 'a flower'
the algorithm is already inside the selection (through the crawler)

Visual Genome
http://visualgenome.org/
on top of Imagenet
33000 unique workers, doing annotations and validating the annotations of the others
4 cents ($) per annotation (Adela Barriuso added 250,000 labels)
if you want to make money with this, your rhythm will need to be fast
why and how is it possible that computer vision need to rely so much on this labour system, was the question for Nicolas to start his phd research
how to relate the 4$ cents to a gold standard, two ways to refer to the same

the Gold standard vs the 4 cents
Li Fei-Fei, person behind Imagenet - superstar of computer vision
you need common benchmark to judge algorithm
imagenet = benchmark
but stays always behind the scene
"The real star is Google."

Fei-Fei in beginning: did experiments without knowing what they would do with the practises/information
also the intention was not focused on machine learning, but more coming from a psychology context
'seeing' - hot topic in psychology (she was part of the psychology team who worked on this project)
immediate perception: what can you say of a building when you open door and look
description after half second

part of the experiment is to explore what a half-of-a-second is (500ms)

experiment the score
an experiment dating from 2007
two stages
cfr paper that is on the table, copy for each one of you

nowadays it is more difficult to get "naive test subjects" because administrative bodies require you to describe in detail what will happen and ask for permission

2007: camera's appeared that had an in-built face-recognition function
Pts: presentation time

stage 1:
22 students from California Institute
unpaid job

stage 2:
scoring of the descriptions
this is a paid position executed by 5 volunteer students from schools in the LA area (18 - 35 years old)

this experiment inspired people who wanted to build models for computer vision
it breaks from previous experiments in that:
    Micro-timing: perception is not uniform during the first 500ms of vision
    Images: they come from Google images
    Free recall: the subjects describe the images with their own words vs multiple choice before
    Taxonomy: a hierarchical tree of terms is produced and used to evaluate the descriptions

images from "the internet", not especially created for the experiment
researchers: "it means the images are less biased"
there is an assumption that the images from the internet are a product of collective knowledge (do not contain bias?)
'it is all free but in the end it must match'

we will replay the experiment with a difference
with small modifications in different steps
we will try to sense where it resists

Questions:
Vision?
The context is cognitive science, computer science, optics, neurology
Within the cognitive science discourse: the vision community, which excludes artists, media theorists, etc.
interesting to note that copyright is waved when it comes to gathering material to train algorithms, but only for academics and commercial researchers (so not artists, activists, ...)

The world of computer vision and cognition refer to each other to validate their work.
"cognition works this way, as algorithms work the same way" - and the other way around

===END OF PART ONE===

EXPERIMENT
stimulus = what is happening on the screen
rectangular helps to focus / cross as fixation

orienting the gaze to the task
cancellation image "an image to cancle the memory" (blue rubber) (sometimes turns green)

duration of the exposure to the image 27ms to 500ms

looking at vision like looking is a camera operation
this set-up s 'analog distributed camera'
there is a script to create the random noise images

27 milliseconds
maximum refresh rate! [so vision measured against the capacity of the machine] of screens in 2007

("don't worry" the images will be different this time)

It's not about being good or bad.

keep in mind your level of comfort in doing this within this setup

(Ready?) (X_X)

27ms is much too fast!!

first: individual descriptions of 3 images

then: descriptions in a group of 3

Interesting how the descriptions float between three perspectives.
There is also much more cultural interpretation included in the descriptions.

experiences of the experiment
normally the experiment is executed individually in a dark room, excluded from any other form of input

different reaction when a group
less authentic in a group
reports of others are built in in your memory
group discussions create false memories

interpretations depend on time and on the group

always a mention of the absence of people in a picture

in collective descriptions, the construction of the image became more important and included more details, lighting, quality, position of the photographer
could also relate to time, and getting used to the procedure, could be a combination
also because we are not looking at vision, but representation. We're not testing our eyes.

False memories started forming in a collective setting
a detail was not noticed in an individual setting, but when it was mentioned, it was possible to form an opinion
triggered to think of aspects

individual part
*training part was important, i went first
*afterwards you realise there is no people in the image
*the more time passes, the more expectation there is
collective
*convinced opinions
*
descriptions don't come over as correct or not
it defines the memory, but also models the memory
description of the image became better, but does not relate anymore to what you saw

different opinions are all correct
you could recognize eachother interpretations in the image afterwards

individual: focused on specific elements (focal)
collective: focus on peripheral (?)
you think more about the story

concept of vision
concept of imagination (specially when doing collective observations)

role of errors where different
collective: errors as a way to identify the ambigious elements
errors signify an ambigious image

more interesting to do it collectively

it was fun to do it alone
it was not fun to do it alone
in general, my visual observation is not so good, lower than most people
My landscapes are very vague. I can describe sounds and conversations.
So i know, i probably have it wrong.
I just don't see it, i can't even make anything out of it.

Speed of transcription can influence.
Is this experiment done with sounds?

When i would type myself, i would write lists of keywords.
The role of the transcriber is very active.

Relation to training a machine
The next step is to reduce the descriptions to a list of words to feed to the computer.
In the original example, nearly 2000 descriptions were made, and then reduced to 60 words.

For ImageNet, they applied this exeriment to all the images at least 3 times.
They didn't use the different intervals.
They use Mechanical Turks without any time constraints, but the time constraint is implied by the work conditions. Fast work is the only way to make money.

Thesis: there are a series of relations that you establish, and you need to deal with its consequences.

But the experiment shows that time constraint is actually not a problem.

When there is no agreement, the image falls out of the process.
Reduction of language, reduction of operable images. Models are learning on cliches/stereotypes/most common images (and words)

We created a lot of bias during the experiment
Discussion of bias -- producing biased descriptions and unbiasing it after
Root of the word bias:
related to 'grain' used in textile
when you cut on the bias, you cut on ???
bias = cut diagionally on the grain of the thread (so it makes textile is more bendable, used to finish round edges for example) http://www.fabrics-store.com/blog/wp-content/uploads/2016/05/bandbinding_body10.jpg

bias as (not) a on/off or good/bad thing

Threadsmagazine.com
"To become a grain rebel, you first need to identify and understand fabric grain (...) that is truly unique"
http://www.threadsmagazine.com/2008/11/23/go-against-the-grain

recognizing bias as part and parcel of the plasticity of the description, so how to work WITH bias.
"To become a grain rebel, you first need to identify and understand fabric grain (see the box below). Then take a fabric and tug it in all directions to test its stability and stretch. Once you know how different fabrics stretch and drape, you can start playing with grain to make a garment that is truly unique."

(sidenote: textile techniques are generous sources of metaphors. it seems to me that the two main metaphors in technology are to war machines and weaving machines. and the brain)
"if we want the description to be plastic elastic, we need to work with a bias"

3 ways to approach expriment
1. Follow the grain direction: "it is science. It is tech"
2. crosswise, against the grain: it is bullshot, old manipulation, not engage with it (AI will never be as good as actual humans)
3. on the bias, "not technically a grain refers to any line diagona to the lengthwise and crosswise grains"
going 45 degrees with the problem (?) So to take the bias into account, but to go with it, use its plasticity without erasing it -- using its strength.

'learn to ride/write on the bias, make it productive, develop oblique relations, not just frontal ones.
find a way to not follow it as it is
- ... objective ....
- stand outside and looking at it, and say that it is biased and therefore bad
- complex way: accept bias

Take bias into account as an interesting dimension.

- follow the grain
- 45 degrees
- "collapse of air pockets" Allows a square piece of fabric to morph into a diamond shape

ref to conversation with Mike Kestemont before this event: machine learning is bias, the bias is doing the work [but how to process multiple biases into a system that often outputs one result? And
also, how racism would be a strength as a bias ... so very touched by this 'third way' but not sure how/where to begin -- maybe it is about opening up spaces for conversation (NM: "making a variety of responses possible"), rather than to 'make efficient', ref. Zach Blas. But how to be fed from this conversation if you are only confronted with a machine learning output that is prepared for your 'profile', form of isolation/segregation that is immediately operative as a consequence]

MK: "precise bias is precise tasks" (?)

Nicolas trying to formulate a critique, by engaging into a process. Not a truth from a distance. That is why we need to experience the process, like today.
to find what is possible to do with it, You need to do a session like this, only reading the paper is not enough. You would'nt experience the richness of the descriptions.

http://www.zachblas.info/works/facial-weaponization-suite/
http://www.zachblas.info/writings/facial-weaponization-suite/
http://median.newmediacaucus.org/caa-conference-edition-2013/escaping-the-face-biometric-facial-recognition-and-the-facial-weaponization-suite/
Subtitles: http://possiblebodies.constantvzw.org/inventory/?023

paid students have to say for each description:
1. whether it is correct
2. what class the image is part of
3. only the keywords stay

[hands out a description of the process of further processing an image after labeling]

replacement exercise: only words in the pre-defined vocabulary are kept. Abstraction process = reduction process. Needed to train the machine. This is less precise than image net; this experiment was more crude.
This is not wordnet but there are familiarities https://en.wikipedia.org/wiki/WordNet

Eleanor Rosch, work on basic categories https://en.wikipedia.org/wiki/Eleanor_Rosch !!! http://psychology.berkeley.edu/people/eleanor-h-rosch
example: parents, teenager, teenager comes back after going out, parents: what did you do?, teenager: 'something'. The parents do not expect to hear about many details. The is a level of expectations. Basic categories of the economy of language. You would describe something that take less effort to describe.
expecting certain levels of description, economy of language -- less effort to describe. First: chair, then: it is made of plastic.
Ref to information theory, where the least amount of coding effort is connected to the letter 'e' (most common letter).

Making difference between harp, piano, saxophone ... they are all instruments.
great difference between what goes through and what is left all
sensuous exerience is reduced to 5 elements, while furniture is very extended (sink/toilet)

Because the distribution of specificity is out of balance in the graph of categories, there is already a bias in the categorization. Specifically, everything that is related to the senses/body is already reduced. Reasons for this? Cartesian split.
coco database (?)

It's not an exception that we have a small vocabulary for body related experiences / abstractions(...?)
wordnet: Transsexual under anomaly -- there are specific visions, obviously.
Q: isn't there not any org that surveys this?
A: This is the widespread database used in MANY applications. There are not a lot of people addressing the problems with vocabularies.
It is also often related to what is available, what starts to circulate. Different contributions from different universities. Long history, slowly growing very big.

wordnet is standard classification for computing
its thickness becomes a reason for many people to see it as something neutral

How you describe something ..., is complex. The filters are very crude.
Wordnet becomes the filter of exclusion of physical impressions and ambiguities.
An: some things are not updated since 1985
Q: what about law, should this be surveyed?
First it should be in line with human rights to include it in the system?
This example is terrible, but it is a sign of even greater horror of flattening. Showing and hiding. Bias in itself is not the problem, but how can we engage with it.!

The process is about flattening. And highlighting certain things.
The process is about biasing. The question is what is the nature of a bias.

Perceptual performance. Detaching perception from the subject, and attaching it to the taxonomy. The 'perceptual performance of a term', but never of the people.

perceptual performance of a term.
not the perception of people, but in the paper they refer to perception as a response.

response - stimulus (so it is a mechanic view on vision, again). It means 'rocks' have more perceptual performance than 'gay'.

only when the response is translated into a list of terms, perception comes in.

chair, rock have a different grade of performance ....
emphasis towards unambiguously described objects

so when recognizing something ambiguous, unambigious objects are more easy to describe
think of perceptual performance of word 'arab', 'gay' versus 'chair', 'table', 'AirBnB room' ;)

Q: what are the applications of this biasing system?
A: security, attention economy, advertising (google ads)
HansL: economical use of language. but the difficulties become obvious in security related usage
Facebook advertisements is an interesting case study to relate to.

NM: "ontological gymnastics"
taking perception out of human, and attaching it to classification?
HansL: it is correct from an information theory point of view: most simple code to get the most efficiency out, these simplified categories make sense.

Everything goes towards/matches an economy of language, and also the economy of science (note: earlier AM mentioned that only academics and commercial researchers have access to this amount of info). Where is the knowledge?
and to an economy of science.
where is the knowledge?

knowledge in the room [indoor household]
knowledge in the hands of the academic researcher

the annotators are anonymized so they cannot claim the knowledge produced.

NM: "the knowledge is not an object, it's a process"
if you want to perform the knowledge, you need to recreate the situation.
How can the room migrate in the different settings where you want to use this knowledge?
We just don't care about the room.
Or, you need to migrate to (?) the room once you want to access and use the knowledge again.
[situatedness]

reproducing conditions of work, singular subjects, ... it is how algorithms become concrete / matter.

And so you can think of algorithms in terms of
*setups
*practical conditions
*labour

if we target the researcher, we miss what is necessary to create/make these algorithms?

Femke: In the narration of these technologies, it is often said that with more data, computing power and so, we will overcome this threats.
Can we follow the dataset mantra?
NM: i don't think the quantity of data can overcome this.
economy of language is not going away

all the circumstances in which relationships existing here are reproduced, are proliferating
we could adress the different settings in which the knowledge is produced

For example: What was different between doing the experiment of this afternoon alone or together?
find different ways ot find convergences/keep differences/diversity (?) in the descriptions

Hans: partially disagrees. Greater computer power means other levels of language. More complicated models of language.
To work with letters in stead of words, is in a way similar to work with simplified categories in stead of having many categories to work with.
We have to find ways to assume political agency.
Always needed: find ways to take responsibility to make these descisions. How do you take your responsibility? And not deligate it to the system

filter & bias are necessary to be able to express something.

Hans: reference to legal language, where also very abstract language is used.

Q: something concrete in front of you, but also there is a situation in which this concrete thing appears, what is more important? the concrete thing or the situation which permits it to exist? ... what does allow the thing to exist.
Compare industrial food. "It smells good"

The critique is not enough. They of course need to be addressed.
You're only able to catch the most obvious problems.

It is not just in 'their' hands, but how can it be in ours. [how can we connect to 'their' hands ... that is what we are trying these days/algolit is trying? How to actually change these extractive relationships?]
NM: bias is a way of expression

What about nouns that do not have clear visual representations. "Like 'patriot'" Things that cannot be imaged, so they are discarded from the taxonomy. A double reduction. "fluffy nouns" are not "physical entities".
taking out the disease subtree.

Expectation of deep serious concerns in these tech but is it really serious, to take out the disease sub-tree.
3000 people that work on wordnet, and they decide that patriot is not imaginable.

FS: very often the response to concerns is "we are still at beta level" "and at some day we will grow up"
now we can recognize a face in the crowd. Have we grown up? Or are will still on baby level and will it become better?
NM: It always comes back to the question: What counts as knowledge?
If it (if the applie algorithm?) is the product, there is no reason it will improve. Because you need to hide too much of the cultural decisions that were taken where (and how) the knowledge is produced. Only when you take them serious and start from there, a change is possible.

verbs are thrown out, unless they indicate action
the problem is often delegated to the annotator
delegate to annotator and if they say it is not/if there is disagreement then the object will be discarded
Is this fascist -- economic language without ambiguity. It takes the conflict out, but opens space for resistance?
NM: as an annotator to be too much in disagreement, you will be out of a job.

SO the question remains: how to make processes/tech that values disagreement, and how it can enrich performance. (annotator note: Maybe emphasise the bias?)
it is not all bad if we place it in different context where we value disagreeemnt
We need to be on the bias.

NM: 19th century -- no reservation to make claims about the world.

Hans: Are we not reading too much into it?

NM: the classification systems are universalist ... they are plugged into these systems. https://www.researchgate.net/publication/283356710_An_Analysis_of_WordNet%27s_Coverage_of_Gender_Identity_Using_Twitter_and_The_National_Transgender_Discrimination_Survey

Hans: cybernetic like systems that look at behavior of users and apply their results to for example search results, is less crude? ref to Google analytics

"it is so available"

the pervasiveness is dangerous

Hans: are we too dark? It does not have that many applications? It will take a while, but more data WILL help in the end.
Making bias political. We are just at the start.

constructing an interesting political relation with the positioning of bias.

Femke: paying attention to the creation of knowledge,
bias is interesting, and inherent to language and communication
but to see the machinaries at work that create biases and clichés
how to work with biases, but also with racism/sexism
concern about reiteration of convention, and the disabilities to deal with differences, but at the same time being super excited to work with the diagonal and the grey

NM: Where do you put the emphasis in the process:

- separation between data and algorithm is a problem
So: think of practices where the intimate relation between the two is positive. Resist the separation, pressure that they can operate independently.
they are symbiotic.
- insisting on the embodiment
computer vision exists because there are bodies that do this work of vision. "It's not a piece of software that does it all".> the eyes of the human subjects that make up computer vision
there are many eyes, of people that have been trained to see in certain ways.

"Adobe guy": a 'normal' picture
(not much happens in a 'normal picture')
"in a normal picture, the most important thing that happens stands in the middle"

Pierre: political project, as it puts the problem of class at the center
artist 'elites' can create not-normal images, with complexity. because of having knowledge on image making histories

FS: normal vision and class vision are connected, but how to go from there?
NM: stretch the cloth

Pierres grandma: "oh, the bias, so difficult but it's so interesting"

HL: do you have an idea on current annotation practices? eg: CAPTCHAs
NM: synthetic annotation: from little data, the algorithm expands the annotation
eg: Mike's example of throwing away the decoding part of the decoding/encoding process