130418_algolit_word2vec2

__NOPUBLISH__

Welcome to Constant Etherpad!

These pads are public. To prevent them from appearing in the archive and the RSS feed on Constant, put the word __NOPUBLISH__ (including surrounding double underscores) anywhere in your pad.
Pads are archived every night around 04:00 CET at http://etherdump.constantvzw.org

To stay informed about Constant infrastructures, please subscribe to this mailinglist: https://tumulte.domainepublic.net/cgi-bin/mailman/listinfo/infrastructures
More about the way pads work: https://pad.constantvzw.org/p/etherpads
Algolit session - word embeddings & word2vec
13 April 2018

Previous sessions: https://pad.constantvzw.org/p/180317_algolit_word2vec
Table image: https://virtualprivateserver.space/~mb/files/algolit-table-as-resource.jpg

proposal:
grammar
space & organisation
just as a text has an organisation

set of objects to create a unity
we need common objects, as word2vec only works with the most common words

word2vec as a system that derives meaning

looking for patterns in the space that we are in
find the grammar of a space
going to words, then words to numbers
can we perform the machine until the end of the process?

what about going from space to colors?

continuous system, not a discrete system
there is a continuous relation between 'very' & 'much', relating to 'a lot'

Ref. to King - Man + Woman = Queen
Computer Programmer as Woman is to Homemaker?
http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf
What is the database that it is trained on?
Why is this hidden?

another proposal:
alternative systems to go from words to numbers
word 2 numbers by giving it values/properties on x/y

exercise:
define a corner of the room: table
define a rule system for oneself to establish relations for the objects on the table
challenge : how to model not only the objects themselves, but their positional / geographical relationships ?

Mäel
matrix, array of arrays
the table as a grid, 0 if empty, 'string' if there is an object
to use to predict where humans are located at the table

The table is the surface, the table is the 0
The table is divided into a 6x6 grid
someone was in the middle, here Mäel cheated a bit

const table = [
    ['notebook', 'bowl', 'laptop', 0],
    ['laptop', 'computer', 'paper', 'glasses case'],
    ['laptop', 'charger', 'cake', 'laptop'],
    ['glass', 'laptop', 'laptop', 'notebook']
];

const tableWithSurroundings = [
    [0, 0, 'human', 'human', 0, 0],
    [0, 0, 'notebook', 'laptop', 0, 0],
    ['human', 'laptop', 'computer', 'paper', 'glasses case', 0],
    ['human', 'laptop', 'charger', 'cake', 'laptop', 0],
    [0, 0, 'laptop', 'laptop', 'notebook', 'human'],
    [0, 0, 'human', 'human', 0, 0],
];

Javi
looked at the transcript of the hearing of Mark Zuckerberg and put it through word2vec
there is a chart!
but no conclusions

committee & hate
white house & targeting
members & million
someone & order

Manetta
looking at the table as a graph at first

- is the a graph (result)?
- or a dataset (input)?

as Graph
Plate is to An' laptop as paper sheet is to etherbox box
An's cup is to An's Lenovo laptop as Mäel's bottle is to Mäel's Dell laptop
cup is to Lenovo as bottle is to Dell
Lenovo - cup + bottle = Dell

as Dataset
(with abstraction level 'loose' applied, captured from an object level, window size 1, in describing format)
The box is placed just at the left of the Lenovo laptop while the cup is standing on the right.
The cable crosses over the charger on the left with the plate as its company on its right.
The cup stands on the left of the Apple laptop with another Apple laptop on the right.

as Dataset
(with abstraction level 'loose' applied, captured from an object level, window size 1, in left to right format while approach the table as a continuous text)

empty, bankcard, pen holder
bankcard, pen holder, notebook
pen holder, notebook, keys
notebook, keys, hand
keys, hand, charger
hand, charger, box
charger, box, laptop
box, laptop, cup
laptop, cup, mouse
cup, mouse, human
mouse, human, laptop
human, laptop, plate
laptop, plate, cable
plate, cable, laptop
cable, laptop, human
laptop, human, bowl
human, bowl, sheet
bowl, sheet, charger
sheet, charger, bottle
charger, bottle, keys
bottle, keys, extension block
keys, extension block, lego piece
extension block, lego piece, laptop
lego piece, laptop, human
laptop, human, human
human, human, notebook
human, notebook, cup
notebook, cup, phone
cup, phone, speaker
phone, speaker, charger
speaker, charger, cup
charger, cup, laptop
cup, laptop, charger
laptop, charger, cup
charger, cup, laptop
cup, laptop, laptop
laptop, laptop, cup
laptop, cup, human
cup, human, human
human, human, phone

An
created sentences with the objects from left to right and top to bottom
choosing different diagonals to create lists of words
For example: backpack, table, cup, pen, human, computer, cable, human, computer, mouse, human

Then counted all the 'words'
making a freq dictionary

An choose to disregard words that only appear once or twice, annotating them with UNK
Things that fall out are: speaker, ...
Things that are kept in: cup, paper, backpack, human, ...

Following the word2vec steps, giving ID numbers to the objects
cable appears 25 times, the most common word, so cable = 1 (ID number)

Cristina
vicinity between objects, nearness (naburigheid in Dutch, of in de buurt)
relations between objects, except the Etherbox box (oops forgotten!)
placed in a spreadsheet
>>> better viewing here: https://ethercalc.org/byh3zayzqjea

computer left 1
charger with hearts
An's banking card
An's glasses box
An's pencil box
An's wallet
computer top 1
notebook
cup
An's keys
computer right 1
water bottle
mouse
computer right 2
cup
pen
cup
computer bottom 1
computer medium
computer bottom 2
cup
cup
cup
phone
Gijs' keys
speaker
A3 paper
extension cable
cake plate
notebook left
bowl

Distance of the objects from the sides of the table
https://ethercalc.org/btu2g14n9kth

hypothesis: objects that are more often used are closer to the table sides
always choosing the smallest margin

This calculates use as hand reach
ears have a whole other reach
Could that be other columns in the spreadsheet?

Gijs
looking at vicinity
the charger is difficult, as it also has a cable that is a bit in the circle of vicinity and also not
Gijs decided to not record humans
do you count from the edge of the object? where do you start counting from?
computers are omnipresent
recorded 34 objects

O no, Gijs & Cristina & An didn't record the lego brick!
the lego brick was the trap of the exercise ;) trolling

David
less methodological approach
looking at what words we connect to the objects
concentric approach

then the color codes could be used to come to numbers
giving the words a sort of property
Are humans included (as an object)? Yes of course!
Do humans also get a color?

Ref. to how to color the human skin
project by Wendy, using fabric threads, asking people to create their skin color with multiple threads
which were in the end woven into a skarf

color linking to a world, culturally defined, as a way to encode

>>> https://virtualprivateserver.space/~mb/files/algolit-table-as-resource.jpg

After discussion:
From words to numbers
weighing words is subjective

Relating to subjectivity:
RGB codes for example are following a standard, they're a translation to something machine understandable
There are colors that fall out of the spectrum: gold, neon
And then, also screens are rendering it differently > enter calibration

This exercise was still visual-based
We approach the table with our eyes
What happens if we do something similar to a text?

we think from the model instead of from the tools, free
many ways to look at 1 table: scale, time, movement, agency of objects....
many levels of information you can do something with
what else can we try to capture? Rhythm?
going back to a text as a departure point.

How to work with the table as a semantic field?
- describe the situation in full sentences (and process it into a graph, what can the graph reflect to us about the situation?)
- how to include non-verbal elements into a dataset?
- time as a parameter: some events trigger change dynamics
- adding words in the space, in middle of relationship between 2 words

Feed algorithm with any type of data: colours, non verbal elements, ....
and see what comes out it maybe not so interesting to do?

Focus on the word2vec calculations, explore what it means to calculate with words
if you substract man from king, what are you left with? if you substract, the relationship is what's left

to accept or not to accept the model?

what if we accept?

by working with word2vec (even if in a restrained manner), we might be already be reinforcing it

Gensim tutorial: https://rare-technologies.com/word2vec-tutorial/
https://gitlab.constantvzw.org/algolit/algolit/tree/master/algologs/gensim-word2vec
+
https://gitlab.constantvzw.org/algolit/algolit/tree/master/algoliterary_encounter         >

An revisiting word2vec
difference between stochastic gradient descent & one-hot-encoding
https://pad.constantvzw.org/p/neural_networks_3
*one-hot vectors
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
with one 0 for each word in a vocabulary
where the 1 is representing the place of a word in the vector
> In this kind of vector representation: none of the words are similar, they are all a 1.

https://pad.constantvzw.org/p/neural_networks_4
co-occurence matrix as a preparation for :
f.ex. stochastic gradient descent - a way to calculate grammar positions ;-)
sentence = "I like deep learning","I like NLP", "I enjoy flying"
words = ['I', 'NLP', 'deep', 'enjoy', 'flying', 'learning', 'like']
matrix =
[[0 0 0 1 0 0 2]
[0 0 0 0 0 0 1] > center word (NLP)
[0 0 0 0 0 1 1] > outer word (deep)
[1 0 0 0 1 0 0]
[0 0 0 1 0 0 0]
[0 0 1 0 0 0 0]
[2 1 1 0 0 0 0]]

using 13 words in word2vec_basic.py
output corresponds to most frequent words
Average loss at step 0 : 9.304950714111328
Nearest to keys: jacket, pens, notebook, cup, paper, human, table, UNK,
Nearest to computer: jacket, cable, chair, pens, notebook, cup, backpack, UNK,
Nearest to pens: notebook, computer, cable, keys, jacket, backpack, cup, UNK,
Nearest to human: table, UNK, keys, jacket, chair, cup, notebook, cable,
Nearest to jacket: computer, table, keys, UNK, cable, human, backpack, notebook,
Nearest to table: jacket, human, backpack, cable, chair, keys, UNK, cup,
Nearest to paper: chair, UNK, notebook, cable, cup, keys, jacket, table,
Nearest to chair: paper, computer, notebook, cable, table, human, cup, backpack,
Nearest to UNK: cable, paper, jacket, human, cup, table, keys, computer,
Nearest to cable: computer, UNK, pens, backpack, paper, jacket, table, chair,
Nearest to backpack: cable, table, notebook, jacket, computer, pens, chair, keys,
Nearest to notebook: pens, chair, computer, cup, paper, backpack, keys, jacket,
Nearest to cup: notebook, paper, keys, UNK, computer, jacket, chair, human,

Average loss at step 10000 : nan
Nearest to keys: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to computer: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to pens: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to human: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to jacket: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to table: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to paper: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to chair: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to UNK: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to cable: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to backpack: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to notebook: cable, computer, human, cup, table, chair, jacket, keys,
Nearest to cup: cable, computer, human, cup, table, chair, jacket, keys,

set print values to smallest possible

co-occurrence matrix
'backpack', 'cable', 'chair', 'computer', 'cup', 'human', 'jacket', 'keys', 'notebook', 'paper', 'table', 'unk'
[[ 0 0 0 0 0 0 0 0 0 0 2 1]
[ 0 12 0 11 4 4 0 1 0 1 5 9]
[ 0 0 0 0 0 7 5 0 0 0 1 0]
[ 0 11 0 2 3 7 0 0 2 1 2 5]
[ 0 4 0 3 4 1 0 0 1 1 2 3]
[ 0 4 7 7 1 0 1 2 1 0 1 2]
[ 0 0 5 0 0 1 0 0 0 0 0 1]
[ 0 1 0 0 0 2 0 0 1 1 1 2]
[ 0 0 0 2 1 1 0 1 0 0 1 2]
[ 0 1 0 1 1 0 0 1 0 0 0 2]
[ 2 5 1 2 2 1 0 1 1 0 0 0]
[ 1 9 0 5 3 2 1 2 2 2 0 24]]

https://raw.githubusercontent.com/RaRe-Technologies/gensim/develop/gensim/test/test_data/questions-words.txt

Manetta's inspector of the table-embeddings:
> input: see line 99 - 138

*from collections import Counter
*import pprint
*import random
*
*pp = pprint.PrettyPrinter(indent=4)
*
*text = open('embedded-objects.txt','r').readlines()
*# [['empty, bankcard, pen holder\n'],
*# ['bankcard, pen holder, notebook\n'],
*# ['pen holder, notebook, keys\n']]
*
*main = {}
*# main = {
*#     'human' {
*#        Counter({
*#             'cup' : 1,
*#             'laptop' : 1,
*#         })
*#     }
*# }
*
*for ngram in text:
*    ngram = ngram.split(', ')
*    center = ngram[1]
*    left = ngram[0]
*    right = ngram[2].replace('\n','')
*
*    if not center in main:
*        main[center] = Counter()
*    main[center][left] +=1
*    main[center][right] +=1
*
*pp.pprint(main)
*print('************')
*
*determiners = {3 : 'a few', 5 :'many', 7: 'a lot of', 6 : 'most', 2 : 'some', 1 : 'any', 4 : 'enough'}
*
*sentences = []
*x = 0
*for cw, c in main.items():
*    x += 1
*    wwords = []
*    wwordsandcounts = []
*    wwordsanddeterminers = []
*    for word, count in c.items():
*        wwords.append(word)
*
*        if count > 1:
*            word = word+'s'
*        determiner = determiners[count]
*
*        wwordsandcounts.append(str(count) + ' ' + word)
*        wwordsanddeterminers.append(determiner + ' ' + word)
*
*    intro = ''
*    templates = [
*        'What makes a {0} a {0}, is its closeness to {1} and {2}.'.format(cw, ', '.join(wwordsanddeterminers[:-1]), wwordsanddeterminers[-1]),
*        'You can identify a {} when it appears in the company of {} and {}.'.format(cw, ', '.join(wwordsandcounts[:-1]), wwordsandcounts[-1]),
*        'A {} can be recognized when it lies next to a {} and a {}.'.format(cw, ', '.join(wwords[:-1]), wwords[-1])
*    ]
*
*    if x == 5:
*        sentences.append('\nMeanwhile ...\n')
*        x = 0
*
*    s = random.choice(templates)
*    sentences.append(s)
*
*for sentence in sentences:
*    print(sentence)

* * *

What makes a cable a cable, is its closeness to any plate and any laptop.
What makes a hand a hand, is its closeness to any keys and any charger.
You can identify a bankcard when it appears in the company of 1 empty and 1 pen holder.
You can identify a notebook when it appears in the company of 1 cup, 1 human, 1 pen holder and 1 keys.

Meanwhile ...

What makes a plate a plate, is its closeness to any cable and any laptop.
A phone can be recognized when it lies next to a speaker and a cup.
A lego piece can be recognized when it lies next to a laptop and a extension block.
What makes a laptop a laptop, is its closeness to any cable, enough cups, any box, any lego piece, any plate, any charger, some laptops and a few humans.
You can identify a keys when it appears in the company of 1 hand, 1 bottle, 1 notebook and 1 extension block.

Meanwhile ...

You can identify a human when it appears in the company of 1 cup, 1 notebook, 1 mouse, 1 bowl, 3 laptops, 1 phone and 4 humans.
You can identify a cup when it appears in the company of 1 notebook, 2 chargers, 4 laptops, 1 mouse, 1 phone and 1 human.
You can identify a bottle when it appears in the company of 1 keys and 1 charger.
You can identify a sheet when it appears in the company of 1 charger and 1 bowl.
What makes a box a box, is its closeness to any laptop and any charger.

Meanwhile ...

You can identify a charger when it appears in the company of 1 hand, 2 cups, 1 bottle, 1 box, 1 speaker, 1 laptop and 1 sheet.
What makes a speaker a speaker, is its closeness to any phone and any charger.
What makes a extension block a extension block, is its closeness to any keys and any lego piece.
What makes a mouse a mouse, is its closeness to any cup and any human.
You can identify a pen holder when it appears in the company of 1 bankcard and 1 notebook.

Meanwhile ...

You can identify a bowl when it appears in the company of 1 sheet and 1 human.

* * *

Situation embeddings (a recipe):
1. Embed yourself into a situation that you would like to encode.
2. Think of an structure to go through all the objects that surround you.
3. Start to encode the first object into a list with its neighbour object on the left and right in the following format:

left object, center object, right object
left object, center object, right object
left object, center object, right object

4. Run your encodings through the inspector script.

Python online courses:
https://github.com/mikekestemont/ghent1516
https://github.com/mikekestemont/lot2016

> Familie opstellingen
Psychological method to work with community problems
In which a person places people in a room, gives people
systemic thinking as a way to analyse a situation and take a distance to look at a situation

next meeting - possible themes:
    tf-idf (can be done through pattern, gensim, scikit-learn etc)
    stochastic gradient descent
    other statistical methods of converting words-to-numbers > and how are they used within NLP?

encoding a situation, including:
- objects (table, cup, laptop)
- properties (color)
- relations (distance in cm, sound)
- time (???)