algoliterary.lectures

Welcome to Constant Etherpad!

These pads are archived each night (around 4AM CET) @ http://etherdump.constantvzw.org/
An RSS feed from the etherdump also appears on http://constantvzw.org/

To prevent your public pad from appearing in the archive and RSS feed, put or just leave the following (including the surrounding double underscores) anywhere in the text of your pad:

__NOPUBLISH__

Changes will be reflected after the next update at 4AM. Notes Algoliterary lectures
http://constantvzw.org/site/Algoliterary-Lectures,2852.html

Notes workshop Nicolas Maleve: Variations on a glance http://constantvzw.org/site/Variations-on-a-Glance.html
Notes from workshop Algolit: http://pad.constantvzw.org/p/algoliterary.workshop.collective-gentleness

--
Introduction An Mertens: algolit = resourcing
Mike Kestermont taught me Python :-)

--
Generative Models and the Digital Humanities

Authorship in medieval texts
Steven Pinker: "Literary criticism is a joke"

Who has the cultural capital?

YSL on Python -- sexy programmers (Y) http://www.refinery29.com/2017/08/170514/alexandre-robicquet-ysl-fragrance
Maybe an answer to this lack of cool : Yves Saint L a ur e nt advertisement.
"Why? it makes everything possible."
"Everything starts with a Y"

Programmers are cool and sexy.

It is an actual researcher (not 'just' a model)!

Y = mathematical symbol for ground truth. MK "the clip is actually real"

Hype of Deep Learning

Deep learning is a form of AI

mimick our own intellectual capabilities

applications: face recognition, autonomous cars, machine translation.
"this is a neural network , a very simple one "

information structure contain information balls 'neurons'
input is transformed in layers and comes out as stgh else
hidden layers are special

"the hidden layers are what is special about it"

"Why are hidden layers a good idea?"

Example from face recognition
from primitive shapes, to more higher level shapes, to 'recognition'.
primitive features at end of network combined to face
"The neurons learn to represent the features in different ways."

it maps the way mammals understand/see (let's check with Nicolas tomorrow!)

Mike shows an analogy to the way the human brain is processing vision.
"what is the state of the art"
inference from posture: Man Holding A Cellphone

"Why is this amazing?"

The result are improved a lot, from recognizing apples and pears 10 years ago.

generating language. The algorithm is actually writing?
"It's still challenging to have neural networks work on cellphones."

"insane economic potential" (it is all about hardware) http://www.decisionproblem.com/
neural networks was 'dead' in the 90s

As hardware got better after the 90s, results of DL improved.
the renaissance of neural networks.

There is a lot of money going around in the DL world.
head of Ai of Tesla, after finishing his phd at Stanford, Google offered him job of 1 million dollars/year

3 influential 'tenors'
The geography of AI: Torronto, Google - Facebook, New York, Montreal
Yann LeCun, NY, Facebook
Yoshua Bengio, Montreal

It's interesting to see that these three people advocate open data, open science and open software a lot
work at universities + companies
always publish work in public domain

Yann Lecun influenced by French enlightement (what about humans?)

Facebook post after the PrayForParis, responding to lots of people responding to this hashtag.

head of AI research at Fb
Open data is important for being a good citizen according to LeCun.

AI supporting democracy -- "AI should informed participation in public life"

"AI should promote informed participation in public life, cooperation and democratic debate".
> A ref to the French enlightment

computers still learn from human knowledge
Algorithms learning from humans: Labeled data, for example ImageNet. - supervised learning

Y is often used as a symbol to represent labels (for example names to images)
Imagenet is labeld image dataset (70 million)
An example way to create labeled data.

"Can we still do learning if we don't have information provided by humans."
Autoencoder
you feed image, goes to encoder, tries to compress image, decoder will try to reconstruct original image

compressing - decompressing information. Network forced to generalize. (Jacotot! http://www.sup.org/books/title/?id=3009)
reconstruction will not be perfect
'Cat Paper' Quoc et al. 2012
watch youtube images (10 million 200x200 images), visualising what individual neurons 'see'

next step: leave out encoder

new artificial data - generative. computer generated images it had never seen
worked well for digits

"it did not look like faces" came up with 'Gans', Generative Adverserial Networks , or GANs

start: random noise
feed it to a network: generator
"r eal " data
D: detective/discriminator > decides which one is the real data and which one is the fake one
forger competes with detective, generating more and more 'real' images
Mike illustrates this with a Mona Lisa example.

Example: Generated bedrooms based on AirBnB. "They sort of look real right?"
Another example: Generated celebs, a project by Nvidia called CelebA-HQ https://futurism.com/these-people-never-existed-they-were-made-by-an-ai/
(karras et al.)
People are amazed by Gans, generating " new " things. (this is super-conservative/confirming, right?)
"Why would you want to generate new images."
"How evaluate this?"

"Do not ask what GANs can do, but think of what you can do for GANs" (tweet)

"what I cannot create, I do not understand" (?!?!) R. Feynman (but this means you need to somehow invent it to understand it. So what about things you might not understand? Anything alien, is that thinkable/imaginable?)
Mike's version: "what i cannot generate, i do not understand"

Come to Literature
Interest in T.S. Eliot - making criticism concrete through his work.

GANs don't work yet for text.
pixels are numbers, language are symbols, this symbolic nature makes it difficult for GANS
Other models work very well
Karpathy, now works at Tesla, blogpost on rnn-effectiveness
takes character and tries to predict next character
Recurrent Neural Network, has a memory (LSTM), f.ex. open quotes / some point expects to close quotes

"this is a bit technical, but the result is cool"
"it looks convincing"
"the line breaks are also generated by the algorithm"

CPNB - Collective for the Promotion of the Dutch Book
theme 'I Robot', Asimov
asked them to have a novel generated by computer
impossible
different suggestion: re-issue 'I, Robot', add 10th chapter that Giphart writes together with robot

2 weeks of training with Recurrent Neural Network
4392 novels by 1600 authors
"literary autocompletion"
different voices

co-creative system
creativity temperature 'button'

now they analyse writing process
'very interesting' (?)
change tenses from present to past

a forensics of the creative process. More data than you normally have. How to reconstruct the process?
commented with highlights.

Predictable IS NOT creative
"Linguistically perfect" - semantics is another thing.
ok for 256 characters, when longer, the semantics/narrative goes wrong

Interesting a sort of inspiration
characters in story have been introduced by bot
now: writing competition by Algemeen Dagblad - asibot.nl

writing with a bot. The robot itself is judging?
trying to come up with evaluation measures
robot selects 10, Giphart choses from ten

conclusion
AI's offer tools that we don't know how to use yet
"Humans still need to learn to interact with the bots."

Potential. Learning how to interact? They are renting servers at Amazon (the bot will eat itself)
current costs are 1000€/month

Another idea is to commercialise this, for example to write job applications.

The code will be opened and published.

But the models are of course difficult to train yourself.

Questions

Hans: models are trained on letters - could you try it on longer elements?
A: Not enough/fast hardware

character set is limited, word vocabulary are endless
you can create new words
J Joyce Finnigan's Wake: try to see if they can model generation of these new words

Femke: trying to understand how to think of this as generative
it is true or not?
struck by use of phrase 'If i cannot generate I do not understand'

A possesive way of grasping understanding?
How do you deal with things that you do not understand?
How is that thought in these models?

Answer: Sentence of characters, and when you feed a non-normal character ....
would assign low probability

Femke: A type of writing that has not been thought of yet.
how to validate/recognize it?
Mike: space of literature that we can generate is limited , a finite space
Creativity: generating and selecting
'writing with the robot', how can it be different form the writing we already know
It would be difficult to generate anything new, this system is something very conservative
It's cool and conservative at the same time.
(So the writing stays within the bounds of the corpus; it copies patterns but not turns of phrase. It is more variations, but not without end. Probable, not possible) <- not sure how this works with co-writing, if there is a generative element here.

Yvonne
what makes a good human writer, is that they are limited, that creates their style
this is unlimited
can you foresee in future limitations integrated in systems

algorithm to come up with its own voice? or multiple voices, but what is creativity? what is a voice?
Reference back to ways how writing styles develop, as a way to think how it would be possible to implement it in computer science.

Q: is there a flow between synthetic literature - synthetic biology
Neurotic Networks!

An organic basis is a reference point -- is there an equivalent. Math behind

Math behind NN are based on the way neurons behave.
It's an inspiration.

MK: "The brain is a big network of neurons. This is how Neural Networks work"

Brains have plasticity. physical change. Matter that matters.
text prewritten in different styles, can look at the differences
bring authors from the death, charles dicken's bot

Q: Montreal declaration -- usage in politics etc. Is there any regulation in place?
For example, the internet as a corpus generates quite a bit of offensive language.

MK: For self-driving cars there is regulation,
differs from region to region
act of creation is bot, it is no legal person yet
a computer does not have 'rechtspersoonlijkheid', you can't sue a computer

In agriculture, there are more self-driving cars, because they drive on private land, not in public space

[how to align the 'enlightment' fandom with the authorship, responsibility issues]

The Beginning of The End of Copyright

[ presentation will be on Algolit soon http://algolit.net/ ]

---------------------------------------------------------------------------

Amir Sarabadani https://wikimediafoundation.org/wiki/User:Ladsgroup
ORES, Ethical and Open source AI for Good

German chapter of Wikipedia
Wikipedia vs Wikimedia vs Wikimedia movement vs Wikimedia foundation

Works for wiki media Deutschland
Movements + projects
Wikimedia Foundation has budget, runs servers, pays engineers, does not interfere in community
Improving language support for example.

Scoring platform team = AI team of Wikimedia

This team work on ORES

NLP = Natural Language Processing
API = Application Programming Interface

Lots of inputs-outputs and a blackbox in between.

AI is a press/comm word.
used to promote production

ORES is an API
application programming interface

API = designed to empower other projects that can be built on top of it
when you interact with Wikipedia, you will interact with something of ORES

json formatted results

result: vandalism
probability:
true: 0.05 (for example)
false: 0.88 (for example)

Too many edits for people to review. Need help of bots.
20 per second
review is subjective ("There are a lot of fights on Wikipedia")

Fighting over 'Aluminium', how to write it -- UK or US spelling.

Discussion bigger than the size of The Great Gatsby

Using AI to blindly revert damaging edits is not the solution.
Cluebot works in this way. It drives people away, as it demotivates people to make edits.
'We don't know the rules' is what is a concern of the community.
When humans correct, it motivates new users

"We might fall in trap of profiling users"

Google + Facebook are facing criticism -- they are profiling users.

ORES: " curate and don't revert " -- watershedding between probably ok and needing attention
'it just gives a number'
tries to collect any damaging edits that might happen
5% of edits in English Wikipedia is vandalism
Q: Motives for vandalists?
categories of 3 type of vandalists:
newcomers, cyber-warriors (push their agendas/say messages) , for fun

ORES tries to not look at WHO is editing: names, other edits, ...
ORES encodes natural language into features. Feature engineering. Manual selection of features that cause vandalism.

neural networks: very slow, cannot be done in wikipedia all the time
70 features is not enough, more 2000 features
everything that user did is fed to machine to find patterns
f.ex. nrs of characters that are written/removed
We have to make a list of swearwords in different languages
a vocabulary of swearwords, and keep it updated - human review; some tricks are used to bypass the system (by writing a word in reverse) -- still caught by ORES
in Dutch Wikipedia 'izan'

How to find the data?
crowdsourcing gathering data platform

what is vandalism? crowdsourcing intelligence on that, labelling.
separating between damaging (but well meant) or damaging (and badly meant)
options in the interface:

damaging
not damaging
bad faith
good faith

more 'objective' results on what vandalism means

"quality is ever changing and subjective"
lots of classification on wikipedia https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment
wp 1.0 : users are assessing quality of articles. +, *, A, B, C , start, stop
ORES is extracting this data and tries to use it.

the-keilanana-effect
campaigns for women scientists
https://blog.wikimedia.org/2017/03/07/the-keilana-effect/

graph: Women Scientist quality gap increasing
Impact: seeing the quality of articles change dramatically from sub standard to +

no t saving edits, different revisions of articles
the graphs shows the quality of the revisions, so including the edit and the rest of the article

Gijs: The List of Swearwords! Is this a public page?
Is ORES a security mechanism ... does it need to be secret, for security

How do you deal with the openness of the mechanisms of Wikipedia and the mechanisms of ORES. Does it then still work?
A: Openness over security (?)
Never train on non-public data

writing LOL in an article - not deleted, but reverted.
it gets reverted but not deleted

Trying to find harassment. Predictions.

Problem of ethics

An: ORES is a platform. Who is working there, how does it work?
[shows recent changes, all done this minute]

An example of an project built on ORES is a tool that works with the RecentChanges page on the English Wikipedia.
uses tools to class edits, 'shows only ten edits per minute' vs only edits the same minute
tools for setting priorities

How do you discuss your approaches between eachother within the ORES project

inspection tool for ORES is on https://www.wikidata.org/wiki/Wikidata:ORES/Report_mistakes

Who is we?
Experience of creating a new page ... already so many interferences!

Meta meta work: tool builders build upon their work which ends up being used by the community
finding highly visited but low quality articles
new pages are problematic because it is assumed that they should already be very good articles, even if they are just a start
patrollers too active!

Q: more examples of applications built on WP
Is ORES part of the Wikimedia foundation?
Three people work on ORES. Two are working for the Wikimedia Foundation, and Amir is working for the Wikimedia Deutchland.

Wikidata as a shared datawarehouse for all wikipedia projects/languages. It is important to protect this resource from vandalism.
example of Bulgaria-capital Espacito vandalism which was then picked up by Siri
Wikidata feeds Siri

Femke: try to imagine digital systems that can deal with ambiguity and dissent
is there any work on multiplicity of use?
to foster ambiguity rather than taking it out

legibility
there is work on Wikipedia around ambiguity in the sense of legible articles

ex: there are pages that are discussed because there is no agreement on description
instead of 'solving' and come with 1 answer, make space for multiple views

A!

The design of wikipedia is made to 'neutral' point of view
not evade certain points of view
The earth is round, but WP should remember there are people thinking the earth is flat

Mike; how accurate is vandalism detection model, what are features
97 or 98% accurate
we are cari n g about recall, to divide what needs to be reviewed and those that doesn't

An: What features are used?
70 features
https://github.com/wiki-ai/editquality/blob/master/editquality/feature_lists/enwiki.py (features are here: lines 56-59)
cfr page revision scores
https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/en
based on reverted edits that happened previously

How did it learn about the backword nazi?
How could it learn about newly invented tricks to do racism for example.

github/wiki-ai
https://github.com/wiki-ai
edit quality / feature list

"the problem is, there are a lot of bad words".
https://github.com/wiki-ai/editquality/blob/master/editquality/feature_lists/enwiki.py

some words are bad for an article, but ok in a discussion (contextual).
so the words are categorized into bad words and informal words

Reminding: ORES trying to get info based on revisions, not on individual users. it needs to go fast!!
tries to handle everything in a timely manner
https://commons.wikimedia.org/wiki/File:Ores.celery_memory_usage_over_time.rss.svg

ORES based on mediawiki software. NASA used it on their wiki. This works.
difficult when there is not wiki structure