Notes Algoliterary lectures http://constantvzw.org/site/Algoliterary-Lectures,2852.html Notes workshop Nicolas Maleve: Variations on a glance http://constantvzw.org/site/Variations-on-a-Glance.html Notes from workshop Algolit: http://pad.constantvzw.org/p/algoliterary.workshop.collective-gentleness -- Introduction An Mertens: algolit = resourcing Mike Kestermont taught me Python :-) -- Generative Models and the Digital Humanities Authorship in medieval texts Steven Pinker: "Literary criticism is a joke" Who has the cultural capital? YSL on Python -- sexy programmers (Y) http://www.refinery29.com/2017/08/170514/alexandre-robicquet-ysl-fragrance Maybe an answer to this lack of cool: Yves Saint Laurent advertisement. "Why? it makes everything possible." "Everything starts with a Y" Programmers are cool and sexy. It is an actual researcher (not 'just' a model)! Y = mathematical symbol for ground truth. MK "the clip is actually real" Hype of Deep Learning Deep learning is a form of AI mimick our own intellectual capabilities applications: face recognition, autonomous cars, machine translation. "this is a neural network, a very simple one" information structure contain information balls 'neurons' input is transformed in layers and comes out as stgh else hidden layers are special "the hidden layers are what is special about it" "Why are hidden layers a good idea?" Example from face recognition from primitive shapes, to more higher level shapes, to 'recognition'. primitive features at end of network combined to face "The neurons learn to represent the features in different ways." it maps the way mammals understand/see (let's check with Nicolas tomorrow!) Mike shows an analogy to the way the human brain is processing vision. "what is the state of the art" inference from posture: Man Holding A Cellphone "Why is this amazing?" The result are improved a lot, from recognizing apples and pears 10 years ago. generating language. The algorithm is actually writing? "It's still challenging to have neural networks work on cellphones." "insane economic potential" (it is all about hardware) http://www.decisionproblem.com/ neural networks was 'dead' in the 90s As hardware got better after the 90s, results of DL improved. the renaissance of neural networks. There is a lot of money going around in the DL world. head of Ai of Tesla, after finishing his phd at Stanford, Google offered him job of 1 million dollars/year 3 influential 'tenors' The geography of AI: Torronto, Google - Facebook, New York, Montreal Yann LeCun, NY, Facebook Yoshua Bengio, Montreal It's interesting to see that these three people advocate open data, open science and open software a lot work at universities + companies always publish work in public domain Yann Lecun influenced by French enlightement (what about humans?) Facebook post after the PrayForParis, responding to lots of people responding to this hashtag. head of AI research at Fb Open data is important for being a good citizen according to LeCun. AI supporting democracy -- "AI should informed participation in public life" "AI should promote informed participation in public life, cooperation and democratic debate". > A ref to the French enlightment computers still learn from human knowledge Algorithms learning from humans: Labeled data, for example ImageNet. - supervised learning Y is often used as a symbol to represent labels (for example names to images) Imagenet is labeld image dataset (70 million) An example way to create labeled data. "Can we still do learning if we don't have information provided by humans." Autoencoder you feed image, goes to encoder, tries to compress image, decoder will try to reconstruct original image compressing - decompressing information. Network forced to generalize. (Jacotot! http://www.sup.org/books/title/?id=3009) reconstruction will not be perfect 'Cat Paper' Quoc et al. 2012 watch youtube images (10 million 200x200 images), visualising what individual neurons 'see' next step: leave out encoder new artificial data - generative. computer generated images it had never seen worked well for digits "it did not look like faces" came up with 'Gans', Generative Adverserial Networks, or GANs start: random noise feed it to a network: generator "real" data D: detective/discriminator > decides which one is the real data and which one is the fake one forger competes with detective, generating more and more 'real' images Mike illustrates this with a Mona Lisa example. Example: Generated bedrooms based on AirBnB. "They sort of look real right?" Another example: Generated celebs, a project by Nvidia called CelebA-HQ https://futurism.com/these-people-never-existed-they-were-made-by-an-ai/ (karras et al.) People are amazed by Gans, generating "new" things. (this is super-conservative/confirming, right?) "Why would you want to generate new images." "How evaluate this?" "Do not ask what GANs can do, but think of what you can do for GANs" (tweet) "what I cannot create, I do not understand" (?!?!) R. Feynman (but this means you need to somehow invent it to understand it. So what about things you might not understand? Anything alien, is that thinkable/imaginable?) Mike's version: "what i cannot generate, i do not understand" Come to Literature Interest in T.S. Eliot - making criticism concrete through his work. GANs don't work yet for text. pixels are numbers, language are symbols, this symbolic nature makes it difficult for GANS Other models work very well Karpathy, now works at Tesla, blogpost on rnn-effectiveness takes character and tries to predict next character Recurrent Neural Network, has a memory (LSTM), f.ex. open quotes / some point expects to close quotes "this is a bit technical, but the result is cool" "it looks convincing" "the line breaks are also generated by the algorithm" CPNB - Collective for the Promotion of the Dutch Book theme 'I Robot', Asimov asked them to have a novel generated by computer impossible different suggestion: re-issue 'I, Robot', add 10th chapter that Giphart writes together with robot 2 weeks of training with Recurrent Neural Network 4392 novels by 1600 authors "literary autocompletion" different voices co-creative system creativity temperature 'button' now they analyse writing process 'very interesting' (?) change tenses from present to past a forensics of the creative process. More data than you normally have. How to reconstruct the process? commented with highlights. Predictable IS NOT creative "Linguistically perfect" - semantics is another thing. ok for 256 characters, when longer, the semantics/narrative goes wrong Interesting a sort of inspiration characters in story have been introduced by bot now: writing competition by Algemeen Dagblad - asibot.nl writing with a bot. The robot itself is judging? trying to come up with evaluation measures robot selects 10, Giphart choses from ten conclusion AI's offer tools that we don't know how to use yet "Humans still need to learn to interact with the bots." Potential. Learning how to interact? They are renting servers at Amazon (the bot will eat itself) current costs are 1000€/month Another idea is to commercialise this, for example to write job applications. The code will be opened and published. But the models are of course difficult to train yourself. Questions Hans: models are trained on letters - could you try it on longer elements? A: Not enough/fast hardware character set is limited, word vocabulary are endless you can create new words J Joyce Finnigan's Wake: try to see if they can model generation of these new words Femke: trying to understand how to think of this as generative it is true or not? struck by use of phrase 'If i cannot generate I do not understand' A possesive way of grasping understanding? How do you deal with things that you do not understand? How is that thought in these models? Answer: Sentence of characters, and when you feed a non-normal character .... would assign low probability Femke: A type of writing that has not been thought of yet. how to validate/recognize it? Mike: space of literature that we can generate is limited, a finite space Creativity: generating and selecting 'writing with the robot', how can it be different form the writing we already know It would be difficult to generate anything new, this system is something very conservative It's cool and conservative at the same time. (So the writing stays within the bounds of the corpus; it copies patterns but not turns of phrase. It is more variations, but not without end. Probable, not possible) <- not sure how this works with co-writing, if there is a generative element here. Yvonne what makes a good human writer, is that they are limited, that creates their style this is unlimited can you foresee in future limitations integrated in systems algorithm to come up with its own voice? or multiple voices, but what is creativity? what is a voice? Reference back to ways how writing styles develop, as a way to think how it would be possible to implement it in computer science. Q: is there a flow between synthetic literature - synthetic biology Neurotic Networks! An organic basis is a reference point -- is there an equivalent. Math behind Math behind NN are based on the way neurons behave. It's an inspiration. MK: "The brain is a big network of neurons. This is how Neural Networks work" Brains have plasticity. physical change. Matter that matters. text prewritten in different styles, can look at the differences bring authors from the death, charles dicken's bot Q: Montreal declaration -- usage in politics etc. Is there any regulation in place? For example, the internet as a corpus generates quite a bit of offensive language. MK: For self-driving cars there is regulation, differs from region to region act of creation is bot, it is no legal person yet a computer does not have 'rechtspersoonlijkheid', you can't sue a computer In agriculture, there are more self-driving cars, because they drive on private land, not in public space [how to align the 'enlightment' fandom with the authorship, responsibility issues] The Beginning of The End of Copyright [ presentation will be on Algolit soon http://algolit.net/ ] --------------------------------------------------------------------------- Amir Sarabadani https://wikimediafoundation.org/wiki/User:Ladsgroup ORES, Ethical and Open source AI for Good German chapter of Wikipedia Wikipedia vs Wikimedia vs Wikimedia movement vs Wikimedia foundation Works for wiki media Deutschland Movements + projects Wikimedia Foundation has budget, runs servers, pays engineers, does not interfere in community Improving language support for example. Scoring platform team = AI team of Wikimedia This team work on ORES NLP = Natural Language Processing API = Application Programming Interface Lots of inputs-outputs and a blackbox in between. AI is a press/comm word. used to promote production ORES is an API application programming interface API = designed to empower other projects that can be built on top of it when you interact with Wikipedia, you will interact with something of ORES json formatted results *result: vandalism *probability: * true: 0.05 (for example) * false: 0.88 (for example) Too many edits for people to review. Need help of bots. 20 per second review is subjective ("There are a lot of fights on Wikipedia") Fighting over 'Aluminium', how to write it -- UK or US spelling. Discussion bigger than the size of The Great Gatsby Using AI to blindly revert damaging edits is not the solution. Cluebot works in this way. It drives people away, as it demotivates people to make edits. 'We don't know the rules' is what is a concern of the community. When humans correct, it motivates new users "We might fall in trap of profiling users" Google + Facebook are facing criticism -- they are profiling users. ORES: "curate and don't revert" -- watershedding between probably ok and needing attention 'it just gives a number' tries to collect any damaging edits that might happen 5% of edits in English Wikipedia is vandalism Q: Motives for vandalists? categories of 3 type of vandalists: newcomers, cyber-warriors (push their agendas/say messages), for fun ORES tries to not look at WHO is editing: names, other edits, ... ORES encodes natural language into features. Feature engineering. Manual selection of features that cause vandalism. neural networks: very slow, cannot be done in wikipedia all the time 70 features is not enough, more 2000 features everything that user did is fed to machine to find patterns f.ex. nrs of characters that are written/removed We have to make a list of swearwords in different languages a vocabulary of swearwords, and keep it updated - human review; some tricks are used to bypass the system (by writing a word in reverse)-- still caught by ORES in Dutch Wikipedia 'izan' How to find the data? crowdsourcing gathering data platform what is vandalism? crowdsourcing intelligence on that, labelling. separating between damaging (but well meant) or damaging (and badly meant) options in the interface: *damaging *not damaging *bad faith *good faith more 'objective' results on what vandalism means "quality is ever changing and subjective" lots of classification on wikipedia https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment wp 1.0: users are assessing quality of articles. +, *, A, B, C, start, stop ORES is extracting this data and tries to use it. the-keilanana-effect campaigns for women scientists https://blog.wikimedia.org/2017/03/07/the-keilana-effect/ graph: Women Scientist quality gap increasing Impact: seeing the quality of articles change dramatically from sub standard to + not saving edits, different revisions of articles the graphs shows the quality of the revisions, so including the edit and the rest of the article Gijs: The List of Swearwords! Is this a public page? Is ORES a security mechanism ... does it need to be secret, for security How do you deal with the openness of the mechanisms of Wikipedia and the mechanisms of ORES. Does it then still work? A: Openness over security (?) Never train on non-public data writing LOL in an article - not deleted, but reverted. it gets reverted but not deleted Trying to find harassment. Predictions. Problem of ethics An: ORES is a platform. Who is working there, how does it work? [shows recent changes, all done this minute] An example of an project built on ORES is a tool that works with the RecentChanges page on the English Wikipedia. uses tools to class edits, 'shows only ten edits per minute' vs only edits the same minute tools for setting priorities How do you discuss your approaches between eachother within the ORES project inspection tool for ORES is on https://www.wikidata.org/wiki/Wikidata:ORES/Report_mistakes Who is we? Experience of creating a new page ... already so many interferences! Meta meta work: tool builders build upon their work which ends up being used by the community finding highly visited but low quality articles new pages are problematic because it is assumed that they should already be very good articles, even if they are just a start patrollers too active! Q: more examples of applications built on WP Is ORES part of the Wikimedia foundation? Three people work on ORES. Two are working for the Wikimedia Foundation, and Amir is working for the Wikimedia Deutchland. Wikidata as a shared datawarehouse for all wikipedia projects/languages. It is important to protect this resource from vandalism. example of Bulgaria-capital Espacito vandalism which was then picked up by Siri Wikidata feeds Siri Femke: try to imagine digital systems that can deal with ambiguity and dissent is there any work on multiplicity of use? to foster ambiguity rather than taking it out legibility there is work on Wikipedia around ambiguity in the sense of legible articles ex: there are pages that are discussed because there is no agreement on description instead of 'solving' and come with 1 answer, make space for multiple views A! The design of wikipedia is made to 'neutral' point of view not evade certain points of view The earth is round, but WP should remember there are people thinking the earth is flat Mike; how accurate is vandalism detection model, what are features 97 or 98% accurate we are caring about recall, to divide what needs to be reviewed and those that doesn't An: What features are used? 70 features https://github.com/wiki-ai/editquality/blob/master/editquality/feature_lists/enwiki.py (features are here: lines 56-59) cfr page revision scores https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/en based on reverted edits that happened previously How did it learn about the backword nazi? How could it learn about newly invented tricks to do racism for example. github/wiki-ai https://github.com/wiki-ai edit quality / feature list "the problem is, there are a lot of bad words". https://github.com/wiki-ai/editquality/blob/master/editquality/feature_lists/enwiki.py some words are bad for an article, but ok in a discussion (contextual). so the words are categorized into bad words and informal words Reminding: ORES trying to get info based on revisions, not on individual users. it needs to go fast!! tries to handle everything in a timely manner https://commons.wikimedia.org/wiki/File:Ores.celery_memory_usage_over_time.rss.svg ORES based on mediawiki software. NASA used it on their wiki. This works. difficult when there is not wiki structure