2. The nearest neighbours inherit all qualities from their gold1000 parent -------------------------------------------------------------------------------------------------------^ great :) Starting point is a tool and the culture around it used in scientific research Proposal workshop 20-1 with Guy De Pauw * Op voorhand - een mail naar deelnemers met vraag naar type teksten waarmee ze willen werken en soort classifiers die ze willen uitproberen - download van Wikipedia, beschikbaar op Constant-server - collectie literaire teksten van Gutenberg (misschien kan dit ook wel gemakkelijk met Pattern? Maar hoe zit het dan met het opkuisen? Gutenberg-romans beginnen en eindigen bijvoorbeeld met enorme hoeveelheid tekst van Gutenberg over hun project, licenties enz) - checklist deelnemers: type OS van hun machine, evt vraag om al Python 2.7 en Pattern op voorhand te installeren? * Workshop (20-1, 10-18u, in deBuren, Leopoldstraat 6, achter de Munt) - inleiding over problemen en uitdagingen van Computational Linguistics/CLiPS - samenstellen van corpus - creëren van classifiers voor gender, leeftijd, deceptie, sentiment analyse, stijlelement -> kunnen we hier ook de regels distilleren, zodat minder technische mensen eventueel creatief aan de slag kunnen? - introduceren van feature over deep learning * Vermelding CLiPS in communicatie Laat je hierover nog iets weten? Proposal text [NL] Computerlinguïsten van het onderzoekscentrum CLIPS aan UA toonden al aan hoe je profielen kan samenstellen van de schrijver, of leugens kan detecteren op basis van de spraak- en taalanalyse van grote tekstbestanden van gekende auteurs. De vraag die ons intrigeert, is hoe deze kennis kan terugvloeien naar de gebruikers en ons als mens kan verrijken. Hoe voelt het om te kunnen schrijven als de 'gemiddelde gebruiker' van een database? Of als een leugenaar? En wat als we die herkenningssoftware toepassen op literaire teksten, van bijvoorbeeld George Orwell's 1984? Zouden we dan zijn profiel kunnen kiezen voor een volgende literaire creatie? We onderzoeken, met andere woorden, welk niet-pragmatische potentieel er schuilt in gereedschappen die werden ontworpen voor economische en surveillance doeleinden. We kijken plichtsbewust in de spiegel en vragen ons af welke fantasieën wij durven projecteren op de magische recepten voor code en taal. [ENG] Computerlinguists of the research center CLIPS at UA showed how you can create author profiles or detect lies by analysing large corpuses of text using speech and language processing tools. As makers we're intrigued by the question how this knowledge can empower users? How would it feel to be able to write as the 'average user' of a database? Or as a liar? And what would happen if we would apply the tools on texts of well known literary authors, like f.ex. George Orwell's 1984? Would we be able to adopt his profile for a next literary creation? In other words, we look closely at and experiment with the non-pragmatic potential of tools that were designed for economic and surveillance means. We dutifully look at ourselves in the mirror and ask what kind of fantasies we dare to project on the magic recipees for code and language. [ENG] Computerlinguists of the research center CLiPS at UA work on developing how to profile the authors of written text or to detect lies by analysing large corpuses of text using speech and language processing tools. From our perspective we are intrigued by how these methods and tools can open to different speculations around written production and identity. How would it feel to be able to write as the 'average user' of a database? Or as a perfect liar? And what would happen if we would work with texts of well known literary authors? For example would we be able to adopt George Orwell's profile for a next literary creation? In other words, we want to experiment with the potential of tools that were designed with a main pragmatic application as for example an economical one or one related to surveillance. We dutifully look at ourselves in the algorithmical mirror, and ask what kind of unexpected results can come from an uncanny use of the magic recipes for code and language. Notes "text" your artistic/literary imagination can work if you understand the principles, even if you don't know the code need to be in co-relations, like in algolit :-) cfr radioshow: http://www.paramoulipist.be/algolit/Radio_algolit_1.wav common ground? why are we still interested? we know it is bias/see the problems Learn how to lie / to change your gender / to learn how to talk as a teenager, or as an average use of a database? How to use it to empower yourself? use it for f.ex. http://www.genderartnet.eu (worth to map, but limitations in tools they had to do so) Use the tool to trick the tool: lie perfectly at the detector / or show the playful sides of lying Need to address a large amount of text that I cannot manage with organic tools/on my own -> look at the limitations of it "Glasses are framing my reading, but if I want to read my fucking book, I need them." attitude of 'purity': software is "wrong", it fiters, it distorts -> I don't want to use (not so interesting) VERSUS recognising that it is very loaded, and that you will not be cured -> seeing other potentials in the tool -> magic of creation with language and code // CLIPS, the Research Center for Computational Linguistics and Psycholinguistics created a tool that allows for crawling the web, datamining, opinion mining, text analysis, data visualisation. While reading to the many pdfs that exist online, around the work of machine learning methods for text analysis, some of the strategies that private companies use, are here clearly described. It opens up the mystery of 'organising by numbers' and allows for a space for experimentation, critique and playfulness. In Computer linguistics research, there is general problem of the non availability of corpus to work with, because of problems of anonymization or non-disclosure agreements. UA invested a lot of time in creating different types of corpus to work with, including their metadata; they're published under GPL; but they're mainly Dutch. -> we could use the same techniques for shadowy practises of blurring data or to learn how to write as an 'average user'. Some ideas by Martino & An: * a website that uses SEO techniques ( http://en.wikipedia.org/wiki/Search_engine_optimization#White_hat_versus_black_hat_techniques ) to make its own content noisy with generated content, so valueless for data mining; parsing confuser -> shadowy practises (noise sites, spam) use the same capitalist techniques, and get high qualifications, shows absurdity -> learn how to lie properly ** towards a web written by machines for other machines to read... * from the average user of their Stylometric Computation Corpus (student, 20y old, female, origin Antwerp) see if we can generate new texts based on the patterns of the average... how emotional comes back into the quantifiers -> Links http://www.clips.ua.ac.be/pattern Pattern.web is very well documented, works smoothly, is quite complete in terms of sources to parse data from Pattern for Python: http://dl.acm.org/citation.cfm?id=2343710 -> Class at UA: http://www.clips.uantwerpen.be/cl1415/ -> needs work to open it up to non-text and non-Python participants -> note on Licenses when we decide to use this in worksession: "Default license keys used by pattern.web.SearchEngine to contact different API's. Google and Yahoo are paid services for which you need a personal license + payment method. The default Google license is for testing purposes (= 100 daily queries). Wikipedia, Twitter and Facebook are free. Bing, Flickr and ProductsWiki use licenses shared among all Pattern users. //what does this mean??//" -> Case studies http://www.clips.ua.ac.be/pages/modeling-creativity-with-a-semantic-network-of-common-sense http://www.clips.ua.ac.be/pages/pattern-examples-100days http://www.cnts.ua.ac.be/~walter/papers/2012/dd12-2.pdf Walter Daelemans Authorship attribution and verification with many authors and little data: http://dl.acm.org/citation.cfm?id=1599146 Measuring the complexity of writing systems: http://www.tandfonline.com/doi/abs/10.1080/09296179408590015#.VB2yLK3l2Bs Predicting age and gender in online social networks: http://dl.acm.org/citation.cfm?id=2065035 Personae, a corpus for author and personality prediction from text: http://www.clips.uantwerpen.be/sites/default/files/LD08lrec.pdf Trying out scripts: * 01-web/15-sort.py: Google, Bing give very different results, how come?