Playing around with style & freeLing 19-3 Host: Olivier Perriquet Introduction by Olivier I will discuss how algorithmic methods coming from the analysis of genetic sequences in computational biology, by working with large text corpora and by infering meaning from the analysis of syntax alone, could bring a new approach to literary text analysis and transformation. I would like to explore a personal hypothesis on the cognitive limits inherent in the act of writing, reminiscent of Kolmogorov complexity - namely, the use and recycling of a presumable finite small number of writing templates, that might work as an unconscious signature for the writer. The orientation of the discussion toward style modeling, based on this hypothesis, is an invitation to to play around with compaction, expansion, rewriting, crossing over, viral contamination, visualization and related ideas. This discussion is also an introduction to Freeling, an open source language analysis tool suite, released under the GNU General Public License, developed and maintained at TALP Research Center, in Universitat Politècnica de Catalunya, who also benefit from many external contributions from a wide community. -------------------------------------------- When writing, depending on type of text (f;ex. literary), not done to reuse words/formulations 2 approaches: *semantics: infer information from content, cluster words in semantic fields, make statistics of that, try to get some meaningful output about content + style *syntax: more structural, how text/grammar is organised (paragraphs, POS) no attention for meaning of the words (most difficult part of linguistic analysis) Hypothesis: from syntax we can infer many things ex Please do not hesitate to contact me if/should you need information -> POS analysis is the same shallow structure would be the same, but deep structure will be different Computational biology we look at text/symbols that have no meaning: the alphabet of the genome, all meaning comes from syntax, have projected meaning no intrinsic meaning -> similar to numbers? I don't think so (intuition), always intentionality hjelmslev/(f?) Plane of expression and plane of meaning Game of life, cellular automata -> black/white cells (alive or dead), colour switches on depending on surrounding cells: neighbourhood. From simple rules you get different behavioural patterns / results. Even though the rules are simple, unexpected (seeming) complex patterns emerge. In the 'gun' pattern shapes seems to move -- an emerged pattern on which we 'project' meaning. https://en.wikipedia.org/wiki/Conway's_Game_of_Life cfr swarm in robotics, agents with simple rules, if you put them together < biomimetics cfr Boids '80, imitate in graphics the flight of birds, mostly ony reacting to their clos surroundings DNA: double stranded / backbone is composed of sugar (red), bases are attached to it 4 types of bases (nuclear types: A, T, C, G) All can pair together, specific complementarity between A-T, C-G untwist molecule = ladder, one strand is exact negative of the other, information is duplicated 23 chromosomes, all together 3 billions of bases, copy in each of our cells in human beings: information is stored linearly (vs neural networks) cfr Mendell 19th century (peas) cfr discovery of structure of DNA: 1953 architecture of base is always the 'same' structurally speaking, we can infer a lot of this unrealistic static view in human genome length of the word = 3 billions of bases ~ 10 phone books, copy of these in each cells constantly expressed & mutating smoking increases number of mutations - cancer inducing Now easy to sequence genomes DNA: repository of information RNA: transient single stranded molecule between DNA & proteins - shape conditioned by sequences, copy of DNA taken into machinery, 3 by 3, combine into protein (ASCII code of DNA :-)), 64 combinations, redundancy in the code, life is not elegant! Probably stronger in terms of resistance to mutations http://www.geek.com/wp-content/uploads/2013/12/genetic-code.jpg cfr wheel representation http://blocs.xtec.cat/ferrerfrancesch/files/2008/05/genetic-code.jpg Proteins: building blocks of life - other alphabet, 20 characters, amino acids there is correspondence between alphabet of DNA/proteins -> not known how they fold? -> databases with annotated sequences of DNA, RNA, proteins - from different sources various techniques from different fields, ao computerlinguistics in vocabulary to speak about genomes: open reading frame/translation gene is part of genome translated into proteins: 1% of genes is coding for proteins, rest is called 'junk-DNA', some patterns are very much repeated Some RNA's are called 'non coding', not aimed to be translated into proteins, have same function as protein (T replaced by U) simplified rules tendency to stack & twist, form stems, RNA folding is believed to be partially hierarcihcal intermediate representations of molecule (only focus on pairings): 90% of sequence can be represented without overlap in articifial unfolding by taking out 'pseudo-knots' direct translation to mathematical trees is possible try to find consensus structure in segmentation if there is a mutation in a pairing, the other part of pairing will mutate as well to keep the pairing -> it works well if you have molecules that align well, if you have too many mutations, difficult to align (Olivier has been experimenting on methods to align using alignment constraints in the intermediate levels, trying to use the common structure/intersection of 2 molecules) -> you can have mutations + insertions + deletions Language that expresses subgraph/intermediate/secondary structure/folding of the molecule: it uses a grammar: stochastic context free grammar free grammar: using rewriting rules S -> Sa | Su | sc | Sg (extend with a at the right) stochastic: each rule has a specific weight S -> aSu | uSa | etc (extends using the pairings) S -> SS (extends by branching) S -> e (epsilon, empty symbol = termination) RNA-world hypothesis: in early stage of life, there was no DNA & protein, first organism were only RNA -> still now: most of viruses are RNA only virus attack - 'machinery would express virus only, not genome' -> try to kill the molecule, or block the expression of the molecule 'epigenetic': eveyrthing that is not encoded in genome, dependent on context cfr biochemical pathways-metabolic pathways http://www.uz.zgora.pl/~jleluk/animacje/show_thumbnails.pl.htm - possibilities of transformation cfr /krebs cycle -> expression of some molecules will favour/inhibit expression of others, explains why genetic twins will evolve differently Playing part: -> uses linguistics methods -> everything is done on syntactic level -> in translation of RNA to protein: splicing, like in film editing: you cut/remove/paste you can interprete result as sentence or word process of alternative splicing: from single RNA you can produce various proteins -> work with large corpora of texts, automate process on large databases Computational biology can be inspiring for computational linguistics writing but there is not only syntax in linguistics... syntax as an approach to look at unconscious use of language (cfr gender classifier) quantum mechanics molecule model https://en.wikipedia.org/wiki/Atomic_theory#/media/File:Helium_atom_QM.svg vs rutherford model (http://byjus.com/chemistry/rutherfords-model-of-atoms-and-its-limitations/) other interests in genome: pattern matching repetition scanning similarities between words/corpora of texts Freeling http://nlp.lsi.upc.edu/freeling/ installing it compares word to dictionary & creates POS trees (function of the words in the sentence, from context) with probability for each tag ex 'This is a present" -> present = JJ ex 'This present is bad' -> present = NN but with very low probability -> maybe 'present' as a NN is missing in the dictionary Uses Hidden Markov Models/conditional probabilties? Check. # analyses POS from text file and keeps only 3rd column (POS tags), take out hard returns, replace Fc by comma's / Fp replace by dots analyze < sturgeon,txt | awk '(print $3)' | tr '\n' ' ' | sed "/s/Fc/,/g" -> can be nup -> output looks like DNA code now we can look for repetitions in the structure hypothesis: we have very little structure patterns and recycle them the architecture of syntax says something about the author result: could be just rules of grammar? forget the sentece-structure/take entire text as 'genome': define repetitions as markers, does it say something about the author? we could look for frequency of n-grams -> rewriting by POS analysis of 1 text, into clusters per category (NN, JJ, VB...) and use them to rewrite another text reducing complexity & hoping that what's left is meaningful reducing it into its most compact form that could be seen as 'signature' of text, cfr charcoal (output could be shape or specific lay-out) and try to expand it again recuperer.lu