Playing around with style & freeLing 19-3
Host: Olivier Perriquet
Introduction by Olivier
I will discuss how algorithmic methods coming from the analysis of
genetic sequences in computational biology, by working with large text
corpora and by infering meaning from the analysis of syntax alone,
could bring a new approach to literary text analysis and
transformation.
I would like to explore a personal hypothesis on the cognitive limits
inherent in the act of writing, reminiscent of Kolmogorov complexity -
namely, the use and recycling of a presumable finite small number of
writing templates, that might work as an unconscious signature for the
writer.
The orientation of the discussion toward style modeling, based on this
hypothesis, is an invitation to to play around with compaction,
expansion, rewriting, crossing over, viral contamination,
visualization and related ideas.
This discussion is also an introduction to Freeling, an open source
language analysis tool suite, released under the GNU General Public
License, developed and maintained at TALP Research Center, in
Universitat Politècnica de Catalunya, who also benefit from many
external contributions from a wide community.
--------------------------------------------
When writing, depending on type of text (f;ex. literary), not done to reuse words/formulations
2 approaches:
*semantics: infer information from content, cluster words in semantic fields, make statistics of that, try to get some meaningful output about content + style
*syntax: more structural, how text/grammar is organised (paragraphs, POS)
no attention for meaning of the words (most difficult part of linguistic analysis)
Hypothesis: from syntax we can infer many things
ex Please do not hesitate to contact me if/should you need information
-> POS analysis is the same
shallow structure would be the same, but deep structure will be different
Computational biology
we look at text/symbols that have no meaning: the alphabet of the genome, all meaning comes from syntax, have projected meaning no intrinsic meaning
-> similar to numbers? I don't think so (intuition), always intentionality
hjelmslev/(f?) Plane of expression and plane of meaning
Game of life, cellular automata -> black/white cells (alive or dead), colour switches on depending on surrounding cells: neighbourhood. From simple rules you get different behavioural patterns / results. Even though the rules are simple, unexpected (seeming) complex patterns emerge. In the 'gun' pattern shapes seems to move -- an emerged pattern on which we 'project' meaning.
https://en.wikipedia.org/wiki/Conway's_Game_of_Life
cfr swarm in robotics, agents with simple rules, if you put them together < biomimetics
cfr Boids '80, imitate in graphics the flight of birds, mostly ony reacting to their clos surroundings
DNA: double stranded / backbone is composed of sugar (red), bases are attached to it
4 types of bases (nuclear types: A, T, C, G)
All can pair together, specific complementarity between A-T, C-G
untwist molecule = ladder, one strand is exact negative of the other, information is duplicated
23 chromosomes, all together 3 billions of bases, copy in each of our cells
in human beings: information is stored linearly (vs neural networks)
cfr Mendell 19th century (peas)
cfr discovery of structure of DNA: 1953
architecture of base is always the 'same' structurally speaking, we can infer a lot of this unrealistic static view
in human genome length of the word = 3 billions of bases ~ 10 phone books, copy of these in each cells
constantly expressed & mutating
smoking increases number of mutations - cancer inducing
Now easy to sequence genomes
DNA: repository of information
RNA: transient single stranded molecule between DNA & proteins - shape conditioned by sequences, copy of DNA
taken into machinery, 3 by 3, combine into protein (ASCII code of DNA :-)), 64 combinations, redundancy in the code, life is not elegant! Probably stronger in terms of resistance to mutations
http://www.geek.com/wp-content/uploads/2013/12/genetic-code.jpg
cfr wheel representation http://blocs.xtec.cat/ferrerfrancesch/files/2008/05/genetic-code.jpg
Proteins: building blocks of life - other alphabet, 20 characters, amino acids
there is correspondence between alphabet of DNA/proteins -> not known how they fold?
-> databases with annotated sequences of DNA, RNA, proteins - from different sources
various techniques from different fields, ao computerlinguistics
in vocabulary to speak about genomes: open reading frame/translation
gene is part of genome translated into proteins: 1% of genes is coding for proteins, rest is called 'junk-DNA', some patterns are very much repeated
Some RNA's are called 'non coding', not aimed to be translated into proteins, have same function as protein (T replaced by U)
simplified rules
tendency to stack & twist, form stems, RNA folding is believed to be partially hierarcihcal
intermediate representations of molecule (only focus on pairings): 90% of sequence can be represented without overlap in articifial unfolding by taking out 'pseudo-knots'
direct translation to mathematical trees is possible
try to find consensus structure in segmentation
if there is a mutation in a pairing, the other part of pairing will mutate as well to keep the pairing -> it works well if you have molecules that align well, if you have too many mutations, difficult to align (Olivier has been experimenting on methods to align using alignment constraints in the intermediate levels, trying to use the common structure/intersection of 2 molecules)
-> you can have mutations + insertions + deletions
Language that expresses subgraph/intermediate/secondary structure/folding of the molecule:
it uses a grammar: stochastic context free grammar
free grammar: using rewriting rules
S -> Sa | Su | sc | Sg (extend with a at the right)
stochastic: each rule has a specific weight
S -> aSu | uSa | etc (extends using the pairings)
S -> SS (extends by branching)
S -> e (epsilon, empty symbol = termination)
RNA-world hypothesis: in early stage of life, there was no DNA & protein, first organism were only RNA -> still now: most of viruses are RNA only
virus attack - 'machinery would express virus only, not genome'
-> try to kill the molecule, or block the expression of the molecule
'epigenetic': eveyrthing that is not encoded in genome, dependent on context
cfr biochemical pathways-metabolic pathways http://www.uz.zgora.pl/~jleluk/animacje/show_thumbnails.pl.htm - possibilities of transformation
cfr /krebs cycle
-> expression of some molecules will favour/inhibit expression of others, explains why genetic twins will evolve differently
Playing part:
-> uses linguistics methods
-> everything is done on syntactic level
-> in translation of RNA to protein: splicing, like in film editing: you cut/remove/paste
you can interprete result as sentence or word
process of alternative splicing: from single RNA you can produce various proteins
-> work with large corpora of texts, automate process on large databases
Computational biology can be inspiring for computational linguistics writing
but there is not only syntax in linguistics... syntax as an approach to look at unconscious use of language (cfr gender classifier)
quantum mechanics molecule model https://en.wikipedia.org/wiki/Atomic_theory#/media/File:Helium_atom_QM.svg
vs rutherford model (http://byjus.com/chemistry/rutherfords-model-of-atoms-and-its-limitations/)
other interests in genome:
pattern matching
repetition scanning
similarities between words/corpora of texts
Freeling
http://nlp.lsi.upc.edu/freeling/
installing
it compares word to dictionary & creates POS trees (function of the words in the sentence, from context) with probability for each tag
ex 'This is a present" -> present = JJ
ex 'This present is bad' -> present = NN but with very low probability
-> maybe 'present' as a NN is missing in the dictionary
Uses Hidden Markov Models/conditional probabilties? Check.
# analyses POS from text file and keeps only 3rd column (POS tags), take out hard returns, replace Fc by comma's / Fp replace by dots
analyze < sturgeon,txt | awk '(print $3)' | tr '\n' ' ' | sed "/s/Fc/,/g"
-> can be nup
-> output looks like DNA code
now we can look for repetitions in the structure
hypothesis: we have very little structure patterns and recycle them
the architecture of syntax says something about the author
result: could be just rules of grammar?
forget the sentece-structure/take entire text as 'genome': define repetitions as markers, does it say something about the author?
we could look for frequency of n-grams
-> rewriting by POS analysis of 1 text, into clusters per category (NN, JJ, VB...) and use them to rewrite another text
reducing complexity & hoping that what's left is meaningful
reducing it into its most compact form that could be seen as 'signature' of text, cfr charcoal (output could be shape or specific lay-out)
and try to expand it again
recuperer.lu