Algemeen: http://pad.constantvzw.org/p/certainty
Vragen: http://pad.constantvzw.org/p/certainty_questions
modality.py close reading: http://pad.constantvzw.org/p/certainty_modality.py_close_reading
Modality paper notes: http://pad.constantvzw.org/public_pad/certainty_notes_Modality-and-Negation

Wednesday 6th-Friday 8th April 2016

-1. orientation on modality

* READINGLIST

- General
* Modality and Negation: An Introduction to the Special Issue (2012) - Morante, Roser, and Caroline Sporleder. http://www.anthology.aclweb.org/J/J12/J12-2001.pdf
* Aantekeningen Jessica, cursus 'Modality in English' UA

- Modality in Scientific texts:
* Hyland, Hedging in Scientific Research Articles: https://benjamins.com/#catalog/books/pbns.54/main
-> p.227 'he proposes a pragmatic classification of hedge expressions based on an exhaustive analysis of a corpus. Catalogue of hedging cues includes: modal auxiliaries, epistemic lexical verbs/adj/nouns, variety of non-lexical cues
* Light, Mark & co, 2004. The language of bioscience http://www.aclweb.org/anthology/W04-3103.pdf
-> 'pioneers in analyzing the use of speculative language in scientific texts'
* Wilbur & co, 2006. New directions in biomedical text annotations http://europepmc.org/articles/PMC1559725 -> specifiek over notatie: 'motivated by the need to identiy and characterize parts of scientific documents where reliable information can be found'
* Vincze & co, 2008, Bioscope corpus (freely available) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2586758/pdf/1471-2105-9-S11-S9.pdf
* Morante, Daelemans, Memory-Based Resolution of In-Sentence Scopes of Hedge Cues https://aclweb.org/anthology/W/W10/W10-3006.pdf
* Annotating Modality and Negation for a Machine Reading Evaluation, Roser Morante and Walter Daelemans, 2011
http://www.clips.ua.ac.be/bibliography/annotating-modality-and-negation-for-a-machine-reading-evaluationhttp://www.clef-initiative.eu/documents/71612/86377/CLEF2011wn-QA4MRE-MoranteEt2011.pdf

* CONTACT PERSONS
- Roser Morante, Universiteit Amsterdam, busy analysing newsarticles on modality over time
http://kyoto.let.vu.nl/~morante/ r.morantevallejo@vu.nl
-> vragen of we met haar data kunnen werken???
- Tom De Smedt, Sint-Lukas Antwerpen, tomdesmedt@gmail.com
- Johan Vander Auwera, module modaliteit Engles UA
https://www.uantwerpen.be/en/staff/johan-vanderauwera/my-website/

* BESCHRIJVING RESEARCH
- Aanleiding: Het woord 'tekstmining' bepaalt het uitgangspunt van dit project. Door de praktijk van de mijnbouw aan te halen, wordt verondersteld dat de data als 'objectieve mineralen' voor het grijpen liggen door de machines die we ervoor ontwerpen. Als je precieser nagaat hoe geautomatiseerde technieken voor tekstanalyse werken, dan zie je dat de keuze van de data, het annoteren ervan en de finetuning van de parameters, van groot belang zijn voor hoe en wat de machine zal leren.
In het hele proces zitten een groot aantal 'grijze' zones, die het gevolg zijn van compromissen tussen mensen, en mensen en machines. Dit stelt de vraag naar hoe we omgaan met de 'imperfectie' van een script, met de experimentele aard van de technieken, en de menselijke impact op het hele proces, in termen van energie, tijd en onzekerheid.
- Onderzoek: Als een algoritme haar verhaal vanuit haar perspectief zou kunnen vertellen, hoe zou dat er dan uitzien? En precieser, is het mogelijk om de 'grijze' zones zichtbaar te maken in dat algoritmische narratieve perspectief? Om dit te onderzoeken, vergelijken we verschillende rulebased en supervised ML classifiers voor uncertainty in nieuwsartikels (waar en in welke graad), die we met dezelfde corpora trainen en testen.
-> modality.py as baseline
-> build ML classifier that performs better on same corpus
-> output: certainty yes/no or degree of certainty (%)

An: dit stukje misschien weglaten?
[Het zou mooi zijn om code te schrijven waarin die grijze zones, zoals die in degelijke wetenschappelijke papers beschreven worden, ook leesbaar en gecontextualiseerd zijn. We zien een overeenkomst met de manier waarop nieuwsberichten verslag uitbrengen over toepassingen van geautomatiseerde technieken voor taalverwerking. Deze nieuwsberichten bevatten vage omschrijvingen en vermelden nauwelijks de specificiteit van de gebruikte technieken. Het taalgebruik voor het beschrijven van machine learning technieken (bijv. 'mining') zien we als misleidend.
Als eerste stap in het onderzoek kijken we naar modaliteit in wetenschappelijke artikels, en vergelijken we onze resultaten met die van modality.py.]

WERKPLAN
- definitie van modaliteit:
logische of linguistische traditie?
- selectie corpora (MPQA, Newsreader, Wikipedia?)
- selectie bestaande scripts als voorbeelden (Daelemans, Morante, Tom De Smedt, andere?)
- opzetten fabriekje in Scikit Learn
- mini-corpora maken op basis van bestaande corpora om fabriekje te doen draaien
- feature generation
- instance generation
- train with different models in scikit-learn

Output
ideeën:
plugin?
scripts met Markdown?
-> rules/lexicon highlighting
-> top 10 features hightlight in text
-> have sentences of 3rd texts visible as actors

-----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------

-1.1. Rule-based Example: Understanding modality.py (Pattern for Python)

* close-reading of script modality.py
http://pad.constantvzw.org/p/certainty_modality.py_close_reading
the next variables are used in modality.py:
    n = (k * weight) + (k * weight) + (k * weight) + (k * weight)
    m = weight + weight + weight + weight + weight

* sources of script:
    - Celle, A. (2009). Hearsay adverbs and modality, in: Modality in English, Mouton.
    Book on English grammar
     http://www.degruyter.com/view/product/182779?format=EBOK
    - Tseronis, A. (2009). Qualifying standpoints. LOT Dissertation Series: 233.
     https://openaccess.leidenuniv.nl/bitstream/handle/1887/14265/LOT%20233.pdf?sequence=2
    - Morante, R., Van Asch, V., Daelemans, W. (2010): Memory-Based Resolution of In-Sentence Scopes of Hedge Cues http://www.aclweb.org/anthology/W/W10/W10-3006.pdf

* understanding in which context modality.py is created
*- CONLL2010 task1 description
*http://rgai.inf.u-szeged.hu/conll2010st/tasks.html#task1
*- description of weaseling on Wikipedia
*https://en.wikipedia.org/wiki/Wikipedia:Writing_better_articles#Avoid_peacock_and_weasel_terms
*--> Articles including weasel words should ideally be rewritten such that they are supported by reliable sources, or they may be tagged with the {{weasel}} or {{by whom}} or similar templates so as to identify the problem to future readers (who may elect to fix the issue).
*- the choice for Wikipedia weasel words as training data for hedging and modality, is this a choice that contains a comprise? Because Wikipedia data is available in a large amount, and the weasel words are already annotated by the Wikipedia community?
*We adopt Wikipedia’s notion of weasel words which we argue to be closely related to hedges and private states.
*http://www.aclweb.org/anthology/P/P09/P09-2044.pdf
*

-----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------

0. data collection

* trainingset / corpora

ANNOTATED NEWS ARTICLES
MPQA Corpus
This corpus contains 70 documents, news articles and other text documents manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).
(voor meer info zie ook: Modality and Negation: An Introduction to the Special Issue p.232)
http://mpqa.cs.pitt.edu/corpora/mpqa_corpus/
paper on work:http://www-personal.umich.edu/~ebreck/publications/wiebe-aaai-2003.pdf
(downloadbaar, 4mb)
version 3.0
last modified December 13, 2015
This release of the corpus contains 70 documents, including the subset of the MPQA original subset that come from English-language sources (i.e., that are not translations) and a subset of the OPQA subset
Authors: Lingjia Deng, Janyce Wiebe, Yuhuan Jiang at University of Pittsburgh.
Lingjia Deng email: lid29@pitt.edu
Janyce Wiebe email: wiebe@cs.pitt.edu

Roser Morante's NewsReader Project
http://www.newsreader-project.eu/results/data/
series of corpora with news articles oa Wikinews, and Techcrunch on start-ups :-)
project contains also a series of tools...

Abstract Meaning Representation: It contains a sembank (semantic treebank) of over 13,000 English natural language sentences from newswire, weblogs and web discussion forums.
AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.
https://catalog.ldc.upenn.edu/LDC2014T12

NON-ANNOTATED NEWS CORPORA
- Tipster: https://catalog.ldc.upenn.edu/LDC93T3A
- The English Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the 5th edition of the English Gigaword Corpus https://catalog.ldc.upenn.edu/LDC2011T07
Annotated version 2012 for knowledge extraction and distributional semantics : https://catalog.ldc.upenn.edu/LDC2012T21

WIKIPEDIA
Category with pages containing weasel tags:
https://en.wikipedia.org/wiki/Category:Articles_with_weasel_words

SCIENTIFIC
Bioscope training set 22.000 sentences from biomedical scientific articles annotated on negation and speculative keywords and their scope in a sentence, provided by the CoNLL shared task 1
http://pad.constantvzw.org/p/certainty_modality.py_bioscope_corpus

CoNLL training sets Bioscope + Wikipedia weaseling:
http://rgai.inf.u-szeged.hu/conll2010st/download.html

TimeBank corpus
An: is dit modality??
http://www.timeml.org/timebank/timebank.html
The TimeBank 1.2 Corpus contains 183 news articles that have been annotated with temporal information, adding events, times and temporal links between events and times following the TimeML 1.2.1 specification.
TimeBank 1.2 is free and is distributed by the Linguistic Data Consortium. https://catalog.ldc.upenn.edu/LDC2006T08
--> heb ik geprobeerd om te downloaden, een account gemaakt, maar geen idee waar ik het corpus vandaan zou moeten halen

* Other Ressources
modality lexicon
http://www.umiacs.umd.edu/~bonnie/ModalityLexicon.txt
Our annotation scheme is based on identifying three components of modality: a trigger, a target and a holder.
paper: http://www.lrec-conf.org/proceedings/lrec2010/pdf/446_Paper.pdf

-----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------

-1.0. intuition

0. data collection for classifier 2
* source 1 -> we decided to narrow our scope and choose texts on 'textmining'
But these are definitely interesting to read :-)
Antoinette Rouvroy: http://works.bepress.com/antoinette_rouvroy/64/
Antoinette 2007: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1013984

Solon Barocas 2015: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2477899
Solon Barocas contribution to book:https://www.nyu.edu/projects/nissenbaum/papers/BigDatasEndRun.pdf
Solon Barocas Datamining & discourse on discrimination: https://dataethics.github.io/proceedings/DataMiningandtheDiscourseOnDiscrimination.pdf
Solon Barocas, Big Data’s End Run Around Procedural Privacy Protections (pdf via mail)
Solon Barocas, A Critical Look at Decentralized Personal Data Architectures, 2012 (via mail)

New York Times:
    - http://www.nytimes.com/2015/04/07/upshot/if-algorithms-know-all-how-much-should-humans-help.html?_r=0

Soccer Players: https://osf.io/gvm2z/wiki/home/
'Critical' readinglist (Microsoft Lab): http://socialmediacollective.org/reading-lists/critical-algorithm-studies/

* source 2
Specific academic papers that deal with text processing techniques, ie. annotation, machine learning, rule based system, ... is there a way to find the 'most citated' papers? Balance male/female? Commercial/Academic?
Golden standard = papers die we kennen/kregen
Training/Test set = most cited papers on Google

*http://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.pdf Pang & Lee, Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales (2005)
*http://nmis.isti.cnr.it/sebastiani/Publications/ACMCS02.pdf Machine learning in automated text categorization - F. Sebastiani (2002)
*Daelemans, explanation in stylometry, 2013 http://www.clips.ua.ac.be/~walter/papers/2013/d13.pdf
*https://sites.sas.upenn.edu/danielpr/files/pers2015clpsych.pdf The role of personality, age and gender in tweeting about mental illnesses (World Well Being Project) (2015)
*http://wwbp.org/papers/assessment2013_openvocab.pdf The online social self, an open vocabulary approach to personality (World Well Being Project)
*http://delivery.acm.org/10.1145/2760000/2750548/p1745-trummer.pdf?ip=62.235.68.201&id=2750548&acc=OA&key=4D4702B0C3E38B35.4D4702B0C3E38B35.4D4702B0C3E38B35.5945DC2EABF3343C&CFID=598489658&CFTOKEN=44991827&__acm__=1460117697_ad2ad89e9a1043b6124e9750f2278aeb Mining Subjective Properties on the Web (2015)

NLP research group
*Google https://research.google.com/pubs/NaturalLanguageProcessing.html
*Microsoft http://research.microsoft.com/en-us/groups/nlp/

* pdf -> txt
*using pdftotext from the command line

0.0. intuition
We read the text Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales by Pang & Lee (2005) for 30 minutes, and mark the sentences that we think are containing modality. We mark them with YES, modal. The other sentences are left untouched.

categories of words that add to the modality of a sentence (according to our intuitive reading)
    - adding vagueness               (vaag makend)
    - describing a possibility       (mogelijkheid)
    - adding clarity                         (verhelderend)
    - confirming                             (bevestigend)
    - describing a necessity        (noodzakelijkheid)
    * jargon

words we rated on modality on intuition:
    - ...




1. modality.py - classifier 1
* set a hypothesis
* apply modality.py on:
** 1 text
** 10 texts
** 50 texts

1.0. intuition

2. start classifier 2
* select a classifier and set a goal
* annotation/golden standard
* training
* validation / testing
** 80% - training
** 20% - testing
** annotation / golden standard

2.0. intuition