THE ANNOTATOR

This report started out as a README that would accompany the published feature of pattern.en.paternalism, an addition we wanted to propose to the Pattern natural language processing Python toolkit.  The feature would detect if and to what extent a text could be considered 'paternalist'. 

Machine-learning algorithms that partially automate data processing still need to be trained for every new form, or every new kind of topic the algorithm might deal with. [...] Such work of alignment is not a bug — it is the condition of possibility for keeping humans and automation working in the same world.” [http://www.publicbooks.org/nonfiction/justice-for-data-janitor]

otivation
We developed the pattern.en.paternalism feature during Cqrrelations, a worksession that offered poetry to the statistician, science to the dissident and detox to the data-addict. [http://www.cqrrelations.constantvzw.org] Artists, academics, programmers and designers worked on impure, missing, invisible, broken or suspicious data. As we slowly got to grips with the practice of data-mining, and more specifically with the text-mining software package Pattern [http://www.clips.ua.ac.be/pattern], we understood that this practice typically depends on the following interrelated elements:


attern
Pattern is a popular text mining module for the Python programming language.The module is developed by CLiPS (Computational Linguistics & Psycholinguistics research center), associated with the Linguistics department of the faculty of Arts of the University of Antwerp. CLiPS states on their website that 'Most of the CLiPS research is based on competitively acquired research funding' and  that its goal is 'to produce internationally recognized top research'.

[IMAGE: http://www.clips.ua.ac.be/media/pattern_schema.gif]

Pattern offers tools to process on-line sources such as Google search results, Tweets and Wikipedia pages. It includes tools for data mining, natural language processing, machine learning, network analysis and visualization. The module is licensed under a BSD license [http://www.linfo.org/bsdlicense.html] and comes with many examples, such as 'Summarization', 'Style detection' and 'Finding negation and speculation', which allowed us to interrogate many of the elements involved in the practice of text-mining.

The Annotator

From the start we were interested in how a Gold Standard is established, a paradoxical situation where human input is both considered a source of truth, and made invisible. Annotation here means the manual work of 'scoring' large amounts of data that can than be used for 'training' algorithms. This scored data becomes a the reference against which the algorithm is trained and tested. The Annotator is typically a student or Mechanical Turk worker, or sometimes the work has been already done for another reason, such as in the case of the sentiment analysis algorithm, where the Gold Standard for deciding between positive or negative language patterns is based on a large corpus of movie reviews along with explicit rating of the described movie.
In-between the solution-oriented and mystifying descriptions of several algorithms for text-mining that we looked at, the actual conditions, context and work of annotation felt surprisingly undervalued and under-documented. Only in a few cases, and often hidden far away in software sources, we found descriptions of the method of annotation.

It seems that annotation always implies a contextual perspective. Scoring sources is also time consuming and boring; it can only speed up when the annotator does not doubt her opinions. Through the development of pattern.en.paternalism we wanted to both experience and challenge this practice. Our decision to work with a contested 'polarity' such as paternalism, was of course deliberate.

We wanted to:


pattern.en.paternalism

The Annotators decided to work on a controversial topic, one that produce disagreement and would force each of the Annotators to question his or her understanding of the 'common sense' around it. The Annotators therefore took into account that the sources had to be interesting enough to spend time on, realizing that the desired outcome related to the subject we chose and the sources we selected. The data itself had to trigger discussion and debate. Paternalism had different connotations in each of the Annotators native languages, but the fact the subject worked so well with the library's name was the deciding factor.

The Annotators selected 20 sources for their dataset (see appendix). From these sources, 600 paragraphs were selected, meaning we did not use the ingestion tools available through Pattern because we were interested in specific data. For Gutenberg sources, paragraphs were automatically scraped. For  Wikipedia sources, Annotators copy-pasted the paragraphs into a spreadsheet by hand. Paragraph titles and graphic elements were ignored.

We did not anonymize the dataset, and relied on the fact that Annotators could link individual records with specific authors. We decided to ask the Annotators to take the context of the source (date of publication, author etc.) into account while scoring the data for paternalism.

Instructions for The Annotators

The annotation process was guided by two desires. First of all we wanted to produce enough results in one day that would be usable in Pattern, but also to create space for discussing the interpretation of every paragraph and documenting these discussions. To this end we came up with the following guidelines: 


About The Annotators

We made an attempt to anonymize The Annotators without ignoring their specific cultural backgrounds and particular interests. This means that we wanted to link individual annotations with additional information about the annotator's point of view, to provide context to their scores.


Method

[IMAGE: ANNOTATORS SEEN FROM BACK]

Once The Annotators settled on the selected sources, guidelines and make-up of the annotation team, we started scoring the dataset. Some meta-notes on the annotation process:


The Removal of Pascal

Once the dataset was scored we could start establishing a classifier for detecting pat(t)ernalism. #stapsgewijze processbeschrijving van trainen en hertrainen#
While training our K-Nearest Neighbour algorithm, results seemed skewed towards a few French terms in the sources, most notably 'autre'. Closer scrutiny revealed the term was part of a quote of Blaise Pascal used in 'How to observe morals and manners' by Harriet Martineau, one of the sources used. Since our algorithm was not performing according to our expectations, we decided to remove the paragraph that created the unwanted result. This is the sentence that was removed:


Unfortunately The Removal of Pascal did not improve the performance of our algorithm.

A process of normalization

When writing up this report a few months later, we remember how many times we were told that also in text-mining, 'there is no free lunch'. Even when algorithms promise universal and undisputable outcomes, there is always a need to tailor data and it's treatment to achieve it. Otherwise said, while the practice of text-mining seems full of normalizing processes, out there is supposedly a treasure trove of discoveries that we could not have dreamt up on our own.

Looking back on our modest experiment we start to see the interplay between the process of creating an algorithm, feeding into a self-fullfilling narrative of necessity and relevance in relation to desired, possible and applicable results.  The removal of Pascal is just one example in our own process that included many moments of normalization.

For text-mining to work, normalization needs to happen on many interconnected levels. The available dataset need to be aligned with the desired outcome (or the desired outcome needs to be aligned with the available sources), The Gold Standard needs to validate the training data, while the training data needs to validate the Golden Standard. Available sources include online reviews of goods, desired outcomes includes sentiment analysis of what people think of products.

Text-mining is an industry aimed at producing predictable, conventional and plausible results. In other words it is about avoiding exceptions, uncertainties and surprises.  At the same time it promises to have overcome ideology and the need for models, but relies on the extrapolation of the common sense of The Annotator.


pattern.en.paternalism was developed by Catherine Lenoble, Anne Laforet, Femke Snelting, Roel Roscam Abbing, Manetta Berends, Julie Boschat Thorez, Cristina Cochior, Maxigas and Johnny xxx

Report delivered by Roel Roscam Abbing and Femke Snelting

APPENDIX

Definitions of paternalism

From: https://en.wikipedia.org/wiki/Paternalism
Paternalism (or parentalism) is behavior, by a person,  organization or state, which limits some person or group's liberty or  autonomy for that person's or group's own good. Paternalism can also imply that the behavior is against or regardless  of the will of a person, or also that the behavior expresses an attitude  of superiority. 
The word paternalism is from the Latin pater for father, though paternalism should be distinguished from patriarchy. Some, such as John Stuart Mill,  think paternalism to be appropriate towards children: "It is, perhaps,  hardly necessary to say that this doctrine is meant to apply only to  human beings in the maturity of their faculties. We are not speaking of  children, or of young persons below the age which the law may fix as  that of manhood or womanhood." Paternalism towards adults is sometimes thought to treat them as if they were children.
Examples of paternalism include laws requiring the use of motorcycle helmets, a parent forbidding their children to engage in dangerous  activities, and a psychiatrist confiscating sharp objects from someone  who is suicidally depressed.

From: https://fr.wikipedia.org/wiki/Paternalisme
Le paternalisme est une doctrine politique qui définit comme  moralement souhaitable qu'un agent privé ou public puisse décider à la  place d'un autre pour son bien propre. Cette doctrine s'oppose au libéralisme. 
Par exemple, quand l’État interdit aux agents de fumer ou de boire, il mène une politique paternaliste. D'un point de vue libéral, on ne  peut pas chercher à faire le bien d'un individu contre son gré. 
Le paternalisme est une attitude qui consiste à se conduire comme un père envers d'autres personnes sur lesquelles on exerce ou tente d'exercer une autorité. Cette attitude peut être volontaire, comme involontaire et inconsciente. 
Ce terme est notamment utilisé dans des domaines comme l'économie, la morale ou la politique. On parle alors de paternalisme économique,  moral, politique, social etc. 
L'attitude paternaliste revient à considérer des adultes comme des enfants. Un paternaliste infantilise  ceux sur qui il exerce, ou cherche à exercer, une autorité. À l'inverse  que c'est parce que ceux-ci sont déjà infantiles que cela suscite en  retour une tendance paternaliste.

From: https://nl.wikipedia.org/wiki/Paternalisme
Paternalisme verwijst naar een houding of beleid vergelijkbaar met het hiërarchische familiepatroon waarbij de vader (pater in het Latijn) aan het hoofd van de familie staat en de vader  beslissingen neemt voor de andere familieleden (vrouw en kinderen), ook  als die beslissing niet in overeenstemming is met wat zij wensen. 
Paternalisme is het optreden van de overheid tegenover het volk, of  van een overheersend volk in vreemd gebied (kolonie of vroegere kolonie)  of van een gezaghebber als een vader of voogd die het goede met het volk, zijn kinderen of pupillen voorheeft, maar hen geen invloed van belang geeft op hun eigen aangelegenheden.

From: http://dexonline.ro/definitie/paternalism (there is no wikipedia entry for Paternalism in the Romanian Wikipedia) 
Paternalism s. n. 1. (Ec. pol.)  Concep?ie care desemneaz? interesul pe care îl manifest? patronii  pentru bun?starea muncitorilor sau pentru atmosfera familial? din  întreprindere, raporturile dintre patroni ?i muncitori caracterizate  prin afec?iune reciproc?, autoritate ?i respect. 2. Protec?ie, protejare, tutelare excesiv? a propriului copil. – Din fr. Paternalisme.

Meta-mining

After one day, 244 paragraphs were classified and ready for training .
Annotators disagreed on whether a paragraph was paternalist on 49 occasions, bringing the annotator disagreement rate to 20.08967213114754% 

Group A (001, 004, 007):  
Paragraphs scored: 174. 20 of those paragraphs were annotated by 1 person (and not taken into account) 
Disagreements: 18
Noise: 7

Group B (002, 005, 008):  
Paragraphs scored: 55. 5 of those paragraphs were annotated by 1 person (and not taken into account ) 
Disagreements: 21
Noise: 2

Group C (003, 006, 009): 
Paragraphs scored: 61. 10 of those paragraphs were annotated by 1 person (and not taken into account ) 
Disagreements: 10
Noise: 2

Annotation files

A : unique ID
B : url of the source
C : title of the source
D : year of publication
E : paragraph (content)
F : the ID number of the annotator
G : classifier (-1/0/1/x)
H : comment

name: main-the-annotator-paragraphs-[ID-number].ods 
example: main-the-annotator-paragraphs-005.ods

Annotation results

x = noise 
d = disagreement 
n = not annotated 
p = annotated by 1 person

All annotations:
https://gitorious.org/cqrrelations/cqrrelations/source/f86b6aec968a58103f59a931a31939d92906897f:share/the-annotator/all-annotations-abc.html

Comments

List of paragraphs that are classified as paternalistic, combined with the notes that were taken during the annotation process:
https://gitorious.org/cqrrelations/cqrrelations/source/f86b6aec968a58103f59a931a31939d92906897f:share/the-annotator/paternalism-classifications.html

Examples:
annotator 004 on paragraph #442 : "What gives Bernard Shaw the aptitude to reveal the deep nature of men and woman?" 
annotator 007 on paragraph #549 : "Reduces questions of political agency to physiological problems." 
annotator 008 on paragraph #334 : "Analysis of individual's history and philosophical outlook." 
annotator 003 on paragraph #416 : "strenously controlling sex" 
annotator 006 on paragraph #416 : "For 1908 it raises feminist issues" 
annotator 009 on paragraph #416 : "I write 0 not because it's neutral but as a kind of balance as i  couldn't choose between -1 and 1. there are elements that can be  considered emancipatory, against paternalism (the text is from 1908),  but there are also  elements which are paternalist as well."

Disagreements

List of paragraphs that were disagreed on by The Annotators, and so are not taken into account in the training , combined with the notes that were taken during the annotation process:
https://gitorious.org/cqrrelations/cqrrelations/source/f86b6aec968a58103f59a931a31939d92906897f:share/the-annotator/disagreement-list-selection.html

Data

Gutenberg project

J. B. Bury, The Idea Of Progress, 1920, http://www.gutenberg.org/cache/epub/4557/pg4557.txt
Maud Churton Braby, Modern Marriage and How To Bear It, 1908, https://www.gutenberg.org/files/31529/31529-0.txt
Harriet Martineau, How to Observe Morals and Manners, 1838, http://www.gutenberg.org/cache/epub/33944/pg33944.txt
Irwin Edman, Human Traits and their Social Significance, 1920, http://www.gutenberg.org/cache/epub/22306/pg22306.txt
James Hayden Tufts, The Ethics of Cooperation, 1918, http://www.gutenberg.org/cache/epub/29508/pg29508.txt
James Harvey Robinson, The Mind in the Making: The Relation of Intelligence to Social Reform, 1921, http://www.gutenberg.org/cache/epub/8077/pg8077.txt
Helen Kendrick Johnson, Woman And The Republic, 1897, https://www.gutenberg.org/cache/epub/7300/pg7300.txt
Charles Darwin, On the Origin of species, 1859, http://www.gutenberg.org/cache/epub/1228/pg1228.txt
Emma Goldman, Anarchism and other essays, 1910, http://www.gutenberg.org/cache/epub/2162/pg2162.txt
John F. Hume, The Abolitionists (Together With Personal Memories Of The Struggle For Human Rights), 1830-1864, http://www.gutenberg.org/cache/epub/13176/pg13176.txt

Wikipedia

Mining:https://en.wikipedia.org/wiki/Mining
Textile Industry: https://en.wikipedia.org/wiki/Textile_industry
History of computing hardware: https://en.wikipedia.org/wiki/History_of_computing_hardware
Marissa Mayer: https://en.wikipedia.org/wiki/Marissa_Mayer
Larry Page: https://en.wikipedia.org/wiki/Larry_Page
Liberty: https://en.wikipedia.org/wiki/Liberty
Choice: https://en.wikipedia.org/wiki/Choice
Sabotage: http://en.wikipedia.org/wiki/Sabotage
Social Darwinism : http://en.wikipedia.org/wiki/Social_Darwinism
Anarchism: https://en.wikipedia.org/wiki/Anarchism