190125_algolit_propaganda_detection

__NOPUBLISH__

Welcome to Constant Etherpad!

These pads are public. To prevent them from appearing in the archive and the RSS feed on Constant, put the word __NOPUBLISH__ (including surrounding double underscores) anywhere in your pad.
Pads are archived every night around 04:00 CET at http://etherdump.constantvzw.org

To stay informed about Constant infrastructures, please subscribe to this mailinglist: https://tumulte.domainepublic.net/cgi-bin/mailman/listinfo/infrastructures
More about the way pads work: https://pad.constantvzw.org/p/etherpads
Algolit 25 January 2019
Tim, Guillaume, Hans, Gijs, An, Cristina, Javier

Propaganda detection
https://www.datasciencesociety.net/hack-news-datathon-case-propaganda-detection/
dataset: https://s3.us-east-2.amazonaws.com/propaganda-datathon/dataset/datasets-v3.zip
What is propaganda according to the organisers of this hackaton: http://propaganda.qcri.org/annotations/definitions.html

Towards automatic censoring, creates different public sphere
How they define propaganda: rhetorical figures, 18 differentiations
Is it possible to start classifying on ways of representing/rhetorical figures
Critical reading: check dataset

3 tasks for the hackaton:
    - sort entire article as propaganda or not: 60.000 articles automatically labeled by source
    - labeling on sentence level: 450 articles annotated by hand on different types, using flowchart http://propaganda.qcri.org/annotations/
    - recognize the rethorical construct and locate it in the sentence

Based on research from Qatar Research Institute: http://propaganda.qcri.org/
It can be used in a lot of different context, but sensitive areas
demo: http://proppy.qcri.org/
annotation guide: http://propaganda.qcri.org/annotations/
results of experiments: https://docs.google.com/spreadsheets/d/1Ni5DILQhHIRk_Jy-QNMwT9aCTO9SLFUKa5eLnPthLno/edit#gid=327118274

https://www.vox.com/world/2017/6/6/15739606/saudi-arabia-ties-qatar-trump

"Despite Al Jazeera being considered to be one of the Middle East's most open media outlets,[3] Qatari authorities enforce stringent restrictions on freedom of local media, including censoring internet services and outlawing criticism of the ruling family in the media."
https://en.wikipedia.org/wiki/Media_of_Qatar

Tim: Is lenght of article taken into account? They vary in length a lot

Gijs: Quatar in press war with Saoudi Arabia, who is supported by US
It seems they want to highlight the techniques used
Hans: knew insitute from before, based on digital humanitarianism, using Twitter streams and detecting where emergency actions are needed
In EU, you would go for research tools to screen pedofilia...
US works with Thinktanks, helping FB to label 'Russian activists'... in that context, this is sophisticated way of doing it
Even if you get it working autmatically, it might not be a good idea
-> the annotation is crucial, if it is done by US white middle class men, it will have a serious bias; should be done by diverse group and each article annotated by different people
-> you need to have a background in order to be able to do this, f.ex. sentence 7 in article111111112.txt
-> list of propaganda techniques is interesting: square boxes = 'we enter territory of....'

Propaganda & fake news go together, since last elections in US, an issue

Exercise:
    * we look at article in tasks 2-3 /train /article111111112.txt
    * check task2_labels where they use propaganda: the 5th sentence is propaganda 'Pamela Geller and Robert Spencer co-founded anti-Muslim group Stop Islamization of America.'
    -> empty lines are also counted as sentences
    * check task3_labels to see in which position of the text which technique is used (character count)

    ~~They missed one:~~
        ~~line 7 is declared as propaganda, but character count points at sentence 11~~
        the character counting is not in order

Looking at flowchart
'do you need external information to judge?' - yes - stop judging : too much expertise needed in order to make interpretation?

Is this flowchart taking into account the case where fake information is presented in a rational way= no, but then again wrong theories discarded by the evolution of sciences were not propaganda, but wrong.

Next: line 15
'Ms Geller, of the Atlas Shrugs blog, and Mr Spencer, of Jihad Watch, are also co-founders of the American Freedom Defense Initiative, best known for a pro-Israel "Defeat Jihad" poster campaign on the New York subway.
On both of their blogs the pair called their bans from entering the UK "a striking blow against freedom" and said the "the nation that gave the world the Magna Carta is dead".'
https://en.wikipedia.org/wiki/Magna_Carta
the character count goes to "the nation that gave the world the Magna Carta is dead": This is a statement, 'it does not expose a rational argument', it is as a conclusion/consequence to the first?
"a striking blow against freedom" is considered non propaganda, because it is a rational argument? the ban is a blow against freedom
First example were several members of the team would not have labelled it the same way it is in the dataset.
Is the flowchart a rigid tool or more of a guide? Presented as a guide.

Are the dataset labbelled by sentences labelled article by article or are labeller given sentences independently?

Can you say whether something is a rational argument, when you have just one sentence?
In logics: yes

In our text, there are no sentences were the sentences are not labelled, no NaN.
Is task number 2 made with the same flowchart as task number 3? No info on this, so it might be that they did not use the same tool for both tasks.

Next document article111111113.txt
propaganda:
    pinned blame for Steinle's death on illegal immigration and insufficiently aggressive deportation policies.

G: over-simplification, we do it all the time
T: important to check the tendency of the entire article
J: non propagandistic at the end is intentional
H: indication of certain rethorical technique does not mean that it is propaganda

Hard to use the propagandistic techniques classification to classify article because the author could talk about/ quote/ use propagandistic techniques in order to explain or contradicts them. These would be spotted by a sentence classifier, but it would not mean that the whole article is propaganda.
Is it a goal to create filter? could be browser plugin highlighting sentence you read as using certain rethoric technique
But they frame it as 'propaganda techniques', not 'rethorical techniques'
what they list are 'false arguments'
slippery business

In some aspects, the idea of filtering propaganda seems "too simple"/ crude/ "simplistic" and is described as such in the first task on the hackathon.

Goal of datathon?
promotion of research insitute? recruiting? get different perspectives on topic
Google is flagging websites, using wikipedia as reference - puts pressure on Wikipedia, because people go there to edit
Task 1 is obvious, simple way to go: annotating by source; task 2 and 3 show where it can go - they can use results for next model generation

Interesting to think about the context of this hackhaton and how they frame it. They don't seem to try and come up with a final propaganda/non propaganda filter but having tools to assess any text with regards to rethorical, grammatical and lexical techniques that are used often in propaganda.
The way they advertise it is by explaining how effective fake news is in spreading. So it frames it in a certain way.

Annotating interface:
https://www.tanbih.org/

No differentiation between propaganda & cirtique on propaganda
every scientist is using 'causal oversimplification'...
can we find other similar tools

script to look at what is indicated as propaganda (task 3):
    #!/usr/bin/python

import sys

if len(sys.argv) < 3:
   sys.exit("Usage: %s <spans tsv file> <input text 1>"%(sys.argv[0]))

span_file = sys.argv[1]
file1 = sys.argv[2]

with open(span_file, "r") as f:
   spans = [ line.rstrip().split("\t")[0:4] for line in f.readlines() ]

#with open(file1, "r", encoding="latin-1") as f:
with open(file1, "r") as f:
   s1 = f.read()
   for doc, label, start, end in spans:
       print("%s\t%s\t%s\t%s\t%s" % (doc,label,start,end,s1[int(start):int(end)]))

-------

#!/usr/bin/python

import sys
import glob
import os.path

# if len(sys.argv) < 3:
#    sys.exit("Usage: %s <spans tsv file> <input text 1>"%(sys.argv[0]))

with open('../output.txt', 'w') as out:
for textfile in glob.glob('*.txt'):
    basename = os.path.splitext(os.path.basename(textfile))[0]
# label = '{}.task3.labels'
    span_file = '{}.task3.labels'.format(basename)
# file1 = sys.argv[2]

    with open(span_file, "r") as f:
      spans = [ line.rstrip().split("\t")[0:4] for line in f.readlines() ]

#with open(file1, "r", encoding="latin-1") as f:
    with open(textfile, "r") as f:
      s1 = f.read()
      for doc, label, start, end in spans:
# print("%s\t%s\t%s\t%s\t%s" % (doc,label,start,end,s1[int(start):int(end)]))
          out.write("%s\t%s\t%s\t%s\t%s\n" % (doc,label,start,end,s1[int(start):int(end)]))

##################
####Sort by label####
##################

#!/usr/bin/python

import sys
import glob
import os.path

# if len(sys.argv) < 3:
#    sys.exit("Usage: %s <spans tsv file> <input text 1>"%(sys.argv[0]))

grouped = []

with open('../output.txt', 'w') as out:
for textfile in glob.glob('*.txt'):
    basename = os.path.splitext(os.path.basename(textfile))[0]
# label = '{}.task3.labels'
    span_file = '{}.task3.labels'.format(basename)
# file1 = sys.argv[2]

    with open(span_file, "r") as f:
      spans = [ line.rstrip().split("\t")[0:4] for line in f.readlines() ]

#with open(file1, "r", encoding="latin-1") as f:
    with open(textfile, "r") as f:
      s1 = f.read()
      for doc, label, start, end in spans:
# print("%s\t%s\t%s\t%s\t%s" % (doc,label,start,end,s1[int(start):int(end)]))
# out.write("%s\t%s\t%s\t%s\t%s\n" % (doc,label,start,end,s1[int(start):int(end)]))
          out.write("{} → {}\n".format(label, s1[int(start):int(end)]))
# labels = label.split(',')
          for l in label.split(','):
            grouped.append((l, s1[int(start):int(end)]))

with open('../output-grouped.txt', 'w') as out:
grouped.sort(key=lambda l: l[0])
prev = None

for (label, sample) in grouped:
    if label != prev:
      prev = label
      out.write('\n\n{}\n===========================\n'.format(label))

    out.write('{}\n'.format(sample))

After lunch:
Algolit podcast stories https://pad.constantvzw.org/p/algolit-exhibition-mons-podcasts
Gender bias tweeterbot
Eliza cfr automatist.org

Memo Akten word of math
http://www.memo.tv/portfolio/word-of-math-word-of-math-bias/
https://twitter.com/wordofmathbias

Artwork where machine "create" their own language:
http://www.dasfremde.world/#about

http://www.crowdsourcedintel.org/
http://www.derekcurry.com/projects/publicdissentiment.html

Internet initially text based, now images, going to voice
sthg on research Nicolas: images & ML also go through text, annotations, image descriptions

influence on analog writing, ex graffiti
tattoo of wifi sign on hand - public space Schaarbeek

automatic cv selection // illusion of avoiding human judgement
embeds bias
ex Amazon: most predictive feature was if you are man or woman, in trainingdata/previous knowledge, most hired people were men
believe that anything new make, will improve
argumentation is done through scientific research (?)
ex Cambridge Analytica, based on sociologists doing research in human behaviour/Personality profiling and implementing it
vectorized humans linked to specific behaviour, creates truth, companies selling the truth

Jorge Luis Borges, Chinese encylopedia listing weird categories
Foucault, Les mots et les Choses, talks about it as a space where basic common categorization space/understanding is broken, like afasia patient not able to categorize anymore
cfr artists who used the categories and trained a model with it: http://ssbkyh.com/works/animal_classifier/
-> you can create any model as long as you have categories and create the training data for it.

See for rest of discussion: https://pad.constantvzw.org/p/algolit-exhibition-mons-podcasts

---Back to Propaganda---

Types of propaganda:

Appeal_to_Authority
Appeal_to_fear-prejudice
Bandwagon
Black-and-White_Fallacy
Causal_Oversimplification
Doubt
Exaggeration,Minimisation
Flag-Waving
Loaded_Language
Name_Calling,Labeling
Obfuscation,Intentional_Vagueness,Confusion
Red_Herring
Reductio_ad_hitlerum
Repetition
Slogans
Straw_Men
Thought-terminating_Cliches
Whataboutism

we run script that was posted by someone on mailinglist of Datathon (line 109) feeding it with task3.labels & full txt

16. Bandwagon
Linked to the rise of behavioral economics, or nudges applied in public policy.
will you do th same as the majority?
cfr FB 'go to vote now!'

QUESTIONS
- where is the source of the article? who has written it? Authorship in annotated data?
- credits of annotations?

___ROUND_UP___
How did they do the scraping/cleaning? Guillaume found ads in the texts
It also seems like many of them are far-right articles: is this a bias in the dataset? Answer: it's a US dataset, with topics related to Trump
This would bias the way that propaganda is understood as.
How did they choose the article/source? Far left texts could also be interesting.

A lot of the sources are texts that quote/refer to propaganda material and are similarly labelled as a result, are reports about facts.
The selected texts are only referring to a particular moment (and place) in time
Annotation is very contextual, complex to automate that

- where is the source of the article? who has written it? Authorship in annotated data?
- credits of annotations?

similarity with advertising tactics / type of language

idea to sort by label

narrative feeling to labels, you can easily make story out of it

---
Hans' follow up email:

Hello,

as promised a bit of follow-up on the propaganda hackaton:

Solutions of the participating teams can be found at
https://www.datasciencesociety.net/category/learn/. The articles are
varying in quality and not always easy to understand, but in general
contain links to their code.

The contributions of the winning teams are:

1.
https://www.datasciencesociety.net/datathon-hacknews-solution-pig-propaganda-identification-group/

2.
https://www.datasciencesociety.net/detecting-propaganda-on-sentence-level/

You can look to presentations about the 10 finalist teams on
https://www.youtube.com/watch?v=Z3MG4JNgAT8, but it is a pain to watch
and listen ;-)

In general teams used different variants of word embeddings as basis to
represent the data. Several of them used deep learning methods, but
often also other techniques. The whole shows that the word embeddings
landscape is becoming more diverse and in full development.

One new embedding method that is used by several teams, and which I did
not know before, is BERT (Bidirectional Encoder Representations from
Transformers). This is developed by Google and pre-trained multi-langual
models are available on https://github.com/google-research/bert. A
background article can be found on https://arxiv.org/abs/1810.04805.
Other pieces of software allowing you to use it more easily are
https://bert-as-service.readthedocs.io/en/latest/index.html and
https://github.com/huggingface/pytorch-pretrained-BERT/

Best,

Hans