Welcome to Etherpad!

This pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!

Get involved with Etherpad at http://etherpad.org
Algemeen: http://pad.constantvzw.org/p/certainty
Vragen: http://pad.constantvzw.org/p/certainty_questions
modality.py close reading: http://pad.constantvzw.org/p/certainty_modality.py_close_reading
Modality paper notes: http://pad.constantvzw.org/public_pad/certainty_notes_Modality-and-Negation

*BIOSCOPE CORPUS*
The bioscope corpus is used to test /train(?) the modality.py script, in the context of the CoNLL-2010 shared task 1.

links
* Description
CoNLL-2010 shared task 1 description: http://rgai.inf.u-szeged.hu/conll2010st/tasks.html#task1
official bioscope corpus page: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-S11-S9
bioscope annotation guidelines: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/Annotation%20guidelines2.1.pdf
paper about the annotation of the bioscope corpus: http://www.clips.ua.ac.be/NeSpNLP2010/nespnlp2010-proceedings.pdf#page=40
Made in 2008
"This article reports on a corpus annotation project that has produced a  freely available resource for research on handling negation and  uncertainty in biomedical texts (we call this corpus the BioScope  corpus)... The dataset contains annotations at the token level for negative and  speculative keywords and at the sentence level for their linguistic  scope."

"The annotation process was carried out by two independent linguist  annotators and a chief linguist – also responsible for setting up the  annotation guidelines – who resolved cases where the annotators  disagreed. The resulting corpus consists of more than 20.000 sentences  that were considered for annotation and over 10% of them actually  contain one (or more) linguistic annotation suggesting negation or  uncertainty. ... The corpus consists of texts taken from 4 different sources and 3  different types in order to ensure that it captures the heterogeneity of  language use in the biomedical domain. We decided to add clinical  free-texts (radiology reports), biological full papers and biological  paper abstracts (texts from Genia)."

"Apart from the intended goal of serving as a common resource for the  training, testing and comparing of biomedical Natural Language  Processing systems, the corpus is also a good resource for the  linguistic analysis of scientific and clinical texts."
see also paper as pdf: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/bioscope_cameraready.pdf

* Downloads
-> convert xml to csv: http://askubuntu.com/questions/174143/convert-xml-to-csv-shell-command-line

annotated bioscope corpus - abstracts only: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/abstracts_pmid.xml (annotated on negation and speculation (on token level) and linguistic scope of these words (on sentence level) )
annotated bioscope corpus - full articles: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/ full_papers.xml (annotated on negation and speculation (on token level) and linguistic scope of these words (on sentence level) )
a sample of the bioscope dataset that is used for task 1 on the CoNLL competition: http://rgai.inf.u-szeged.hu/conll2010st/trial_Task1.zip (annotation on certain and uncertain per sentence, specific cues are marked only in the uncertain sentences) --> is this a test- of trainingset? can be both :-)

example of the annotated corpus (from the 'abstracts only' version):

example of the CoNLL sample dataset: