certainty_modality.py_bioscope

Algemeen: http://pad.constantvzw.org/p/certainty
Vragen: http://pad.constantvzw.org/p/certainty_questions
modality.py close reading: http://pad.constantvzw.org/p/certainty_modality.py_close_reading
Modality paper notes: http://pad.constantvzw.org/public_pad/certainty_notes_Modality-and-Negation

*BIOSCOPE CORPUS*
The bioscope corpus is used to test~~/train(?~~) the modality.py script, in the context of the CoNLL-2010 shared task 1.

links
* Description
CoNLL-2010 shared task 1 description: http://rgai.inf.u-szeged.hu/conll2010st/tasks.html#task1
official bioscope corpus page: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-S11-S9
bioscope annotation guidelines: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/Annotation%20guidelines2.1.pdf
paper about the annotation of the bioscope corpus: http://www.clips.ua.ac.be/NeSpNLP2010/nespnlp2010-proceedings.pdf#page=40
Made in 2008
"This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus)... The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope."

"The annotation process was carried out by two independent linguist annotators and a chief linguist – also responsible for setting up the annotation guidelines – who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty. ... The corpus consists of texts taken from 4 different sources and 3 different types in order to ensure that it captures the heterogeneity of language use in the biomedical domain. We decided to add clinical free-texts (radiology reports), biological full papers and biological paper abstracts (texts from Genia)."

"Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts."
see also paper as pdf: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/bioscope_cameraready.pdf

* Downloads
-> convert xml to csv: http://askubuntu.com/questions/174143/convert-xml-to-csv-shell-command-line

annotated bioscope corpus - abstracts only: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/abstracts_pmid.xml (annotated on negation and speculation (on token level) and linguistic scope of these words (on sentence level) )
annotated bioscope corpus - full articles: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/full_papers.xml (annotated on negation and speculation (on token level) and linguistic scope of these words (on sentence level) )
a sample of the bioscope dataset that is used for task 1 on the CoNLL competition: http://rgai.inf.u-szeged.hu/conll2010st/trial_Task1.zip (annotation on certain and uncertain per sentence, specific cues are marked only in the uncertain sentences) --> is this a test- of trainingset? can be both :-)

example of the annotated corpus (from the 'abstracts only' version):

<sentence id="S1.6">When U937 cells were infected with HIV-1, <xcope id="X1.6.3"><cue type="negation" ref="X1.6.3">no</cue> induction of NF-KB factor was detected</xcope>, whereas high level of progeny virions was produced, <xcope id="X1.6.2"><cue type="speculation" ref="X1.6.2">suggesting</cue> that this factor was <xcope id="X1.6.1"><cue type="negation" ref="X1.6.1">not</cue> required for viral replication</xcope></xcope>.

example of the CoNLL sample dataset:

<sentence id="S7.105" certainty="uncertain">To distinguish which tissues require ADGF-A expression for proper development, we <ccue>tested</ccue> for rescue of adgf-a lethality by expressing ADGF-A in specific subsets of larval tissues.</sentence>
<sentence id="S7.205" certainty="certain">We produced a loss-of-function mutation in the ADGF-A gene, which produces a product (ADGF-A) with ADA activity.</sentence>