Algemeen: http://pad.constantvzw.org/p/certainty Vragen: http://pad.constantvzw.org/p/certainty_questions modality.py close reading: http://pad.constantvzw.org/p/certainty_modality.py_close_reading Modality paper notes: http://pad.constantvzw.org/public_pad/certainty_notes_Modality-and-Negation *BIOSCOPE CORPUS* The bioscope corpus is used to test/train(?) the modality.py script, in the context of the CoNLL-2010 shared task 1. links * Description CoNLL-2010 shared task 1 description: http://rgai.inf.u-szeged.hu/conll2010st/tasks.html#task1 official bioscope corpus page: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-S11-S9 bioscope annotation guidelines: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/Annotation%20guidelines2.1.pdf paper about the annotation of the bioscope corpus: http://www.clips.ua.ac.be/NeSpNLP2010/nespnlp2010-proceedings.pdf#page=40 Made in 2008 "This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus)... The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope." "The annotation process was carried out by two independent linguist annotators and a chief linguist – also responsible for setting up the annotation guidelines – who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty. ... The corpus consists of texts taken from 4 different sources and 3 different types in order to ensure that it captures the heterogeneity of language use in the biomedical domain. We decided to add clinical free-texts (radiology reports), biological full papers and biological paper abstracts (texts from Genia)." "Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts." see also paper as pdf: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/bioscope_cameraready.pdf * Downloads -> convert xml to csv: http://askubuntu.com/questions/174143/convert-xml-to-csv-shell-command-line annotated bioscope corpus - abstracts only: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/abstracts_pmid.xml (annotated on negation and speculation (on token level) and linguistic scope of these words (on sentence level) ) annotated bioscope corpus - full articles: http://rgai.inf.u-szeged.hu/project/nlp/bioscope/full_papers.xml (annotated on negation and speculation (on token level) and linguistic scope of these words (on sentence level) ) a sample of the bioscope dataset that is used for task 1 on the CoNLL competition: http://rgai.inf.u-szeged.hu/conll2010st/trial_Task1.zip (annotation on certain and uncertain per sentence, specific cues are marked only in the uncertain sentences) --> is this a test- of trainingset? can be both :-) example of the annotated corpus (from the 'abstracts only' version): *When U937 cells were infected with HIV-1, no induction of NF-KB factor was detected, whereas high level of progeny virions was produced, suggesting that this factor was not required for viral replication. example of the CoNLL sample dataset: *To distinguish which tissues require ADGF-A expression for proper development, we tested for rescue of adgf-a lethality by expressing ADGF-A in specific subsets of larval tissues. *We produced a loss-of-function mutation in the ADGF-A gene, which produces a product (ADGF-A) with ADA activity.