Previous meeting: http://pad.constantvzw.org/p/algolit_reboot Thu 3rd November Algorithmic Models for text analysis & connect to context where it is used & metaphors around it & visualisation Presentation Uncertainty Detected the creation of a model to predict uncertain sentences in scientific articles *general introduction *Research field: World Well Being Project (commercial & very controversial applications, ao Gender and personality wordcloud/Wordle) *Released differential analisys toolkit http://dlatk.wwbp.org/ built in wordcloud visualization *wordcloud is beautiful algorithm, starts from the center and works up in the form of spiral to position words *problem: long words in large letters seem to be more important, depending on surface they occupy * *In a previous collective study moment, An and Manetta focused on the recognition of "certainty" in texts. The start for this was the script "modality.py" in Pattern, a rule based script tested on the Bioscope corpus. * *Corpus presentation *http://pad.constantvzw.org/p/certainty_modality.py_bioscope_corpus *Biocorpus, developed at a university in Eastern Europe in 2008, built by 2 linguist annotators and a chief linguist. In the corpus 20000 sentences, 10% of them are annotated as uncertain. *"the corpus is also a good resource for the linguistic analysis of scientific and clinical texts" *4 different sources, 3 different types *Concrete question for which the corpus is made: due to a very high amount of biomedical papers, researcher requested a tool to scan all the available papers *For the uncertain section, An used the cue type "uncertain", "certain" and 'speculation', and for the certain An used all the elements without "negation". She equalized the training corpus by tweaking the balance between certain/uncertain sentences to 50/50. *Corpus is also annotated with scopes of uncertainty/negation. An ignored this. She decided to work with straightforward tools in order to pay most attention to the process, understand it, work to an end. She chose to work on sentence level, which seemed easier than chunks. There is a 1000 papers written on how to MLanalyze scopes of uncertainty in sentences, but looks more complex. *This corpus comes with interannotator agreement, official validation. *An chose to work with 2 classes (certain/uncertain), multiclass would have been an option (uncertain/negation/certain) but again 1 level more complex. Could be stgh to do in the future *Have a look at f.ex. http://www.lrec-conf.org/proceedings/lrec2016/pdf/469_Paper.pdf * *creation lexicon on basis of cues of the corpus & lemmatized them of the full papers and abstracts, only from the "uncertain" + "speculation" tags *perhaps it's interesting to visualise how much of the total data the lexicon is currently using, as a tool for researchers that this information creates a characteristic of the eventual model *the same reduction is an interesting way to look at 'writing' and language, and how the 'essence' of a text can be showed depending on what you're looking for *http://algolit.net/scripts/uncertainty_detected%3f/lemma_lexicon_clean.txt --> lexicon containing 96 words *read through script: Please do not use/distribute as such, has still small bugs in it: http://algolit.net/scripts/uncertainty_detected%3f/ *uncertainty.py *300 sentences + vectors are created *the sentences are divided in 80% / 20% train and test data *numerization of vectors *retro-move: reconvert list of the feature names from vectors --> Scikit learn only saves weight and score, no feature names. *numerization of the class labels: uncertain & certain --> only 2 labels *normalization of the features: scaling scores of the features between 0 and 1 *check if absolute counting has been removed from the vectorm only relative counting, no tf-idf *for every sentence, the following features in the vector are plotted: *token *pos-tag *chunk tag *position of token *lemma *appearance in lexicon yes/no *word frequency *bigram freq *trigram freq *char bigrams *char trigrams *chunk --> light grammatical analysis to cut up a sentence *note on deciding a set of features: *An created a spreadsheet, annotating various papers on the decisions that are made to use a specific set of features. An's intution was that the more features the better. But, it is not sure, it is the big data mentality, the choice on features is also connected to CPU, time, etc. *Each feature is a dimension, a selection of 300 sentences in combination with this set of features gives 50000 features for each vector. Each sentence is tranformed in vector of features, this gives a lot of zero values (is also a choice of the author, which appears in cases where a word does not appear another time in the text. * *This process can also be excecuted in Sklearn, where there is a function to vectorize a set of sentences and apply a TFIDF counter. But to know how the sklearn function differs from this manual vectorization process is difficult, reading the source code is required here. * *1.0 is the maximum value in these plots, what this value exactly means is unclear. * *This plot is counting on sentence level. (In contrast to the TFIDF counting method) The features are representing the sentence, and are a result of various counting systems relative to the length of the sentence. *It would be good to create plot in which you can hover over the dot and read the feature name & value * *feature reduction *selecting the best 100 features using the chi2 (chi2 is a statistical algoritme, and a sort of normaalverdeling) *which features decide the most what features have the highest influence, aka the places where there is the strongest correlation *+ output plots where the selected features are colored in red ** output plots with 100 best features and their labeling, blue dots are uncertain, red are certain sentences * *"the results seem to show that a rule-based system for uncertainty works", as the results of these feature plots start to overlap with rule-based systems (looking at words used in modality.py in Pattern) * *Rule-based systems are much more intense in terms of human work and time, in contrast to the time a programmer spends to write a supervised machine learning system. From intution a rule-based system doesn't feel so 'heavy' as the code is limited and readable, in constrast to the complex and large amount lines of code that a supervised ML systems needs. * *KbestFit is a filter that saves the 'mode' of filtering as a sort of recipe *so you can apply it to test data as well *result: the vector is still the same, but shortened to 100 features. Every vector (sentence) from now on has the same set of 100 selected features. *saving the feature scores, as not all the features have the same 'weight' *plotting the graphs with matplotlib + numpy *to plot using lists and list-transformations *baselines with Gridsearch *vectors with best features = trainingdata *best trainingsdata to train the classifier * *weighted random baseline *statistische baseline * *majority baseline *statistische baseline *de verhouding tussen certain en uncertain sentences, in dit geval 50% 50%, er is een kans van 50% dat het resultaat "uncertain" is, dus de baseline is 50% *je eigen correcte resultaten zullen hoger moeten liggen om je model te laten werken *human informed baseline *modality.py STILL TO DO : write functions and save them in separate scripts, f.ex. vectorize, plotting *Support Vector Machine (SVM) *'only' here we start to train our classifier. As the vectors contained many '0's and many many dimensions, all the steps of vector creation and filtering are needed to make a model that is digestible for the computer and informative at the same time. The vectors are reduced to a "learnspace". * *The test data that we will use to make predictions on, needs to be processed in the exact same way as the training data is processed. This means, the test data needs to be: *create the set of features to make one vector per sentence *keep the 100 most effective features that are detected in the training data *(more?) From here on, the algorithm (in this case SVM) will search for patterns in the model (the model = all the vectors together) *testing during the training (using the training data) *10-fold cross validation *used here to improve the quality of your model, before continuing to the test phase *evaluation of the training of a classifier using 10-fold cross validation *accuracy report *classification report *confusion matrix *counting true positives + false positives + true negative + false negatives *testing (on the test data) *is the final test of your model * *the classifier needs an input, that is in the exact same vector format as the training data, otherwise you cannot compare the two * *the test data is already processed into features and saved into a vector before the corpus was splitted into 80% training and 20% test data *the feature reduction is applied after the division, aka the selected feature set is decided by looking only at the training data * *question: if we would use the tfidf counting system, when would the vectors than made? before the splitting of training and test data? * *in the test results: it would be interesting to connect the detected feature on which the classifier made mistakes *if we would do tf-idf we would have to split the corpus in train/test before the vectorization? we need to check this! *there are different ways of counting frequency! cfr wikipedia - this reflects the 'interpretation' of how you transform language into a workable set of numbers --- Proposal: Spend another meeting on only the correlating action in training a model. We could prepare watching a few video's and further discuss them in another Algolit session. Side proposal: use a set of 5 software packages to compare the results, using the same algorithm and (if possible) the same reduced set of vectors plot the plotters using plotter as a way to see how the machine thinks try 3D representation INTERMEZZO after lunch: Gijs shows his latest version of 'Your weekly address' Markov chain generated texts is read by video fragments of Obama's youtube speeches. Machine reads fast. Gijs discovered that bigrams & trigrams work well to generate the video, unigrams make us dizzy :-) = new Big data grammar for legibility of machines for humans Proposal to create list of these works on the algolit wiki. side note: *unsupervised learning *- bootstrapping: training van model van random selecties van data * *how text is transformed into numbers using text analysis tools *visualisation of the vector of a sentence & feature selection *http://pyalgoviz.appspot.com/ *http://setosa.io/ev/principal-component-analysis/ *statistical applications & interpretations *methodology of scoring (F-score, True/False postiives/negatives, confusion matrix) *references to papers