----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
# KDD step 2: data preperation
syntax:
*tokenization/normalization (98%)*
*simplest thing/important thing
*identifying the units in your text
*to read the punctuation, e.g.:
*- dr.
*- This is a sentence.
lemmatization:
*reduce wordforms to their dictionary item
*is/been/was/be
*--> belongs to 'to be'
*+ plurals --> singulars
syntactical:
*part-of-speech tagging
*important elements for object text-mining
*--> nouns
*for subjective text-mining*--> adjectives
*word sense disambiguation
*bank / bank
*--> river bank / money bank
*semantic role labeling
*
pragmatics: (?)
*named entity recognition
*co-reference resolution (50%)*
*<-- meaning output
*
*(% refers to accuracy)
from : CLiPS Guy de Pauw, Pattern workshop — Cqrrelations, January 2015
----------------------------------------------------------------------------
pattern.en |es|de|fr|it|nl
*- text preperation
*- sentiment analysis tool
*- WordNet interface
*- wordlists interface
pattern.search
*- a pattern matching system similar to regular expressions, that can be used to search a string by syntax (word function) or by semantics (word meaning).
*- eg.:('{NP} be * than {NP}')