---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- # KDD step 2: data preperation syntax: *tokenization/normalization (98%)* *simplest thing/important thing *identifying the units in your text *to read the punctuation, e.g.: *- dr. *- This is a sentence. lemmatization: *reduce wordforms to their dictionary item *is/been/was/be *--> belongs to 'to be' *+ plurals --> singulars syntactical: *part-of-speech tagging *important elements for object text-mining *--> nouns *for subjective text-mining *--> adjectives *word sense disambiguation *bank / bank *--> river bank / money bank *semantic role labeling * pragmatics: (?) *named entity recognition *co-reference resolution (50%)* *<-- meaning output * *(% refers to accuracy) from : CLiPS ­ Guy de Pauw, Pattern workshop — Cqrrelations, January 2015 ---------------------------------------------------------------------------- pattern.en |es|de|fr|it|nl *- text preperation *- sentiment analysis tool *- WordNet interface *- wordlists interface pattern.search *- a pattern matching system similar to regular expressions, that can be used to search a string by syntax (word function) or by semantics (word meaning). *- eg.:('{NP} be * than {NP}')