----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
# KDD step 2: data preperation
syntax:
- tokenization/normalization (98%)*
- simplest thing/important thing
- identifying the units in your text
- to read the punctuation, e.g.:
- - dr.
- - This is a sentence.
lemmatization:
- reduce wordforms to their dictionary item
- is/been/was/be
- + plurals --> singulars
syntactical:
- part-of-speech tagging
- important elements for object text-mining
- for subjective text-mining
- word sense disambiguation
- bank / bank
- --> river bank / money bank
- semantic role labeling
-
pragmatics: (?)
- named entity recognition
- co-reference resolution (50%)*
*(% refers to accuracy)
from : CLiPS Guy de Pauw, Pattern workshop — Cqrrelations, January 2015
----------------------------------------------------------------------------
pattern.en |es|de|fr|it|nl
- - text preperation
- - sentiment analysis tool
- - WordNet interface
- - wordlists interface
pattern.search
- - a pattern matching system similar to regular expressions, that can be used to search a string by syntax (word function) or by semantics (word meaning).
- - eg.:('{NP} be * than {NP}')