Welcome to Etherpad!
This pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!
Get involved with Etherpad at
# KDD step 2: data preperation
tokenization/normalization (98%)*
simplest thing/important thing
identifying the units in your text
to read the punctuation, e.g.:
- dr.
- This is a sentence.
reduce wordforms to their dictionary item
+ plurals --> singulars
part-of-speech tagging
important elements for object text-mining
for subjective text-mining
word sense disambiguation
bank / bank
--> river bank / money bank
semantic role labeling
pragmatics: (?)
named entity recognition
co-reference resolution (50%)*
*(% refers to accuracy)
from : CLiPS Guy de Pauw, Pattern workshop — Cqrrelations, January 2015
pattern.en |es|de|fr|it|nl
- text preperation
- sentiment analysis tool
- WordNet interface
- wordlists interface
- a pattern matching system similar to regular expressions, that can be used to search a string by syntax (word function) or by semantics (word meaning).
- eg.:('{NP} be * than {NP}')