Welcome to Etherpad!
This pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!
Get involved with Etherpad at
http://etherpad.org
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
# KDD step 2: data preperation
syntax:
-
tokenization/normalization (98%)*
-
simplest thing/important thing
-
identifying the units in your text
-
to read the punctuation, e.g.:
-
- dr.
-
- This is a sentence.
lemmatization:
-
reduce wordforms to their dictionary item
-
is/been/was/be
-
+ plurals --> singulars
syntactical:
-
part-of-speech tagging
-
important elements for object text-mining
-
for subjective text-mining
-
word sense disambiguation
-
bank / bank
-
--> river bank / money bank
-
semantic role labeling
-
pragmatics: (?)
-
named entity recognition
-
co-reference resolution (50%)*
*(% refers to accuracy)
from : CLiPS Guy de Pauw, Pattern workshop — Cqrrelations, January 2015
----------------------------------------------------------------------------
pattern.en |es|de|fr|it|nl
-
- text preperation
-
- sentiment analysis tool
-
- WordNet interface
-
- wordlists interface
pattern.search
-
- a pattern matching system similar to regular expressions, that can be used to search a string by syntax (word function) or by semantics (word meaning).
-
- eg.:('{NP} be * than {NP}')