----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------

# KDD step 3: data mining

document > Document.vector --- A Document is a bag-of-words representation of a text, i.e., unordered words + word count. The Document.vector maps the words (or features) to their weight (absolute or relative word  count, tf-idf, ...). The weight of a word represents its relevancy in the text. So we can compare how similar two documents are by measuring if they have relevant words in common. Given an unlabeled document, a classifier yields the label of the most similar document(s) in its training set. This implies that a larger training set with more features (and less labels) gives better performance. — from: http://www.clips.ua.ac.be/pages/pattern-vector#classification


----------------------------------------------------------------------------


pattern.vector


----------------------------------------------------------------------------


clustering unsupervised learning --> group similarities
classification supervised learning --> map into predefined classes
from : Data Mining and Profiling in Large Databases, Bart Custers, Toon Calders, Bart Schermer, and Tal Zarsky (Eds.) (2013) --> in resource folder