----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
# KDD step 3: data mining
document > Document.vector --- A Document is a bag-of-words representation of a text, i.e., unordered words + word count. The Document.vector maps the words (or features) to their weight (absolute or relative word count, tf-idf, ...). The weight of a word represents its relevancy in the text. So we can compare how similar two documents are by measuring if they have relevant words in common. Given an unlabeled document, a classifier yields the label of the most similar document(s) in its training set. This implies that a larger training set with more features (and less labels) gives better performance. — from: http://www.clips.ua.ac.be/pages/pattern-vector#classification
----------------------------------------------------------------------------
pattern.vector
*- machine learning tools:
*- word count functions
*- bag-of-word documents
*- a vector space model
*- latent semantic analysis
*(context analysis)
*- algorithms for
** clustering
*k-means (similar clusters)
*hierarchical (nested clusters)
** and classification
*NB (Naive Bayes)*KNN (k-nearest neighbor)
*SLP (Single-layer perceptron)
*SVM (Support vector machine)
*- genetic algorithm
*
*from: http://www.clips.ua.ac.be/pages/pattern-vector
----------------------------------------------------------------------------
clustering unsupervised learning --> group similarities
classification supervised learning --> map into predefined classes
from : Data Mining and Profiling in Large Databases, Bart Custers, Toon Calders, Bart Schermer, and Tal Zarsky (Eds.) (2013) --> in resource folder