----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
# KDD step 3: data mining
document > Document.vector --- A Document is a bag-of-words representation of a text, i.e., unordered words + word count. The Document.vector maps the words (or features) to their weight (absolute or relative word count, tf-idf, ...). The weight of a word represents its relevancy in the text. So we can compare how similar two documents are by measuring if they have relevant words in common. Given an unlabeled document, a classifier yields the label of the most similar document(s) in its training set. This implies that a larger training set with more features (and less labels) gives better performance. — from:
http://www.clips.ua.ac.be/pages/pattern-vector#classification
----------------------------------------------------------------------------
pattern.vector
- - machine learning tools:
- - word count functions
- - bag-of-word documents
- - a vector space model
- - latent semantic analysis
- - algorithms for
- * clustering
- k-means (similar clusters)
- hierarchical (nested clusters)
- * and classification
- NB (Naive Bayes)
- KNN (k-nearest neighbor)
- SLP (Single-layer perceptron)
- SVM (Support vector machine)
- - genetic algorithm
-
-
from:
http://www.clips.ua.ac.be/pages/pattern-vector
----------------------------------------------------------------------------
clustering unsupervised learning --> group similarities
classification supervised learning --> map into predefined classes
from : Data Mining and Profiling in Large Databases, Bart Custers, Toon Calders, Bart Schermer, and Tal Zarsky (Eds.) (2013) --> in resource folder