Welcome to Etherpad!
This pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!
Get involved with Etherpad at
http://etherpad.org----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------
# KDD step 3: data mining
document > Document.vector --- A Document is a bag-of-words representation of a text, i.e., unordered words + word count. The Document.vector maps the words (or features) to their weight (absolute or relative word count, tf-idf, ...). The weight of a word represents its relevancy in the text. So we can compare how similar two documents are by measuring if they have relevant words in common. Given an unlabeled document, a classifier yields the label of the most similar document(s) in its training set. This implies that a larger training set with more features (and less labels) gives better performance. — from:
http://www.clips.ua.ac.be/pages/pattern-vector#classification
----------------------------------------------------------------------------
pattern.vector
-
- machine learning tools:
-
- word count functions
-
- bag-of-word documents
-
- a vector space model
-
- latent semantic analysis
-
- algorithms for
-
* clustering
-
k-means (similar clusters)
-
hierarchical (nested clusters)
-
* and classification
-
NB (Naive Bayes)
-
KNN (k-nearest neighbor)
-
SLP (Single-layer perceptron)
-
SVM (Support vector machine)
-
- genetic algorithm
-
-
from:
http://www.clips.ua.ac.be/pages/pattern-vector
----------------------------------------------------------------------------
clustering unsupervised learning --> group similarities
classification supervised learning --> map into predefined classes
from : Data Mining and Profiling in Large Databases, Bart Custers, Toon Calders, Bart Schermer, and Tal Zarsky (Eds.) (2013) --> in resource folder