---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- # KDD step 3: data mining document > Document.vector --- A Document is a bag-of-words representation of a text, i.e., unordered words + word count. The Document.vector maps the words (or features) to their weight (absolute or relative word count, tf-idf, ...). The weight of a word represents its relevancy in the text. So we can compare how similar two documents are by measuring if they have relevant words in common. Given an unlabeled document, a classifier yields the label of the most similar document(s) in its training set. This implies that a larger training set with more features (and less labels) gives better performance. — from: http://www.clips.ua.ac.be/pages/pattern-vector#classification ---------------------------------------------------------------------------- pattern.vector *- machine learning tools: *- word count functions *- bag-of-word documents *- a vector space model *- latent semantic analysis *(context analysis) *- algorithms for ** clustering *k-means (similar clusters) *hierarchical (nested clusters) ** and classification *NB (Naive Bayes) *KNN (k-nearest neighbor) *SLP (Single-layer perceptron) *SVM (Support vector machine) *- genetic algorithm * *from: http://www.clips.ua.ac.be/pages/pattern-vector ---------------------------------------------------------------------------- clustering unsupervised learning --> group similarities classification supervised learning --> map into predefined classes from : Data Mining and Profiling in Large Databases, Bart Custers, Toon Calders, Bart Schermer, and Tal Zarsky (Eds.) (2013) --> in resource folder