commonsense_kdd

----------------------------------------------------------------------------
----------------------------------------------------------------------------
----------------------------------------------------------------------------

# KDD step 3: data mining

document > Document.vector --- A Document is a bag-of-words representation of a text, i.e., unordered words + word count. The Document.vector maps the words (or features) to their weight (absolute or relative word count, tf-idf, ...). The weight of a word represents its relevancy in the text. So we can compare how similar two documents are by measuring if they have relevant words in common. Given an unlabeled document, a classifier yields the label of the most similar document(s) in its training set. This implies that a larger training set with more features (and less labels) gives better performance. — from: http://www.clips.ua.ac.be/pages/pattern-vector#classification

----------------------------------------------------------------------------

pattern.vector

- machine learning tools:
- - word count functions
- - bag-of-word documents
- - a vector space model
- - latent semantic analysis
  - (context analysis)
- - algorithms for
  - * clustering
    - k-means (similar clusters)
    - hierarchical (nested clusters)
  - * and classification
    - NB (Naive Bayes)
    - KNN (k-nearest neighbor)
    - SLP (Single-layer perceptron)
    - SVM (Support vector machine)
- - genetic algorithm
from: http://www.clips.ua.ac.be/pages/pattern-vector

----------------------------------------------------------------------------

clustering unsupervised learning --> group similarities
classification supervised learning --> map into predefined classes
from : Data Mining and Profiling in Large Databases, Bart Custers, Toon Calders, Bart Schermer, and Tal Zarsky (Eds.) (2013) --> in resource folder