середа, 22 грудня 2010 р.

Data mining with Ubuntu.

There are many cool tools for data mining in Ubuntu. Below you can see the list of them and the short description of each tool.

python-cluster is a "simple" package that allows to create several groups (clusters) of objects from a list. It's meant to be flexible and able to cluster any object. To ensure this kind of flexibility, you need not only to supply the list of objects, but also a function that calculates the similarity between two of those objects. For simple datatypes, like integers, this can be as simple as a subtraction, but more complex calculations are possible. Right now, it is possible to generate the clusters using a hierarchical clustering and the popular K-Means algorithm. For the hierarchical algorithm there are different "linkage" (single, complete, average and uclus) methods available.
The module's features include: 
  •  computing distance matrices from observation vectors
  •  generating hierarchical clusters from distance matrices
  •  computing statistics on clusters
  •  cutting linkages to generate flat clusters
  •  visualizing clusters with dendrograms
The interface is very similar to MATLAB's Statistics Toolbox API.
The core implementation of this library is in C for efficiency.
AutoClass solves the problem of automatic discovery of classes in data (sometimes called clustering, or unsupervised learning), as distinct from the generation of class descriptions from labeled examples (called supervised learning). It aims to discover the "natural" classes in the data. AutoClass is applicable to observations of things that can be described by a set of attributes, without referring to other things. The data values corresponding to each attribute are limited to be either numbers or the elements of a fixed set of symbols. With numeric data, a measurement error must be provided.
Libtextcat is a library with functions that implement the classification technic described in Cavnar & Trenkle, "N-Gram-Based Text Categorization". It was primarily developed for language guessing, a task on which it is known to perform with near-perfect accuracy. This package contains 'createfp' for generating fingerprints
automatic classification or clustering 
multimix fits a mixture of multivariate distributions to a set of observations by maximum likelihood using the EM algorithm. The emphasis is less on parameter estimation than on the use of the estimated component distributions to cluster the data. The program is designed to cluster multivariate data with categorical and continuous variables.
Python data processing framework. Implemented algorithms include: Principal Component Analysis (PCA), Independent Component Analysis (ICA), Slow Feature Analysis (SFA), Independent Slow Feature Analysis (ISFA), Growing Neural Gas (GNG), Factor Analysis, Fisher Discriminant Analysis (FDA), and Gaussian Classifiers.
Machine learning algorithms for data mining tasks
Weka is a collection of machine learning algorithms in Java that can either be used from the command-line, or called from your own Java code. Weka is also ideally suited for developing new machine learning schemes. 
Implemented schemes cover decision tree inducers, rule learners, model tree generators, support vector machines, locally weighted regression, instance-based learning, bagging, boosting, and stacking. Also included are clustering methods, and an association rule learner. Apart from actual learning schemes, Weka also contains a large variety of tools that can be used for pre-processing datasets. 
This package contains the binaries and examples. 
Toolkit for multivariate data analysis 
The Toolkit for Multivariate Analysis (TMVA) provides a ROOT-integrated environment for the parallel processing and evaluation of MVA techniques to discriminate signal from background samples. It presently includes (ranked by complexity):
  • Rectangular cut optimisation 
  • Correlated likelihood estimator (PDE approach) 
  • Multi-dimensional likelihood estimator (PDE - range-search approach) 
  • Fisher (and Mahalanobis) discriminant
  • H-Matrix (chi-squared) estimator
  • Artificial Neural Network (two different implementations) 
  • Boosted Decision Trees 
The TMVA package includes an implementation for each of these discrimination techniques, their training and testing (performance evaluation). In addition all these methods can be tested in parallel, and hence their performance on a particular data set may easily be compared. 
ROOT web-site: http://root.cern.ch 
TMVA web-site: http://tmva.sourceforge.net 
nondeterministic quartet tree search library for unrooted trees 
qsearch is a library for universal hierarchical clustering using an arbitrary distance matrix as input. It searches through the space of all possible unrooted trees of a given size and finds the closest match based on a weighted quartet cost function determined by the distance matrix. When used in combination with other feature extraction libraries such as libcomplearn this system can be used for fast and accurate phylogenetic reconstruction, linguistic analysis, or stemmatology.
The MCL package is an implementation of the MCL algorithm, and offers utilities for manipulating sparse matrices (the essential data structure in the MCL algorithm) and conducting cluster experiments. MCL is currently being used in sciences like biology (protein family detection, genomics), computer science (node clustering in Peer-to-Peer networks), and linguistics (text analysis).
high-performance Python package for predictive modeling 
mlpy provides high level procedures that support, with few lines of code, the design of rich Data Analysis Protocols (DAPs) for preprocessing, clustering, predictive classification and feature selection. Methods are available for feature weighting and ranking, data resampling, error evaluation and experiment landscaping. 
mlpy includes: SVM (Support Vector Machine), KNN (K Nearest Neighbor), FDA, SRDA, PDA, DLDA (Fisher, Spectral Regression, Penalized, Diagonal Linear Discriminant Analysis) for classification and feature weighting, I-RELIEF, DWT and FSSun for feature weighting, *RFE (Recursive Feature Elimination) and RFS (Recursive Forward Selection) for feature ranking, DWT, UWT, CWT (Discrete, Undecimated, Continuous Wavelet Transform), KNN imputing, DTW (Dynamic Time Warping), Hierarchical Clustering, k-medoids, Resampling Methods, Metric Functions, Canberra indicators.
Large Scale Machine Learning Toolbox 
SHOGUN - is a new machine learning toolbox with focus on large scale kernel methods and especially on Support Vector Machines (SVM) with focus to bioinformatics. It provides a generic SVM object interfacing to several different SVM implementations. Each of the SVMs can be combined with a variety of the many kernels implemented. It can deal with weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain, where an optimal sub-kernel weighting can be learned using Multiple Kernel Learning. Apart from SVM 2-class classification and regression problems, a number of linear methods like Linear Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel) Perceptrons and also algorithms to train hidden markov models are implemented. The input feature-objects can be dense, sparse or strings and of type int/short/double/char and can be converted into different feature types. Chains of preprocessors (e.g. substracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing. SHOGUN comes in different flavours, a stand-a-lone version and also with interfaces to Matlab(tm), R, Octave, Readline and Python. This is the core library all interfaces are based on.
State of the art machine learning library - runtime library 
Torch is a machine-learning library, written in C++. Its aim is to provide the state-of-the-art of the best algorithms for machine-learning. 
  • Many gradient-based methods, including multi-layered perceptrons, radial basis functions, and mixtures of experts. Many small "modules" (Linear module, Tanh module, SoftMax module, ...) can be plugged together. 
  • Support Vector Machine, for classification and regression. 
  • Distribution package, includes Kmeans, Gaussian Mixture Models, Hidden Markov Models, and Bayes Classifier, and classes for speech recognition with embedded training. 
  • Ensemble models such as Bagging and Adaboost. 
  • Non-parametric models such as K-nearest-neighbors, Parzen Regression and Parzen Density Estimator. This package is the Torch runtime library.
automatic classification or clustering 
multimix fits a mixture of multivariate distributions to a set of observations by maximum likelihood using the EM algorithm. The emphasis is less on parameter estimation than on the use of the estimated component distributions to cluster the data. The program is designed to cluster multivariate data with categorical and continuous variables.
Python bindings for support vector machine library
Python dynamic loading extension module for the LIBSVM library.
multivariate pattern analysis with Python 
Python module to ease pattern classification analyses of large datasets. It provides high-level abstraction of typical processing steps (e.g. data preparation, classification, feature selection, generalization testing), a number of implementations of some popular algorithms (e.g. kNN, Ridge Regressions, Sparse Multinomial Logistic Regression), and bindings to external machine learning libraries (libsvm, shogun). While it is not limited to neuroimaging data (e.g. fMRI, or EEG) it is eminently suited for such datasets.

Немає коментарів:

Дописати коментар