User:Ian Helmke/Sandbox
Clustering Additions
This is text I'd like to add to the clustering page (pending heavy editing)
Overfitting
Overfitting occurs when a classifier trains so closely to model data that it is not useful for classifying things outside of the model. The classifier is extremely accurate when classifying the training data, but when the classifier is used to classify data outside of the model, it will often classify data incorrectly.
This can sometimes occur because a classifier looks at irrelevant data points in the training data. Classifiers often give more weight to features seen less commonly in the training data, and if data given to the classifier shares similar uncommon values, it is grouped accordingly. This produces inaccurate results.
Machine learning techniques can prevent overfitting to some extent by penalizing overly complicated models. A simpler model is oftentimes more consistent with the data in question.
Imbalance of Data
At times, data within training sets is imbalanced, where one sample category has many more points of data than another. Oftentimes, the category with less available data is the more interesting one, and we want to find characteristics that are common to the minority group that are also not shared by the majority group. Classifiers assume that the ratio of the different categories of training data that they receive are roughly equivalent to the ratio of real data that will belong in each category.
One way to prevent this imbalance of data is to use a technique called active example selection. Active example selection builds the classifier model slowly, adding a few pieces of sample data of each class at a time. The model is tested at every stage, and documents which improve the accuracy of the model are kept in the classifier, while ones that do not change the model or make its performance worse are removed. This ensures that only meaningful pieces of data are used in the classifier.
Evaluating Results
There are a number of techniques for evaluating the results of a machine learning algorithm. Some of these techniques are also used in natural-language processing. Machine learning techniques are generally evaluated by their results, and some methods (such as neural networks) are considered "black box" forms of classification, since it is not easy to understand how or why the underlying implementation is sorting the way it is.
In some cases, particularly with classifiers, the outcome of the machine learning algorithm is compared to a set of data classified by experts. In this case, the machine learning algorithm is given a set of training data, and then classifies a second sample set of data. A group of experts also annotate this second set of data, marking it according to how they believe it should be classified (not according to how they believe a machine would classify it). This training and evaluation data is generally a tiny subset of the data available. If a classifier is able to produce good results for a subset of data, it should also be successful at classifying a larger set of similar data. The results of the algorithm are compared to the results of the experts and arranged into two scores: precision and recall. Precision accounts for situations where the classifier put something in a category where it did not belong. Recall accounts for situations where the classifier did not put something in a category it should have.
Classification and clustering algorithms can also be measured against other algorithms. This is useful when an algorithm is attempting to improve performance (speedwise, for example) while providing similar levels of precision and recall relative to another algorithm. It can also be used to show that an algorithm is an improvement over a previous generation, or to show which algorithm is most useful for organizing data for a particular problem.
Scalability
The ability of machine learning algorithms to take advantage of modern computers, which tend to have the ability to multitask exceptionally well through the use of multiple cores or CPUs, is important. Since these algorithms are often used to process large quantities of data, the ability of an algorithm to be able to run in parallel and scale accordingly (so that an algorithm running on two cores runs twice as fast, for example) allows it to run faster on modern hardware, so that larger quantities of data can be processed.
Biclustering
The following material would go on a new page of the above title (probably linked to the ML page)
Biclustering is an unsupervised machine learning method which searches for a set of common features across the input data. It is unique among machine learning methods because it searches for similarities in parts of the data, instead of putting a piece of data into a single group.
Biclustering was first discovered in 1970. Today, it is a commonly used technique in bioinformatics, paricularly in the area of gene expression, or identifying groups of genes that are similar between different people.
Processing
Clustering algorithms take vectors as input. They sort the vectors according to how similar they are by comparing all of the features (data) in the vector, and the result of the clustering is several groups that each contain a bunch of (hopefully) similar vectors. Biclustering looks at all of the vectors as a single input matrix, and attempts to find regions of the input which look similar.
Biclustering is a useful technique for finding trends in data when each vector of data is very large because it can spot trends in specific parts of data that clustering cannot. A normal clustering algorithm sorts pieces of data according to features that the majority of them share. Biclustering is able to organize data according to parts of them that seem similar.
Applications
Biclustering is particularly useful in the medical field, where it can be used, for example, to find genes related to a specific disease in a group of patients. If each vector represents how a person expresses traits, biclustering can be used to determine a set of genes which is associated with cancer. It can even be used to find similarities and differences between different varieties of cancers.