A random forest algorithm is a machine learning algorithm, specifically, supervised learning, that is used mainly for classification and regression. It is also a form of ensemble modeling. If that language sounds complicated for people out of field, here it is again in simple terms. A random forest algorithm is a basic artificial intelligence program that is used for breaking things into categories or for predicting outcomes in certain situations.
Random forest algorithms consist of a large collection of decision trees that have been set aside within training sets in order to get more robust training samples. What is a decision tree?
A decision tree is an algorithm model that branches out into different possibilities and therefore, looks a little like a tree when given a graphic presentation for humans. The basic problem that you are trying to solve, such as a classification problem or a regression problem, becomes the trunk of the tree, what is called the root node.
From there, every statement, meaning every artificial neuron, can have a true or false answer where the algorithm splits in either direction. That fork becomes the branches of the tree and becomes known as a predictor or decision node. When the algorithm has run its full course and there are no more possibilities and it stops splitting, it becomes a leaf node.
An individual decision tree, however, can have high variance and not be a good predictor or classifier. Therefore, in order to have clean, robust training data so that a training set can have a low bias and a low variance, a large number of samples from a large number of trees is needed. That is the origin of its name, forest, because there is a large number of trees and decision trees because the samples are randomly taken from different groups to reduce bias.
This is the power of consensus! A good analogy would be citing news sources. If I were to cite a source from Fox News, people might question my source because of its politically conservative bias. However, if I bootstrap my source with a politically liberal news outlet, like MSNBC, and both sources agree, I have cross-validation from two different sources and my model becomes more accurate. To find consensus via the random forest model, just go with the majority vote from a vast number of trees.
If we want to increase the robustness of our training set even further with a low variance and a low bias, we can bag the decision trees. To bag trees means to take bootstrap samples and aggregate them. This means that we break our training set into several subsets and take samples from those subsets and mix them up and then average them.
In order to do this, we need to reduce the number of features, which increases the speed of the cycles for each training dataset. That means we have time to run more cycles.
The Merit of Randomness
Randomness helps clean your dataset of impurity. You do not want to run dirty data in your test set. The final output from your learning algorithm will only be as good as the data you feed it. As your datasets get larger, as it would within a Big Data data lake, with billions of data points, any prediction error gets magnified due to the size of the data set. The information gain might be tainted if you were to rely on a single decision tree.
Even though algorithms do a lot of the work, machine learning is a science for patient people. It takes discipline to build good data sets and to run training sets and test sets that have a reasonable level of reliability. Python is one of the easiest coding languages to learn, and it is fortunate that most machine learning libraries use Python, but coding is never completely easy.