In this lecture, we will cover two popular machine learning algorithms: random forests and gradient boosting tree models. These algorithms are both ensemble methods, which means that they combine multiple weak learners to create a single strong learner. Random forests are made up of decision trees, while gradient boosting tree models are made up of decision stumps.
We will discuss the advantages and disadvantages of both algorithms, as well as how to choose the right one for your particular problem. We will also provide some guidance on how to use these algorithms in practice.
Random forests are a type of ensemble method that uses decision trees as the weak learners.
Each tree in a random forest is trained on a randomly selected subset of the data, and the predictions from all of the trees are then combined to make a final prediction.
Random forests are known for their accuracy and robustness.
They can be used for both classification and regression tasks.
Random forests work by building a number of decision trees on a randomly selected subset of the data. The trees are then combined to make a final prediction.
The trees are built in a recursive manner. At each step, the tree is split into two branches based on a randomly selected feature. The splitting process continues until all of the data points in a branch have the same label.
The predictions from all of the trees are then combined to make a final prediction. The most common way to do this is to average the predictions from all of the trees.
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
#Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70%/30% split
clf=RandomForestClassifier(n_estimators=100)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(y_pred)Gradient Boosted Decision Tree (GBDT) models are a type of ensemble method that uses either decision stumps or full decision trees as the weak learners.
Each tree in a gradient boosted tree model is trained to correct the errors made by the previous trees.
Gradient boosted tree models are known for their accuracy.
They may be used for classification and regression tasks.
Gradient boosted tree models work by building a number of decision stumps on the data. The stumps are then combined to make a final prediction.
The stumps are built in a sequential manner. At each step, the stump is trained to correct the errors made by the previous trees. This is done by minimizing the gradient of the loss function with respect to the predictions from the previous trees. (In other words, it tries to fit the residuals from previous trees.)
The predictions from all of the stumps are then combined to make a final prediction. The most common way to do this is to average the predictions from all of the stumps.
Image: Hands-On Machine Learning with R, Bradley Boehmke & Brandon Greenwell, Figure 12.1 (upscaled in Pixelmator Pro).
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
clf.fit(X_train, y_train)
print(clf.predict(X_test))library(gbm)
library(rsample) # data splitting
set.seed(123)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
# train GBM model
set.seed(123)
gbm.fit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = ames_train,
n.trees = 10000,
interaction.depth = 1,
shrinkage = 0.001,
cv.folds = 5,
n.cores = NULL, # will use all cores by default
verbose = FALSE
)
# print results
print(gbm.fit)
So, which algorithm should you choose?
It depends on your particular problem. Simple decision trees are easier to interpret. Random forests work well on some types of problems, but more challenging problems may benefit from gradient boosted trees.
Keep in mind the scale of model complexity:
Decision Tree \(\rightarrow\) Random Forest \(\rightarrow\) GBDT
Random Forest and Gradient Boosted Decision Trees

CS 4/5623 Fundamentals of Data Science