0% found this document useful (0 votes)
11 views

Learn Python From Scratch

Uploaded by

mohanadvani74
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Learn Python From Scratch

Uploaded by

mohanadvani74
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Building

Random Forest Algorithm


from Scratch in Python

Without relying on high-level libraries


1 ANSHUMAN JHA
Building Random Forest Algorithm from Scratch in Python

Table of Contents
1. Introduction to Random Forest
2. Building Blocks of a Random Forest
a. Decision Tree
b. Bootstrap Aggregation (Bagging)
c. Random Feature Selection
3. The Structure of a Random Forest
4. Implementation Random Forest from Scratch in Python
a. Implementing the Decision Tree
i. Calculate the Gini Index
ii. Split a Dataset Based on an Attribute and an Attribute Value
iii. Select the Best Split Point for a Dataset
iv. Create a Terminal Node Value
v. Create Child Splits for a Node or Make Terminal
vi. Build a Decision Tree
vii. Make a Prediction with a Decision Tree
viii. Create a Random Subsample from the Dataset with Replacement
ix. Prediction with a List of Bagged Trees
b. Building the Random Forest
i. Create a Random Forest
ii. Evaluate the Algorithm Using Cross-Validation
iii. Split a Dataset into K Folds
iv. Calculate Accuracy Percentage
v. Test the Random Forest Algorithm
vi. Load Dataset Function
vii. Main Script to Run the Random Forest Algorithm
5. Conclusion

2 ANSHUMAN JHA
Building Random Forest Algorithm from Scratch in Python

1. Introduction to Random Forest Algorithm


Random Forest is a popular ensemble learning method for classification and regression tasks. It combines multiple
decision trees to improve the overall performance and reduce the risk of overfitting. In this post, we will build a Random
Forest algorithm from scratch in Python, without relying on high-level libraries like scikit-learn.

Random Forest is an ensemble method that builds multiple decision trees and merges them together to get a more
accurate and stable prediction. Each tree is built using a different subset of the training data, and the final prediction is
made by averaging the predictions of all trees (for regression) or by majority voting (for classification)

2. Building Blocks of a Random Forest


Decision Tree
A decision tree is a flowchart-like structure where each internal node represents a decision based on a feature, each
branch represents the outcome of the decision, and each leaf node represents a class label (for classification) or a
continuous value (for regression).

Bootstrap Aggregation (Bagging)


Bagging involves randomly sampling with replacement from the training data to create multiple datasets. Each decision
tree is trained on a different dataset, reducing variance and improving robustness.

Random Feature Selection


Random feature selection means that each node in a decision tree is split using a random subset of features. This
process helps to decorrelate the trees and improve the model's performance.

3 ANSHUMAN JHA
Building Random Forest Algorithm from Scratch in Python

3. The Structure of a Random Forest Algorithm


This Structure includes the steps and sub-steps with appropriate labels and connections. Each step corresponds to a
function or a key part of the process described in the provided implementation.

4 ANSHUMAN JHA
Building Random Forest Algorithm from Scratch in Python

4. Implementation in Python
Let's implement a simple Random Forest Algorithm in Python.

Step 1: Implementing the Decision Tree


Let's start by implementing the core component of our Random Forest: the decision tree.

Calculate the Gini Index


The Gini index is a measure of impurity or diversity used to evaluate splits in decision trees.

import numpy as np

def gini_index(groups, classes):


# Count all samples at split point
n_instances = float(sum([len(group) for group in groups]))
# Sum weighted Gini index for each group
gini = 0.0
for group in groups:
size = float(len(group))
# Avoid division by zero
if size == 0:
continue
score = 0.0
# Score the group based on the score for each class
for class_val in classes:
p = [row[-1] for row in group].count(class_val) / size
score += p * p
# Weight the group score by its relative size
gini += (1.0 - score) * (size / n_instances)
return gini

Split a Dataset Based on an Attribute and an Attribute Value


This function splits the dataset into two groups based on a feature index and a threshold value.

def test_split(index, value, dataset):


left, right = list(), list()
for row in dataset:
if row[index] < value:
left.append(row)
else:
right.append(row)
return left, right

Select the Best Split Point for a Dataset


This function evaluates all potential splits and selects the one with the lowest Gini index.

def get_split(dataset):
class_values = list(set(row[-1] for row in dataset))
b_index, b_value, b_score, b_groups = 999, 999, 999, None
for index in range(len(dataset[0])-1):
for row in dataset:
groups = test_split(index, row[index], dataset)
gini = gini_index(groups, class_values)
if gini < b_score:
b_index, b_value, b_score, b_groups = index, row[index], gini, groups
return {'index': b_index, 'value': b_value, 'groups': b_groups}

5 ANSHUMAN JHA
Building Random Forest Algorithm from Scratch in Python

Create a Terminal Node Value


This function determines the output value for a terminal node, which is the most common class in the group.

def to_terminal(group):
outcomes = [row[-1] for row in group]
return max(set(outcomes), key=outcomes.count)

Create Child Splits for a Node or Make Terminal


This recursive function splits nodes into child nodes or makes them terminal nodes if stopping criteria are met.

def split(node, max_depth, min_size, depth):


left, right = node['groups']
del(node['groups'])
# Check for no split
if not left or not right:
node['left'] = node['right'] = to_terminal(left + right)
return
# Check for max depth
if depth >= max_depth:
node['left'], node['right'] = to_terminal(left), to_terminal(right)
return
# Process left child
if len(left) <= min_size:
node['left'] = to_terminal(left)
else:
node['left'] = get_split(left)
split(node['left'], max_depth, min_size, depth+1)
# Process right child
if len(right) <= min_size:
node['right'] = to_terminal(right)
else:
node['right'] = get_split(right)
split(node['right'], max_depth, min_size, depth+1)

Build a Decision Tree


This function builds a decision tree by recursively splitting nodes.

def build_tree(train, max_depth, min_size):


root = get_split(train)
split(root, max_depth, min_size, 1)
return root

6 ANSHUMAN JHA
Building Random Forest Algorithm from Scratch in Python

Make a Prediction with a Decision Tree


This function predicts the class label for a given row of data using a decision tree.

def predict(node, row):


if row[node['index']] < node['value']:
if isinstance(node['left'], dict):
return predict(node['left'], row)
else:
return node['left']
else:
if isinstance(node['right'], dict):
return predict(node['right'], row)
else:
return node['right']

Create a Random Subsample from the Dataset with Replacement


This function creates a bootstrap sample from the dataset, which is used to train each decision tree in the
Random Forest.

from random import seed, randrange

def subsample(dataset, ratio):


sample = list()
n_sample = round(len(dataset) * ratio)
while len(sample) < n_sample:
index = randrange(len(dataset))
sample.append(dataset[index])
return sample

Make a Prediction with a List of Bagged Trees


This function predicts the class label for a given row of data by aggregating predictions from multiple decision
trees.

def bagging_predict(trees, row):


predictions = [predict(tree, row) for tree in trees]
return max(set(predictions), key=predictions.count)

Step 2: Building the Random Forest


With our decision tree implementation ready, we can now build the Random Forest. The Random Forest will be
composed of multiple decision trees, each trained on a different bootstrap sample of the training data.
Create a Random Forest
This function builds the Random Forest by creating multiple decision trees, each trained on a different
bootstrap sample of the training data.

def random_forest(train, max_depth, min_size, sample_size, n_trees, n_features):


trees = list()
for _ in range(n_trees):
sample = subsample(train, sample_size)
tree = build_tree(sample, max_depth, min_size)
trees.append(tree)
return trees

7 ANSHUMAN JHA
Building Random Forest Algorithm from Scratch in Python

Evaluate the Algorithm Using Cross-Validation


This function evaluates the performance of the Random Forest algorithm using cross-validation.

def evaluate_algorithm(dataset, algorithm, n_folds, *args):


folds = cross_validation_split(dataset, n_folds)
scores = list()
for fold in folds:
train_set = list(folds)
train_set.remove(fold)
train_set = sum(train_set, [])
test_set = list()
for row in fold:
row_copy = list(row)
test_set.append(row_copy)
row_copy[-1] = None
predicted = algorithm(train_set, *args)
actual = [row[-1] for row in fold]
accuracy = accuracy_metric(actual, predicted)
scores.append(accuracy)
return scores

Split a Dataset into K Folds


This function splits the dataset into k folds for cross-validation.

def cross_validation_split(dataset, n_folds):


dataset_split = list()
dataset_copy = list(dataset)
fold_size = int(len(dataset) / n_folds)
for _ in range(n_folds):
fold = list()
while len(fold) < fold_size:
index = randrange(len(dataset_copy))
fold.append(dataset_copy.pop(index))
dataset_split.append(fold)
return dataset_split

Calculate Accuracy Percentage


This function calculates the accuracy percentage of the predictions.

def accuracy_metric(actual, predicted):


correct = 0
for i in range(len(actual)):
if actual[i] == predicted[i]:
correct += 1
return correct / float(len(actual)) * 100.0

Test the Random Forest Algorithm


This function tests the Random Forest algorithm on the dataset.

def random_forest_algorithm(train, test, max_depth, min_size, sample_size, n_trees, n_features):


trees = random_forest(train, max_depth, min_size, sample_size, n_trees, n_features)
predictions = [bagging_predict(trees, row) for row in test]
return predictions

8 ANSHUMAN JHA
Building Random Forest Algorithm from Scratch in Python

Load Dataset Function


This function loads a dataset from a CSV file.

def load_dataset(filename):
dataset = list()
with open(filename, 'r') as file:
for line in file:
if line.strip():
dataset.append(list(map(float, line.split(','))))
return dataset

Main Script to Run the Random Forest Algorithm


The main script ties everything together and runs the Random Forest algorithm on a given dataset.

seed(1)
filename = 'data.csv'
dataset = load_dataset(filename)
n_folds = 5
max_depth = 10
min_size = 1
sample_size = 1.0
n_trees = 10
n_features = int(np.sqrt(len(dataset[0])-1))
scores = evaluate_algorithm(dataset, random_forest_algorithm, n_folds, max_depth, min_size, sample_size,
n_trees, n_features)
print(f'Scores: {scores}')
print(f'Mean Accuracy: {sum(scores)/float(len(scores)):.3f}%')

By breaking down each function and providing detailed comments, we have a clear understanding of how each part of
the Random Forest algorithm is implemented from scratch in Python.

5. Conclusion
Building a Random Forest from scratch involves understanding and implementing several key components: decision
trees, bootstrap aggregation, and random feature selection. By combining these elements, we can create a powerful
ensemble model that improves prediction accuracy and reduces overfitting.

This implementation provides a foundational understanding of how Random Forest works and allows you to customize
and extend the algorithm for specific use cases.

Constructive comments and feedback are welcomed


9 ANSHUMAN JHA

You might also like