How to split a Dataset into Train and Test Sets using Python

Last Updated : 7 Apr, 2026

To build and evaluate a machine learning model, the dataset must be divided into two parts i.e one for training the model and another for testing its performance. This process helps measure how well a model works on unseen data. This is done to properly assess how well the model will perform in real-world scenarios.

The training set is used to learn patterns from the data.
The test set is used to evaluate how well the model performs on new data.
It prevents overfitting by avoiding training and testing on the same data.
It provides a realistic estimate of model accuracy.
It allows fair comparison between different models.

Method 1: Splitting Dataset Using train_test_split()

The train_test_split() function from scikit-learn is the most common and easiest way to split a dataset.

Here:

test_size=0.2 keeps 20% data for testing
Remaining 80% is used for training
random_state ensures same split every time

Python

from sklearn.model_selection import train_test_split
import pandas as pd

data = {
    'Age': [22, 25, 47, 52, 46],
    'Salary': [25000, 32000, 48000, 52000, 50000],
    'Purchased': [0, 1, 1, 0, 1]
}

df = pd.DataFrame(data)

X = df[['Age', 'Salary']]
y = df['Purchased']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print( X_train,'\n', X_test)
print( y_train,'\n', y_test)

Output:

Screenshot-2026-02-03-165102 — Output

This shows the splitting of our dataset. Now let's see our models accuracy using logistic regression model.

Python

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Creating and training the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Making predictions on test data
y_pred = model.predict(X_test)

# Evaluating model performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)

Output:

Accuracy: 1.0

We can see our model is performing well after train and test split.

Method 2: Manual Splitting Using Indexing

Manual splitting means dividing a dataset into training and testing parts without using built-in ML functions like train_test_split(). This approach gives full control over how data is shuffled and split.

Here:

Dataset is shuffled first
80% rows are selected for training
Remaining rows are used for testing

Python

import pandas as pd

data = {
    'Age': [22, 25, 47, 52, 46],
    'Salary': [25000, 32000, 48000, 52000, 50000],
    'Purchased': [0, 1, 1, 0, 1]
}

df = pd.DataFrame(data)

df = df.sample(frac=1).reset_index(drop=True)

split = int(0.8 * len(df))

train = df[:split]
test = df[split:]

print(train)
print(test)

Output:

Screenshot-2026-02-03-161859 — Output

Method 3: Splitting Using NumPy

NumPy can also be used when working with arrays instead of DataFrames.

Data is split based on index position
Suitable for numerical array-based datasets

Python

import numpy as np

arr = np.arange(20)

print("original array: ",arr)

train, test = np.split(arr, [16])

print("train: ",train)
print("test: ", test)

Output:

Screenshot-2026-02-03-165544 — Output

Choosing the Right Split Ratio

Dataset Size	Recommended Split
Small	70:30
Medium	80:20
Large	90:10

Best Method to Use

Use train_test_split() for most ML tasks
Use manual splitting for learning or custom logic
Use NumPy split for array-based workflows

Common Mistakes to Avoid

Not shuffling data before splitting
Using test data during training
Choosing very small test size
Forgetting to set random_state

Comment

Article Tags:

Machine Learning

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses