How to split a Dataset into Train and Test Sets using Python

Last Updated : 7 Apr, 2026

To build and evaluate a machine learning model, the dataset must be divided into two parts i.e one for training the model and another for testing its performance. This process helps measure how well a model works on unseen data. This is done to properly assess how well the model will perform in real-world scenarios.

  • The training set is used to learn patterns from the data.
  • The test set is used to evaluate how well the model performs on new data.
  • It prevents overfitting by avoiding training and testing on the same data.
  • It provides a realistic estimate of model accuracy.
  • It allows fair comparison between different models.

Method 1: Splitting Dataset Using train_test_split()

The train_test_split() function from scikit-learn is the most common and easiest way to split a dataset.

Here:

  • test_size=0.2 keeps 20% data for testing
  • Remaining 80% is used for training
  • random_state ensures same split every time
Python
from sklearn.model_selection import train_test_split
import pandas as pd

data = {
    'Age': [22, 25, 47, 52, 46],
    'Salary': [25000, 32000, 48000, 52000, 50000],
    'Purchased': [0, 1, 1, 0, 1]
}

df = pd.DataFrame(data)

X = df[['Age', 'Salary']]
y = df['Purchased']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print( X_train,'\n', X_test)
print( y_train,'\n', y_test)

Output:

Screenshot-2026-02-03-165102
Output

This shows the splitting of our dataset. Now let's see our models accuracy using logistic regression model.

Python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Creating and training the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Making predictions on test data
y_pred = model.predict(X_test)

# Evaluating model performance
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)

Output:

Accuracy: 1.0

We can see our model is performing well after train and test split.

Method 2: Manual Splitting Using Indexing

Manual splitting means dividing a dataset into training and testing parts without using built-in ML functions like train_test_split(). This approach gives full control over how data is shuffled and split.

Here:

  • Dataset is shuffled first
  • 80% rows are selected for training
  • Remaining rows are used for testing
Python
import pandas as pd

data = {
    'Age': [22, 25, 47, 52, 46],
    'Salary': [25000, 32000, 48000, 52000, 50000],
    'Purchased': [0, 1, 1, 0, 1]
}

df = pd.DataFrame(data)

df = df.sample(frac=1).reset_index(drop=True)

split = int(0.8 * len(df))

train = df[:split]
test = df[split:]

print(train)
print(test)

Output:

Screenshot-2026-02-03-161859
Output

Method 3: Splitting Using NumPy

NumPy can also be used when working with arrays instead of DataFrames.

  • Data is split based on index position
  • Suitable for numerical array-based datasets
Python
import numpy as np

arr = np.arange(20)

print("original array: ",arr)

train, test = np.split(arr, [16])

print("train: ",train)
print("test: ", test)

Output:

Screenshot-2026-02-03-165544
Output

Choosing the Right Split Ratio

Dataset SizeRecommended Split
Small70:30
Medium80:20
Large90:10

Best Method to Use

  • Use train_test_split() for most ML tasks
  • Use manual splitting for learning or custom logic
  • Use NumPy split for array-based workflows

Common Mistakes to Avoid

  • Not shuffling data before splitting
  • Using test data during training
  • Choosing very small test size
  • Forgetting to set random_state
Comment