Open In App

How to split a Dataset into Train and Test Sets using Python

Last Updated : 18 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

One of the most important steps in preparing data for training a ML model is splitting the dataset into training and testing sets. This simply means dividing the data into two parts: one to train the machine learning model (training set), and another to evaluate how well it performs on unseen data (testing set). The training set is used to fit the model, and the statistics of the training set are known. The second set is called the test data set which is solely used for predictions.

We’ll see how to split a dataset into train and test sets using Python. We’ll use scikit-learn library to perform the split efficiently. Whether you’re working with numerical data, text, or images, this is an essential part of any supervised machine learning workflow.

Installation:

The scikit-learn library can be installed using pip:-

Python
pip install scikit-learn

Alternatively, it can also be downloaded from here.

Dataset Splitting

Scikit-learn is one of the most widely used machine learning libraries in Python. It provides a range of tools for building models, pre-processing data, and evaluating performance. For splitting datasets, it provides a handy function called train_test_split() within the model_selection module, making it simple to divide your data into training and testing sets.

Syntax:

train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

Parameters:

  • *arrays: The data you want to split. This can be in the form of lists, arrays, pandas DataFrames, or matrices.
  • test_size: A number between 0.0 and 1.0 that tells what portion of the data should go into the test set. For example, 0.2 means 20% of the data will be used for testing.
  • train_size: this is a number between 0.0 and 1.0 that tells what portion of the data should go into the training set. If not set, it’s automatically calculated based on the test_size.
  • random_state: A number that makes sure the split is the same every time you run the code. It’s like setting a seed for the shuffle.
  • shuffle: If True, the data is shuffled before splitting. This helps make the train and test sets more random. It’s True by default.
  • stratify: This helps keep the same class distribution in both the train and test sets. It’s useful especially for classification problems.

Example

Let us take a sample data to perform splitting of data over it. The data can be downloaded from here in the form of CSV.

Data_set

In the example, we first import pandas and sklearn. Then, we load the CSV file using the read_csv() function. This stores the data in a DataFrame called df. we want to predict the house price, which is in the last column so we set that as y (target). All the other columns are used as features, stored in X.

We use train_test_split() to split the data:

  • test_size=0.05 means 5% of the data is used for testing, and 95% for training.
  • random_state=0 ensures the split is the same every time we run the code.
Python
# import modules
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# read the dataset
df = pd.read_csv('Real-estate.csv')

# get the locations
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.05, random_state=0)

Output:

Hence, we have our splitted dataset into training and testing set. If you want to learn further more about improving your machine learning flow, you may explore :-

  1. Stratified sampling
  2. Cross validation
  3. Handling imbalanced datasets
  4. Pre-processing before splitting
  5. Machine Learning Models


Next Article

Similar Reads