Open In App

Extract First and last N rows from PySpark DataFrame

Last Updated : 22 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In data analysis, extracting the start and end of a dataset helps understand its structure and content. PySpark, widely used for big data processing, allows us to extract the first and last N rows from a DataFrame. In this article, we’ll demonstrate simple methods to do this using built-in functions and RDD transformations. Let’s start by creating a sample DataFrame.

Python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('app').getOrCreate()

d = [
    ["1", "sravan", "company 1"],
    ["2", "ojaswi", "company 2"],
    ["3", "bobby", "company 3"],
    ["4", "rohith", "company 2"],
    ["5", "gnanesh", "company 1"]
]

cols = ['ID', 'Name', 'Company']
df = spark.createDataFrame(d, cols)
df.show()

Output

Dataframe

Sample dataframe

Extracting first N rows

When working with large datasets, it is often helpful to quickly inspect the first few rows to gain insights into the structure and content of the data. We can extract the first N rows using several methods, which are discussed below with the help of examples.

Using head()

head() method retrieves the first N rows of a DataFrame. It returns a list of Row objects, allowing you to access the data as a regular Python list.

Python
# extract top 2 rows
a = df.head(2)
print(a)

# extract top 1 row
a = df.head(1)
print(a)

Output

Output4578e

Using head()

Using first()

first() method retrieves only the very first row from the DataFrame. It’s similar to head(1) but returns a single Row object rather than a list.

Python
a = df.first()
print(a)

Output

Output

Using first()

Using limit(n).collect()

You can use limit(n) to limit the number of rows returned, followed by the collect() method to convert the DataFrame into a list of Rows.

Python
a = df.limit(3).collect()
print(a)

Output

Output

Using limit(n).collect()

Extracting last N rows

When working with large datasets, it is often useful to quickly inspect the last few rows to gain insights into the most recent entries or the tail end of the data. We can extract the last N rows using several methods, which are discussed below with the help of examples.

Using tail()

tail() method allows you to fetch the last N rows from the DataFrame, returning a list of Row objects similar to head() but from the end of the DataFrame.

Python
df.tail(3)

Output

Output

Using tail()

Using collect()

You can also use the collect() method to retrieve all the rows from the DataFrame as a list of Rows, then slice the list to get the last N rows.

Python
df.collect()[-3:]

Output

Output

Using collect()



Next Article
Article Tags :
Practice Tags :

Similar Reads