PySpark dataframe foreach to fill a list
Last Updated :
28 Apr, 2025
In this article, we are going to learn how to make a list of rows in Pyspark dataframe using foreach using Pyspark in Python.
PySpark is a powerful open-source library for working on large datasets in the Python programming language. It is designed for distributed computing and it is commonly used for data manipulation and analysis tasks. By using parallel processing techniques It allows users to easily and efficiently process large amounts of data.
Data frame structure is one of the key features of PySpark, which makes it to manipulate and analyze data in a tabular format. Dataframe is similar to traditional spreadsheet or SQL table structures which provide a variety of functions and methods for manipulating and analyzing data.
Dataframes and their importance in PySpark
In PySpark, a data frame is a distributed collection of data that is organized into rows and columns. Which is similar to a spreadsheet or a SQL table, and it is an essential part of PySpark for data manipulation and analysis. Dataframes allows users to easily manipulate, filter, and transform data, and also provide a wide range of functions and methods for working with data. Key advantage of data frames are their ability to scale to large amounts of data. As data frames are distributed across a cluster of machines, they can handle very large datasets without memory constraints. Because of this it makes it ideal for working with big data and for running complex queries and operations on large datasets.
Creating a Pyspark data frame with the list
In this we are going to create Pyspark data frame using list of tuples by defining its schema using StructType() and then create data frame using createDataFrame() function. Here are the step to create a PySpark data frame from a list.
Python3
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession
# create a SparkSession
spark = SparkSession.builder.appName(
"PySpark fill list").getOrCreate()
# Create a list of tuples
data = [(1, "John"), (2, "Mike"),
(3, "Sara")]
# Define the schema of the dataframe
schema = StructType([
StructField("id", IntegerType()),
StructField("name", StringType())
])
# Create a dataframe from the list
df = spark.createDataFrame(data, schema)
# Show the dataframe
df.show()
Output :
 Using foreach to fill a list from Pyspark data frame
foreach() is used to iterate over the rows in a PySpark data frame and using this we are going to add the data from each row to a list. The foreach() function is an action and it is executed on the driver node and not on the worker nodes. This means that it is not recommended to use foreach() when the data is large and distributed. Here are the steps of using the foreach() function to fill a list with data from a PySpark data frame:
Python3
# Import the SparkSession class from the pyspark.sql module
from pyspark.sql import SparkSession
# Create a SparkSession with the specified app name
spark = SparkSession.builder.appName(
'Example').getOrCreate()
# Create a DataFrame with three rows,
# containing the names and ages of three people
df = spark.createDataFrame(
[('Alice', 25), ('Bob', 30),
('Charlie', 35)], ['name', 'age'])
# Initialize an empty list to store the results
result = []
# Perform an action on each row of the
# DataFrame using the foreach() function
# In this case, the action is to append
# the name and age of each row to the result list
df.foreach(lambda row: result.append((row.name,
row.age)))
# Collect the rows of the DataFrame into
# a list using the collect() function
result = df.collect()
# Print the resulting list of rows
print(result)
Output :
Â
Similar Reads
PySpark - Create DataFrame from List In this article, we are going to discuss how to create a Pyspark dataframe from a list. To do this first create a list of data and a list of column names. Then pass this zipped data to spark.createDataFrame() method. This method is used to create DataFrame. The data attribute will be the list of da
2 min read
How to display a PySpark DataFrame in table format ? In this article, we are going to display the data of the PySpark dataframe in table format. We are going to use show() function and toPandas function to display the dataframe in the required format. show(): Used to display the dataframe. Syntax: dataframe.show( n, vertical = True, truncate = n) wher
3 min read
PySpark - Read CSV file into DataFrame In this article, we are going to see how to read CSV files into Dataframe. For this, we will use Pyspark and Python. Files Used: authorsbook_authorbooksRead CSV File into DataFrame Here we are going to read a single CSV into dataframe using spark.read.csv and then create dataframe with this data usi
2 min read
Read Text file into PySpark Dataframe In this article, we are going to see how to read text files in PySpark Dataframe. There are three ways to read text files into PySpark DataFrame. Using spark.read.text()Using spark.read.csv()Using spark.read.format().load() Using these we can read a single text file, multiple files, and all files fr
3 min read
Create PySpark DataFrame from list of tuples In this article, we are going to discuss the creation of a Pyspark dataframe from a list of tuples. To do this, we will use the createDataFrame() method from pyspark. This method creates a dataframe from RDD, list or Pandas Dataframe. Here data will be the list of tuples and columns will be a list
2 min read