PySpark sampleBy using multiple columns
Last Updated :
28 Apr, 2025
In this article, we are going to learn about PySpark sampleBy using multiple columns in Python.
While doing the data processing of the big data. There are many cases where we need a sample of data. In Pyspark, we can get the sample of data by using sampleBy() function to get the sample of data. In this article, we are going to learn how to take samples using multiple columns through sampleBy() function.
sampleBy() function:
The function which returns a stratified sample without replacement based on the fraction given on each stratum is known as sampleBy(). It not only defines strata but also adds sampling by a column.
Syntax: DataFrame.sampleBy(col, fractions, seed=None)
Parameters:
- col: It can be column or string that defines strata.
- fraction: It is the fraction between 0 and 1 according to which sampleBy will be done.
- seed: Random seed (Optional)
Returns: A new DataFrame that represents the stratified sample.
Steps of PySpark sampleBy using multiple columns
Step 1: First of all, import the SparkSession library. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using getOrCreate() function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, either create the data frame using the createDataFrame() function or read the CSV file.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
sep = ',', inferSchema = True, header = True)
or
data_frame=spark_session.createDataFrame([(column_data_1), (column_data_2 ), (column_data_3 )],
['column_name_1', 'column_name_2','column_name_3']
Step 4: Later on, store the data frame in another variable as it will be used during sampling.
df=data_frame
Step 5: Further, apply a transformation on every element by defining the columns as well as sampling percentage as an argument in the map() function.
fractions = df.rdd.map(lambda x:
(x[column_index_1],
x[column_index_2])).distinct().map(lambda x:
(x,fraction)).collectAsMap()
Step 6: Moreover, create a tuple of elements using the keyBy() function.
key_df = df.rdd.keyBy(lambda x: (x[column_index_1],x[column_index_2]))
Step 7: Finally, extract random sample through sampleByKey() function using boolean, fraction, and seed as arguments and display the data frame.
key_df.sampleByKey(False,fractions).map(lambda x:
x[column_index_1]).toDF(data_frame.columns).show()
Example 1:
In this example, we have created the data frame with columns 'Roll_Number,' 'Fees' and 'Fine', and then extracted the data from it through the sampleByKey() function by boolean, multiple columns ('Roll_Number' and 'Fees') and fraction as arguments. We have extracted the random sample twice through the sampleByKey() function to see if we get the same fractional value each time. What we observed is that we got different values each time.
Python3
# Pyspark program to sampleBy using multiple columns
# Import the libraries SparkSession library
from pyspark.sql import SparkSession
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Create a data frame with three columns 'Roll_Number,' 'Fees' and 'Fine'
data_frame=spark_session.createDataFrame([(1, 10000, 400),
(2, 14000 , 500),
(3, 12000 , 800)],
['Roll_Number', 'Fees', 'Fine'])
# Store the data frame in another variable
# as it will be used during sampling
df=data_frame
print("Data frame:")
df.show()
# Apply transformation on every element by defining the columns (first, second)
# as well as sampling percentage as an argument in the map function
fractions = df.rdd.map(lambda x:
(x[0],x[1])).distinct().map(lambda x:
(x,0.4)).collectAsMap()
# Create tuple of elements using keyBy function
key_df = df.rdd.keyBy(lambda x: (x[0],x[1]))
# Extract random sample through sampleByKey function
# using boolean, columns (first and second) and fraction as arguments
print("Sample 1: ")
key_df.sampleByKey(False,
fractions).map(lambda x:
x[1]).toDF(data_frame.columns).show()
# Again extract random sample through sampleByKey function
# using boolean, columns (first and second) and fraction as arguments
print("Sample 2: ")
key_df.sampleByKey(False,
fractions).map(lambda x:
x[1]).toDF(data_frame.columns).show()
Output:
Data frame:
+-----------+-----+----+
|Roll_Number| Fees|Fine|
+-----------+-----+----+
| 1|10000| 400|
| 2|14000| 500|
| 3|12000| 800|
+-----------+-----+----+
Sample 1:
+-----------+-----+----+
|Roll_Number| Fees|Fine|
+-----------+-----+----+
| 3|12000| 800|
+-----------+-----+----+
Sample 2:
+-----------+-----+----+
|Roll_Number| Fees|Fine|
+-----------+-----+----+
| 3|12000| 800|
+-----------+-----+----+
Example 2:
In this example, we have taken the data frame from the CSV file (link) and then extracted the data from it through the sampleByKey() function by boolean, multiple columns ('Class,' 'Fees' and 'Discount'), fraction and seed as arguments. We have extracted the random sample twice through the sampleByKey() function to see if we get the same fractional value each time. What we observed is that we got the same values each time.
Python3
# Pyspark program to sampleBy using multiple columns
# Import the libraries SparkSession library
from pyspark.sql import SparkSession
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Read the CSV file
data_frame = csv_file = spark_session.read.csv(
'/content/drive/MyDrive/Colab Notebooks/class_data.csv',
sep=',', inferSchema=True, header=True)
# Store the data frame in another variable
# as it will be used during sampling
df = data_frame
print("Data frame: ")
df.show()
# Apply transformation on every element by defining the columns
# (third, fourth and fifth) as well as sampling percentage
# as an argument in the map function
fractions = df.rdd.map(lambda x:
(x[2], x[3], x[4])).distinct().map(
lambda x: (x, 0.4)).collectAsMap()
# Create tuple of elements using keyBy function
key_df = df.rdd.keyBy(lambda x:
(x[2], x[3], x[4]))
# Extract random sample through sampleByKey function using boolean,
# columns (third, fourth and fifth), fraction and seed (value=2) as arguments
print("Sample 1: ")
key_df.sampleByKey(True, fractions, 4).map(
lambda x: x[1]).toDF(data_frame.columns).show()
# Again extract random sample through sampleByKey function using boolean,
# columns (third, fourth and fifth),fraction and seed (value=2) as arguments
print("Sample 2: ")
key_df.sampleByKey(True, fractions, 4).map(
lambda x: x[1]).toDF(data_frame.columns).show()
Output:
Data frame:
+-------+--------------+-----+-----+--------+
| name| subject|class| fees|discount|
+-------+--------------+-----+-----+--------+
| Arun| Maths| 10|12000| 400|
| Aniket|Social Science| 11|15000| 600|
| Ishita| English| 9| 9000| 0|
|Pranjal| Science| 12|18000| 1000|
|Vinayak| Computer| 12|18000| 500|
+-------+--------------+-----+-----+--------+
Sample 1:
+------+-------+-----+----+--------+
| name|subject|class|fees|discount|
+------+-------+-----+----+--------+
|Ishita|English| 9|9000| 0|
+------+-------+-----+----+--------+
Sample 2:
+------+-------+-----+----+--------+
| name|subject|class|fees|discount|
+------+-------+-----+----+--------+
|Ishita|English| 9|9000| 0|
+------+-------+-----+----+--------+
Similar Reads
Python Tutorial - Learn Python Programming Language Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly. It'sA high-level language, used in web development, data science, automation, AI and more.Known fo
10 min read
Python Interview Questions and Answers Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Non-linear Components In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Python OOPs Concepts Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Python Projects - Beginner to Advanced Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Python Exercise with Practice Questions and Solutions Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Python Programs Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read
Spring Boot Tutorial Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Python Introduction Python was created by Guido van Rossum in 1991 and further developed by the Python Software Foundation. It was designed with focus on code readability and its syntax allows us to express concepts in fewer lines of code.Key Features of PythonPythonâs simple and readable syntax makes it beginner-frien
3 min read
Python Data Types Python Data types are the classification or categorization of data items. It represents the kind of value that tells what operations can be performed on a particular data. Since everything is an object in Python programming, Python data types are classes and variables are instances (objects) of thes
9 min read