
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Sort By Value in PySpark
PySpark is distributed data processing engine that will use to write the code for an API. PySpark is the collaboration of Apache Spark and Python. Spark is large-scale data processing platform that provides the capability to process petabyte scale data. In Python, we have PySpark built-in functions like orderBy(), sort(), sortBy(), createDataFrame(), collect(), and asc_nulls_last() that can be used to sort the values.
Syntax
The following syntax is used in the examples ?
createDataFrame()
This is a built-in function in Python that represents another way to create the DataFrame from the PySpark module.
orderBy()
This is the built-in method in Python that follows the PySpark module to represent the order of ascending or descending in one or more columns.
sort()
This method is used to list the order in ascending order. If we categorize any condition over to the sort function then it allows us to work in descending order.
sortBy()
The sortBy() method in Python follows the sparkContext that can be used to sort the data in an ordered sequence.
parallelize()
This method is contained by the sparkContext that allows the data to distribute across multiple nodes.
collect()
This is commonly known as PySpark RDD or DataFrame collect which can be used to access all the items or numbers from the dataset.
asc_nulls_last("column_name")
This is a built-in function in Python that returns the sequence of order in ascending order.
Installation Required ?
pip install pyspark
This command helps to run the program based on PySpark.
Example 1
In the following example, we will show how to sort DataFrame by a single column. First, start importing the module named pyspark.sql.SparkSession. Then create an object named SparkSession. Then store the list of tuples in the variable spark. Next, create a DataFrame using spark.createDataFrame() and provide the data and column names. Moving ahead to use the orderBy() function on the DataFrame to sort it by the desired column, in this case, "Age". Finally, use the sorted DataFrame using the show method to get the final result.
from pyspark.sql import SparkSession # Creation of SparkSession spark = SparkSession.builder.getOrCreate() # Creation of DataFrame stu_data = [("Akash", 25), ("Bhuvan", 23), ("Peter", 18), ("Mohan", 26)] df = spark.createDataFrame(stu_data, ["Name", "Age"]) # Sorting of Dataframe column(Age) in ascending order sorted_df = df.orderBy("Age") # sorted DataFrame sorted_df.show()
Output
+------+---+ | Name|Age| +------+---+ | Peter| 18| |Bhuvan| 23| | Akash| 25| | Mohan| 26| +------+---+
Example 2
In the following example, we will show how to sort a DataFrame by Multiple columns. Here it uses the orderBy method that accepts two parameters- list(to set the column name) and ascending(to set the value as true which allows to sort the order sequence) and store it in the variable sorted_df. Then use the method name show() with sorted_df to get the result.
from pyspark.sql import SparkSession # Create SparkSession spark = SparkSession.builder.getOrCreate() # Create a DataFrame data = [("Umbrella", 125), ("Bottle", 20), ("Colgate", 118)] df = spark.createDataFrame(data, ["Product", "Price"]) # Sort DataFrame by product and price in ascending order sorted_df = df.orderBy(["Price", "Product"], ascending=[True, True]) # Show the sorted DataFrame sorted_df.show()
Output
+--------+-----+ | Product|Price| +--------+-----+ | Bottle| 20| | Colgate| 118| |Umbrella| 125| +--------+-----+
Example 3
In the following example, we will show how to sort a dataframe in descending order. Here it uses the built-in function createDataframe that manually sets the dataframe in a 2D structure. Then initialize the variable named sorted_df that store the value by using the method named sort()[The sort function accept the parameter as a built-in function desc("Age") that will follow the sequence of descending order]. Finally, we are printing the result using sorted_df.show().
from pyspark.sql import SparkSession from pyspark.sql.functions import desc # Creation of SparkSession spark = SparkSession.builder.getOrCreate() # Creation of DataFrame Emp_data = [("Abhinav", 25, "Male"), ("Meera", 32, "Female"), ("Riya", 18, "Female"), ("Deepak", 33, "Male"), ("Elon", 50, "Male")] df = spark.createDataFrame(Emp_data, ["Name", "Age", "Gender"]) # Sort DataFrame by Age in descending order sorted_df = df.sort(desc("Age")) # Show the sorted DataFrame sorted_df.show()
Output
+-------+---+------+ | Name|Age|Gender| +-------+---+------+ | Elon| 50| Male| | Deepak| 33| Male| | Meera| 32|Female| |Abhinav| 25| Male| | Riya| 18|Female| +-------+---+------+
Example 4
In the following example, we will show how to sort RDD by value. Here we create an RDD from a list of tuples. Then use the sortBy() function on the RDD and provide a lambda function that extracts the value to be sorted by the second element of each tuple. Next, collect and iterate over the sorted RDD to access the sorted records.
# Sorting RDD by Value from pyspark.sql import SparkSession # Creation of SparkSession spark = SparkSession.builder.getOrCreate() # Creation of RDD data = [("X", 25), ("Y", 32), ("Z", 18)] rdd = spark.sparkContext.parallelize(data) # Sort RDD by value in ascending order sorted_rdd = rdd.sortBy(lambda x: x[1]) # Print the sorted RDD for record in sorted_rdd.collect(): print(record)
Output
('Z', 18) ('X', 25) ('Y', 32)
Example 5
In the following example, we will show how to sort a dataframe with null values. Here, it uses the sort() function on the DataFrame to sort it by the "Price" column in ascending order with null values i.e. none. Next, it passes asc_nulls_last("Price") as the argument to sort() and then specifies the sorting order in the sequence of ascending to descending. Then allocate the sorted dataframe in the variable sorted_df and then use the same variable with show() method to get the result.
# Sorting DataFrame with Null Values from pyspark.sql import SparkSession from pyspark.sql.functions import asc_nulls_last # Creation of SparkSession spark = SparkSession.builder.getOrCreate() # Creation of DataFrame with null values data = [("Charger", None), ("Mouse", 320), ("PEN", 18), ("Bag", 1000), ("Notebook", None)] # None = null df = spark.createDataFrame(data, ["Product", "Price"]) # Sorting of DataFrame column(Price) in ascending order with null values last sorted_df = df.sort(asc_nulls_last("Price")) # Show the sorted DataFrame sorted_df.show()
Output
+--------+-----+ | Product|Price| +--------+-----+ | PEN| 18| | Mouse| 320| | Bag| 1000| | Charger| null| |Notebook| null| +--------+-----+
Conclusion
We discussed the different ways to sort the value in PuSpark. We used some of the built-in functions like orderBy(), sort(), asc_nulls_last(), etc. The purpose of sorting is used to get the sequence order whether it's ascending or descending. The various application related to PySpark such as real-time libraries, large-scale data processing, and build API.