In this article, we are going to see how to convert a data frame to JSON Array using Pyspark in Python.
In Apache Spark, a data frame is a distributed collection of data organized into named columns. It is similar to a spreadsheet or a SQL table, with rows and columns. You can use a data frame to store and manipulate tabular data in a distributed environment. DataFrames are designed to be expressive, efficient, and flexible, and they are a key component of Spark’s Structured Streaming API.
What is JSON array?
A JSON (JavaScript Object Notation) array is a data structure that consists of an ordered list of values. It is often used to transmit data between a server and a web application, or between two different applications. JSON arrays are written in a syntax similar to that of JavaScript arrays, with square brackets containing a list of values separated by commas.
Methods to convert a DataFrame to a JSON array in Pyspark:
- Use the .toJSON() method
- Using the toPandas() method
- Using the write.json() method
Method 1: Use the .toJSON() method
The toJSON() method in Pyspark is used to convert pandas data frame to a JSON object. This method takes a number of arguments that allow you to specify the format of the resulting JSON object, such as the index, the data type of the values, and whether to include the column labels in the output.
Stepwise implementation:
Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Create a spark session using the getOrCreate() function.
spark = SparkSession.builder.appName("MyApp").getOrCreate()
Step 3: Create a data frame with sample data.
df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)],
["id", "name", "age"])
Step 4: Print the data frame.
df.show()
Step 5: Use the .toJSON() function and convert data frame to JSON array
json_array = df.toJSON().collect()
Step 6: Finally, print the JSON array.
print("JSON array:",json_array)
Example:
In this example, we have created a data frame with three columns id, name, and age and converted it to a JSON array using toJSON() method. In the output, the data frame as well as the JSON array are printed.
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( "MyApp" ).getOrCreate()
df = spark.createDataFrame([( 1 , "Alice" , 10 ),
( 2 , "Bob" , 20 ),
( 3 , "Charlie" , 30 )],
[ "id" , "name" , "age" ])
print ( "Dataframe: " )
df.show()
json_array = df.toJSON().collect()
print ( "JSON array:" ,json_array)
|
Output:
Dataframe:
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1| Alice| 10|
| 2| Bob| 20|
| 3|Charlie| 30|
+---+-------+---+
JSON array:
[
'{"id":1,"name":"Alice","age":10}',
'{"id":2,"name":"Bob","age":20}',
'{"id":3,"name":"Charlie","age":30}'
]
Method 2: Using the toPandas() method
In this method, we will convert the spark data frame to a pandas data frame which has three columns id, name, and age, and then going to convert it to a JSON string using the to_json() method.
Stepwise implementation:
Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Create a spark session using the getOrCreate() function.
spark = SparkSession.builder.appName("MyApp").getOrCreate()
Step 3: Create a data frame with sample data.
df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)],
["id", "name", "age"])
Step 4: Print the data frame
df.show()
Step 5: Convert the data frame to a pandas data frame
pandas_df = df.toPandas()
Step 6: Convert the panda’s data frame to JSON array
json_data = pandas_df.to_json(orient='records')
Step 7: Finally, print the JSON array
print("JSON array:",json_data)
Example:
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( "MyApp" ).getOrCreate()
df = spark.createDataFrame([( 1 , "Joseph" , 10 ),
( 2 , "Jack" , 20 ),
( 3 , "Elon" , 30 )],
[ "id" , "name" , "age" ])
print ( "Dataframe: " )
df.show()
pandas_df = df.toPandas()
json_data = pandas_df.to_json(orient = 'records' )
print ( "JSON array:" , json_data)
|
Output:
Dataframe:
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1| Joseph| 10|
| 2| Jack| 20|
| 3| Elon| 30|
+---+-------+---+
JSON array:
[
'{"id":1,"name":"Joseph","age":10}',
'{"id":2,"name":"Jack","age":20}',
'{"id":3,"name":"Elon","age":30}'
]
Method 3: Using the write.json() method
In this method, we will use write.json() to create a JSON file. But this will create a directory called data.json that contains a set of files with the JSON data. If we want to write the JSON data to a single file, we can use the coalesce() method to merge the files and then use the write.json() method.
Stepwise implementation:
Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Create a spark session using the getOrCreate() function.
spark = SparkSession.builder.appName("MyApp").getOrCreate()
Step 3: Create a data frame with sample data.
df = spark.createDataFrame([(1, "Alice", 10),
(2, "Bob", 20),
(3, "Charlie", 30)],
["id", "name", "age"])
Step 4: Use write.json() to create a JSON directory
df.write.json('data.json')
Step 5: Finally, merge the JSON files into a single JSON file.
df.coalesce(1).write.json('data_merged.json')
Example:
In this example, we created a data frame with three columns id, name, and age. The code below creates a directory with multiple JSON files which are then merged into a single file. The output is finally stored in a JSON file named data_merge.json.
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( "MyApp" ).getOrCreate()
df = spark.createDataFrame([( 1 , "Donald" , 51 ),
( 2 , "Riya" , 23 ),
( 3 , "Vani" , 22 )],
[ "id" , "name" , "age" ])
df.write.json( 'data.json' )
df.coalesce( 1 ).write.json( 'data_merged.json' )
df.show()
|
Output:
[{"id":1,"name":"Donald","age":51},
{"id":2,"name":"Riya","age":23},
{"id":3,"name":"Vani","age":22}]
Similar Reads
Pyspark - Converting JSON to DataFrame
In this article, we are going to convert JSON String to DataFrame in Pyspark. Method 1: Using read_json() We can read JSON files using pandas.read_json. This method is basically used to read JSON files through pandas. Syntax: pandas.read_json("file_name.json") Here we are going to use this JSON file
1 min read
Split Dataframe in Row Index in Pyspark
In this article, we are going to learn about splitting Pyspark data frame by row index in Python. In data science. there is a bulk of data and their is need of data processing and lots of modules, functions and methods are available to process data. In this article we are going to process data by sp
5 min read
Rename Nested Field in Spark Dataframe in Python
In this article, we will discuss different methods to rename the columns in the DataFrame like withColumnRenamed or select. In Apache Spark, you can rename a nested field (or column) in a DataFrame using the withColumnRenamed method. This method allows you to specify the new name of a column and ret
3 min read
Apply function to each row of Spark DataFrame
Spark is an open-source, distributed computing system used for processing large data sets across a cluster of computers. It has become increasingly popular due to its ability to handle the big data processing in real-time. Spark's DataFrame API, which offers a practical and effective method for carr
8 min read
PySpark dataframe foreach to fill a list
In this article, we are going to learn how to make a list of rows in Pyspark dataframe using foreach using Pyspark in Python. PySpark is a powerful open-source library for working on large datasets in the Python programming language. It is designed for distributed computing and it is commonly used f
3 min read
How to create an empty PySpark DataFrame ?
In PySpark, an empty DataFrame is one that contains no data. You might need to create an empty DataFrame for various reasons such as setting up schemas for data processing or initializing structures for later appends. In this article, weâll explore different ways to create an empty PySpark DataFrame
4 min read
Merge two DataFrames in PySpark
In this article, we will learn how to merge multiple data frames row-wise in PySpark. Outside chaining unions this is the only way to do it for DataFrames. The module used is pyspark : Spark (open-source Big-Data processing engine by Apache) is a cluster computing system. It is faster as compared to
4 min read
How to re-partition pyspark dataframe in Python
Are you a data science or machine learning enthusiast who likes to play with data? Have you ever got the need to repartition the Pyspark dataset you got? Got confused, about how to fulfill the demand? Don't worry! In this article, we will discuss the re-partitioning of the Pyspark data frame in Pyth
3 min read
Convert JSON data to Dataframe in R
In Data Analysis, we have to manage data in various formats, one of which is JSON (JavaScript Object Notation). JSON is used for storing and exchanging data between different systems and is hugely used in web development. In R Programming language, we have to work often with data in different format
4 min read
How to Convert a List to a DataFrame Row in Python?
In this article, we will discuss how to convert a list to a dataframe row in Python. Method 1: Using T function This is known as the Transpose function, this will convert the list into a row. Here each value is stored in one column. Syntax: pandas.DataFrame(list).T Example: C/C++ Code # import panda
3 min read