0% found this document useful (0 votes)
7 views

Data Manipulation

This document discusses performing various operations on DataFrames in Spark SQL such as filtering rows, selecting columns, grouping data, running SQL queries, and stopping the Spark session. It provides code examples for each step.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data Manipulation

This document discusses performing various operations on DataFrames in Spark SQL such as filtering rows, selecting columns, grouping data, running SQL queries, and stopping the Spark session. It provides code examples for each step.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Step 4: Data Manipulation

Perform various DataFrame operations such as filtering, selecting columns, grouping, and
aggregating.

1. Filtering Data:

python
Copy code
# Filter rows where age > 21
df_filtered = df.filter(df.age > 21)

o filter(condition): Filters rows based on the given condition.

2. Selecting Specific Columns:

python
Copy code
# Select specific columns
df_selected = df_filtered.select("name", "age", "city")

o select(*columns): Selects specified columns from the DataFrame.

3. Grouping and Aggregating Data:

python
Copy code
# Group by city and count the number of occurrences
df_grouped = df_selected.groupBy("city").count()

o groupBy(*cols): Groups the DataFrame using the specified columns.


o count(): Counts the number of rows for each group.

4. Displaying Results:

python
Copy code
df_filtered.show()
df_selected.show()
df_grouped.show()

o show(): Displays the first 20 rows of the DataFrame by default.

Step 5: Run SQL Queries

Register the DataFrame as a temporary SQL view and execute SQL queries on it.

1. Registering the DataFrame as a SQL Temporary View:

python
Copy code
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
o createOrReplaceTempView(viewName): Registers the DataFrame as a temporary
view with the given name.

2. Executing SQL Queries:

python
Copy code
# Execute SQL query to count the number of people in each city where
age > 21
sql_result = spark.sql("SELECT city, COUNT(*) as count FROM people
WHERE age > 21 GROUP BY city")

o spark.sql(query): Executes the specified SQL query and returns the result as a
DataFrame.

3. Displaying SQL Query Results:

python
Copy code
sql_result.show()

o show(): Displays the first 20 rows of the DataFrame by default.

Step 6: Stop the SparkSession

After completing the operations, stop the SparkSession to free up resources.

python
Copy code
# Stop the Spark session
spark.stop()

 stop(): Stops the SparkSession.

Complete Example Code


python
Copy code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Step 1: Initialize SparkSession


spark = SparkSession.builder.appName("End-to-End DataFrame
Workflow").getOrCreate()

# Step 2: Create DataFrame from JSON file


df = spark.read.json("path/to/json/file.json")

# Step 3: Explore the DataFrame


df.printSchema()
df.show()

# Step 4: Data Manipulation


df_filtered = df.filter(col("age") > 21)
df_selected = df_filtered.select("name", "age", "city")
df_grouped = df_selected.groupBy("city").count()
df_filtered.show()
df_selected.show()
df_grouped.show()

# Step 5: Run SQL Queries


df.createOrReplaceTempView("people")
sql_result = spark.sql("SELECT city, COUNT(*) as count FROM people WHERE
age > 21 GROUP BY city")
sql_result.show()

# Step 6: Stop the SparkSession


spark.stop()

You might also like