Data Manipulation
Data Manipulation
Perform various DataFrame operations such as filtering, selecting columns, grouping, and
aggregating.
1. Filtering Data:
python
Copy code
# Filter rows where age > 21
df_filtered = df.filter(df.age > 21)
python
Copy code
# Select specific columns
df_selected = df_filtered.select("name", "age", "city")
python
Copy code
# Group by city and count the number of occurrences
df_grouped = df_selected.groupBy("city").count()
4. Displaying Results:
python
Copy code
df_filtered.show()
df_selected.show()
df_grouped.show()
Register the DataFrame as a temporary SQL view and execute SQL queries on it.
python
Copy code
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
o createOrReplaceTempView(viewName): Registers the DataFrame as a temporary
view with the given name.
python
Copy code
# Execute SQL query to count the number of people in each city where
age > 21
sql_result = spark.sql("SELECT city, COUNT(*) as count FROM people
WHERE age > 21 GROUP BY city")
o spark.sql(query): Executes the specified SQL query and returns the result as a
DataFrame.
python
Copy code
sql_result.show()
python
Copy code
# Stop the Spark session
spark.stop()