0% found this document useful (0 votes)

72 views4 pages

Transform Data with Databricks and S3

The document outlines three scenarios for data manipulation using Databricks and Spark. Scenario 1 demonstrates creating a table from CSV data, adding a phone number column, and displaying the results. Scenarios 2 and 3 show how to read, transform, and save data from and to Databricks Local and AWS S3, respectively, while also adding a phone number column.

Uploaded by

Guru Sathiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views4 pages

Transform Data with Databricks and S3

Uploaded by

Guru Sathiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Scenario 1:

Create table using Python Databricks Workspace and apply transformation:

from io import StringIO

import pandas as pd

from [Link] import SparkSession

# Create a raw string of your CSV

csv_data = """EMPLOYEE_ID,FIRST_NAME,LAST_NAME,SALARY

101,John,Doe,50000

102,Jane,Smith,60000

103,Ravi,Kumar,55000"""

# Read CSV as Pandas

pdf = pd.read_csv(StringIO(csv_data))

# Convert Pandas to Spark DataFrame

spark = [Link]()

df = [Link](pdf)

# Transform: Add phone number

from [Link] import lit

df_transformed = [Link]("Phone_Number", lit("9999999999"))

df_transformed.show()
Scenario 2:
Save the data in Databricks Local, apply transformation and load the
transformed data locally:
from [Link] import SparkSession

from [Link] import lit

# Create Spark session

spark = [Link]()

# Use temporary directory paths

input_path = "dbfs:/tmp/[Link]"

output_path = "dbfs:/tmp/employees_transformed"

# Step 1: Read CSV from DBFS

df = [Link]("header", True).csv(input_path)

# Step 2: Transform — Add new column

df_transformed = [Link]("Phone_Number", lit("9999999999"))

# Step 3: Write transformed data back to DBFS tmp directory

df_transformed.[Link]("overwrite").option("header", True).csv(output_path)

print("Transformed file saved to:", output_path)

Scenario 3:
Extract the data from AWS S3 apply transformation using Databricks workspace
and load the data to AWS S3:

from [Link] import SparkSession

from [Link] import lit

# Step 1: Set your AWS credentials here

access_key = "AKIAR5YA6K6I75YQTKUF"

secret_key = "DPzE0iPug44TSmRQEJjlBspkpy3x+2RCNVWFEVCj"

# Step 2: Define S3 input/output paths

input_path = "s3a://3marchtest/[Link]"

output_path = "s3a://3marchtestoutput/customers_transformed"

# Step 3: Set up Spark session with AWS S3 credentials

spark = [Link]("S3DatabricksDemo").getOrCreate()

hadoop_conf = spark._jsc.hadoopConfiguration()

hadoop_conf.set("[Link]", access_key)

hadoop_conf.set("[Link]", secret_key)

hadoop_conf.set("[Link]", "[Link]")

hadoop_conf.set("[Link]", "[Link].s3a.S3AFileSystem")

hadoop_conf.set("[Link]", "true")

# Step 4: Read from S3

df = [Link]("header", True).option("inferSchema", True).csv(input_path)

# Step 5: Add Phone Number column

df_transformed = [Link]("Phone_Number", lit("9999999999"))

# Step 6: Write back to S3

df_transformed.[Link]("overwrite").option("header", True).csv(output_path)

print(" Data written to:", output_path)

Create Unmanaged Table in Spark
No ratings yet
Create Unmanaged Table in Spark
33 pages
Mastering DataFrames in PySpark
No ratings yet
Mastering DataFrames in PySpark
59 pages
Create Delta Table from DataFrame
No ratings yet
Create Delta Table from DataFrame
8 pages
DataFrame Operations in Spark Python
No ratings yet
DataFrame Operations in Spark Python
11 pages
Master PySpark: DataFrame Operations Guide
No ratings yet
Master PySpark: DataFrame Operations Guide
106 pages
Spark Scala Cheat Sheet
No ratings yet
Spark Scala Cheat Sheet
10 pages
Spark DataFrame Project Overview
No ratings yet
Spark DataFrame Project Overview
9 pages
Essential PySpark for Data Analysis
No ratings yet
Essential PySpark for Data Analysis
16 pages
PySpark Cheat Sheet for Data Engineers
No ratings yet
PySpark Cheat Sheet for Data Engineers
7 pages
Creating DataFrames in PySpark
No ratings yet
Creating DataFrames in PySpark
14 pages
PySpark Features and Functions Guide
No ratings yet
PySpark Features and Functions Guide
63 pages
Employee Management System Overview
No ratings yet
Employee Management System Overview
29 pages
Databricks: Reading CSV and Excel Files
No ratings yet
Databricks: Reading CSV and Excel Files
4 pages
Practicum Notebook Title Page
No ratings yet
Practicum Notebook Title Page
6 pages
Working with UTC Timestamps in Databricks
No ratings yet
Working with UTC Timestamps in Databricks
27 pages
Databricks Data Pipeline for EDA
No ratings yet
Databricks Data Pipeline for EDA
3 pages
Azure Databricks Notes and Operations
No ratings yet
Azure Databricks Notes and Operations
16 pages
PySpark Syntax Cheat Sheet for Data Engineers
No ratings yet
PySpark Syntax Cheat Sheet for Data Engineers
37 pages
Employee Management Project Overview
No ratings yet
Employee Management Project Overview
29 pages
Employee Management Project Overview
No ratings yet
Employee Management Project Overview
30 pages
Employee Management Project Overview
No ratings yet
Employee Management Project Overview
30 pages
Employee Management System Project
No ratings yet
Employee Management System Project
33 pages
Spark Data Processing with PySpark
No ratings yet
Spark Data Processing with PySpark
7 pages
PySpark Analytics Pipeline Guide
No ratings yet
PySpark Analytics Pipeline Guide
10 pages
Create DataFrames in Spark Using Files
No ratings yet
Create DataFrames in Spark Using Files
23 pages
Employee Management Project Report
No ratings yet
Employee Management Project Report
33 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
103 pages
Working with Apache Spark and Delta Lake
No ratings yet
Working with Apache Spark and Delta Lake
40 pages
Spark Entity Resolution Workflow
No ratings yet
Spark Entity Resolution Workflow
5 pages
PySpark Overview and Data Handling Guide
No ratings yet
PySpark Overview and Data Handling Guide
64 pages
Heart Disease Dataset Analysis Guide
No ratings yet
Heart Disease Dataset Analysis Guide
16 pages
Apache Spark Structured API Overview
No ratings yet
Apache Spark Structured API Overview
25 pages
Swadha School Teacher Salary Analysis
No ratings yet
Swadha School Teacher Salary Analysis
34 pages
Employee Management System Project
No ratings yet
Employee Management System Project
40 pages
Pandas Basics: Reading CSV Files
No ratings yet
Pandas Basics: Reading CSV Files
7 pages
Spark SQL Overview and DataFrames Guide
No ratings yet
Spark SQL Overview and DataFrames Guide
30 pages
PySpark Data Engineering Practices
No ratings yet
PySpark Data Engineering Practices
47 pages
SQL and PySpark Data Engineering Tasks
No ratings yet
SQL and PySpark Data Engineering Tasks
14 pages
PySpark Essential Cheatsheet Guide
No ratings yet
PySpark Essential Cheatsheet Guide
5 pages
SQL, Spark SQL, and PySpark Comparison
No ratings yet
SQL, Spark SQL, and PySpark Comparison
11 pages
PySpark Installation and Basics Guide
100% (1)
PySpark Installation and Basics Guide
131 pages
Class 12 Real Estate System Project
No ratings yet
Class 12 Real Estate System Project
46 pages
ETL Pipelines with Apache Spark
100% (1)
ETL Pipelines with Apache Spark
43 pages
Load Unstructured Data to Hive with PySpark
No ratings yet
Load Unstructured Data to Hive with PySpark
9 pages
Aastha IP Employee Project Report
No ratings yet
Aastha IP Employee Project Report
34 pages
DataFrame Staff Salary Analysis
No ratings yet
DataFrame Staff Salary Analysis
11 pages
Data Wrangling with PySpark Essentials
No ratings yet
Data Wrangling with PySpark Essentials
10 pages
Pandas Data Analysis and Visualization Guide
No ratings yet
Pandas Data Analysis and Visualization Guide
26 pages
CSV Processor for Intent Generation
No ratings yet
CSV Processor for Intent Generation
3 pages
Ali Bhai's IP Project
No ratings yet
Ali Bhai's IP Project
31 pages
Data Analysis & Visualization Project
No ratings yet
Data Analysis & Visualization Project
17 pages
CSV Data Handling with Pandas Project
No ratings yet
CSV Data Handling with Pandas Project
16 pages
Cloud Run LLM Integration Guide
No ratings yet
Cloud Run LLM Integration Guide
12 pages
Aastha IP Employee Project Report
No ratings yet
Aastha IP Employee Project Report
34 pages
Automating CSV Data Ingestion with Snowpipe
No ratings yet
Automating CSV Data Ingestion with Snowpipe
2 pages
Employee Data Analysis Project Report
No ratings yet
Employee Data Analysis Project Report
32 pages
Top 50 PySpark Interview Questions
100% (2)
Top 50 PySpark Interview Questions
57 pages
Employee Management System Project
No ratings yet
Employee Management System Project
11 pages
DSA Pre-Mid Exam Study Guide
No ratings yet
DSA Pre-Mid Exam Study Guide
3 pages
MBA Working Capital Management Exam Guide
No ratings yet
MBA Working Capital Management Exam Guide
2 pages
SCANIA ECU ECOM User Manual Eng Edition 3
76% (25)
SCANIA ECU ECOM User Manual Eng Edition 3
56 pages
2D Spiral Turbo-Spin-Echo Imaging Technique
No ratings yet
2D Spiral Turbo-Spin-Echo Imaging Technique
8 pages
Summer Training at Atlanta Electricals
100% (1)
Summer Training at Atlanta Electricals
39 pages
Home Learning Plan: Stats & Probability
No ratings yet
Home Learning Plan: Stats & Probability
6 pages
FY 2021 Annual Implementation Plan
No ratings yet
FY 2021 Annual Implementation Plan
5 pages
Computer-Based Accounting Exam Guide
No ratings yet
Computer-Based Accounting Exam Guide
4 pages
ECHOMETER 1076 Basic Overview
No ratings yet
ECHOMETER 1076 Basic Overview
2 pages
Run Scripts Through The Rational Functional Tester Command Line PDF
No ratings yet
Run Scripts Through The Rational Functional Tester Command Line PDF
6 pages
Nigeria's FY17 Green Bond Report
No ratings yet
Nigeria's FY17 Green Bond Report
7 pages
Effect of Exercise on Heart Rate
No ratings yet
Effect of Exercise on Heart Rate
3 pages
Comprehensive Guide to Macramé Techniques
No ratings yet
Comprehensive Guide to Macramé Techniques
6 pages
OOP Characteristics and Concepts
No ratings yet
OOP Characteristics and Concepts
59 pages
Vaswani Industries IPO Violation Summary
No ratings yet
Vaswani Industries IPO Violation Summary
4 pages
Content Access Point Specifications
No ratings yet
Content Access Point Specifications
9 pages
Computer Vision: History and Applications
No ratings yet
Computer Vision: History and Applications
10 pages
H1 Deamidation Enhances DNA Repair
No ratings yet
H1 Deamidation Enhances DNA Repair
33 pages
Fan Vibration
100% (1)
Fan Vibration
8 pages
Microprocessor and Applications (EC1303) - Question Bank
86% (21)
Microprocessor and Applications (EC1303) - Question Bank
39 pages
AJOG MFM: Recent Research Highlights
No ratings yet
AJOG MFM: Recent Research Highlights
4 pages
History of Ball Robots and Innovations
No ratings yet
History of Ball Robots and Innovations
5 pages
Gel Card Blood Group Interpretation
No ratings yet
Gel Card Blood Group Interpretation
1 page
Midpoint and Merge of Linked Lists
No ratings yet
Midpoint and Merge of Linked Lists
6 pages
LATEX Cheat Sheet
No ratings yet
LATEX Cheat Sheet
2 pages
Types of Petit Fours Explained
No ratings yet
Types of Petit Fours Explained
13 pages
Open Source Virtual Lab for E-Learning
No ratings yet
Open Source Virtual Lab for E-Learning
8 pages
Audit and Assurance Study Guide 2023-2024
No ratings yet
Audit and Assurance Study Guide 2023-2024
3 pages
IRDO Job Vacancies Announcement 2022
No ratings yet
IRDO Job Vacancies Announcement 2022
1 page
Malaysia Auto Lead Battery Market Insights
No ratings yet
Malaysia Auto Lead Battery Market Insights
5 pages

Transform Data with Databricks and S3

Uploaded by

Transform Data with Databricks and S3

Uploaded by

Scenario 1:

Create table using Python Databricks Workspace and apply transformation:

from [Link] import SparkSession

# Create a raw string of your CSV

# Read CSV as Pandas

# Convert Pandas to Spark DataFrame

# Transform: Add phone number

from [Link] import lit

df_transformed = [Link]("Phone_Number", lit("9999999999"))

from [Link] import lit

# Create Spark session

# Use temporary directory paths

# Step 1: Read CSV from DBFS

# Step 2: Transform — Add new column

df_transformed = [Link]("Phone_Number", lit("9999999999"))

# Step 3: Write transformed data back to DBFS tmp directory

print("Transformed file saved to:", output_path)

from [Link] import SparkSession

from [Link] import lit

# Step 1: Set your AWS credentials here

# Step 2: Define S3 input/output paths

# Step 3: Set up Spark session with AWS S3 credentials

# Step 4: Read from S3

df = [Link]("header", True).option("inferSchema", True).csv(input_path)

# Step 5: Add Phone Number column

df_transformed = [Link]("Phone_Number", lit("9999999999"))

print(" Data written to:", output_path)

You might also like