The document provides a cheat sheet on ELT using PySpark with over 17 sections covering topics like basic and advanced DataFrame operations, data transformation, data profiling, data visualization, data import/export, machine learning, graph processing, and performance tuning.
The document provides a cheat sheet on ELT using PySpark with over 17 sections covering topics like basic and advanced DataFrame operations, data transformation, data profiling, data visualization, data import/export, machine learning, graph processing, and performance tuning.
● UDF (User Defined Function): from pyspark.sql.functions import udf;
udf_function = udf(lambda z: custom_function(z)) ● String Operations: from pyspark.sql.functions import lower, upper; df.select(upper(df["column"])) ● Date and Time Functions: from pyspark.sql.functions import current_date, current_timestamp; df.select(current_date()) ● Numeric Functions: from pyspark.sql.functions import abs, sqrt; df.select(abs(df["column"])) ● Conditional Expressions: from pyspark.sql.functions import when; df.select(when(df["column"] > value, "true").otherwise("false")) ● Type Casting: df.withColumn("column", df["column"].cast("new_type")) ● Explode Function (Array to Rows): from pyspark.sql.functions import explode; df.withColumn("exploded_column", explode(df["array_column"])) ● Pandas UDF: from pyspark.sql.functions import pandas_udf; @pandas_udf("return_type") def pandas_function(col1, col2): return operation ● Aggregating with Custom Functions: df.groupBy("column").agg(custom_agg_function(df["another_column"])) ● Window Functions (Rank, Lead, Lag): from pyspark.sql.functions import rank, lead, lag; windowSpec = Window.orderBy("column"); df.withColumn("rank", rank().over(windowSpec)) ● Handling JSON Columns: from pyspark.sql.functions import from_json, schema_of_json; df.withColumn("parsed_json", from_json(df["json_column"], schema_of_json))
5. Data Profiling
● Column Value Counts: df.groupBy("column").count()
● Summary Statistics for Numeric Columns: df.describe()
By: Waleed Mousa
● Correlation Between Columns: df.stat.corr("column1", "column2") ● Crosstabulation and Contingency Tables: df.stat.crosstab("column1", "column2") ● Frequent Items in Columns: df.stat.freqItems(["column1", "column2"]) ● Approximate Quantile Calculation: df.approxQuantile("column", [0.25, 0.5, 0.75], relativeError)
6. Data Visualization (Integration with other libraries)
● Convert to Pandas for Visualization: df.toPandas().plot(kind='bar')
● Histograms using Matplotlib: df.toPandas()["column"].hist() ● Box Plots using Seaborn: import seaborn as sns; sns.boxplot(x=df.toPandas()["column"]) ● Scatter Plots using Matplotlib: df.toPandas().plot.scatter(x='col1', y='col2')
7. Data Import/Export
● Reading Data from JDBC Sources:
spark.read.format("jdbc").options(url="jdbc_url", dbtable="table_name").load() ● Writing Data to JDBC Sources: df.write.format("jdbc").options(url="jdbc_url", dbtable="table_name").save() ● Reading Data from HDFS: spark.read.text("hdfs://path/to/file") ● Writing Data to HDFS: df.write.save("hdfs://path/to/output") ● Creating DataFrames from Hive Tables: spark.table("hive_table_name")
● Coalesce Partitions: df.coalesce(numPartitions) ● Reading Data in Chunks: spark.read.option("maxFilesPerTrigger", 1).csv("path/to/file.csv") ● Optimizing Data for Skewed Joins: df.repartition("skewed_column") ● Handling Data Skew in Joins: df1.join(df2.hint("broadcast"), "column")
9. Spark SQL
● Running SQL Queries on DataFrames: df.createOrReplaceTempView("table");
spark.sql("SELECT * FROM table")
By: Waleed Mousa
● Registering UDF for SQL Queries: spark.udf.register("udf_name", lambda x: custom_function(x)) ● Using SQL Functions in DataFrames: from pyspark.sql.functions import expr; df.withColumn("new_column", expr("SQL_expression"))
10. Machine Learning and Advanced Analytics
● VectorAssembler for Feature Vectors: from pyspark.ml.feature import
VectorAssembler; assembler = VectorAssembler(inputCols=["col1", "col2"], outputCol="features") ● StandardScaler for Feature Scaling: from pyspark.ml.feature import StandardScaler; scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures") ● Building a Machine Learning Pipeline: from pyspark.ml import Pipeline; pipeline = Pipeline(stages=[assembler, scaler, ml_model]) ● Train-Test Split: train, test = df.randomSplit([0.7, 0.3]) ● Model Fitting and Predictions: model = pipeline.fit(train); predictions = model.transform(test) ● Cross-Validation for Model Tuning: from pyspark.ml.tuning import CrossValidator; crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid) ● Hyperparameter Tuning: from pyspark.ml.tuning import ParamGridBuilder; paramGrid = ParamGridBuilder().addGrid(model.param, [value1, value2]).build()
11. Graph and Network Analysis
● Creating a GraphFrame: from graphframes import GraphFrame; g =
● Time Series Window Functions: from pyspark.sql.functions import window; df.groupBy(window("timestamp", "1 hour")).mean()
21. Advanced Machine Learning Operations
● Custom Machine Learning Models with MLlib: from pyspark.ml.classification
import LogisticRegression; lr = LogisticRegression() ● Text Analysis with MLlib: from pyspark.ml.feature import Tokenizer; tokenizer = Tokenizer(inputCol="text", outputCol="words") ● Model Evaluation and Metrics: from pyspark.ml.evaluation import BinaryClassificationEvaluator; evaluator = BinaryClassificationEvaluator() ● Model Persistence and Loading: model.save("path"), ModelType.load("path")
22. Graph Analysis with GraphFrames
● Creating GraphFrames for Network Analysis: from graphframes import
custom_udf("column")) ● Vector Operations for ML Features: from pyspark.ml.linalg import Vectors; df.withColumn("vector_col", Vectors.dense("column"))
24. Logging and Monitoring
● Logging Operations in Spark: spark.sparkContext.setLogLevel("WARN")
25. Best Practices and Patterns
● Following Data Partitioning Best Practices: (Optimizing partition
strategy for data size and operations) ● Efficient Data Serialization: (Using Kryo serialization for performance) ● Optimizing Data Locality: (Ensuring data is close to computation resources) ● Error Handling and Recovery Strategies: (Implementing try-catch logic and checkpointing)
By: Waleed Mousa
26. Security and Compliance
● Data Encryption and Security: (Configuring Spark with encryption and
security features) ● GDPR Compliance and Data Anonymization: (Implementing data masking and anonymization)
27. Advanced Data Science Techniques
● Deep Learning Integration (e.g., with TensorFlow): (Using Spark with
TensorFlow for distributed deep learning) ● Complex Event Processing in Streams: (Using structured streaming for event pattern detection)
28. Cloud Integration
● Running Spark on Cloud Platforms (e.g., AWS, Azure, GCP): (Setting up
Spark clusters on cloud services) ● Integrating with Cloud Storage Services: (Reading and writing data to cloud storage like S3, ADLS, GCS)
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)