Demonstrates using Google Sheets as a data source for PySpark sessions on Google Cloud Dataproc Serverless. Includes Jupyter notebooks for data processing and an Airflow demo for scheduling notebook execution.
.
├── .gitignore
├── drive-api.json # (Git-ignored) Google Service Account credentials
├── pyproject.toml # Project dependencies and configuration
├── sheets_pyspark.ipynb # Main Jupyter Notebook for data analysis
├── sheets_bigquery.ipynb # BigQuery notebook example
├── uv.lock # Lock file for uv package manager
├── airflow-demo/ # Airflow notebook scheduling demo
│ ├── README.md # Demo documentation
│ ├── dags/ # Airflow DAG definitions
│ ├── docker/ # Vertex AI container files
│ ├── notebooks/ # Production-ready notebooks
│ ├── scripts/ # Auto-generated scripts
│ └── setup/ # Setup scripts
├── sample_data/ # Sample CSV files
│ ├── legacy_charges.csv
│ ├── merchant_excluded.csv
│ └── merchant_send_mid_label.csv
└── utils/ # Utility scripts
├── seed_gsheets.py
└── test_gspread_access.py
- Python 3.11
- uv (fast Python package installer)
- Access to Google Cloud project with Dataproc Serverless enabled
This project requires Google Service Account credentials to access Google Sheets and Google Drive.
- Obtain Service Account key in JSON format from Google Cloud Console
- Rename file to
drive-api.jsonand place in project root - The
drive-api.jsonfile is included in.gitignoreand should never be committed
Install uv:
pip install uvCreate virtual environment:
uv syncActivate virtual environment:
source .venv/bin/activateThe sheets_pyspark.ipynb notebook demonstrates:
- Connect to Dataproc Serverless Spark session
- Authenticate with Google Sheets using service account credentials
- Read data from multiple Google Sheets into Spark DataFrames
- Perform SQL queries and analysis on data
The airflow-demo/ directory contains production-ready solutions for scheduling notebook execution using Cloud Composer (Airflow).
Three execution options:
- PythonVirtualenvOperator - Cost-effective execution on Composer workers
- Vertex AI Custom Training - Flexible execution with custom containers
- Dataproc Serverless - Spark-native execution with auto-scaling
Features:
- One-command setup scripts
- Isolated environments per task
- Package isolation
- Cost-effective execution
- Idempotent deployment
See airflow-demo/README.md for complete setup and usage instructions.
The utils/ directory contains helper scripts:
test_gspread_access.py: Verifiesdrive-api.jsoncredentials and Google Sheets API accessseed_gsheets.py: Populates Google Sheets with sample data fromsample_data/directory