Candidate Submittal Cover Page
Candidate Name: Giri Reddy
Phone Number: (350) 227-2441
Current location: Sacramento, California, United States
Willing to Relocate? Local to Sacramento, California and comfortable to work onsite at
Columbus, OH
Former TCS employee / NO
contractor?
(Please specify employment type
and when)
Interview Availability
(Dates listed must be 48-72 hours after candidate submission)
INCLUDE TIMEZONE
Time Slot 1 (Date / Time) Time Slot 2 (Date / Time) Time Slot 3 (Date / Time)
31 July/ 12 Pm Pst 1 Aug/ 11 Am to 4 Pm Pst 4 Aug/ 2 Pm Pst
Availability to Start: 2 weeks
Relevant Skills
Mandatory Skills # of Years Candidate’s relevant hands-on
(As listed in JD) Experience experience
AWS Datalake 4 Design, build, and manage scalable Data
Lake solutions on AWS.
Ingest structured and unstructured data from
various sources into the Data Lake using tools
like AWS Glue, Lambda, or Kinesis.
AWS Services 7 Manage and monitor cloud resources for
performance, cost, and availability using tools
like CloudWatch and CloudTrail.
Automate deployments and infrastructure
using AWS CloudFormation or Terraform.
ETL 8 xtract data from various sources like
databases, APIs, and files.
Clean, transform, and validate data to meet
business and quality requirements.
Load processed data into data warehouses,
data lakes, or reporting systems.
Professional Summary
Lead AWS Data Engineer with 10+ years of experience in designing, architecting, and optimizing cloud-native data
platforms using AWS services such as Glue, Lambda, S3, and Redshift for large-scale Data Lake and Warehouse
implementations.
Experience in building metadata-driven, reusable ingestion and transformation pipelines using PySpark on AWS EMR
and Databricks, enabling scalable, distributed processing of structured and semi-structured data at petabyte scale.
Designed and implemented modular, serverless ETL pipelines using AWS Lambda, Step Functions, and EventBridge,
supporting robust orchestration, fault tolerance, and near-real-time processing across cloud-native platforms.
Experience with with Delta Lake and Lakehouse architecture on Databricks (AWS), enabling ACID-compliant data
layers (Bronze, Silver, Gold) and supporting scalable governance, quality, and analytical consumption.
Extensive experience with AWS Glue, developing PySpark-based dynamic scripts with parameterization, error
handling, and integration with Crawlers, Catalogs, and Workflows to support automated schema discovery and
metadata management
Designed high-performance, secure ingestion pipelines for diverse sources (API, CSV, Parquet, Avro, JSON, and IoT)
into S3-based Data Lakes and Redshift staging zones with consistent schema enforcement and DQ validations.
Architected Terraform and CloudFormation-based Infrastructure as Code for provisioning scalable AWS environments
for data engineering workloads, reducing manual errors and enabling version-controlled deployments.
Developed custom Python-based Data Quality (DQ) frameworks for schema validation, null/type checks, anomaly
detection, and business rule enforcement, integrated across ETL pipelines to ensure reliability and trust in data.
Built reusable orchestration logic using Glue Workflows and Step Functions to support incremental loads, pipeline
recovery, SCD Type 2 handling, and historical archival in batch and streaming scenarios.
Implemented Redshift internals and optimization strategies (VACUUM, distribution/sort keys, Redshift Spectrum),
supporting hybrid analytical workloads across Data Lake and Warehouse zones.
Integrated Databricks with Glue Catalog, Delta Live Tables, and Jobs API to build lineage-aware, governed workflows,
supporting data curation for reporting, ML pipelines, and real-time analytics.
Implemented CI/CD pipelines for Glue, EMR, and Lambda deployments using GitHub Actions, Jenkins, and Terraform
modules, ensuring automated environment promotion and rollback strategies.
Implemented secure-by-design patterns across ingestion and transformation layers using IAM policies, VPC
endpoints, S3 encryption/KMS, and CloudWatch monitoring, ensuring compliance with HIPAA, SOC2, and GDPR.
Delivered scalable solutions for customer analytics, claims processing, and fraud detection by leading the design and
deployment of cloud-native Data Lakes, supporting high-throughput, SLA-bound data ingestion and curation.
Migrated legacy ETL systems (Informatica, Talend) to Spark-based cloud-native solutions using Glue and EMR,
improving performance, reducing TCO, and enabling more agile, reusable pipelines.
Created monitoring and alerting dashboards using CloudWatch, SNS, and Lambda to ensure pipeline reliability,
latency tracking, and proactive resolution of data pipeline issues.
Worked extensively with SQL to build and tune complex queries, stored procedures, and data models on Redshift and
Databricks SQL to support BI dashboards, regulatory reporting, and ML pipelines.
Integrated streaming and batch workloads using Spark Structured Streaming and AWS Kinesis, handling event time
windows, watermarks, and deduplication for real-time analytics use cases.
Designed cloud-native data lakes using S3 lifecycle policies, partitioned folder structures, and Parquet compression to
minimize storage costs while maintaining high analytical performance.
Led efforts in data quality profiling, duplicate detection, drift monitoring, and schema evolution handling using Glue,
Python, and Great Expectations across multi-source datasets..
Mentored junior data engineers on AWS best practices, terraform scripting, PySpark development, and debugging
large-scale distributed jobs using Spark UI and logs.
Delivered high-impact data platforms for business-critical applications such as customer analytics, claims processing,
real-time fraud detection, and enterprise reporting across finance and healthcare domains.
Professional Experience
CVS Health Care, Sacramento, CA/Remote Sep 2021 to Present
Lead AWS Data Engineer
Architected and deployed a scalable, metadata-driven Lakehouse platform on AWS using S3, Glue, Redshift, and
Delta Lake on Databricks, supporting daily ingestion and processing of multi-domain healthcare data.
Developed reusable PySpark Glue jobs to ingest, cleanse, and transform structured and semi-structured data (claims,
prescriptions, billing) from flat files, APIs, and Kafka into optimized Parquet files with enforced schema consistency.
Built centralized Glue Catalog and implemented Lake Formation policies for fine-grained, role-based access control
and metadata governance across functional teams.
Orchestrated automated ETL pipelines using Glue Workflows, Job Bookmarks, and Crawlers to support schema
evolution, incremental loading, and conditional triggering.
Implemented a Medallion (Bronze/Silver/Gold) architecture using Delta Lake on AWS-hosted Databricks to support
curated, reliable, and consumption-ready datasets for analytics and machine learning.
Integrated external data from EHR systems and partner APIs using Lambda-triggered S3 ingestion zones with
deduplication, format validation, and automated metadata tagging.
Designed modular Terraform templates for provisioning AWS resources (Lambda, Redshift, S3, IAM), enforcing
tagging standards, security policies, and environment consistency.
Tuned high-complexity Redshift queries, materialized views, and WLM configurations to optimize performance for
Tableau, Power BI, and executive dashboards.
Migrated legacy Talend-based pipelines to scalable, modular PySpark-based ETL on Glue, achieving better
maintainability, performance gains, and reduced processing time.
Built CI/CD pipelines using GitHub Actions and Terraform Cloud for automated testing, validation, and deployment of
Glue and Lambda jobs across environments.
Enabled near-real-time ingestion from Kafka to Redshift via Kinesis Firehose with Lambda enrichment for latency-
sensitive applications.
Implemented cost optimization strategies including lifecycle policies for S3, concurrency scaling for Redshift, and Glue
job scheduling aligned with data volume patterns.
Developed complex business logic using PySpark and Redshift merge operations to support SCD Type 2, historical
tracking, and CDC-based transformations.
Created end-to-end monitoring using CloudWatch, Lambda, and SNS for job health checks, latency spikes, volume
anomalies, and SLA alerts.
Defined access strategies with IAM and Lake Formation tags, aligning data access policies with data classification
and departmental roles.
Coordinated with InfoSec to enforce encryption (KMS), secure transfers, and audit logging across S3, Redshift, and
inter-service communication.
Documented pipeline metadata, job flows, and operational runbooks using GitHub + Confluence for audit readiness
and team onboarding.
Collaborated with QA and analytics teams to validate data lineage, latency SLAs, and transformation integrity across
the ETL lifecycle.
Automated Redshift UNLOAD/LOAD flows using Step Functions for movement and transformation between S3 zones
and data marts.
Leveraged Redshift Advisor and system tables to optimize performance via WLM tuning, distribution/sort key choices,
and workload segmentation.
Mentored junior engineers in PySpark best practices, Glue job tuning, Terraform scripting, and cost-aware pipeline
design.
Owned incident triaging, failure resolution, and cross-environment provisioning for Glue and Redshift jobs in on-call
rotations.
Contributed to architectural review boards focusing on job dependency modelling, reusability standards, and
performance baselining.
Enabled machine learning integration by exposing curated Delta Lake and Redshift views to data scientists via
Databricks Notebooks.
Supported audit processes with lineage tracking, encryption verification, user access logs, and data retention policies.
Built SLA dashboards to track ETL health, data freshness, and success metrics for key stakeholders.
Developed real-time pipelines using Kinesis and Spark Structured Streaming on EMR, supporting sub-minute latency
analytics to S3 and Redshift.
Reduced redundancy by integrating Redshift Spectrum for direct querying over S3-based Parquet datasets, cutting
storage duplication.
Used Athena and CloudTrail to monitor access patterns, detect anomalies, and support internal and external audit
activities.
Migrated legacy RDBMS workloads (Oracle, SQL Server) into AWS Redshift using SCT, DMS, and PySpark-based
transformations with data validation workflows.
Supported sprint execution by clarifying technical dependencies, reviewing stories, and coordinating cross-team
engineering efforts.
Cognizant/TD Bank, Hyderabad, India Sep 2020 to Aug 2021
Senior AWS Data Engineer
Led the design and implementation of secure, scalable ETL pipelines on AWS to ingest and process financial data
from transactional systems, third-party APIs, and internal legacy platforms into a centralized S3-based data lake.
Migrated batch jobs from on-prem Oracle and Netezza systems into AWS Glue-based PySpark pipelines, achieving
reduction in job runtime and improved scalability.
Built highly modular Glue workflows integrating Crawlers, Triggers, and PySpark jobs to orchestrate multi-stage
transformations with schema evolution and SCD Type 2 handling for historical banking data.
Designed Redshift-based data marts to support risk analysis, regulatory compliance, and audit reporting; optimized
schema design with distribution keys, sort keys, and late-binding views.
Implemented data ingestion using AWS Lambda and EventBridge to automate file detection, validation, and ingestion
into S3 from internal SFTP and external financial vendors.
Provisioned infrastructure using Terraform scripts to deploy IAM roles, S3 buckets, Glue jobs, Redshift clusters, and
Lake Formation policies across QA and production environments.
Built Spark Structured Streaming pipelines in Databricks to process trade and transaction event streams in real time,
with windowed aggregations and watermarking for accurate latency handling.
Applied encryption and access control policies using AWS KMS, Lake Formation tags, and custom IAM policies to
protect PII and meet FFIEC regulatory compliance.
Collaborated with the DevOps team to create CI/CD pipelines for Glue, Lambda, and Terraform modules using GitHub
Actions, enabling fully automated testing and promotion workflows.
Designed Python-based data validation layers with null checks, referential integrity validations, and threshold-based
anomaly detection on financial transactions.
Integrated AWS CloudTrail and CloudWatch for monitoring data access, job failures, execution latency, and building
dashboards with alerts for support teams.
Supported Power BI and Tableau dashboards by exposing Redshift and S3 (via Athena) as data sources, publishing
semantic views and certified datasets.
Developed lookup enrichment logic in PySpark to map transaction codes to business categories and risk tiers,
enabling better segmentation in downstream analytics.
Worked closely with audit and compliance teams to define data retention policies, backup strategies, and recovery
plans using S3 versioning and Redshift snapshots.
Built custom Python SDK to simplify onboarding of new data sources into the ETL framework, allowing business
teams to self-serve onboarding with parameterized YAML configs.
Created Spark jobs with robust error handling, retry logic, and audit trails by integrating logging with DynamoDB and
CloudWatch.
Actively participated in Agile ceremonies, including sprint planning, grooming, and retrospectives, contributing story
estimates and technical risk assessments.
Delivered knowledge transfer and training sessions for junior engineers on Delta Lake implementation patterns, Glue
job tuning, and CI/CD workflows.
Developed fallback and quarantine zones for corrupt data files and schema mismatches using Lambda triggers and
tagging workflows in S3.
Wrote optimized SQL using Redshift and Spark SQL for multi-level aggregations and joins across financial, risk, and
product datasets.
Enabled data scientists to pull training features from curated Silver/Gold tables and integrated model scoring into
downstream batch scoring pipelines.
Ensured high availability by designing cross-region S3 replication and integrating Redshift backups with AWS Backup
and CloudFormation.
Created reusable Terraform modules for standard provisioning of Redshift clusters, IAM roles, and Glue job
permissions across departments.
Managed code repository with GitHub branching strategy and enabled pull request enforcement and automated
testing.
Delivered documentation including system architecture, ETL data flow diagrams, job-level readme files, and
infrastructure provisioning guides.
Conducted weekly knowledge-sharing sessions and architecture reviews with data engineering, security, and cloud
operations teams.
Built automated data reconciliation framework using Python and Spark to compare source vs. target datasets across
S3 and Redshift.
Recognized for leading the migration legacy workflows into modern, cost-effective AWS-native pipelines that
improved SLA adherence and governance.
Cognizant/ Duke Energy, Hyderabad, India Jul 2018 – Aug 2020
AWS Data Engineer
Designed and implemented a robust and scalable ETL framework using PySpark on AWS EMR to process batch and
streaming data from smart meters, sensor devices, and billing systems into structured formats for real-time and
historical analysis.
Created a secure, partitioned S3-based data lake to store raw, processed, and curated data layers with lifecycle
policies and Glacier integration to control storage costs and meet retention policies.
Developed modular Glue jobs using PySpark to perform complex data transformations such as nested JSON parsing,
data normalization, and data quality checks before loading into Redshift.
Built and managed multiple AWS Glue workflows with built-in triggers and error handling to automate nightly and real-
time ingestion pipelines, with job status tracked in a centralized logging table.
Configured Redshift clusters with appropriate distribution styles, sort keys, vacuum strategy, and WLM queues to
ensure high-performance query execution for downstream reporting applications.
Designed and implemented streaming data ingestion using AWS Kinesis Data Streams for real-time telemetry data,
integrated with Lambda and Firehose to push data into S3 and Redshift in near real-time.
Architected and deployed infrastructure-as-code (IaC) templates using Terraform to provision Redshift clusters, IAM
roles, VPC endpoints, S3 buckets, Glue jobs, and security groups across dev, test, and prod environments.
Used CloudFormation and custom Terraform modules to standardize and reuse provisioning patterns across multiple
teams, improving consistency and reducing time-to-deliver.
Developed custom Python utilities to validate schema conformity, detect null anomalies, compare record counts, and
generate summary statistics post-ingestion.
Integrated Spark Structured Streaming with Kafka topics for processing near real-time grid usage events, applying
watermarking and windowing to manage late data arrival and duplication.
Converted legacy Informatica workflows into Glue PySpark equivalents, removing license dependencies and
improving job execution time through Spark optimizations.
Designed CDC logic using AWS Glue bookmarks, custom hash columns, and timestamp-based filtering to support
incremental loads from transactional sources.
Used AWS Lake Formation for access control management on S3 and Redshift data, creating fine-grained table- and
column-level security policies with federated SSO integration.
Integrated CloudWatch with Glue and Lambda for pipeline monitoring and alerting, including setting up custom
metrics, email/SMS notifications, and automatic retries.
Led the implementation of Delta Lake format on Databricks for high-performance analytics on time-series data,
enabling ACID transactions and schema evolution.
Developed Redshift federated queries and materialized views to support unified reporting over transactional (RDS)
and historical (S3) datasets without data duplication.
Created data mart models in Redshift and published curated views to Tableau and Power BI for stakeholder reporting
on consumption patterns and cost insights.
Assisted with POC and rollout of Databricks Lakehouse environment using S3 and Delta Lake for managing high-
frequency IoT data processing.
Participated in sprint planning and architectural design reviews, providing subject-matter expertise on AWS Glue and
EMR performance tuning.
Delivered technical documentation, runbooks, and support handover guides as part of the operationalization and
support transition plan for production systems.
Capgemini/AXA Life Insurance, Hyderabad, India Jul 2015 – May 2018
Python Automation Engineer
Automated ETL workflows using Python and shell scripting to ingest and transform structured and semi-structured
data from APIs, FTP servers, and relational databases.
Designed and implemented custom Python scripts and batch jobs for scheduled data extraction, transformation, and
loading, improving pipeline reliability and maintainability.
Developed modular data validation and cleansing scripts using Python and pandas to ensure data quality before
loading into target data stores.
Created automated workflows using cron and custom Python orchestration scripts to schedule, monitor, and recover
ETL jobs, ensuring continuity after failures.
Integrated unit testing frameworks such as unit test and pytest into Python automation scripts to validate code
functionality and data consistency before deployment.
Built reusable Python utilities for processing CSV and Excel files, enabling easy onboarding of new data sources by
business teams.
Developed Python-based logging and alerting mechanisms using email and Slack APIs to notify stakeholders of ETL
job status and errors.
Migrated legacy batch ETL processes from manual and script-based executions to more automated and scheduled
Python workflows, enhancing efficiency and traceability.
Built automated validation scripts in Python to test configuration files, API responses, and system health metrics.
Education:
Bachelor of Technology | G Pulla Reddy Engineering college (2011-2015)