Hruthik Reddy - Senior Data Engineer
Hruthik Reddy - Senior Data Engineer
[email protected]
+1 (913) 326-9754
linkedin.com/in/hruthik-reddy-2356bb296
PROFESSIONAL SUMMARY:
▪ Overall years 9+ of IT experienced in Senior Data Engineer involving project development, implementation,
deployment and maintenance using Hadoop ecosystem related technologies.
▪ Experienced Data Engineer with extensive expertise in Azure, including managing subscriptions, virtual
machines, SQL Azure Instances, and HDInsight clusters to create robust data solutions.
▪ Experienced in ETL processes in AWS, including data migration from AWS S3 and Parquet files into AWS
Redshift, and utilizing AWS EMR for large-scale data transformations.
▪ Skilled in optimizing ETL performance through Informatica and Autosys, designing complex workflows, and
resolving bottlenecks to enhance data processing efficiency.
▪ Adept at deploying and managing Talend jobs, utilizing the Talend Job Conductor for scheduling and
independent development of ETL processes.
▪ Experienced in Python development, leveraging libraries such as Pandas, NumPy, and SciPy for advanced data
manipulation and analysis throughout the development lifecycle.
▪ Proficient in Spark and Scala, developing applications for high-performance data extraction, transformation,
and aggregation, yielding actionable insights from complex datasets.
▪ Experienced in configuring Kafka security, including SSL and Kerberos, to ensure advanced data security and
seamless integration with external data systems.
▪ Expert in PySpark performance optimization, focusing on memory management, partitioning, and caching
strategies to enhance distributed computing tasks.
▪ Proficient in data migration using SQL, SQL Azure, and Azure Data Factory, streamlining data processes for
Azure subscribers and improving data accessibility and management.
▪ Experienced in data ingestion and analysis using Hadoop tools like Flume, Sqoop, Pig, and MapReduce,
integrating customer behavioural data into HDFS for comprehensive insights.
▪ Proficient in designing and managing MySQL databases, optimizing schemas for high-performance data
ingestion, storage, and retrieval for large-scale applications.
▪ Expert in PostgreSQL database management, focusing on performance tuning, query optimization, and
maintaining data integrity through well-designed schemas and indexing strategies.
▪ Adept at automating data pipeline scheduling with Apache Airflow, utilizing task dependencies and retry
mechanisms to ensure high availability and fault tolerance.
▪ Experienced in Snowflake environment management, including data migration from SQL Server and designing
enterprise-level solutions to support scalable data operations.
▪ Skilled in creating and configuring Power BI reports, implementing Row-Level Security (RLS) to restrict data
access and integrating various data sources for comprehensive reporting.
▪ Skilled in real-time data processing using Spark-Streaming APIs, integrating with Kinesis and Cassandra for
near real-time data modeling and persistence.
▪ Proficient in Tableau and SAS for data visualization and reporting, creating custom reports and optimizing
scripts to enhance data analysis and decision-making.
▪ Experienced in SQL, MySQL, and Oracle database management, focusing on schema design, performance
optimization, and ETL integration to improve data management and accessibility.
▪ Proficient in GitHub and JIRA for version control and project management, automating data workflows with
custom scripts and ensuring efficient collaboration and issue tracking.
▪ Team Player and self-starter possessing effective communication, motivation and organizational skills
combined with attention to detail and business process improvements, hard worker with ability to meet
deadlines on or ahead of schedules.
TECHNICAL SKILLS:
PROFESSIONAL EXPERIENCE:
Environment: Azure, SQL DW, Data Factory, HD Insight, Informatica, Autosys, Talend, Python, Pysftp, NumPy, SciPy,
Matplotlib, Beautiful Soup, Pandas, Sqoop, Vertica, Spark, Scala, Spark-SQL, Kafka, SSL, Kerberos, PySpark, Flume,
Pig, Hadoop Map Reduce, Hive, Oozie, MySQL, PostgreSQL, Airflow, Snowflake, Power BI, GitHub, JIRA.
Responsibilities:
▪ Used Informatica power centre Extract, Transform and Load data into Netezza Data Warehouse from various
sources like Oracle and flat files.
▪ Involved in performance tuning of the Informatica ETL mappings by using the caches and overriding the SQL
queries and by using Parameter files.
▪ Implemented the Medallion Architecture using Azure Databricks to ensure efficient and scalable data
processing across the Bronze, Silver, and Gold layers. Focused specifically on the Silver Layer, transforming
raw, unstructured data into clean and enriched datasets by applying deduplication, normalization, and
business rules.
▪ Designed and executed data workflows using Azure Data Factory to migrate on-premises data to the cloud,
leveraging activities like Copy Data, Conditional, Loop, and Web to load data from multiple sources into
Synapse Analytics.
▪ Using Partitions, Python in Memory capabilities, Broadcasts in Python, Effective and efficient Joins,
Transformations and other during ingestion process itself.
▪ Transformed raw data from ADLS Gen2 into structured datasets for Azure SQL Server while implementing
secure access management through Azure Key Vault.
▪ Implemented Kafka Connect for seamless integration with external data systems, including databases, cloud
storage, and other data sources.
▪ Optimized Kafka message serialization and deserialization, improving throughput and reducing latency in data
processing workflows.
▪ Analyzed, Strategized & Implemented Azure migration of Application & Databases to cloud
▪ Troubleshoot and identify performance, connectivity and other issues for the applications hosted in Azure
platform.
▪ Worked on developing PySpark script to encrypting the raw data by using hashing algorithms concepts on
client specified columns.
▪ Involved in Strom Batch-mode processing over massive data sets which is analogous to a Hadoop job that runs
as a batch process over a fixed data set.
▪ Extract, transform and load data using Access Database and conduct data analysis using Process modelling,
Data Mining, Data Modelling and Data Mapping using Azure Databricks
▪ Designed, developed, and maintained MySQL databases to support data storage, retrieval, and management
needs for large-scale applications, ensuring optimized performance and data integrity.
▪ Developed stored procedures, triggers, and functions in PostgreSQL to automate data processing tasks and
improve workflow efficiency.
▪ Implementing row-level security from Azure Synapse to help prevent unauthorized access when users query
data from the same tables but must see different subsets of data.
▪ Optimized Power BI reports by simplifying data models and reducing load times, enhancing user experience
and report performance.
▪ Stored and retrieved data from data-warehouses using Microsoft Azure Synapse Analytics
▪ Managed sprint planning, backlog refinement, and task prioritization for data engineering projects using Jira
Agile boards.
Environment: Azure, S3, Parquet, Redshift, EMR, Dynamo DB, Informatica PowerCenter, Netezza, Oracle, flat files,
Talend, APIs, cloud storage, Python, HDFS, Spark-Streaming, Kinesis, Cassandra, PySpark, Spark SQL, Kafka
Connect, MySQL, PostgreSQL, Airflow, Kubernetes, Snowflake, Power BI, GitHub, Jira.
Responsibilities:
▪ Design and Develop ETL Processes in AWS SQL to migrate Campaign data from external sources like AWS S3,
Parquet Text Files into AWS Redshift.
▪ Designed and developed complex ETL mappings using Informatica PowerCenter to integrate and transform
data from various sources, ensuring high data quality and accuracy.
▪ Optimized Talend jobs for performance improvements, ensuring efficient data processing for real-time and
batch operations.
▪ Designed and maintained databases using Python and developed Python based using Flask, SQL Alchemy,
PLSQL and PostgreSQL.
▪ Used AWS EMR to transform and move large amounts of data into and out of other AWS Data stores and
databases, such an AWS S3 and Confidential Dynamo DB.
▪ The system is a full micro services architecture written in Python utilizing distributed message passing via
Kafka with JSON as data exchange formats.
▪ Developing Spark best practices like Partitions, Caching check pointing for performance and UDF’s.
▪ Designed and maintained automated PySpark workflows for batch processing, ensuring high data quality and
reliability through thorough testing, logging, and monitoring mechanisms.
▪ Involved in designing and deploying multitude applications utilizing almost all AWS stack (Including EC2, S3,
AMI, Route53, RDS, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and Auto-Scaling in AWS Cloud
Formation.
▪ Managed and reviewed Hadoop Log files as a part of administration for troubleshooting purposes.
Communicate and escalate issues appropriately.
▪ Developed and maintained complex SQL queries, stored procedures, and triggers to streamline data processing
and transformation tasks across multiple MySQL databases.
▪ Designed and implemented ETL jobs using AWS Glue to ingest vendor data from diverse sources, performing
data cleaning, imputation, and mapping, and storing results in S3 buckets for querying via AWS Athena.
▪ Implemented Airflow’s role-based access control RBAC for managing user permissions and securing DAG
access within the team.
▪ Developed on learning architecting data intelligence solutions around Snowflake Data Warehouse and
architecting snowflake solutions as developer.
▪ Created on-demand tables from S3 files using AWS Lambda functions and Glue with Python and PySpark.
▪ Established AWS RDS as a Hive metastore, consolidating EMR cluster metadata to prevent data loss upon EMR
termination.
▪ Transfer the data from HDFS TO MONGODB using pig, hive and Map reduce scripts and visualize the streaming
data in dashboard.
▪ AWS Data Pipeline and CDC Implementation Built and maintained AWS data pipelines utilizing Change Data
Capture (CDC) strategies to manage incremental data loads and maintain data accuracy across various
environments.
▪ Monitored and resolved merge conflicts in Git repositories, ensuring smooth integration of code from multiple
branches.
▪ Created custom Tableau reports and visualizations tailored to specific business requirements, improving data
accessibility and understanding.
Environment: Data Factory, AWS, Informatica PowerCenter, Talend, Python, Flask, SQL Alchemy, PLSQL,
PostgreSQL, Kafka, JSON, Spark, PySpark, Hadoop, SQL, MySQL, Airflow, Snowflake Data Warehouse, HDFS,
MongoDB, Pig, Hive, MapReduce, Git, Tableau.
Info Logitech Systems - Hyderabad, India July 2015 to Oct 2017
Data Engineer
Responsibilities:
▪ Designed and Implemented Amazon Web Services as a passionate advocate of AWS within Grace note, migrated
from a physical data centre environment.
▪ Installation and configuration, Hadoop Cluster and Maintenance, Cluster Monitoring, Troubleshooting and
certifying environments for production readiness.
▪ Installed and configured a multi-node cluster on AWS EC2 and managed it using AWS tools like CloudWatch
and CloudTrail, storing log files in S3.
▪ Developed views and templates with Python and Django's view controller and templating language to created
user-friendly website interface.
▪ Developed and implemented solutions using advanced AWS components such as EMR, EC2, Redshift, S3,
Athena, Glue, Lambda, and Kinesis.
▪ Developed routine SAS macros to create tables, graphs and listings for inclusion in Clinical study reports and
regulatory submissions and maintained existing ones
▪ Developed and implemented data ingestion and storage solutions with AWS S3, AWS Redshift.
▪ Integrated MySQL databases with ETL pipelines, automating the extraction, transformation, and loading of data
across various sources for streamlined reporting and analytics.
▪ Integrated AWS DynamoDB with AWS Lambda to store item values and back up DynamoDB streams.
▪ Connected Tableau to various data sources such as PostgreSQL, SQL Server, and cloud databases, ensuring
accurate and up-to-date data representation.
▪ Optimized existing SAS scripts and processes, significantly improving the efficiency of data pipelines and
reducing processing time.
▪ Led the migration of a quality monitoring tool from AWS EC2 to AWS Lambda and built logical datasets for
quality monitoring in Snowflake warehouses.
▪ Created and developed ETL processes with Informatica 10.4 to extract and load data from various sources such
as Oracle, flat files, Salesforce, and AWS Cloud.
▪ Developed and maintained complex Excel spreadsheets for tracking, analyzing, and reporting on large-scale
data sets, optimizing operational efficiency.
▪ Created comprehensive project reports and status updates in MS Word, communicating progress, challenges,
and solutions to stakeholders.
▪ Managed user permissions and security roles within SQL databases to control access and protect sensitive data.
▪ Kimball Dimensional Modeling and Data Warehousing Applied Kimball dimensional modeling concepts to
design scalable data warehouses on AWS Redshift, Snowflake, and other data storage platforms, supporting
complex reporting and analysis requirements
▪ Designed and implemented Oracle database schemas to meet business requirements, including tables, views,
indexes, and constraints.
Environment: Amazon Web Services, AWS S3, AWS Redshift, Hadoop, Python, Django, SAS, MySQL, ETL, Tableau,
PostgreSQL, SQL Server, Excel, MS Word, SQL, Oracle.