Deepak (Sr. Data Engineer)
Deepak (Sr. Data Engineer)
TECHNICAL SKILLS:
Big Data Technologies: Apache Hadoop, Apache Spark (Core, SQL, Streaming), MapReduce, HDFS, Hive, Kafka, Sqoop,
Oozie, AWS EMR, Azure HDInsight, Azure Databricks.
Programming Languages: Python, Scala, Java, SQL
Data Analytics and Machine Learning: NumPy, Pandas, Matplotlib, PySpark, TensorFlow, PyTorch
Cloud Platforms: Amazon Web Services (AWS): EC2, S3, Redshift, Glue, Athena, Microsoft Azure: HDInsight, Databricks,
Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, CosmosDB
Data Warehousing and Database Technologies: DynamoDB, HBase, Azure Synapse Analytics (DW), Azure SQL DB, Oracle,
MySQL, Teradata, PostgreSQL
Data Ingestion and Integration: AWS Kinesis, Apache Airflow, AWS Data Firehose, Apache Kafka, Azure Event Hub, Azure
Data Factory
Data Visualization and Reporting: Power BI, Tableau
DevOps and CI/CD: Git, Jenkins, Docker, Maven, Azure DevOps
Scripting and Automation: Shell Scripting, AWS Lambda, Azure Functions
Data Formats and Compression: Avro, Parquet, ORC, JSON, Snappy, zlib
Data Governance and Security: AWS IAM, Azure Active Directory, Sentry, Ranger, Kerberos
Monitoring and Logging: Prometheus, Grafana, ELK Stack
Deployment and Orchestration: Kubernetes, Terraform, AWS CloudFormation
Agile Methodologies: Scrum, Kanban
Integrated Development Environments (IDEs): IntelliJ IDEA, Eclipse
Version Control Systems: Git, GitHub, Azure Repos, Clear Case
Tools and Technologies: UNIX/Linux, Maven, IBM WebSphere, IBM MQ, Toad, SoapUI
PROFESSIONAL EXPERIENCE:
AWS Data Engineer
Truist Bank, Charlotte, NC Sep 2022 to Present
Responsibilities:
Participate with technical staff team and business managers and practitioners in the business unit to determine
requirements and functionalities needed in a project.
Performed wide, narrow transformations, actions like filter, Lookup, Join, count, etc. on Spark Data Frames.
Worked with Parquet files and ORC using PySpark, and Spark Streaming with Data Frames.
Developed batch and streaming processing apps using Spark APIs for functional pipeline requirements.
Automated data storage from streaming sources to AWS data lakes like S3, Redshift and RDS by configuring AWS Kinesis
(Data Firehose).
Performed analytics using real time integration capabilities of AWS Kinesis (Data Streams) on streamed data
Created PySpark code that uses Spark SQL to generate data frames from Avro formatted raw layer and writes them to
data service layer internal tables as Parquet format.
Generated workflows through Apache Airflow, then Apache Oozie for scheduling the Hadoop jobs which controls large
data transformations.
Experienced import/export data into HDFS/Hive from relational database and Teradata using Sqoop.
Involved in the creation of Hive tables, loading, and analyzing the data by using hive queries.
Have worked on creating and configuration of EC2 instances on AWS (Amazon Web Services) for the establishment of
clusters on the cloud.
Worked on CI/CD solution, using Git, Jenkins, Docker to setup and configure big data architecture on AWS cloud platform.
Configured Kafka brokers and clusters to ensure high availability and fault tolerance.
Developed Kafka producers and consumers using Java, Python, or Scala for real-time data processing.
Implemented Kafka Streams for stream processing applications, enabling near real-time analytics.
Designed and optimized Kafka topics, partitions, and offsets for efficient data ingestion and distribution.
Integrated Kafka with various data sources and sinks such as databases, message queues, and file systems.
Deployed scalable applications on Google Cloud Platform (GCP) using Kubernetes Engine.
Managed GCP resources efficiently to optimize costs for projects.
Implemented CI/CD pipelines on GCP using tools like Cloud Build and Jenkins.
Utilized GCP's BigQuery for data analytics and generating actionable insights.
Configured and maintained Google Cloud Storage for secure and scalable data storage solutions.
Developed comprehensive technical documentation for architecture, deployment processes, and system configurations.
Collaborated with cross-functional teams to create user-friendly documentation for end-users, ensuring effective
knowledge transfer.
Implemented robust monitoring solutions using tools like Prometheus, Grafana, and ELK stack to ensure the availability
and performance of critical systems.
Conducted regular performance assessments and capacity planning to optimize resource utilization and cost efficiency.
Designed and implemented infrastructure to support the deployment and scaling of machine learning models in
production environments.
Integrated GCP services with third-party applications for seamless workflow automation.
Leveraged GCP's Compute Engine to provision and manage virtual machines as per project requirements.
Implemented Identity and Access Management (IAM) policies on GCP for robust security measures.
Integrated machine learning pipelines with CI/CD processes to automate model training, testing, and deployment.
Implemented end-to-end CI/CD pipelines, utilizing Jenkins, GitLab CI, and other automation tools to streamline software
delivery processes.
Established a culture of collaboration between development and operations teams, fostering continuous improvement
and efficiency.
Established and maintained a comprehensive data governance framework, defining data ownership, stewardship, and
lifecycle management processes.
Implemented data quality monitoring tools and processes to proactively identify and address data inconsistencies and
anomalies.
Developed and enforced data security policies and procedures to ensure compliance with industry regulations and
standards, such as GDPR and HIPAA.
Implemented robust encryption mechanisms for data in transit and at rest, enhancing overall data protection measures.
Led the design and implementation of a scalable and resilient cloud architecture, ensuring high availability and fault
tolerance.
Orchestrated the migration of legacy on-premises systems to cloud environments, optimizing resource utilization and
reducing operational costs.
Analyzed the SQL scripts and designed the solution to implement using PySpark.
Experienced in writing Spark Applications in Scala and Python (PySpark).
Implement Spark applications using python to perform advanced procedures like text analytics and processing, utilizing
data frames and Spark SQL API with in-memory computing capabilities of Spark for faster processing of data.
Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
Used Spark API over MapReduce FS to perform analytics on data in Hive tables and HBase Tables.
Worked on AWS Lambda to run servers without managing them and to trigger run code by S3 and SNS.
Working on integrating Kafka Publisher in spark job to capture errors from Spark Application and push into database.
Orchestrated containerized applications using Kubernetes, ensuring high availability, scalability, and fault tolerance.
Utilized Kubernetes to automate deployment, scaling, and management of containerized microservices in production
environments.
Configured Kubernetes clusters for optimal resource utilization and workload distribution, maximizing efficiency and cost-
effectiveness.
Implemented advanced networking features in Kubernetes to enable seamless communication between microservices
across clusters.
Applied NLTK for text preprocessing tasks such as tokenization, stemming, and lemmatization to prepare text data for
analysis.
Utilized spaCy for advanced text processing tasks including named entity recognition, part-of-speech tagging, and
dependency parsing.
Implemented Gensim for topic modeling and document similarity tasks, extracting meaningful insights from textual data.
Developed custom NLP pipelines using NLTK, spaCy, and Gensim to address specific business requirements and
challenges.
Proficient in Plotly for creating interactive and dynamic visualizations, enhancing data exploration and presentation.
Utilized Seaborn to generate statistical graphics, facilitating data analysis and interpretation.
Experienced in leveraging D3.js to develop custom and intricate visualizations tailored to specific data requirements.
Employed Plotly, Seaborn, and D3.js collectively to create comprehensive and visually compelling dashboards for
stakeholders.
Ensured compliance with GDPR regulations by implementing data protection policies, conducting privacy impact
assessments, and ensuring data subjects' rights were upheld.
Implemented HIPAA-compliant data handling procedures to safeguard protected health information (PHI), ensuring
confidentiality, integrity, and availability of healthcare data.
Utilized data governance tools such as Collibra and Informatica to establish data quality standards, metadata
management, and data lineage documentation.
Conducted data audits to assess compliance with regulatory requirements and internal data governance policies,
identifying areas for improvement and risk mitigation.
Applied various statistical methods including hypothesis testing, regression analysis, and ANOVA to analyze relationships
and patterns within datasets.
Developed predictive models using regression techniques such as linear regression, logistic regression, and polynomial
regression to infer insights and make informed decisions.
Utilized Bayesian statistics to quantify uncertainty and incorporate prior knowledge into modeling processes, improving
the robustness of statistical analyses.
Employed survival analysis techniques like Kaplan-Meier estimation and Cox proportional hazards model to analyze time-
to-event data in medical research and customer churn analysis.
Implemented decision trees to analyze and classify complex datasets, optimizing decision-making processes in various
applications.
Developed neural networks for predictive modeling tasks, enhancing accuracy and efficiency in pattern recognition and
forecasting.
Utilized clustering techniques such as K-means and hierarchical clustering to identify underlying structures within data,
facilitating segmentation and targeted marketing strategies.
Employed ensemble learning methods combining decision trees, neural networks, and other algorithms to improve model
robustness and predictive performance.
Deployed applications and services on Google Cloud Platform (GCP) utilizing Compute Engine and Kubernetes Engine.
Leveraged GCP's BigQuery for data warehousing and analytics to handle large datasets efficiently.
Utilized Google Cloud Storage (GCS) for scalable and durable object storage of data assets.
Implemented serverless computing with Google Cloud Functions for event-driven applications.
Communicated effectively with stakeholders to gather requirements and provide project updates.
Collaborated with team members to brainstorm solutions and address challenges.
Demonstrated strong problem-solving skills by analyzing issues and proposing effective solutions.
Acted as a mediator in resolving conflicts and fostering a positive team environment.
Orchestrated data pipelines using tools like Apache Airflow to automate data workflows.
Implemented data versioning using Git or Apache Parquet to track changes and ensure reproducibility.
Managed schema evolution using tools like Apache Avro or Protobuf to maintain data consistency.
Utilized data lineage tracking to understand the origin and transformation of data across pipelines.
Implemented parallel processing using Apache Spark to optimize big data processing.
Utilized data partitioning strategies in Hadoop MapReduce for efficient data processing.
Employed compression techniques such as snappy and gzip to reduce storage and enhance processing speed.
Applied caching mechanisms like Redis or Memcached to store frequently accessed data for faster retrieval.
Environment: Hadoop, Spark, Hive, HDFS, Kafka, UNIX, Shell, AWS Services, Python, Scala, GLUE, Oozie, SQL, AWS.
Data Engineer
Homesite insurance Boston, MA June 2020 to Aug 2022
Responsibilities:
Responsible for building scalable distributed data solutions using Hadoop.
Worked on Apache Spark using Scala and Python.
Extensively used Spark-core and spark-SQL libraries to perform transformations on the data.
Used Azure HD Insight cluster for processing spark applications.
Designed and Developed the Azure data lake storage and placed all the different files from various sources into Data Lake
and AWS Glue that read the datasets from various data sources and perform transformations.
Experience developing data infrastructure and tools and familiarity with current large-scale data processing technologies,
e.g., Tensor Flow or PyTorch
Stored the data from and source systems into Data Lake and processed the data by using the spark.
Collected the business requirements, designed, and implemented spark applications by using Scala code in IntelliJ IDE
with MAVEN build.
Utilized GCP's Pub/Sub for real-time messaging and event-driven architectures.
Designed and implemented serverless applications on GCP using Cloud Functions.
Configured networking services like VPC and Cloud DNS to establish secure communication within GCP environments.
Employed GCP's Dataflow for stream and batch processing of large datasets.
Developed and deployed microservices architecture on GCP using Cloud Run.
Conducted performance optimization and tuning of GCP services for enhanced efficiency.
Implemented logging and monitoring solutions on GCP using Stackdriver.
Configured load balancing and auto-scaling features on GCP for high availability and reliability.
Used different transformations and functions.
Worked on performance tuning on the existing spark applications.
Developed enterprise-level utilities developed on spark.
Continuous monitoring and managing the Hadoop cluster through Yarn UI.
Experienced in performance tuning of Spark Applications for the correct level of Parallelism and memory tuning.
Experienced in writing shell scripts to process the jobs.
Extensively used Accumulators and Broadcast variables to fine-tune the spark applications and to monitor the spark jobs.
Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
Conducted regular security audits and vulnerability assessments to identify and mitigate potential risks to sensitive data.
Collaborated with cross-functional teams to ensure alignment between security measures and overall business objectives.
Facilitated data profiling and metadata management initiatives to enhance the overall quality and reliability of
organizational data.
Collaborated with business units to define and implement data classification and categorization policies.
Implemented cloud-native solutions, leveraging services such as AWS Lambda, Azure Functions, and Google Cloud
Functions for efficient and cost-effective application development.
Designed and executed cloud deployment strategies, utilizing Infrastructure as Code (IaC) tools like Terraform and AWS
Cloud Formation.
Implemented security features like SSL/TLS encryption and SASL authentication for securing Kafka clusters.
Monitored Kafka clusters using tools like Kafka Manager, Prometheus, and Grafana to ensure optimal performance.
Implemented Kafka Connect for seamless integration with external systems, enabling data pipelines.
Managed schema evolution using Schema Registry to maintain compatibility in Kafka data streams.
Configured Kafka MirrorMaker for data replication and disaster recovery across multiple data centers.
Integrated automated testing processes into the CI/CD pipelines to ensure code quality and reduce the risk of deployment
failures.
Implemented version control best practices, branching strategies, and code review processes for efficient and
collaborative development.
Collaborated with data scientists to deploy models in cloud environments, ensuring optimal performance and resource
utilization.
Implemented model monitoring and logging solutions to track model performance and detect deviations over time.
Implemented automated scaling strategies for applications based on real-time performance metrics and user demand.
Conducted root cause analysis for performance issues and implemented corrective actions to improve system reliability.
Facilitated regular knowledge-sharing sessions and workshops to promote best practices in cloud architecture and
deployment.
Acted as a liaison between technical and non-technical stakeholders, translating complex technical concepts into
understandable terms.
Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective &
efficient Joins, Transformations and others during ingestion process itself.
Upgrade the existing HD insight code to use azure Data bricks for better performance, because data bricks are an
optimized cluster.
Used the Spark streaming using the Kafka streaming and Azure-based Event hub services.
Extracted the data from relational databases using the data factory and stored the data in the data lake.
Stored the data in the data lake and access the data from the data lake.
Worked on the data factory to copy data from on-prem relational databases into Data Lake.
Scheduled the data bricks from the azure data factory.
Performed some basic transformations on the data by using the data factory.
Trigger the data factory pipelines and created connection.
Scheduled the azure data bricks jobs and notebooks.
Worked on configuring azure data bricks cluster and managed it. Extensively used delta lake storage.
Used Reporting Tool Power BI to connect with Hive for generating daily reports of data.
Implemented indexing on large datasets to improve query performance in distributed systems.
Leveraged query optimization techniques in databases like Hive or Impala for faster data retrieval.
Employed clustering algorithms to distribute data evenly across nodes for better load balancing.
Employed data quality checks and monitoring to ensure accuracy and reliability of datasets.
Implemented data partitioning strategies to optimize querying performance in distributed systems.
Leveraged columnar storage formats like Apache ORC or Apache Arrow for efficient data storage and retrieval.
Adapted communication style to effectively convey technical concepts to non-technical stakeholders.
Demonstrated empathy and active listening to understand and address team members' concerns.
Prioritized tasks and managed time effectively to meet project deadlines.
Utilized Google Cloud Pub/Sub for real-time messaging and event-driven architectures.
Leveraged Alibaba Cloud ECS for scalable and flexible cloud computing resources.
Utilized Alibaba Cloud Object Storage Service (OSS) for secure and reliable data storage.
Conducted feature engineering to enhance the performance of machine learning models, utilizing techniques like
Principal Component Analysis (PCA) and feature scaling.
Fine-tuned hyperparameters of machine learning models using techniques like grid search and cross-validation to
optimize performance and generalization.
Implemented anomaly detection algorithms to identify outliers and irregularities in data, enhancing fraud detection and
risk mitigation strategies.
Conducted A/B testing to assess the effectiveness of interventions or changes, ensuring data-driven decision-making in
experimental settings.
Applied multivariate analysis techniques such as factor analysis and principal component analysis (PCA) to reduce
dimensionality and identify underlying patterns in complex datasets.
Developed econometric models to analyze economic relationships and forecast future trends in financial markets and
macroeconomic indicators.
Implemented access control mechanisms and encryption techniques to protect sensitive data and prevent unauthorized
access or data breaches.
Developed data retention policies in accordance with regulatory requirements and business needs, ensuring appropriate
data lifecycle management.
Collaborated with legal and compliance teams to interpret and apply regulatory requirements to data management
practices, mitigating legal and reputational risks.
Integrated Plotly, Seaborn, and D3.js into data pipelines to automate visualization processes, ensuring efficiency and
consistency.
Customized visualization elements using Plotly, Seaborn, and D3.js to convey complex insights effectively to diverse
audiences.
Conducted training sessions to educate team members on advanced features and best practices of Plotly, Seaborn, and
D3.js.
Integrated NLTK, spaCy, and Gensim into machine learning workflows for text classification and sentiment analysis tasks.
Conducted sentiment analysis on social media data using NLTK, spaCy, and Gensim to gauge public opinion and trends.
Leveraged NLTK, spaCy, and Gensim for information extraction from unstructured text sources such as news articles and
customer reviews.
Leveraged Kubernetes for blue-green deployments and canary releases, minimizing downtime and risk during application
updates.
Managed Kubernetes configurations and secrets to ensure secure storage and access control for sensitive data and
credentials.
Monitored and optimized Kubernetes clusters using built-in monitoring tools and third-party solutions to ensure
performance and reliability.
Environment: Python, Spark, Spark SQL, Scala, Azure HD Insight, azure data lake, Azure data bricks, Azure Data Factory,
Azure Event Hub, Kafka, Data flow & Data Lineage, Data modeling, Power BI Desktop, Oracle, SQL Server, HDFS, YARN,
Hadoop Developer
Careator Technologies Pvt Ltd Hyderabad, India Mar 2017 to July 2018
Responsibilities:
Involved in importing data from Microsoft SQL Server, MySQL, Teradata into HDFS using Sqoop.
Developed workflow in Oozie to automate the tasks of loading the data into HDFS.
Used Hive to analyze the partitioned and bucked data to compute various metrics of reporting.
Involved in creating Hive tables loading data, and writing queries that will run internally in MapReduce
Involved in creating Hive External tables for HDFS data.
Solved performance issues in Hive and PySpark Scripts with understanding of Joins, Group and Aggregation and perform
the MapReduce jobs.
Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context,
Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
Implemented end-to-end ETL pipelines using Python and SQL for high-volume analytics. Reviewed use cases before
onboarding to HDFS.
Automated deployments and routine tasks using UNIX Shell Scripting
Used Spark for transformations, event joins and some aggregations before storing the data into HDFS.
Troubleshoot and resolve data quality issues and maintain elevated level of data accuracy in the data being reported.
Analyze the large amount of data sets to determine optimal way to aggregate.
Worked on the Oozie workflow to run multiple Hive jobs.
Worked on creating Custom Hive UDF's.
Developed automated shell script to execute Hive Queries.
Involved in processing ingested raw data using Python.
Monitored continuously and managed the Hadoop cluster using Cloudera manager.
Worked on different file formats like JSON, AVRO, ORC, Parquet and Compression like Snappy, zlib, ls4 etc.
Involved in converting Hive/SQL queries into Spark transformations using Data frames.
Gained Knowledge in creating Tableau dashboard for reporting analyzed data.
Expertise with NoSQL databases like HBase.
Experienced in managing and reviewing the Hadoop log files.
Used GitHub as repository for committing code and retrieving it and Jenkins for continuous integration.
Environment: HDFS, MapReduce, Sqoop, Hive, Spark, Oozie, MySQL, Eclipse, Git, GitHub, Jenkins.
Application Developer
Couth Infotech Pvt. Ltd, Hyderabad, India Sep 2015 to Feb 2017
Responsibilities:
Involved in various stages of Enhancements in the Application by doing the required analysis, development, and testing.
Prepared the High- and Low-level design document and Generating Digital Signature.
For analysis and design of application created Use Cases, Class and Sequence Diagrams.
For the registration and validation of the enrolling customer developed logic and code.
Developed web-based user interfaces using struts framework.
Handled Client-side Validations used JavaScript
Wrote SQL queries, stored procedures and enhanced performance by running explain plans.
Involved in integration of various Struts actions in the framework.
Used Validation Framework for Server-side Validations
Created test cases for the Unit and Integration testing.
Front-end was integrated with Oracle database using JDBC API through JDBC-ODBC Bridge driver at server side.
Designed project related documents using MS Visio which includes Use case, Class and Sequence diagrams.
Writing end-to-end flow i.e., controllers' classes, service classes, DAOs classes as per the Spring MVC design and writing
business logics using core java API and data structures
Used Spring JMS related MDB to receive the messages from other team with IBM MQ for queuing
Developed presentation layer code, using JSP, HTML, AJAX and jQuery
Developed the Business layer using spring (IOC, AOP), DTO, and JTA
Developed application service components and configured beans using Spring IOC. Implemented persistence layer and
Configured EH Cache to load the static tables into secondary storage area.
Involved in the development of the User Interfaces using HTML, JSP, JS, CSS and AJAX
Created tables, triggers, stored procedures, SQL queries, joins, integrity constraints and views for multiple databases,
Oracle 11g using Toad tool.
Developed the project using industry standard design patterns like Singleton, Business Delegate Factory Pattern for better
maintenance of code and re-usability.
Environment: Java, J2EE, Spring, Spring Batch, Spring JMS, MyBatis, HTML, CSS, AJAX, jQuery, JavaScript, JSP, XML, UML,
JUNIT, IBM WebSphere, Maven, Clear Case, SoapUI, Oracle 11g, IBM MQ