0% found this document useful (0 votes)
326 views

Deepak (Sr. Data Engineer)

Deepak has over 9 years of experience as a data engineer working with big data technologies like Apache Spark, Hadoop, HDFS, Hive, AWS Kinesis and Azure services. He has extensive experience building batch and stream processing applications using PySpark, Spark Streaming and Kafka. Deepak also has experience deploying solutions on cloud platforms including AWS, GCP and Azure, and working with data warehouses like Redshift, BigQuery and Azure SQL Data Warehouse.

Uploaded by

ankul
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
326 views

Deepak (Sr. Data Engineer)

Deepak has over 9 years of experience as a data engineer working with big data technologies like Apache Spark, Hadoop, HDFS, Hive, AWS Kinesis and Azure services. He has extensive experience building batch and stream processing applications using PySpark, Spark Streaming and Kafka. Deepak also has experience deploying solutions on cloud platforms including AWS, GCP and Azure, and working with data warehouses like Redshift, BigQuery and Azure SQL Data Warehouse.

Uploaded by

ankul
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

DEEPAK

Email: [email protected] PH: 316-285-0375


Data Engineer
PROFESSIONAL SUMMARY:
 Data Engineer with 9+ years of experience in IT with exceptional expertise in Big Data/Hadoop ecosystem and Data
Analytics techniques.
 Hands on experience working with Big Data/Hadoop ecosystem including Apache Spark, Map Reduce, Spark Streaming,
PySpark, Hive, HDFS, AWS Kinesis, Airflow Dags, Oozie.
 Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for
organizing data.
 Experience working with NoSQL database including DynamoDB and HBase.
 Experience in tuning and debugging Spark application and using Spark optimization techniques.
 Experience in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream
processing.
 Hands on experience in creating real time data streaming solutions using Apache Spark Core, Spark SQL, and Data Frames.
 Extensive knowledge in implementing, configuring, and maintaining Amazon Web Services (AWS) like EC2, S3, Redshift,
Glue and Athena.
 Processing, High availability, fault tolerance, and Scalability.
 Expertise in developing Spark applications for interactive analysis, batch processing and stream processing, using
programming languages like PySpark, Scala and Java.
 Advanced knowledge in Hadoop based Data Warehouse (HIVE) and database connectivity (SQOOP).
 Ample experience using Sqoop to ingest data from RDBMS - Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.
 Experience in working with various streaming ingest services with Batch and Real-time processing using Spark Streaming,
Kafka.
 Proficient in using Spark API for streaming real-time data, staging, cleaning, applying transformations, and preparing data
for machine learning needs.
 Extensive knowledge in working with Amazon EC2 to provide a solution for computing, query processing, and storage
across a wide range of applications.
 Expertise in using AWS S3 to stage data and to support data transfer and data archival. Experience in using AWS Redshift
for large scale data migrations using AWS DMS and implementing CDC (change data capture).
 Strong experience in developing LAMBDA functions using Python to automate data ingestion and tasks.
 Working knowledge of Azure cloud components (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage
Explorer, SQL DB, SQL DWH, CosmosDB).
 Experienced in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake,
Azure SQL Database, Azure SQL Data Warehouse, and controlling database access.
 Extensive experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, and
Storage Explorer.
 Good knowledge in understanding the security requirements and implementation using Azure Active Directory, Sentry,
Ranger, and Kerberos for authentication and authorizing resources.

TECHNICAL SKILLS:

Big Data Technologies: Apache Hadoop, Apache Spark (Core, SQL, Streaming), MapReduce, HDFS, Hive, Kafka, Sqoop,
Oozie, AWS EMR, Azure HDInsight, Azure Databricks.
Programming Languages: Python, Scala, Java, SQL
Data Analytics and Machine Learning: NumPy, Pandas, Matplotlib, PySpark, TensorFlow, PyTorch
Cloud Platforms: Amazon Web Services (AWS): EC2, S3, Redshift, Glue, Athena, Microsoft Azure: HDInsight, Databricks,
Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, CosmosDB
Data Warehousing and Database Technologies: DynamoDB, HBase, Azure Synapse Analytics (DW), Azure SQL DB, Oracle,
MySQL, Teradata, PostgreSQL
Data Ingestion and Integration: AWS Kinesis, Apache Airflow, AWS Data Firehose, Apache Kafka, Azure Event Hub, Azure
Data Factory
Data Visualization and Reporting: Power BI, Tableau
DevOps and CI/CD: Git, Jenkins, Docker, Maven, Azure DevOps
Scripting and Automation: Shell Scripting, AWS Lambda, Azure Functions
Data Formats and Compression: Avro, Parquet, ORC, JSON, Snappy, zlib
Data Governance and Security: AWS IAM, Azure Active Directory, Sentry, Ranger, Kerberos
Monitoring and Logging: Prometheus, Grafana, ELK Stack
Deployment and Orchestration: Kubernetes, Terraform, AWS CloudFormation
Agile Methodologies: Scrum, Kanban
Integrated Development Environments (IDEs): IntelliJ IDEA, Eclipse
Version Control Systems: Git, GitHub, Azure Repos, Clear Case
Tools and Technologies: UNIX/Linux, Maven, IBM WebSphere, IBM MQ, Toad, SoapUI

PROFESSIONAL EXPERIENCE:
AWS Data Engineer
Truist Bank, Charlotte, NC Sep 2022 to Present
Responsibilities:
 Participate with technical staff team and business managers and practitioners in the business unit to determine
requirements and functionalities needed in a project.
 Performed wide, narrow transformations, actions like filter, Lookup, Join, count, etc. on Spark Data Frames.
 Worked with Parquet files and ORC using PySpark, and Spark Streaming with Data Frames.
 Developed batch and streaming processing apps using Spark APIs for functional pipeline requirements.
 Automated data storage from streaming sources to AWS data lakes like S3, Redshift and RDS by configuring AWS Kinesis
(Data Firehose).
 Performed analytics using real time integration capabilities of AWS Kinesis (Data Streams) on streamed data
 Created PySpark code that uses Spark SQL to generate data frames from Avro formatted raw layer and writes them to
data service layer internal tables as Parquet format.
 Generated workflows through Apache Airflow, then Apache Oozie for scheduling the Hadoop jobs which controls large
data transformations.
 Experienced import/export data into HDFS/Hive from relational database and Teradata using Sqoop.
 Involved in the creation of Hive tables, loading, and analyzing the data by using hive queries.
 Have worked on creating and configuration of EC2 instances on AWS (Amazon Web Services) for the establishment of
clusters on the cloud.
 Worked on CI/CD solution, using Git, Jenkins, Docker to setup and configure big data architecture on AWS cloud platform.
 Configured Kafka brokers and clusters to ensure high availability and fault tolerance.
 Developed Kafka producers and consumers using Java, Python, or Scala for real-time data processing.
 Implemented Kafka Streams for stream processing applications, enabling near real-time analytics.
 Designed and optimized Kafka topics, partitions, and offsets for efficient data ingestion and distribution.
 Integrated Kafka with various data sources and sinks such as databases, message queues, and file systems.
 Deployed scalable applications on Google Cloud Platform (GCP) using Kubernetes Engine.
 Managed GCP resources efficiently to optimize costs for projects.
 Implemented CI/CD pipelines on GCP using tools like Cloud Build and Jenkins.
 Utilized GCP's BigQuery for data analytics and generating actionable insights.
 Configured and maintained Google Cloud Storage for secure and scalable data storage solutions.
 Developed comprehensive technical documentation for architecture, deployment processes, and system configurations.
 Collaborated with cross-functional teams to create user-friendly documentation for end-users, ensuring effective
knowledge transfer.
 Implemented robust monitoring solutions using tools like Prometheus, Grafana, and ELK stack to ensure the availability
and performance of critical systems.
 Conducted regular performance assessments and capacity planning to optimize resource utilization and cost efficiency.
 Designed and implemented infrastructure to support the deployment and scaling of machine learning models in
production environments.
 Integrated GCP services with third-party applications for seamless workflow automation.
 Leveraged GCP's Compute Engine to provision and manage virtual machines as per project requirements.
 Implemented Identity and Access Management (IAM) policies on GCP for robust security measures.
 Integrated machine learning pipelines with CI/CD processes to automate model training, testing, and deployment.
 Implemented end-to-end CI/CD pipelines, utilizing Jenkins, GitLab CI, and other automation tools to streamline software
delivery processes.
 Established a culture of collaboration between development and operations teams, fostering continuous improvement
and efficiency.
 Established and maintained a comprehensive data governance framework, defining data ownership, stewardship, and
lifecycle management processes.
 Implemented data quality monitoring tools and processes to proactively identify and address data inconsistencies and
anomalies.
 Developed and enforced data security policies and procedures to ensure compliance with industry regulations and
standards, such as GDPR and HIPAA.
 Implemented robust encryption mechanisms for data in transit and at rest, enhancing overall data protection measures.
 Led the design and implementation of a scalable and resilient cloud architecture, ensuring high availability and fault
tolerance.
 Orchestrated the migration of legacy on-premises systems to cloud environments, optimizing resource utilization and
reducing operational costs.
 Analyzed the SQL scripts and designed the solution to implement using PySpark.
 Experienced in writing Spark Applications in Scala and Python (PySpark).
 Implement Spark applications using python to perform advanced procedures like text analytics and processing, utilizing
data frames and Spark SQL API with in-memory computing capabilities of Spark for faster processing of data.
 Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
 Used Spark API over MapReduce FS to perform analytics on data in Hive tables and HBase Tables.
 Worked on AWS Lambda to run servers without managing them and to trigger run code by S3 and SNS.
 Working on integrating Kafka Publisher in spark job to capture errors from Spark Application and push into database.
 Orchestrated containerized applications using Kubernetes, ensuring high availability, scalability, and fault tolerance.
 Utilized Kubernetes to automate deployment, scaling, and management of containerized microservices in production
environments.
 Configured Kubernetes clusters for optimal resource utilization and workload distribution, maximizing efficiency and cost-
effectiveness.
 Implemented advanced networking features in Kubernetes to enable seamless communication between microservices
across clusters.
 Applied NLTK for text preprocessing tasks such as tokenization, stemming, and lemmatization to prepare text data for
analysis.
 Utilized spaCy for advanced text processing tasks including named entity recognition, part-of-speech tagging, and
dependency parsing.
 Implemented Gensim for topic modeling and document similarity tasks, extracting meaningful insights from textual data.
 Developed custom NLP pipelines using NLTK, spaCy, and Gensim to address specific business requirements and
challenges.
 Proficient in Plotly for creating interactive and dynamic visualizations, enhancing data exploration and presentation.
 Utilized Seaborn to generate statistical graphics, facilitating data analysis and interpretation.
 Experienced in leveraging D3.js to develop custom and intricate visualizations tailored to specific data requirements.
 Employed Plotly, Seaborn, and D3.js collectively to create comprehensive and visually compelling dashboards for
stakeholders.
 Ensured compliance with GDPR regulations by implementing data protection policies, conducting privacy impact
assessments, and ensuring data subjects' rights were upheld.
 Implemented HIPAA-compliant data handling procedures to safeguard protected health information (PHI), ensuring
confidentiality, integrity, and availability of healthcare data.
 Utilized data governance tools such as Collibra and Informatica to establish data quality standards, metadata
management, and data lineage documentation.
 Conducted data audits to assess compliance with regulatory requirements and internal data governance policies,
identifying areas for improvement and risk mitigation.
 Applied various statistical methods including hypothesis testing, regression analysis, and ANOVA to analyze relationships
and patterns within datasets.
 Developed predictive models using regression techniques such as linear regression, logistic regression, and polynomial
regression to infer insights and make informed decisions.
 Utilized Bayesian statistics to quantify uncertainty and incorporate prior knowledge into modeling processes, improving
the robustness of statistical analyses.
 Employed survival analysis techniques like Kaplan-Meier estimation and Cox proportional hazards model to analyze time-
to-event data in medical research and customer churn analysis.
 Implemented decision trees to analyze and classify complex datasets, optimizing decision-making processes in various
applications.
 Developed neural networks for predictive modeling tasks, enhancing accuracy and efficiency in pattern recognition and
forecasting.
 Utilized clustering techniques such as K-means and hierarchical clustering to identify underlying structures within data,
facilitating segmentation and targeted marketing strategies.
 Employed ensemble learning methods combining decision trees, neural networks, and other algorithms to improve model
robustness and predictive performance.
 Deployed applications and services on Google Cloud Platform (GCP) utilizing Compute Engine and Kubernetes Engine.
 Leveraged GCP's BigQuery for data warehousing and analytics to handle large datasets efficiently.
 Utilized Google Cloud Storage (GCS) for scalable and durable object storage of data assets.
 Implemented serverless computing with Google Cloud Functions for event-driven applications.
 Communicated effectively with stakeholders to gather requirements and provide project updates.
 Collaborated with team members to brainstorm solutions and address challenges.
 Demonstrated strong problem-solving skills by analyzing issues and proposing effective solutions.
 Acted as a mediator in resolving conflicts and fostering a positive team environment.
 Orchestrated data pipelines using tools like Apache Airflow to automate data workflows.
 Implemented data versioning using Git or Apache Parquet to track changes and ensure reproducibility.
 Managed schema evolution using tools like Apache Avro or Protobuf to maintain data consistency.
 Utilized data lineage tracking to understand the origin and transformation of data across pipelines.
 Implemented parallel processing using Apache Spark to optimize big data processing.
 Utilized data partitioning strategies in Hadoop MapReduce for efficient data processing.
 Employed compression techniques such as snappy and gzip to reduce storage and enhance processing speed.
 Applied caching mechanisms like Redis or Memcached to store frequently accessed data for faster retrieval.
Environment: Hadoop, Spark, Hive, HDFS, Kafka, UNIX, Shell, AWS Services, Python, Scala, GLUE, Oozie, SQL, AWS.

Data Engineer
Homesite insurance Boston, MA June 2020 to Aug 2022
Responsibilities:
 Responsible for building scalable distributed data solutions using Hadoop.
 Worked on Apache Spark using Scala and Python.
 Extensively used Spark-core and spark-SQL libraries to perform transformations on the data.
 Used Azure HD Insight cluster for processing spark applications.
 Designed and Developed the Azure data lake storage and placed all the different files from various sources into Data Lake
and AWS Glue that read the datasets from various data sources and perform transformations.
 Experience developing data infrastructure and tools and familiarity with current large-scale data processing technologies,
e.g., Tensor Flow or PyTorch
 Stored the data from and source systems into Data Lake and processed the data by using the spark.
 Collected the business requirements, designed, and implemented spark applications by using Scala code in IntelliJ IDE
with MAVEN build.
 Utilized GCP's Pub/Sub for real-time messaging and event-driven architectures.
 Designed and implemented serverless applications on GCP using Cloud Functions.
 Configured networking services like VPC and Cloud DNS to establish secure communication within GCP environments.
 Employed GCP's Dataflow for stream and batch processing of large datasets.
 Developed and deployed microservices architecture on GCP using Cloud Run.
 Conducted performance optimization and tuning of GCP services for enhanced efficiency.
 Implemented logging and monitoring solutions on GCP using Stackdriver.
 Configured load balancing and auto-scaling features on GCP for high availability and reliability.
 Used different transformations and functions.
 Worked on performance tuning on the existing spark applications.
 Developed enterprise-level utilities developed on spark.
 Continuous monitoring and managing the Hadoop cluster through Yarn UI.
 Experienced in performance tuning of Spark Applications for the correct level of Parallelism and memory tuning.
 Experienced in writing shell scripts to process the jobs.
 Extensively used Accumulators and Broadcast variables to fine-tune the spark applications and to monitor the spark jobs.
 Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
 Conducted regular security audits and vulnerability assessments to identify and mitigate potential risks to sensitive data.
 Collaborated with cross-functional teams to ensure alignment between security measures and overall business objectives.
 Facilitated data profiling and metadata management initiatives to enhance the overall quality and reliability of
organizational data.
 Collaborated with business units to define and implement data classification and categorization policies.
 Implemented cloud-native solutions, leveraging services such as AWS Lambda, Azure Functions, and Google Cloud
Functions for efficient and cost-effective application development.
 Designed and executed cloud deployment strategies, utilizing Infrastructure as Code (IaC) tools like Terraform and AWS
Cloud Formation.
 Implemented security features like SSL/TLS encryption and SASL authentication for securing Kafka clusters.
 Monitored Kafka clusters using tools like Kafka Manager, Prometheus, and Grafana to ensure optimal performance.
 Implemented Kafka Connect for seamless integration with external systems, enabling data pipelines.
 Managed schema evolution using Schema Registry to maintain compatibility in Kafka data streams.
 Configured Kafka MirrorMaker for data replication and disaster recovery across multiple data centers.
 Integrated automated testing processes into the CI/CD pipelines to ensure code quality and reduce the risk of deployment
failures.
 Implemented version control best practices, branching strategies, and code review processes for efficient and
collaborative development.
 Collaborated with data scientists to deploy models in cloud environments, ensuring optimal performance and resource
utilization.
 Implemented model monitoring and logging solutions to track model performance and detect deviations over time.
 Implemented automated scaling strategies for applications based on real-time performance metrics and user demand.
 Conducted root cause analysis for performance issues and implemented corrective actions to improve system reliability.
 Facilitated regular knowledge-sharing sessions and workshops to promote best practices in cloud architecture and
deployment.
 Acted as a liaison between technical and non-technical stakeholders, translating complex technical concepts into
understandable terms.
 Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective &
efficient Joins, Transformations and others during ingestion process itself.
 Upgrade the existing HD insight code to use azure Data bricks for better performance, because data bricks are an
optimized cluster.
 Used the Spark streaming using the Kafka streaming and Azure-based Event hub services.
 Extracted the data from relational databases using the data factory and stored the data in the data lake.
 Stored the data in the data lake and access the data from the data lake.
 Worked on the data factory to copy data from on-prem relational databases into Data Lake.
 Scheduled the data bricks from the azure data factory.
 Performed some basic transformations on the data by using the data factory.
 Trigger the data factory pipelines and created connection.
 Scheduled the azure data bricks jobs and notebooks.
 Worked on configuring azure data bricks cluster and managed it. Extensively used delta lake storage.
 Used Reporting Tool Power BI to connect with Hive for generating daily reports of data.
 Implemented indexing on large datasets to improve query performance in distributed systems.
 Leveraged query optimization techniques in databases like Hive or Impala for faster data retrieval.
 Employed clustering algorithms to distribute data evenly across nodes for better load balancing.
 Employed data quality checks and monitoring to ensure accuracy and reliability of datasets.
 Implemented data partitioning strategies to optimize querying performance in distributed systems.
 Leveraged columnar storage formats like Apache ORC or Apache Arrow for efficient data storage and retrieval.
 Adapted communication style to effectively convey technical concepts to non-technical stakeholders.
 Demonstrated empathy and active listening to understand and address team members' concerns.
 Prioritized tasks and managed time effectively to meet project deadlines.
 Utilized Google Cloud Pub/Sub for real-time messaging and event-driven architectures.
 Leveraged Alibaba Cloud ECS for scalable and flexible cloud computing resources.
 Utilized Alibaba Cloud Object Storage Service (OSS) for secure and reliable data storage.
 Conducted feature engineering to enhance the performance of machine learning models, utilizing techniques like
Principal Component Analysis (PCA) and feature scaling.
 Fine-tuned hyperparameters of machine learning models using techniques like grid search and cross-validation to
optimize performance and generalization.
 Implemented anomaly detection algorithms to identify outliers and irregularities in data, enhancing fraud detection and
risk mitigation strategies.
 Conducted A/B testing to assess the effectiveness of interventions or changes, ensuring data-driven decision-making in
experimental settings.
 Applied multivariate analysis techniques such as factor analysis and principal component analysis (PCA) to reduce
dimensionality and identify underlying patterns in complex datasets.
 Developed econometric models to analyze economic relationships and forecast future trends in financial markets and
macroeconomic indicators.
 Implemented access control mechanisms and encryption techniques to protect sensitive data and prevent unauthorized
access or data breaches.
 Developed data retention policies in accordance with regulatory requirements and business needs, ensuring appropriate
data lifecycle management.
 Collaborated with legal and compliance teams to interpret and apply regulatory requirements to data management
practices, mitigating legal and reputational risks.
 Integrated Plotly, Seaborn, and D3.js into data pipelines to automate visualization processes, ensuring efficiency and
consistency.
 Customized visualization elements using Plotly, Seaborn, and D3.js to convey complex insights effectively to diverse
audiences.
 Conducted training sessions to educate team members on advanced features and best practices of Plotly, Seaborn, and
D3.js.
 Integrated NLTK, spaCy, and Gensim into machine learning workflows for text classification and sentiment analysis tasks.
 Conducted sentiment analysis on social media data using NLTK, spaCy, and Gensim to gauge public opinion and trends.
 Leveraged NLTK, spaCy, and Gensim for information extraction from unstructured text sources such as news articles and
customer reviews.
 Leveraged Kubernetes for blue-green deployments and canary releases, minimizing downtime and risk during application
updates.
 Managed Kubernetes configurations and secrets to ensure secure storage and access control for sensitive data and
credentials.
 Monitored and optimized Kubernetes clusters using built-in monitoring tools and third-party solutions to ensure
performance and reliability.
Environment: Python, Spark, Spark SQL, Scala, Azure HD Insight, azure data lake, Azure data bricks, Azure Data Factory,
Azure Event Hub, Kafka, Data flow & Data Lineage, Data modeling, Power BI Desktop, Oracle, SQL Server, HDFS, YARN,

Big Data Engineer


Genesis, Beaverton, OR Oct 2018 to May 2020
Responsibilities:
 Worked closely with stake holders to understand business requirements to design quality technical solutions that align
with business and IT strategies and comply with the organization's architectural standards.
 Developed multiple applications required for transforming data across multiple layers of Enterprise Analytics Platform and
implement Big Data solutions to support distributed processing using Big Data technologies.
 Responsible for data identification and extraction using third-party ETL and data-transformation tools or scripts. (e.g.,
SQL, Python)
 Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) &Azure SQL
DB).
 Developed and managed Azure Data Factory pipelines that extracted data from various data sources, transformed it
according to business rules, using python scripts that utilized Pyspark and consumed APIs to move data into an Azure
SQL database.
 Utilized GCP's AI and Machine Learning services for predictive analytics and automation.
 Collaborated with cross-functional teams to architect solutions leveraging GCP's capabilities.
 Conducted disaster recovery planning and implementation on GCP for business continuity.
 Designed and implemented multi-region deployments on GCP for geo-distributed applications.
 Implemented containerization using Docker and managed container orchestration with GKE.
 Conducted performance testing and optimization of applications deployed on GCP.
 Integrated GCP services with version control systems like Git for seamless development workflows.
 Created a new data quality check framework project in Python that utilized pandas.
 Implemented source control and development environments for Azure Data Factory pipelines utilizing Azure Repos.
 Created Hive/Spark external tables for each source table in the Data Lake and written Hive SQL and Spark SQL to parse
the logs and structure them in tabular format to facilitate effective querying on the log data.
 Designed and developed ETL & ETL frameworks using Azure Data Factory and Azure Data Bricks.
 Created generic data bricks NOTEBOOKs for performing data cleansing.
 Created Azure Data factory pipelines to refactor on-prem SSIS packages into Data factory pipelines.
 Working with Azure BLOB and Data Lake storage for loading data into Azure SQL Synapse (DW).
 Ingested and transformed source data using Azure Data flows and Azure HDInsight.
 Created Azure Functions to ingest data at regular intervals.
 Created Data Bricks notebooks for performing complex transformations and integrated them as activities in ADF
pipelines.
 Played a key role in incident response and resolution, implementing improvements based on lessons learned from
security incidents.
 Developed and conducted training programs to promote data governance awareness and adherence across the
organization.
 Optimized CI/CD workflows for microservices architecture, ensuring seamless integration and deployment across
distributed systems.
 Facilitated cross-functional collaboration between data science and engineering teams to drive innovation in machine
learning applications.
 Established continuous improvement processes based on monitoring data to enhance system performance and user
experience.
 Tuned Kafka broker and consumer configurations for optimal throughput and latency.
 Implemented message partitioning strategies to achieve balanced load distribution across Kafka brokers.
 Designed and implemented custom Kafka producers and consumers for specific use cases and performance requirements.
 Troubleshooted and resolved performance issues, such as high latency or message loss, in Kafka clusters.
 Implemented end-to-end data pipelines with Kafka, including data ingestion, transformation, and delivery.
 Utilized data pruning techniques to reduce unnecessary data scans and improve processing efficiency.
 Optimized ETL processes by batching small data transactions into larger ones to minimize overhead.
 Implemented resource management techniques such as containerization with Docker or Kubernetes to efficiently allocate
computing resources.
 Integrated data governance practices to ensure compliance with regulatory standards and policies.
 Implemented automated testing and validation of data pipelines to detect and prevent errors early.
 Collaborated with cross-functional teams to design and implement scalable and reliable data architectures.
 Provided constructive feedback and mentorship to team members to facilitate their growth.
 Demonstrated flexibility in adapting to changing project requirements and priorities.
 Exhibited leadership qualities by taking initiative and inspiring teamwork to achieve common goals.
 Implemented Alibaba Cloud Function Compute for serverless computing and event-driven applications.
 Leveraged Alibaba Cloud MaxCompute for big data processing and analytics.
 Integrated Alibaba Cloud services with existing infrastructure to create hybrid cloud solutions.
 Leveraged reinforcement learning techniques to develop adaptive systems capable of learning and optimizing decision-
making processes in dynamic environments.
 Applied natural language processing (NLP) algorithms such as sentiment analysis and named entity recognition to extract
insights from unstructured text data.
 Implemented time-series forecasting models like ARIMA and LSTM networks to predict future trends and patterns in
sequential data.
 Utilized machine learning algorithms for feature selection and variable importance analysis to identify key drivers
influencing outcomes in predictive models.
 Conducted Monte Carlo simulations to assess risk and uncertainty in financial portfolios, informing investment strategies
and risk management decisions.
 Employed time series analysis techniques like autoregressive integrated moving average (ARIMA) and seasonal
decomposition to forecast trends and patterns in sequential data.
 Conducted regular training sessions and awareness programs to educate employees on data governance best practices
and compliance obligations.
 Established data classification schemes to categorize data based on sensitivity and regulatory requirements, guiding
appropriate handling and protection measures.
 Implemented data anonymization and pseudonymization techniques to protect privacy while preserving data utility for
analysis and research purposes.
 Collaborated with UI/UX designers to integrate Plotly, Seaborn, and D3.js visualizations seamlessly into user interfaces.
 Implemented interactive charts and graphs using Plotly, Seaborn, and D3.js to enhance user engagement and data
exploration.
 Utilized Plotly, Seaborn, and D3.js to visualize large datasets, optimizing performance and scalability of visualizations.
 Implemented text summarization techniques using NLTK, spaCy, and Gensim to condense large volumes of text into
concise summaries.
 Developed custom NLP models using NLTK, spaCy, and Gensim to address domain-specific challenges in industries such as
healthcare and finance.
 Collaborated with linguists and subject matter experts to fine-tune NLTK, spaCy, and Gensim models for improved
accuracy and performance.
 Integrated Kubernetes with CI/CD pipelines to automate the deployment process and achieve continuous delivery of
containerized applications.
 Implemented custom resource definitions (CRDs) in Kubernetes to extend its functionality and support custom application
requirements.
 Contributed to the Kubernetes community by sharing best practices, troubleshooting tips, and insights through forums
and documentation.
 Established best practices for cloud security, including identity and access management, encryption, and network security
configurations.
 Maintained up-to-date documentation on security policies, compliance measures, and data governance practices,
ensuring audit readiness.
 Written complex SQL queries for data analysis and extraction of data in required format.
 Created Power BI DataMart's and reports for various stakeholders in the business.
 Created CI/CD pipelines using Azure DevOps.
 Enhanced the functionality of existing ADF pipeline by adding new logic to transform the data.
 Worked on Spark jobs for data preprocessing, validation, normalization, and transmission.
 Optimized code and configurations for performance tuning of Spark jobs.
 Worked with unstructured and semi structured data sets to aggregate and build analytics on the data.
 Work independently with business stakeholders with strong emphasis on influencing and collaboration.
 Daily participation in Agile based Scrum team with tight deadlines.
 Created complex data transformations and manipulations using ADF and Scala.
 Worked on cloud deployments using Maven, Docker, and Jenkins.
 Experience in using Avro, Parquet, ORC and JSON file formats, developed UDF in Hive.
Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, S3, EC2, MapR, HDFS, Hive, PIG, Apache Kafka, Sqoop, Python,
Scala, PySpark, Shell scripting, Linux, MySQL, NoSQL.

Hadoop Developer
Careator Technologies Pvt Ltd Hyderabad, India Mar 2017 to July 2018
Responsibilities:
 Involved in importing data from Microsoft SQL Server, MySQL, Teradata into HDFS using Sqoop.
 Developed workflow in Oozie to automate the tasks of loading the data into HDFS.
 Used Hive to analyze the partitioned and bucked data to compute various metrics of reporting.
 Involved in creating Hive tables loading data, and writing queries that will run internally in MapReduce
 Involved in creating Hive External tables for HDFS data.
 Solved performance issues in Hive and PySpark Scripts with understanding of Joins, Group and Aggregation and perform
the MapReduce jobs.
 Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context,
Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
 Implemented end-to-end ETL pipelines using Python and SQL for high-volume analytics. Reviewed use cases before
onboarding to HDFS.
 Automated deployments and routine tasks using UNIX Shell Scripting
 Used Spark for transformations, event joins and some aggregations before storing the data into HDFS.
 Troubleshoot and resolve data quality issues and maintain elevated level of data accuracy in the data being reported.
 Analyze the large amount of data sets to determine optimal way to aggregate.
 Worked on the Oozie workflow to run multiple Hive jobs.
 Worked on creating Custom Hive UDF's.
 Developed automated shell script to execute Hive Queries.
 Involved in processing ingested raw data using Python.
 Monitored continuously and managed the Hadoop cluster using Cloudera manager.
 Worked on different file formats like JSON, AVRO, ORC, Parquet and Compression like Snappy, zlib, ls4 etc.
 Involved in converting Hive/SQL queries into Spark transformations using Data frames.
 Gained Knowledge in creating Tableau dashboard for reporting analyzed data.
 Expertise with NoSQL databases like HBase.
 Experienced in managing and reviewing the Hadoop log files.
 Used GitHub as repository for committing code and retrieving it and Jenkins for continuous integration.
Environment: HDFS, MapReduce, Sqoop, Hive, Spark, Oozie, MySQL, Eclipse, Git, GitHub, Jenkins.

Application Developer
Couth Infotech Pvt. Ltd, Hyderabad, India Sep 2015 to Feb 2017
Responsibilities:
 Involved in various stages of Enhancements in the Application by doing the required analysis, development, and testing.
 Prepared the High- and Low-level design document and Generating Digital Signature.
 For analysis and design of application created Use Cases, Class and Sequence Diagrams.
 For the registration and validation of the enrolling customer developed logic and code.
 Developed web-based user interfaces using struts framework.
 Handled Client-side Validations used JavaScript
 Wrote SQL queries, stored procedures and enhanced performance by running explain plans.
 Involved in integration of various Struts actions in the framework.
 Used Validation Framework for Server-side Validations
 Created test cases for the Unit and Integration testing.
 Front-end was integrated with Oracle database using JDBC API through JDBC-ODBC Bridge driver at server side.
 Designed project related documents using MS Visio which includes Use case, Class and Sequence diagrams.
 Writing end-to-end flow i.e., controllers' classes, service classes, DAOs classes as per the Spring MVC design and writing
business logics using core java API and data structures
 Used Spring JMS related MDB to receive the messages from other team with IBM MQ for queuing
 Developed presentation layer code, using JSP, HTML, AJAX and jQuery
 Developed the Business layer using spring (IOC, AOP), DTO, and JTA
 Developed application service components and configured beans using Spring IOC. Implemented persistence layer and
Configured EH Cache to load the static tables into secondary storage area.
 Involved in the development of the User Interfaces using HTML, JSP, JS, CSS and AJAX
 Created tables, triggers, stored procedures, SQL queries, joins, integrity constraints and views for multiple databases,
Oracle 11g using Toad tool.
 Developed the project using industry standard design patterns like Singleton, Business Delegate Factory Pattern for better
maintenance of code and re-usability.
Environment: Java, J2EE, Spring, Spring Batch, Spring JMS, MyBatis, HTML, CSS, AJAX, jQuery, JavaScript, JSP, XML, UML,
JUNIT, IBM WebSphere, Maven, Clear Case, SoapUI, Oracle 11g, IBM MQ

You might also like