Akash Resume
Akash Resume
SUMMARY:
Around 9+ Years of experience in Information Technology Industry which includes 5+Years of
experience as Hadoop/Spark Developer using Bigdata Technologies like Hadoop Ecosystem, Spark
Ecosystems and 3+Years of Java/J2EE Technologies and SQL.
Hands on experience in installing, configuring and using Hadoop ecosystem components
like HDFS , MapReduce Programming, Hive, Pig, Yarn, Sqoop, Flume, Hbase, Impala,
Oozie, ZooKeeper , Kafka , Spark .
In depth understanding of Hadoop Architecture including YARN and various components such
as HDFS, Resource Manager, Node Manager, Name Node, Data Node and MR v1 & v2 concepts.
In - depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark
Streaming, Spark MLib and Spark Real time Streaming.
Hands on experience in Analysis, Design, Coding and Testing phases of Software Development Life
Cycle (SDLC).
Hands on experience with AWS (Amazon Web Services), Elastic Map Reduce (EMR), Storage S3, EC2
instances and Data Warehousing.
Worked and learned a great deal from Amazon Web Services ( AWS ) Cloud services like EC2, S3, EBS
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small
data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
Hands on experience in various Bigdata application phases like data ingestion, data
analytics and data visualization.
Experience in usage of Hadoop distribution like Cloudera , Hortonworks distribution & Amazon AWS
Experience in transferring data from RDBMS to HDFS and HIVE table using SQOOP.
Migrating the coding from Hive to Apache Spark and Scala using Spark SQL, RDD.
Experience in working with flume to load the log data from multiple sources directly into HDFS.
Very well versed in workflow scheduling and monitoring tools such as Oozie, Hue and Zookeeper.
Good knowledge on Impala, Mahout, SparkSQL , Storm , Avro , Kafka, Hue and AWS and
knowledge on IDE tools such as Eclipse, NetBeans, and Maven.
Installed and configured MapReduce, HIVE and the HDFS, implemented CDH5 and HDP clusters
on CentOS. Assisted with performance tuning, monitoring and troubleshooting.
Experience data processing like collecting, aggregating, moving from various sources using
Apache Flume and Kafka.
Proficient in data manipulation and analysis using Pandas, a powerful Python library for data handling
and transformation.
Experience in working with Pandas' data structures, including Series and DataFrame, for efficient data
organization and analysis.
Strong knowledge of version control systems like SVN and GITHUB.
Experience in manipulating the streaming data to clusters through Kafka and Spark -Streaming.
Experience in analyzing data using HiveQL, Pig Latin , and custom MapReduce programs in Java .
Basic Knowledge on Kudu, Nifi, Kylin and Zeppelin with Apache Spark.
Experience in NoSQL Column-Oriented Databases like Hbase, Cassandra and its Integration
with Hadoop cluster.
Involved in Cluster coordination services through Zookeeper.
Good level of experience in Core Java, J2EE technologies as JDBC, Servlets, and JSP.
Experienced in Robotic Process Automation (RPA) platforms like UIPath, Automation Anywhere, or
Blue Prism, to automate repetitive tasks and increase operational efficiency.
Hands-on experience in automating processes by interacting with various systems and applications
using UIPath, including web scraping, data extraction, and report generation
TECHNICAL SKILLS:
Web Technologies HTML, HTML5, XML, XHTML, CSS3, JSON, AJAX, XSD, WSDL, ExtJS
Server side Frameworks and Libraries Spring 2.5/3.0/3.2, Hibernate 3x/4x, MyBatis, Spring MVC, Spring
web flow, Spring Batch, Spring Integration, Spring-WS, Struts,
Jersey Restful Web services, Xfire, Apache CXF, Mule ESB,
Zookeeper, Curator, Apache POI, Junit, Mockito, PowerMock, Slf4j,
Log4j, Gson, Jackson, UML, Selenium, Crystal Reports
UI Frameworks and Libraries ExtJS, JQuery, JQueryUI, AngularJS, Thymeleaf, Prime Faces,
Bootstrap
Build Tools and IDE’s Maven, Ant, IntelliJ, Eclipse, Spring Tool Suite, NetBeans and
Jenkins
Tools SVN, JIRA, Toad, SQL Developer, Serena Dimensions, Share point,
Clear Case, Perforce
Process & Concepts Agile, SCRUM, SDLC, Object-Oriented Analysis and Design, Test
driven Development, Continuous Integration
Education Details:
PROFESSIONAL SUMMARY:
Walmart Sunnyvale, CA October 2021 to July 2023
Sr Data Engineer
Responsibilities:
Extracting & Analyzing data with Spark from varied sources; creating & developing
algorithms for smooth functioning of the system & eliminating false data
Designing, Creating, Testing & maintaining complete data management & processing
systems on Airflow.
Constructing intricate pipelines with complete automated process; building jars, testing &
deploying using Git (CI/CD)
Developed Spark code using Scala and Spark-SQL for faster processing and testing.
Intensifying unit tests to minimize the issue & presenting a quality product; creating unit
tests to check Python functions for their expected performance & actual performance.
Implemented Spark using Scala, Java and utilizing Data frames and Spark SQL API for faster
processing of data.
Built data processing pipelines and ETL (Extract, Transform, Load) workflows using Java and
related frameworks.
Creating utilities to automate manual work using Scala; reusing operators to lessen
redundancy.
Developed data ingestion processes by implementing custom Java applications to extract
data from various sources, such as databases, APIs, and file systems.
Expertise in setting up and managing Data Proc clusters to process large-scale data
workloads efficiently. This includes configuring cluster specifications, scaling resources based
on workload demands, and optimizing cluster performance for faster data processing.
Structuring highly scalable, robust & fault-tolerant systems
Determining data acquisitions opportunities & prospects; finding ways & methods to find
value out of existing data.
Utilized Java libraries for data serialization formats (e.g., Avro, Parquet) and worked with
schema evolution to ensure compatibility and flexibility in data storage.
Implemented performance optimizations, including indexing, caching, and query tuning, to
enhance data retrieval and processing efficiency in Java applications.
Familiarity with Java frameworks for web development (e.g., Spring, Java Servlets) and
RESTful API design, enabling seamless integration of data services with other systems.
Created multi-node Hadoop and Spark clusters in cloud instances to generate terabytes of
data and stored it in GCP HDFS.
Configured Spark streaming to receive data from Kafka and store the data to HDFS using
Scala.
Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for
building the common learner data model which gets the data from Kafka in near real time.
Developed Kafka consumer using Spark Structured Streaming to read from Kafka topics and
write to GCP.
Refining data quality, reliability & efficiency of the individual components & the complete
system
Fashioning a complete solution by integrating a variety of programming languages & tools
together
Developed interactive dashboards using Kibana and Looker for data visualization, analytics,
and reporting purposes.
Presenting new data management tools & technologies into the existing system to make it
more effectual.
Developed dashboards and reports using Looker's data modeling and visualization
functionalities, including LookML modeling, dimensions, measures, and custom
visualizations.
Environment: HDFS, Spark, Hive, Sqoop, SQL, HBase, Scala, Python, GCP, Kafka, Airflow and
Shell Scripting, Looker, Kibana.
Responsibilities:
Worked as a senior Big Data Cloud engineer for one of the largest Cable and Broadband
supplier in the US.
Took lead of different improvement projects utilizing Databricks and different AWS Services
like Kinesis, Apache Spark Streaming, DynamoDB, and Elasticsearch.
Developed and deployed data processing pipelines on the Amazon Web Services (AWS)
platform using Python and Spark Streaming technologies.
Implemented data processing and analysis workflows using AWS Glue, Amazon EMR, or
Apache Spark with Python, handling large-scale datasets and performing complex
transformations.
Followed best practices in code organization, documentation, and version control using Git,
resulting in maintainable and scalable Flask and FastAPI applications.
Implemented data transformations and business logic using Java libraries and frameworks,
ensuring data integrity, quality, and compliance with business requirements.
Utilized Spark Streaming's windowing and sliding window operations to handle time-based
aggregations and analytics on streaming data.
Conducted performance tuning and optimization of Spark Streaming jobs to improve
throughput, reduce latency, and enhance overall system efficiency.
Developed different exceptionally complex Databricks jobs utilizing Spark Streaming
(PySpark and Spark SQL) to deal with real-time data coming from AWS Kinesis and storing
the final output in S3 buckets, DynamoDB, and AWS Elasticsearch.
Worked to develop a testing framework known as DQ Checks utilizing Pyspark and the
framework is capable of validating real-time data coming from SFTP or AWS Kinesis.
Made CICD Pipeline on Jenkins utilizing GitHub Repo to generate release deployment
management.
Framework is additionally equipped for emailing and storing the final output in Athena Table
for further analysis.
Environment: Amazon EC2. Spark, Python, AWS SDK for Python, Spark Streaming Applications, Base,
ZXoopkeeper, Mapreduce, Postman, Flume, NoSql, HDFS, Avro, Airflow, Sqoop, DynamoDB
Lowes Mooresville, NC Oct 2018 to April 2021
Sr. Big Data Developer/Engineer
Responsibilities:
Involved in story-driven Agile development methodology and actively participated in daily scrum
meetings.
Worked on all activities related to the development, implementation and support for Hadoop.
Designed custom re-usable templates in Nifi for code reusability and interoperability.
Involved in Installing, Configuring Hadoop Eco System, and Cloudera Manager using CDH4 Distribution.
Build frameworks using Python in Airflow to orchestrate the Data Science pipelines
Worked with teams in setting up AWS EC2 instances by using different AWS services like S3, EBS, and
Elastic Load Balancer, Auto scaling groups, VPC subnets and Cloud Watch.
Responsible to manage data coming from different sources and involved in HDFS maintenance and
loading of structured and unstructured data.
Worked with Kafka streaming tool to load the data into HDFS and exported it into MongoDB database.
Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
Installed and Configured Apache Hadoop clusters for application development and Hadoop tools like
Hive, Pig, HBase, Zookeeper and Sqoop.
Implemented multiple MapReduce Jobs in java for data cleansing and pre-processing.
Wrote complex Hive queries and UDFs in Java and Python.
Worked on AWS provisioning EC2 Infrastructure and deploying applications in Elastic load balancing.
Generated data analysis reports using Matplotlib, Tableau, successfully delivered and presented the
results for C-level decision makers
Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce.
Used Scala and Spark-SQL to develop spark code for faster processing, testing and performed complex
Hive queries on Hive tables.
Worked on Kerberization to secure the applications using SSL and SAML authentication
Wrote and execute SQL queries to work with structured data available in relational databases and to
validate the transformation/ business logic.
Use Flume to move data from individual data sources to Hadoop system.
Use MRUnit framework to test the MapReduce code.
Responsible for building scalable distributed data solutions using Hadoop Eco system and Spark.
Worked on performance testing the api’s using Postman
Involved in the process of data acquisition, data pre-processing various types of source data using
Stream sets.
Responsible for design & development of Spark SQL Scripts using Scala/Java based on Functional
Specifications.
Analysed the data by performing Hive queries (HiveQL), ran Pig scripts, Spark SQL and Spark
streaming.
Developed tools using Python, Shell scripting, XML to automate some of the menial tasks
Wrote scripts in Python for extracting data from HTML file.
Implemented MapReduce jobs in HIVE by querying the available data.
Configured Hive Meta store with MySQL, which stores the metadata for Hive tables.
Performed data analytics in Hive and then exported those metrics back to Oracle Database using
Sqoop.
Performance tuning of Hive queries, MapReduce programs for different applications.
Proactively involved in ongoing maintenance, support and improvements in Hadoop cluster.
Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
Used Cloudera Manager for installation and management of Hadoop Cluster.
Environment: Nifi 1.1, Hadoop 2.6, JSON, XML, Avro, HDFS, Airflow Teradata r15, Sqoop, Kafka, MongoDB,
Hive 2.3, Pig 0.17, HBase, Zookeeper, MapReduce, Postman , java, Python 3.6, Yarn, Flume, NoSQL, Cassandra
3.11.
US Bank Minneapolis, MN Dec 2017 to Oct 2018
Hadoop Developer
Responsibilities:
In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS,
Application master, Node Manager, Resource Manager, Name Node, Datanode and MapReduce
concepts.
Imported required tables from RDBMS to HDFS using Sqoop and also used Storm and Kafka to get
real time streaming of data into HBase.
Good experience with NoSQL database Hbase and creating Hbase tables to load large sets of semi
structured data coming from various sources.
Wrote Hive and Pig scripts as ETL tool to do transformations, event joins, filter both traffic and some
pre-aggregations before storing into the HDFS.
Developed data pipeline using Flume, Sqoop, Pig and MapReduce to ingest customer behavioural
data and purchase histories into HDFS for analysis
Developed Spark code using Scala and Spark-SQL for faster testing and processing of data.
Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
Involved in moving all log files generated from various sources to HDFS for further processing
through Flume.
Developed Java code to generate, compare & merge AVRO schema files.
Prepared the validation report queries, executed after every ETL runs, and shared the resultant values
with business users in different phases of the project.
Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting &
used the hive optimization techniques during joins and best practices in writing hive
scripts using HiveQL.
Importing and exporting data into HDFS and Hive using Sqoop. Writing the HIVE queries to extract
the data processed
Developing and running Map-Reduce Jobs on YARN and Hadoop clusters to produce daily and
monthly reports as per user's need.
Teamed up with Architects to design Spark model for the existing MapReduce model and Migrated
MapReduce models to Spark Models using Scala.
Implemented Spark using Scala and utilizing SparkCore, Spark Streaming and SparkSQL API for faster
processing of data instead of MapReduce in Java.
Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and
handled Structured data using Spark SQL
Handled importing of data from various data sources, performed transformations using Hive,
MapReduce , loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data
from Kafka to HDFS.
Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types
of Hadoop jobs such as Java MapReduce Hive, Pig, and Sqoop.
Analyzed large and critical datasets using Cloudera, HDFS, HBase, MapReduce, Hive, Pig, Sqoop,
Spark and Zookeeper.
Expert knowledge on MongoDB NoSQL data modeling, tuning, disaster recovery and backup
Environment: Apache Hadoop, HDFS, MapReduce, HBase, Hive, Yarn, Pig, Sqoop, Flume, Zookeeper, Kafka,
Impala, SparkSQL, Spark Core, Spark Streaming, NoSQL, MySQL, Cloudera, Java, JDBC, Spring, ETL, WebLogic,
Web Analytics, Avro, Cassandra, Oracle, Shell Scripting, Ubuntu.
Zenmonics Hyderabad, India May 2012 to Nov 2015
Data Analyst
Responsibilities:
Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in
Python
Wrote and optimized complex SQL queries involving multiple joins and advanced analytical functions to
perform data extraction and merging from large volumes of historical data stored in Oracle 11g,
validating the ETL processed data in target database
Good understanding of Teradata SQL Assistant, Teradata Administrator and data load Experience with
Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, Pivot Tables and OLAP reporting.
Identified the variables that significantly affect the target.
Continuously collected business requirements during the whole project life cycle.
Conducted model optimization and comparison using stepwise function based on AIC value
Developed Python scripts to automate data sampling process. Ensured the data integrity by checking
for completeness, duplication, accuracy, and consistency
Generated data analysis reports using Matplotlib, Tableau, successfully delivered and presented the
results for C-level decision makers
Generated cost-benefit analysis to quantify the model implementation comparing with the former
situation
Worked on model selection based on confusion matrices, minimized the Type II error.
Environment: Tableau 7, Python 2.6.8, Numpy, Pandas, Matplotlib, Scikit-Learn, MongoDB, Oracle 10g, SQL