iCEDQ Ebooks - DataOps Implementation Guide
iCEDQ Ebooks - DataOps Implementation Guide
Implementation
Guide
DATAOPS FOR BIG DATA, ETL, DATA MIGRATION,
BUSINESS INTELLIGENCE REPORTING
Sandesh Gawande
CTO-ICEDQ | TORANA, INC. STAMFORD CT USA | 203 666 4442 |
[email protected]
DATAOPS IMPLEMENTATION GUIDE
Contents
Abstract ................................................................................................................................................... 2
Problem Statement ................................................................................................................................. 2
Solution: .................................................................................................................................................. 5
What is DataOps?................................................................................................................................ 5
How to implement DataOps? ............................................................................................................. 5
Why DataOps with iCEDQ results in better Data Quality? ............................................................... 10
Conclusion ............................................................................................................................................. 10
Appendix A: Enable DataOps ................................................................................................................ 11
Appendix B: Testing and Monitoring Rule Patterns .............................................................................. 12
SANDESH GAWANDE 1
DATAOPS IMPLEMENTATION GUIDE
Abstract
Data projects in the form of data warehouse, data lake, big data, cloud data migration, BI
reporting and analytics, machine learning are manifesting in every organization. While
project timelines are shrinking, the number of data projects are increasing as is the complexity.
We have observed that data-centric applications are lacking the rigors and the discipline required
to execute these large and complex projects. While general software projects have adopted
the CICD and DevOps principles, the data integration and migration projects are still living
under the rock. With the advent of Big data and Cloud technology, this has become a huge
problem.
Time-to-market for a data project has become critical in organizations of all sizes. This paper
discusses the adoption of DataOps methodologies for data and big data projects, to improve the
success of the project as well as speed up the time-to-market. We further analyse some of the
bottlenecks such as: organizational culture, data test automation and how they are
hindering the implementation of DataOps. Ultimately, we are proposing the DataOps
solution to improve both delivery of the data project and data quality.
Problem Statement Data-centric projects are becoming both bigger in size and complexity, which
makes execution that much more difficult. This not only creates delays in project execution but
also results in poor data quality. More and more projects are facing:
Longer time to Market - The time required for projects is increasing, with many
cloud data migration projects having multi-year timelines.
Delayed or failed projects - Data teams are underestimating the complexity of the
data projects resulting in last moment surprises as well as cost over runs.
Poor Data Quality - Projects are delayed due to testing issues that are discovered too
late in the project lifecycle.
User dissatisfaction and Complaints - Data quality is after thought, resulting in high
rates of user dissatisfaction.
High Production Cost Fixes - Lack of test automation has resulted in lots of
refactoring or patchwork in production.
Testing on big data volumes - The large volumes has made is generally impossible to
test the data manually.
Regression testing nearly impossible - After the delivery of the project, code revision
or ETL processes require complete regression testing. However, these concepts are
missing in the data engineering side.
Costly Manpower - The manual and repetitive tasks are still not automated and
either requires manual work or custom coding, which often will take highly skilled talent
off of other critical work.
SANDESH GAWANDE 2
DATAOPS IMPLEMENTATION GUIDE
While there are many macro and micro issues affecting the delivery of data engineering projects,
the following are some of the underlying causes:
1. Siloed Teams: The team is usually divided into development, QA, operations and business
users. In almost all Data Integration projects, development teams try to build and test ETL
processes, reports as fast as possible and throw the code across the wall to the operations teams
and business users. However, when the data issues start appearing in production, the business
users become unhappy. They point fingers at Operations people, who in turn point fingers at QA
people. The QA group then puts the blame the development teams.
2. Lack of Code Repository: ETL, Database procedures, schemas, schedules and reports are not
treated as code. In the early nineties, the ETL and Reporting tools came into existence and since
they created custom ETL objects or Reports, they were not treated as code.
3. Lack of Data Management Repository: Configuration data, Reference data and Test data are
not managed. A data project requires test data, however test data is not created in advance nor
linked to the test cases.
Reference data is required to initialize the database. For example, default values for customer
types must be created in advance so it doesn’t have any data source. If the reference data is
missing, none of the ETL processes will work.
Configuration tables data must also be prepopulated. Some of the configuration data is used for
incremental or delta processing. Some data values are used to populate metadata about the
processes.
4. Lack of Test Automation: The way data processes (ETL) and reports are tested is very
different than how software applications are tested. In order to test, the ETL process is
executed first and then the data is compared from the original to certify the ETL process. This is
because the
SANDESH GAWANDE 3
DATAOPS IMPLEMENTATION GUIDE
quality is determined by the expected vs actual. The actual data is the data added or updated by
the ETL process and expected is the input data plus the data transformation rule(s).
5. Lack of Automated Build and Deployment: Since most ETL and Report developers use GUI /
tools to create their processes, the code is not visible. The ETL tool stores the code directly into its
repository. This creates a false narrative that since there is no code, there’s no need to manage,
version or integrate. The majority of ETL tools now provide APIs to import and deploy the code
into different environments, the functionality of which is often ignored.
6. Lack of Agile & Test-Driven Development (TDD): While data transformation rules are
provided to developers, the business doesn’t share testing and monitoring requirements during
development. Once the developers have completed development, only then the focus shifts to
testing. This is now late in the process and quite often this is when users start complaining. At
this late stage is the time when data monitoring issues are considered.
8. Lack of Regression Testing: After the system goes live, if any data issues are found, the
development team must go back and fix the code. This creates a big testing challenge in order to
complete regression testing, since previous/older test cases must be considered to test the ETL
flow. If they’ve not used a test automation tool that stores the rules in a repository, nothing will
exist.
SANDESH GAWANDE 4
DATAOPS IMPLEMENTATION GUIDE
Solution:
Many of the problem statements defined above are already solved in the software development
world, implementing concepts such as Agile Development, CICD, Test Automation, and DevOps.
It’s time the data world borrows some of these ideas and adopts them in the data world as well.
What is DataOps?
DataOps is the application of Agile Development, Continues Integration, Continues Deployment,
Continuous Testing methodologies and DevOps principles, with the addition of some data
specific
considerations to a data-centric project. It could be any of the data integration or data migration
projects such as data warehouses, data lakes, big data, ETL, Data Migration, BI Reporting and Cloud
Migration.
DataOps = 1. Culture
+ 2. Tools
+ 3. Practices
SANDESH GAWANDE 5
DATAOPS IMPLEMENTATION GUIDE
A. Identify the people and their culture – In a data project there are many types of resources.
However, their roles also define their boundary. Developers, testers on one side of the wall
and business users, operations data stewards are on the other side.
DataOps is about removing this wall and the first cultural change required for DataOps is to:
Tell the development team that they are responsible for data quality issues that will
appear in production environments.
Tell the business users it’s their responsibility to provide the data transformation
Now, instead of sequential steps, developers can create the design and develop the tests in
parallel to the development of the data pipeline. By using Non-Linear timelines Time-to-
Market is now 33% faster.
SANDESH GAWANDE 6
DATAOPS IMPLEMENTATION GUIDE
B. Get the automation tools for DataOps – DataOps in not possible with proper automation
tools. The organization must acquire multiple software platforms to support DataOps, such as:
a. Code Repository, Ex. Git
b. QA software for Data Test Automation, Ex. iCEDQ
c. Test Data Repository, Ex. Stored in dedicated database or file server
d. CICD software, Ex. Jenkins
e. Production Data Monitoring Software, Ex. iCEDQ
f. Issue management software, Ex. Jira, ServiceNow
The idea is to continuously integrate, deploy, test and monitor the data and processes in an
automated fashion. The purpose of each tool will be clearer with the process diagrams in the
section below.
C. Define the DataOps Practice – Requirements process, development process, data testing
process, test data management, production data monitoring and defect tracking. Assuming
people and the tools are in place.
a. Develop and Integrate in a Code Repository
SANDESH GAWANDE 7
DATAOPS IMPLEMENTATION GUIDE
SANDESH GAWANDE 8
DATAOPS IMPLEMENTATION GUIDE
1. Continuous Integration - In the previous section it’s clear that call code must be stored in
some repository and available for DevOps automation. With code it becomes easy to
manage various code branches and versions. Based on the release plan, code can be
selected and integrated with the help of CICD tools like Jenkins.
2. Continuous Deployment - The integrated code is pulled by Jenkins and deployed with
help of API’s of command line import and export utilities. Depending on the code type, the
code is pushed to a Database, ETL, Reporting platform. Further, CICD tools will also
deploy initialization data in the database. This will create the necessary QA or production
environment which is ready for further execution.
3. Initialization Tests - Once the environment is ready with code and data, the CICD tool will
execute iCEDQ rules to validate the data structures (database objects, tables, column,
datatypes, etc.) as well as initial data.
4. ETL/Report Execution - The next step for CICD tool is to execute the scheduler to
orchestrate execution of the ETL process and reports.
5. ETL/Report Testing - Once the data is loaded by ETL and reports are executed, iCEDQ
can run the test and verify the validity of both the ETL as well as report quality. (This is
unique to DataOps because without first executing the ETL or the reports, there is no way
to do the data testing.
6. Production Monitoring - Once the system is live, the hooks left by the development and
QA team will be used for monitoring the production systems, which is also sometimes
referred to as white box monitoring. The business also benefits as they now have hooks
(testing rules) developed by QA teams available to monitor the production data pipeline
on an ongoing basis.
a. Once the system is online and running based on the schedules, the Audit Rules
in iCEDQ will also start running.
b. When ICEDQ notices any discrepancies in the data it will identify the specific
data issues and raise alerts.
c. The Issue logging system can then be used as a source of changes in the data
pipeline or simple update of the data.
If there is a change in the code due to defects found in the data or a new business requirement
is discovered, the DataOps cycle repeats again.
SANDESH GAWANDE 9
DATAOPS IMPLEMENTATION GUIDE
SANDESH GAWANDE 10
DATAOPS IMPLEMENTATION GUIDE
SANDESH GAWANDE 11
DATAOPS IMPLEMENTATION GUIDE
SANDESH GAWANDE 12