Data Fabric: Smart Data Engineering, Operations, and Orchestration
Data Fabric: Smart Data Engineering, Operations, and Orchestration
Dave Wells
September 2019
Research Sponsored by
Get more value from your data. Put an expert on your side.
Learn what Eckerson Group can do for you!
Table of Contents
Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
The Trouble with Data Management. . . . . . . . . . . . . . . . . . . . . 6
Data Silos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Data Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
Data Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Data Orchestration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
What is Data Fabric? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Data Fabric Defined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Data Fabric Concepts and Principles . . . . . . . . . . . . . . . . . . 9
Why a Data Fabric?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Data Management Complexities . . . . . . . . . . . . . . . . . . . . . 11
Data Fabric Functions and Features. . . . . . . . . . . . . . . . . . . . 12
A Single Data Management Platform. . . . . . . . . . . . . . . . . . 12
Data Fabric Components. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Data Management across the Analytics Lifecycle. . . . . . . 17
Data Infrastructure Management. . . . . . . . . . . . . . . . . . . . . 19
State of the Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Data Fabric Use Cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Current State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
The Future of Data Fabric. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Getting Started with Data Fabric. . . . . . . . . . . . . . . . . . . . . . . 23
Do You Need Data Fabric? . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
About Eckerson Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
About Infoworks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Executive Summary
Data fabric is a combination of architecture and technology that is designed to streamline the
complexities of managing many different kinds of data, using multiple database management
system, and deployed across a variety of platforms. A typical data management organization
today has data deployed in on-premises data centers and multiple cloud environments. They
have data in flat files, tagged files, relational databases, document stores, graph databases,
and more. Processing spans technologies from batch ETL to changed data capture, stream
processing, and complex event processing. The variety of tools, technologies, platforms, and
data types make it difficult to manage processing, access, security, and integration. Data
fabric provides a consolidated data management platform. It is a single platform to manage
disparate data and divergent technologies deployed across multiple data centers, both cloud
and on-premises.
The complexities of modern data management expand rapidly as new technologies, new kinds
of data, and new platforms are introduced. As data becomes increasingly distributed across
in-house and cloud deployments, the work of moving, storing, protecting, and accessing data
becomes fragmented with different practices depending on data locations and technologies.
Changing and bolstering data management methods with each technological shift is difficult
and disruptive, and will quickly become unsustainable as technology innovation accelerates.
Data fabric can serve to minimize disruption by creating a highly adaptable data management
environment that can quickly be adjusted as technology evolves.
Key Takeaways
• Data fabric is an emerging solution to the complexities of modern data
management. It combines architecture, technology, and services to automate
much of data engineering, operations, and orchestration.
• Data fabric provides a single, unified platform for data management across
multiple technologies and deployment platforms.
• No single vendor provides a complete data fabric solution today. Choose the right
technologies to weave your data fabric. Interoperability is a key consideration.
Recommendations
• Don’t ask if you need data fabric. Ask when you’ll need data fabric.
• Make the case for data fabric in three dimensions—business case, technical case,
and operational case.
• Identify your data fabric use cases. They may include managing complex data
systems, modernizing data management architecture, migrating to cloud, moving
to DataOps, and more.
• Leverage existing technology when weaving your data fabric, but don’t be
anchored by it.
• Don’t forget the human side of data fabric. Data management has many
stakeholders and you’ll need to engage them all.
Data Silos
In the age of data-driven business, data is everywhere. That is a good thing for data-hungry
processes and analytics but a challenge for data management. When data is siloed across
multiple cloud platforms and also stored in on-premises databases it becomes difficult to find,
blend, and integrate when needed. The complex deployment landscape shown in figure 1
illustrates typical deployments today. This landscape has four separate cloud environments as
well as several systems operated in an on-premises data center.
These kinds of data deployments can’t be avoided today. With legacy systems, SaaS
applications, data warehouses, and data lakes data is spread widely across platforms and
technologies. Although data is abundant it is isolated and difficult to find, access, and
integrate. One goal of data fabric is to “connect the dots” across data silos.
Data Engineering
Data engineering is a critically important part of analytics that receives little attention
compared to data science. Recent research shows 12 times as many unfilled data engineer
jobs as data scientist positions. Breadth and depth of required skills limits the number of
qualified people to work as data engineers. Clearly the demand for data engineers outstrips
the supply, and the gap continues to grow. The large number of unfilled jobs reflects the
complexity of data engineering. Breadth of knowledge ranges from relational databases to
NoSQL, from batch ETL to data stream processing, and from traditional data warehousing
to data lakes. Depth of skills includes hands-on work with Hadoop; programming in Java,
Python, R, Scala, or other languages; and data modeling from relational and star-schema to
document stores and graph databases. The data engineer is part database engineer (building
the databases that implement data warehouses, data lakes, and analytic sandboxes) and part
software engineer (building the processes, pipelines, and services that move data through the
ecosystem and make it accessible to data consumers). One goal of data fabric is to automate
much of data engineering to increase reuse and repeatability, and to expand data engineering
capacity.
Data Operations
Rapid, reliable, and repeatable delivery of production-ready data for reporting, analytics, and
data science is an ongoing challenge. Operationalizing data pipelines is difficult. When data
is spread across multiple platforms, pipeline processing must be able to span on-premises,
cloud, multi-cloud, and hybrid environments. Sustaining data pipelines is equally challenging
as business needs, data sources, and technologies continuously change. Fault tolerance
is critical to data operations. Entire analytics supply chains are disrupted when a data
pipeline fails, and repairs are especially slow and difficult when everything is done manually.
Data operations must also be attentive to data protection, data governance, data lineage,
metadata, and auditability.
data pipelines and analytic models. One goal of data fabric is to fully support the automation
needed for DataOps success, with ability to automate across on-premises, cloud, and hybrid
data ecosystems.
Data Orchestration
Execution environments have many of the same challenges as data environments. Few
organizations today have a single execution environment. Pushing processing to data
locations results in on-premises, cloud, multi-cloud, hybrid, and edge environments
for runtime processing of data. Separating computation from data and scaling each
independently is fundamental to operate this extreme of distributed and parallel processing.
End-to-end data pipelines often span multiple execution environments. Managing data access
and processing across these complex environments requires attention to configuration and
coordination, workflow and scheduling, cross-platform interoperability, fault tolerance, and
performance optimization. Data orchestration is a multi-faceted and complex job that can’t be
done without automation. One goal of data fabric is to support automation across the many
dimensions of data orchestration.
• A new way to manage and integrate data that promises to unlock the power of
data in ways that shatter the limits of previous generations of technology such as
data warehouses and data lakes
• Based on a graph data model … able to absorb, integrate, and maintain the
freshness of vast quantities of data in any number of formats
• A system that provides seamless, real-time integration and access across the
multiple data silos
Each of these items are quoted directly from software vendors offering data fabric solutions.
In the interest of remaining vendor-neutral, specific vendor attribution is omitted here.
The fabric metaphor also applies from the perspective that fabrics are woven from fibers. With
data fabric we seek to weave data management into the activities and workflow of everyday
business processes—to operationalize and orchestrate in a way that makes data an integral
part of doing business.
Beyond metaphor and concept, the following principles are fundamental to data fabric:
• Ability to find, access, and combine data from all sources regardless of type and
location.
• Support for the speed, scale, and reliability needed for enterprise-grade data
systems.
• Ability to process and provision data at all velocities from streaming to batch.
• Support for multiple processing engines including Hadoop, Spark, Samza, Flink,
and Storm.
• Ability to move data from one platform to another without extensive refactoring.
• Unified data access: Providing a single and seamless point of access to all data
regardless of structure, database technology, and deployment platform creates a
cohesive analytics experience working across data storage silos.
• Cloud mobility and portability: Minimizing the technical differences that lead
to cloud service lock-in and enabling quick migration from one cloud platform to
another supports the goal of a true cloud-hybrid environment.
Beyond a single data management platform, data fabric should provide a smart data
management platform that uses algorithms to understand data characteristics, collect
metadata, inform and advise data consumers, protect sensitive data, automate repetitive and
complex processes, and much more. Machine learning should enable adaptive data fabric
capable of adjusting to changing data, changing business needs, and changing user behavior.
• Varied applications and use cases including reporting, analytics, and data
science.
• Common shared data stores such as data warehouses and data lakes.
• Data ingestion has undergone dramatic change with growth of data volume,
variety, and velocity. Modern ingestion methods must support batch, real-time,
and stream processing. Relational data must coexist with big data formats such
as JSON, time-series, Apache Avro, Apache Parquet and, most notably of late,
cloud object data formats—particularly AWS S3. Smart data ingestion tools will
recognize each incoming data type and format, understand the structure of
datasets with self-describing schema, recognize previously known data sources
and process them as needed, and detect and respond to schema changes. When
disruptive data source and schema changes require human intervention, the
fabric will employ machine learning to improve its ability to automatically adapt
to similar changes in the future.
• Data transport moves data across networks, from one storage location to
another or from a storage location to a processing location. With cloud data
storage or cloud-based processing, this typically means transporting data across
the internet—exposing data on public networks. Securing data in motion is an
important consideration. Ideally, smart data transport technology automatically
recognizes sensitive data assets and protects them with encryption or secure
data transport protocols.
• Data storage technology advances are among the most important drivers of
change in data architecture today. In-memory and combinations of in-memory
and disk methods have altered the economics of data stores for both structured
and unstructured data. Ultralow-cost cloud object storage technologies are
experiencing rapid adoption. Smart data fabric will support the full variety of
data storage options, automatically applying the best mix of storage technologies
depending on use cases. The trend toward stateful microservices and containers
adds yet another dimension to data storage evolution, and emphasizes the role of
data fabric in “future-proofing” data architecture and technology infrastructure.
• Data pipelines consist of data flow and processing that moves data from an
origin (a data source or data store) to a destination (a data store or consuming
application) and transforms the data to meet requirements of the destination.
Modern data pipelines must handle high-velocity processing of messages, logs,
and data streams, in real-time and with scheduled or triggered batch processing.
The new generation of data pipeline technology applies AI/ML for smart data
pipelines able to recognize incoming data, know how to process that data and
where to deliver it, project data volumes and dynamically scale, and detect and
respond to data and process anomalies. The pipelines often use microservices
and containers. They are decoupled from specific execution platforms and
technologies, providing a high level of portability and ease of migration between
execution environments. (See figure 4.)
• Data access is provided through query, APIs, and data services. A wide variety of
human and digital data consumers, combined with diverse use cases, requires
support for many data access protocols. Query continues to be a common and
widely used form of data access. Smart data fabric may include intelligent query
optimization. APIs and data services, based on remote procedure call (RPC),
SOAP, and REST protocols often build semantics, intelligence, and rules into data
access mechanisms.
quality, prepare and provision data, protect and govern data, trace data lineage,
and trust data and analysis results. The following smart metadata opportunities
exist throughout the fabric:
These are but a few of the many opportunities for automated metadata
extraction and collection throughout the data fabric. Metadata is the means by
which data managers and data consumers can know the data. Knowing the data
is fundamental to deriving value from data.
Data fabric brings all of these components together as a single data management platform.
Orchestration and cataloging are the foundation. Ingestion, transport, and storage manage
data in motion and data at rest. Pipelines, preparation, and provisioning blend, harmonize,
integrate, and otherwise make data ready for analysis. Query, APIs, and services make the
data accessible. It is all supported by the critical administrative and oversight capabilities of
security and protection, governance, and metadata management.
Revisiting some of the data fabric components described earlier helps to understand many of
the stakeholder needs and expectations.
• Data ingestion—Chief data officers expect that any data needed by the business,
regardless of source and format, can be ingested at the speed needed by the
business. Data owners and stewards expect that security- and privacy-sensitive
data is recognized as soon as it enters the data ecosystem. Data engineers need
the right technologies to ingest data of all types at any velocity from streaming to
batch. Smart data fabric helps to ensure reliable data ingestion processes without
disruption from schema and data source changes.
• Data Catalog—Data stewards need to have data cataloged and expect that all
of the right metadata is recorded in the catalog to accurately describe the data
and to help data consumers know how to use it. Data engineers use the catalog
to find reusable datasets and reusable processes that promote consistency and
reduce redundancy in data engineering work. Scientists, analysts, and report
writers expect the catalog to provide robust search capabilities, and to help
them to find, understand, evaluate, and access data. They also expect to find
data wherever it resides without needing to know or care about deployment
databases and platforms. Business managers expect faster analysis and greater
analytics capacity with the catalog radically reducing the time that analysts
spend finding and understanding data. Smart data fabric uses the AI/ML features
of data catalogs to automate metadata extraction and collection.
• Data Storage—Data engineers need the ability to mix and match storage
technologies depending on data types, data formats, and expected uses of the
data. They need to store relational data, documents, images, geospatial data,
property graphs, knowledge graphs, and more. Their expectations include
storing data both on premises and in the cloud, using relational databases,
NoSQL databases, and object stores. Smart data fabric helps data engineers
make informed choices for data storage.
• Data Access—Data owners and stewards need to ensure that data access
works according to data security and protection requirements. Data engineers
need quick and easy access to data for provisioning, and they need to take full
advantage of reusable processes and pipelines when building query, API, and
data services capabilities. Data consumers expect data to be readily accessible
when needed. Smart data fabric supports authorized access connected with
existing security infrastructure for authentication and authorization. For data
consumers, it connects data catalog and data access to support seamless data
searching, understanding, evaluation, and access.
Yes, anticipating the changes is easy. Responding to those changes is more difficult and filled
with hard questions. How do you continue to increase data storage capacity to keep up with
data growth? How do you continue to increase compute capacity to cope with more data
pipelines, more complex analytics, and more simultaneous processing driven by a growing
population of self-service users? How do you fit new technologies into the existing technology
landscape and architecture? How do you pull the trailing edge of technology forward as you
push the leading edge? How to you commit to technologies that meet today’s needs without
experiencing vendor lock-in that limits your options for the future?
Obviously, these aren’t easy questions to answer. But data fabric holds promise and eases the
pain. With a unified platform for data management the changes need to be accommodated
in only one place instead of across a disparate collection of data management technologies
with limited interoperability. With smart data fabric and microservices/container architecture,
processing is easily ported between environments and technologies, and the infrastructure
becomes highly resilient.
isn’t practical. Organizations driving toward self-service and data democratization will
experience many of the challenges that data fabrics address. Those pursuing data-driven
digital transformation will find it difficult to achieve without smart and automated data
engineering, operations, and orchestration.
Current State
Data fabric is a relatively new concept and the technology market is best characterized as
emerging and evolving. Today no single vendor provides a complete data fabric solution in a
single product. Data fabric vendors fit into the following three main categories:
• Single Product—A few vendors offer a single product that provides much, but
not all, data fabric functionality. These are good choices if they offer the functions
that are most essential for your organization and have a strong roadmap for the
future.
Data fabric architecture varies with vendors and products. One vendor sees orchestration as
the umbrella under which all other functions operate; orchestration is the glue that holds it all
together. Another vendor drives data fabric from a graph perspective where the combination
of relationships and AI/ML are the foundation of data fabric capabilities. One positions the
data catalog as the centerpiece, shifting from passive catalog to what they call “active data
hub.” Yet another vendor bases data fabric capabilities on data virtualization, with an abstract
semantic layer as the foundation. Data lake management vendors are also evolving to become
data fabric providers, building on their pipeline management and storage management
functions with expanded automation and orchestration features.
with orchestration, for example, or a catalog and graph combination. Ultimately, every data
management product will have a role in data fabric, but today’s leaders will shape the future.
Don’t ask if you need data fabric. Ask when you’ll need data fabric.
If these questions reflect the data management challenges you face, then it is time to look
closely at data fabric and how it will fit into your data management processes and practices.
Recommendations
The top-level conclusion from this report is that you will need data fabric to build and sustain
a data-driven organization. Getting to data fabric is a journey that involves several activities.
• Begin by looking at the business case for data fabric. What is the cost/value ratio
of data management in your organization today? How much could you reduce
cost through automation? How many value opportunities go unrealized? How
much can data value be increased? What benefits would unified data governance
offer?
• Take a look at the technical case. Do you need multi-cloud support? Do you
need cloud-hybrid support? Is your data management scalability and elasticity
challenged? Does growth of data, processing, and users routinely degrade system
performance? Do you need technical infrastructure agility? Are you concerned
about vendor lock-in?
• Consider your data fabric use cases. Is your data management environment
highly complex and difficult to manage? What data initiatives do you have
planned or in process? Are you modernizing data architecture? Are you migrating
to cloud? Are self-service and data democratization underway or in your future?
Do you aspire to become a DataOps organization?
• Itemize the tangible benefits of data fabric. What will you gain with unified data
management? How will unified data access provide benefits? What results do you
expect from consolidated data protection? What value will central service level
management provide? What needs can be met with cloud portability? How will
infrastructure resilience help you?
Now you know the why of data fabric. The next steps are to answer what and how.
• Determine your preferred approach to data fabric. Do you want a single product
solution, a single vendor solution, or a multiple vendor solution?
• Plot the roadmap to implementing your preferred solution. If single product, what
is the path from product evaluation to implementation? If single vendor, what
is the path from vendor and product evaluation to implementation? If multiple
vendor, how will you select products, in what sequence will you implement, and
how will you stitch them together as a fabric of interoperating technologies?
• Don’t forget the human side of data fabric. Who are the stakeholders? How will
you get them engaged and committed? What participation is needed from each?
What organizational changes might be needed?
Getting to data fabric isn’t quick or easy. But for every data-driven organization it is necessary,
either now or in the future. Begin today by asking the questions, exploring the possibilities,
and looking toward a future of smart data engineering, operations, and orchestration.
Unlike other firms, Eckerson Group focuses solely on data analytics. Our experts each have
more than 25+ years of experience in the field. They specialize in every facet of data analytics—
from data architecture and data governance to business intelligence and artificial intelligence.
Their primary mission is to help you get more value from data and analytics by sharing their
hard-won lessons with you.
Our clients say we are hard-working, insightful, and humble. We take the compliment! It all
stems from our love of data and desire to help you get more value from analytics—we see
ourselves as a family of continuous learners, interpreting the world of data and analytics for
you and others.
Get more value from your data. Put an expert on your side.
Learn what Eckerson Group can do for you!
About Infoworks
Infoworks provides the first Enterprise Data Operations and Orchestration software system
to automate the development and operationalization of data pipelines from source to
consumption in support of business intelligence (BI), machine learning (ML) and artificial
intelligence (AI) analytics applications. Infoworks’ code-free development environment allows
organizations to develop and manage end-to-end data workflows without requiring an army
of big data experts. The software system automates and simplifies development of data
ingestion, data preparation, query acceleration and ongoing operationalization of production
data pipelines at scale. Infoworks supports cloud, multi-cloud, and on premise environments,
enabling customers to deploy projects to production within days, dramatically increasing
business agility and accelerating time to value.