0% found this document useful (0 votes)
36 views15 pages

Data Management Guide Checklists

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views15 pages

Data Management Guide Checklists

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Eight Essential Checklists

for Managing the Analytic Data Pipeline


Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Checklist 1: Data Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Checklist 2: Data Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Checklist 3: Data Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Checklist 4: Data Preparation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Checklist 5: Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Checklist 6: Pipeline Automation and Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Checklist 7: Governance and Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Checklist 8: Extensibility and Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

The Bottom Line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

About Pentaho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Eight Essential Checklists 2


Introduction
Organizations of every type are racing to use their data better. We all know the importance
of analytics and business intelligence (BI) when it comes to successfully operating a company.
But in an age when the number of data sources and data volumes is exploding, it’s essential
to make certain that your data is analytics-ready at the beginning of the data pipeline, as
opposed to random points where business users may need to use it

Failing to take a holistic approach to your data pipeline can In other words, it’s important to consider whether the right
yield dark, unused data, or worse, it may compel organizations approach for your company is to try to assemble a patchwork
to make critical business decisions based on inaccurate data. of data management tools, one for each part of the data man-
Within any analytics pipeline, the right data management agement process, or if time and cost savings can be obtained
processes are paramount to providing accurate and reliable through a comprehensive platform. Pentaho discusses the
information. These processes help to future-proof your pipeline importance of an end-to-end platform in managing the data
against changing analytic needs and emerging technologies. pipeline in its blog post “The Holy Grail of Analytics.”

451 Research, an IT research company, validated the importance This guide provides a checklist of eight essential categories
of data management in their recent report titled “Data Platforms to consider as you evaluate your vendor options, and flags
and Analytics Market Map 2016.” The authors note that “Data potential pitfalls to guard against as you plan your pipeline.
management is an essential part of the analytics process, and Each category is critical part of the analytic data pipeline, from
is defined as the management of data using a number of specific data connectivity and preparation, to analytics. You must also
tools – or a broader platform combining multiple tools – with the consider ease of use, flexibility, and governance to ensure that
endgame of enabling analytics.” the right people can access the right data at the right time.

Eight Essential Checklists 3


Checklist 1
Data Connectivity
Data connectivity is essential to ensure that business-critical To manage your data pipeline effectively, your tools must have
data is available for analysis and that your platform is ready the right connectivity to both traditional and emerging sources
to handle newer data sources and types – including big data of structured, semi-structured, and unstructured data. When
sources – from which competitive advantage is so often gained. evaluating potential vendors, it’s important to ask questions
about their connectivity capabilities when it come to different
types of data sources.

Data Source Types Questions to Ask Your Vendor Potential Vendor Pitfalls

Flat Files Can you access flat files saved via a Vendors may support only Linux or Microsoft Windows operating
variety of operating systems using a systems. Also, vendors may not support SSDs, flash drives, and other
range of file storage options? high-performance storage options.

Relational Data- Do you connect to a wide variety Vendors may support only popular databases such as Oracle DB,
bases of relational DBMSs via JDBC or Microsoft SQL Server, IBM DB2, Sybase, or IBM Informix. They may not
ODBC? support newer databases such as MySQL, PostgreSQL, Hypersonic
SQL, and H2.

Legacy / Nonrela- Do you read raw data from legacy Vendors may connect to IBM AS/400 but not to mainframes,
tional DBMSs or nonrelational data sources? COBOL environments and other legacy systems

ERP, CRM, and SCM Can you connect to packaged Vendors may connect to popular systems such as SAP ERP and Sales-
Sources applications? force CRM but not to newer systems such as OpenERP or Splunk.

Cloud Sources Can you connect to cloud Vendors may connect to only a subset of Amazon’s data sources that
data sources? includes Amazon Redshift, Amazon S3, Amazon EMR, Amazon RDS,
and Amazon DynamoDB.

SaaS Can you connect to a variety of Vendors may connect to only a few SaaS applications such as
Applications SaaS applications? Google Analytics or JIRA.

API Inputs Can you use a variety of APIs to Vendors may support only web services such as REST API but not older
connect to different web services? file transfer protocols such as FTP or HTTP.

Industry Can you read a variety of industry Vendors may support only standard formats for a specific vertical such
Standards standard data streams and feeds? as SWIFT for financial services and may not support older standards
such as EDI or standards in a different vertical such as HL7 in healthcare.

Message Queues / Can you read messages from a Vendors may support only JMS or IBM MQ server.
EAI Products variety of queues?

Semistructured Do you read and process complex Vendors may have limitations using XPath specifications, adding XML
Data (JSON, XML) XML files? tags into a stream of data, or reading data from RSS and Atom feeds.

Continued on next page

Eight Essential Checklists 4


Checklist 1
Data Connectivity (cont.)

Data Source Types Questions to Ask Your Vendor Potential Vendor Pitfalls

Log Files Do you integrate and parse complex Vendors may use manual coding for parsing infrastructure log files or
application logs? mobile log files.

Social Data Do you connect to prominent social Vendors may have generic REST APIs and may not be optimized to read
media sites? from Facebook or Twitter feeds.

Mobile Do you collect mobile device Vendors may be limited to Android devices from the top two manufactur-
Platforms information? ers such as Samsung or LG and not supporting closed systems such iOS
or Windows Phone

Spatial Data Do you connect to Esri data and Vendors may not offer geolocation data or provide geocoding services.
do you offer any GIS features?

Structured / Do you connect to a variety Vendors may support a limited set of file types such as PDF and RTF
Unstructured Data of structured and unstructured and may not support audio, image, and video formats. The vendor may
data sources? access email through either POP or IMAP.

Web Clickstream Do you connect to and parse While vendors may be able to access server log files, they may not be
Data clickstream data? able to access application logs that are often stored in RDBMSs, NoSQL,
or other stores. Also, vendors may not provide connectivity to Google
Analytics for clickstream data.

Call Detail Records Do you parse CDR data in different Vendors may not easily adapt to changes in CDR structure and may
(CDR) common message formats and support only a basic structure.
protocols?

Big Data Future- Can you help provide future-proofing While vendors may be able to connect to some NoSQL databases such
Proofing by connecting to different sources of as Apache Cassandra or MongoDB, they might not be able to integrate
big data? data from different on-premise and cloud Hadoop distributions.

Eight Essential Checklists 5


Checklist 2
Data Engineering
Data engineering requires more than just connecting to or create major obstacles. Additionally, as technology researcher
loading data. Rather, it involves managing a changing array of O’Reilly has noted, data engineering best practices mean that
data sources, establishing repeatable processes at scale, and your data pipeline should be reproducible, consistent, and
maintaining control and governance. Whether an organization productionizable. Consider vendors’ components and processes
is implementing an ongoing process for ingesting hundreds of that will enable you to go from information delivery to data
data sources into Hadoop or enabling business users to upload integration all the way through to analytics and reporting.
diverse data without IT assistance, onboarding projects tend to

Capability Questions to Ask Your Vendor Potential Vendor Pitfalls

Drag-and-Drop UI Do you offer a drag-and-drop Drag-and-drop steps may be too generic requiring additional configuration
capability? and manual coding.

Repeatability Do you source data in a repeatable Vendors may need to tweak the ingestion process for even small changes
fashion? in source structure.

Event Do you deliver data through auto- Vendors may not support event processing to automatically initiate the
Framework mated processes initiated by a transformation process.
variety of events?

Operationalizing Do you operationalize data preparation Vendors may not understand the preparation rules since these may have
Self-Service rules created by business users so been created by a different tool from another vendor.
Data Prep they can be rerun on schedule or on
demand?

Metadata-Driven Can you share and reuse data Vendors may require that each step in the transformation process be
definitions and metadata? hard-coded rather than metadata-driven.

Integration Do you permanently join data from Vendors may support only a limited set of joins or joins of similarly
different sources? structured data.

Prebuilt Do you offer me a range of sort, Vendors may offer only the most common variations of these steps.
DI Steps lookup, and join steps?

Predictive Do you operationalize advanced Vendors may require users to manually feed data to their models with a
Model Support analytics models? custom script at a later stage in the pipeline.

Clustering Do you enable me to cluster servers Vendors may limit clustering to a homogeneous set of servers having the
to improve performance? same operating systems.

Load Balancing Do you enable me to distribute the Vendors’ ability to distribute the transformation process may not consider
transformation process across the current load on the system outside the data engineering process.
different nodes?

Continued on next page

Eight Essential Checklists 6


Checklist 2
Data Engineering (cont.)

Capability Questions to Ask Your Vendor Potential Vendor Pitfalls

High Availability Do you automatically recover processes Vendors may provide only partial support by restarting the engineering
With Automatic from external failures? process from the beginning rather than the point of failure.
Restarts

Data Do you extract data from multiple Vendors may need to create a staging area for a BI tool to read from directly.
Federation sources, integrate the data, and flow
that data directly into reports?

Cloud Can you process data using a cloud- Does the list of major cloud infrastructure providers include Amazon EC2
Processing processing capability? instances only?

Eight Essential Checklists 7


Checklist 3
Data Delivery
Getting data where it needs to go is essential. Some solutions and the open source community develop something new,
perform better with traditional data warehouses and some per- requiring you to adapt to the changing data landscape. Here
form better with newer technologies. It’s important to consider are some questions to consider to help you ensure your data
how to future-proof your solution so that you can avoid getting pipeline strategy is flexible and nimble.
stuck with outdated technology when innovative companies

Locations to
Questions to Ask Your Vendor Potential Vendor Pitfalls
Deliver Data

Data Marts Do you connect to a variety of data Vendors may support only MySQLor PostgreSQL and may not support
marts? popular options such as Microsoft SQL Server.

Enterprise Data Do you connect to a variety of data Vendors may support only IBM Netezza, Teradata, and possibly Oracle
Warehouses warehouses and support bulk data Exadata and may not support bulk loading for newer data warehouses
(EDWs) movement? such as SAP HANA and Greenplum Database.

Analytic Do you connect to a wide variety of Vendors may support on-premise analytic databases such as Infobright,
Databases analytic databases that use high- Greenplum Database, and HP Vertica while skipping cloud databases such
performance capabilities such as as Amazon Redshift.
columnar storage and MPP?

NoSQL Do you connect to schemaless Vendors may support MongoDB and Apache Hbase while ignoring Apache
Databases NoSQL stores? Cassandra and Apache CouchDB.

Hadoop Do you connect to and run transforma- Vendors may connect to and run transformations natively as MapReduce or
tions natively on a variety of Hadoop Spark processes on Cloudera and Hortonworks, but they may not support
distributions and connect with the other popular distributions such as MapR, and Amazon EMR. Vendors may
major components of the Hadoop be unable to connect with Hive or Impala. They may fail to integrate with
ecosystem? other Hadoop ecosystem components such as Hbase, Oozie, Sqoop, and
YARN. They may not support Avro file formats.

In-Memory Do you connect to and take advantage Vendors may support popular SAP HANA databases but not others such
Data Sources of in-memory databases? as EXASOL, H2, and Infobright.

Eight Essential Checklists 8


Checklist 4
Data Preparation
As Forbes noted in 2016, data scientists spend up to 80% of you use, the more likely you are to run into problems in going
their time simply preparing data – time that could be better spent from one stage of the data pipeline to another. Here are some
building analytical models. Stand-alone tools that help with data considerations when choosing the right vendor to help you
preparation may lack the flexibility to blend both traditional and with an end-to-end data pipeline solution.
new unstructured data sources. The more stand-alone vendors

Capability Questions to Ask Your Vendor Potential Vendor Pitfalls

Data Discovery Do you allow me to easily search, explore, and discover Vendors may allow you to discover data sources without
data sources, tables, columns, and files and to request considering your security needs.
permission if needed?

Data Access Is the connectivity for self-service users the same as it IT assistance may be needed to connect to many data
would be for IT or is it limited to a subset of data sources? sources, making it impossible to be a self-service tool.

Access Do you provide capabilities to save and automate Access could be very narrowly defined, requiring the
Automation access to data? process to be repeated for each set of sources.

Visual Do you offer me tools to visually examine data from Vendors may provide visibility to just the source or target
Examination different sources to determine fitness for purpose? data, and may have no visibility to the intermediate trans-
formation steps of the data analytics pipeline
Is this view limited to the source or to all the steps along
the transformation pipeline?

Is visualization native to the tool or is it integrated with


an analytics tool?

Data Profiling Do you offer a few quality metrics and statistical analysis? Vendors may fail to highlight missing data.
and Sampling
Statistics may be available for only a small subset of the
data.

Data Do you create relationships among multiple data sources? Vendors may offer relationships only for structured data
Relationships sources and not for other types of data.

Data Cleansing Do you provide an easy way to resolve errors in data? Vendors may not provide smart text-processing algorithms
that add users’ actions to resolve data errors to their list
Vendors may not provide smart text-processing algo-
of available actions.”
rithms that add users’ actions to resolve data errors to
their list of available actions?

Definitions Do you offer a means to label data sets with additional Vendors may limit definitions to technical definitions and
context? may not include broader business definitions.

Continued on next page

Eight Essential Checklists 9


Checklist 4
Data Preparation (cont.)

Capability Questions to Ask Your Vendor Potential Vendor Pitfalls

Business Do you offer a glossary of predefined business terms Vendors may offer a limited business glossary that cannot
Glossary that could be expanded? be expanded with customized definitions.

Collaboration Do you offer capabilities to capture tribal knowledge? Vendors may not offer the ability to share content and
commentary with other users.

Data Blending/ Do you offer the ability to blend multiple data streams? Vendors may be able to offer prototyping capabilities
Enrichment but may not be able to operationalize them.
Can you blend both traditional data and semistructured
data? Vendors may not easily handle semistructured or
unstructured data.
Do you blend dozens of data sources efficiently?
Vendors may have a proprietary modeling language to
Can data blending be accomplished with zero coding?
blend data that takes time to learn.

Data Shaping Do you create calculated fields, dimensions, or aggre- Vendors may not offer templates for data shaping rules.
gations?

Preparation Do you templatize the entire preparation process? Vendors may not offer design templates to automate data
Automation preparation.

Governance Do you integrate with other parts of the analytics pipe- Point vendors may not offer seamless integration, leading
line to provide a seamless experience? to different definitions, inaccurate data, and loss of security
and centralized control.

Advanced Do you use prepared test data to improve the predictive Most vendors are not able to incorporate any predictive
Analytics component of your analytics models? analytics into their tools.

Eight Essential Checklists 10


Checklist 5
Analytics
As your business needs evolve, it’s important to have a platform able to leverage the best of predictive analytics and to embed
that can evolve with you. Vendors that provide a fixed library of analytics into your existing business processes or even in your
analytics options may not have the flexibility you need. Being software are critical for getting the most business value.

Capability Questions to Ask Your Vendor Potential Vendor Pitfalls

Reporting Do you provide traditional linear-style reporting tools? Many vendors don’t offer traditional reporting tools, instead
choosing more attractive visualization tools.
Can business users create reports or is this function
limited to administrators?

Dashboards Do you offer numerous templates for end users to drop Many vendors limit the variety of interactive and analytic
into a variety of visualizations and reports? components that can be part of the dashboards. Different
dashboard sections may not automatically synchronize
Are changes made in one section of the dashboard
after changes in one section
reflected in other sections?

Visualization Do you offer a variety of bar/column charts, heat grids, Many vendors limit geo-mapping layers with no extensibility
geo maps, and scatter plots? that they support out of the box.

Do you offer the ability to create custom charts and Vendors may limit customization to their proprietary visual-
templates? ization libraries rather than provide access to open libraries.

Ad Hoc Do you offer ad hoc analysis via a web interface in Some vendors limit ad hoc analysis to desktop tools rather
Analysis addition to a desktop interface? than web interfaces.

Embedding Do you offer the ability to custom brand the analytics Vendors may limit custom branding abilities or offer
interface? proprietary APIs.

Do you offer open APIs to easily embed analytics offering Vendors may need to partner with a data integration
without extensive coding? solution provider that may have a different set of APIs,
leading to greater complexity.
Can the back-end data integration offering also be
embedded along with the analytics?

Virtual Do you analyze virtual data sets? Vendors may require data to be staged in a physical table.
Data Sets

Eight Essential Checklists 11


Checklist 6
Pipeline Automation and Management
As the saying goes, “Excel ETL” isn’t scalable – and the method- possible so you can make the most of your team’s resources.
ology one person follows may not be followed by colleagues in Consider the following questions to help you choose a vendor
other business units, which leads to nonstandard reporting. It’s that can speed up the pipeline from raw data to analytics and
vital to be able to automate as much of your data pipeline as business insights.

Capability Questions to Ask Your Vendor Potential Vendor Pitfalls

Monitoring Do you show step-by-step performance metrics to Vendors may limit monitoring to just execution history.
identify bottlenecks?

Auditing Do you offer auditing capabilities to analyze usage and Vendors may not offer an easy way to audit usage history.
trends and plan capacity? herefore users would have to have to rely on their home-
grown metrics with questionable accuracy to plan capacity.

Automation Can a user select a custom data set that is facilitated by Vendors may not be able to initiate data integration jobs
automated data blending? based on data set choices selected from the front end.

Error Handling Are you capable of diverting rows that are in error rather Vendors that do not separate the transformation from the
than bringing the entire transformation process to a orchestration typically must use a hard stop to handle any
hard stop? errors.

Stream Can you execute transformations via microbatches to Vendors may perform only batch processing.
Processing process streaming data?

Native Do you offer end-to-end capabilities from data ingestion Lack of tight integration within a product portfolio or
Integration to transformation to analytics? If not, are you seamlessly across partners’ products leads to enhanced complexity
integrated with your partners? and longer debugging time.

Eight Essential Checklists 12


Checklist 7
Governance and Security
Data governance and security are not optional – and it’s best of who did what, with what data, and when. As you’re evaluating
to have a security plan rather than handle damage control after vendors, review their capabilities when it comes to data gover-
a breach. If you’re working in a regulated industry, it’s especially nance and security.
important to use a data pipeline platform that captures the flow

Capability Questions to Ask Your Vendor Potential Vendor Pitfalls

Security Do you integrate with security providers such as LDAP, Vendors may have their own security framework with no
single sign-on (SSO), or Microsoft’s Active Directory? ability to integrate with enterprise security frameworks.

Do you then further customize their security settings?

Data Curation Do you offer the ability to manage the data throughout Vendors may not offer the ability to mark data to be old or
its lifecycle? to be archived.

Data Lineage Do you offer the capability to track where the data Vendors may offer lineage to only their portion of the
came from and how it was prepared? analytics pipeline resulting in a loss of history.

Does the lineage transition across different stages


from ingestion to transformation and later to the
visualizations and analysis?

Data Protection Do you have the capability to apply regulatory policies Vendors who cater to individuals or departments typically
and rules to protect sensitive data? lack the ability to discern between sensitive data and
broadly shared data.
Do you promote data sanctioned by governance?

Multitenancy Do you deliver the correct data, reporting content, and Vendors may only offer one customization when it comes
UI to the appropriate group in a shared cloud-based to data, content, or UI, resulting in an incomplete solution.
infrastructure?

Eight Essential Checklists 13


Checklist 8
Extensibility and Scalability
The big data ecosystem surrounding Apache Hadoop includes community. Accordingly, if you’re considering a vendor based
dozens of tools, which are each constantly evolving. Much of on proprietary rather than open source code, you might get left
the innovation in the last few years around data management, behind when tools evolve. To stay flexible, consider a vendor’s
especially with big data, has taken place in the open source extensibility and scalability features.

Capability Questions to Ask Your Vendor Potential Vendor Pitfalls

Marketplace Do you offer the ability to extend the platform with Proprietary “black box” vendors mean you are dependent
a marketplace of transformation plugins or custom on their technical expertise rather than open to new devel-
visualizations? opments as technology evolves.

Analytics Can you accommodate more than a handful of users? Vendors may not scale beyond a department level, resulting
Scalability in a sharp drop-off in system responsiveness if many users
Can there be a combination of internal and external
are accessing it simultaneously.
users?

Data Scalability Do you transform millions of rows of data by leveraging Vendors using enterprise service bus architecture are only
the existing cluster of servers used for storing the data? good for running small pipes of small packets of data.
Vendors may not natively run their processes in storage
server clusters and may require additional hardware.

In-Memory Do you have the flexibility to process data in-memory Vendors that are capable only of in-memory data processing
Processing for speed or push it down to disk when data size hit obstacles to scalability because data must be loaded
increases? before processing, which becomes difficult as data sizes
increase.

Clustering Do you take advantage of a dynamic provisioning of Vendors deployed in the cloud may require node-by-node
nodes to a cluster? installs, requiring more maintenance by IT.

Load Balancing Do you offer load balancing across several servers Vendors without a built-in load balancer need to rely on
forwarding traffic in a round-robin fashion, worker external load balancers that don’t understand the transfor-
server quotas, or some other method? mation process.

High Availability Do you provide uninterrupted access to critical data Vendors without these capabilities require all transforma-
and reduce errors in data delivery? tion processes to be restarted when an error occurs.

Eight Essential Checklists 14


Learn more about Pentaho Analytics at HitachiVantara.com

About Hitachi Vantara


Hitachi Vantara, a wholly owned subsidiary of Hitachi, Ltd., helps data-driven leaders find and use the value in their data to innovate
intelligently and reach outcomes that matter for business and society. We combine technology, intellectual property and industry
knowledge to deliver data-managing solutions that help enterprises improve their customers’ experiences, develop new revenue
streams, and lower business costs. Only Hitachi Vantara elevates your innovation advantage by combining deep information
technology (IT), operational technology (OT) and domain expertise. We work with organizations everywhere to drive data to meaningful
outcomes. Visit us at HitachiVantara.com.

Hitachi Vantara
Corporate Headquarters Regional Contact Information
2845 Lafayette Street Americas: +1 866 374 5822 or [email protected]
Santa Clara, CA 95050-2639 USA Europe, Middle East and Africa: +44 (0) 1753 618000 or [email protected]
www.HitachiVantara.com | community.HitachiVantara.com Asia Pacific: +852 3189 7900 or [email protected]

HITACHI is a registered trademark of Hitachi, Ltd. Pentaho is a trademark or registered trademark of Hitachi Vantara Corporation. All other trademarks, service marks and company names
are properties of their respective owners.
P-055-A KK May 2018

You might also like