Big Data and Hadoop
Senior Product Specialist
The Challenge
2014
2011
Data fragmentation becomes the
barrier to business success
Devices
& Machines
2007
Communities
& Society
1990s
1980s
BUSINESS
1960s-1970s
USERS
VALUE
Few
Employees
Back Office
Automation
TECHNOLOGIES
Business
Ecosystems
Customers/
Consumers
Many
Employees
Front Office
Productivity
Line-of-Business
Self-Service
Social
Engagement
Real-Time
Optimization
E-Commerce
OS/360
SOURCES
TECHNOLOGY
MAINFRAME
10 2
CLIENT-SERVER
10 4
WEB
10 6
CLOUD
10 7
SOCIAL
10 9
INTERNET
OF THINGS
10 11
Big Data Challenges
Volume, Variety, Velocity, Veracity
Analytic Systems
Source Data
Batch ETL
Transactions,
OLTP, OLAP
Documents and Emails
Social Media, Web Logs
Where is
the data I
need?
Can I trust
this data?
Enterprise Data
Warehouse
Data
Mart
Data
Mart
Data
Mart
Data
Mart
Machine Device,
Scientific
Data
Mart
Data
Mart
Data
Mart
Data
Mart
Data
Mart
80% of the work in big data projects
is data integration and quality
80% of the work in any data
project is in cleaning the data
70% of my value is an ability
to pull the data, 20% of my
value is using data-science
I spend more than half my time
integrating, cleansing, and
transforming data without doing
any actual analysis.
Why Informatica for Big Data & Hadoop
PowerCenter Big Data Edition
No-Code Productivity
Big Transaction Data
Online Transaction
Processing (OLTP)
Oracle
DB2
Ingres
Informix
Sysbase
SQL Server
Online Analytical
Processing (OLAP) &
DW Appliances
Teradata
Redbrick
EssBase
Sybase IQ
Netezza
Exadata
High-Speed Data
Ingestion and
Extraction
HANA
Greenplum
DataAllegro
Asterdata
Vertica
Paraccel
Business-IT
Collaboration
Unified Administration
9.6
Salesforce.com
Concur
Google App Engine
Amazon
Complex Data
Parsing on Hadoop
Social Media & Web Data
Facebook
Twitter
Linkedin
Youtube
Web applications
Blogs
Discussion forums
Communities
Partner portals
Universal Data Access
Cloud
ETL on Hadoop
Big Interaction Data
the VibeTM virtual
data machine
Other Interaction Data
Clickstream
image/Text
Scientific
Genomoic/pharma
Medical
Medical/Device
Sensors/meters
RFID tags
CDR/mobile
Entity Extraction and
Data Classification on
Hadoop
Big Data Processing
Profiling on Hadoop
Get Data Into and Out of
Hadoop
PowerExchange for Hadoop
Replication to Hadoop
Streaming to Hadoop
Data Archiving to Hadoop
Data Ingestion and Extraction
Moving terabytes of data per hour
Transactions,
OLTP, OLAP
Batch Load
Applications
Documents,
Email
Replicate
Social Media,
Web Logs
Machine Device,
Scientific
Industry
Standards
Extract
Streaming
Archive
Data
Warehouse
MDM
Extract
Low
Cost
Store
PowerExchange Connectors
Enterprise
Applications,
Software as a
Service (SaaS)
JDE EnterpriseOne
JDE World
Lotus Notes
Oracle E-Business Suite
PeopleSoft Enterprise
Salesforce (salesforce.com)
SAP NetWeaver
SAP NetWeaver BI
SAS
Siebel
Netsuite
Microsoft Dynamics
Databases and
Data
Warehouses
Adabas for UNIX, Windows
C-ISAM
DB2 for LUW
Essbase
EMC/Greenplum
Informix Dynamic Server
Netezza Performance Server
ODBC
Oracle
SQL Server
Sybase
Teradata
Messaging
Systems
JMS
MSMQ
TIBCO
webMethods Broker
WebSphere MQ
Technology
Standards
Email (POP, IMAP)
HTTP(S)
LDAP
Web Services
XML
Mainframe
Adabas for z/OS
Datacom
DB2 for z/OS, z/Linux
IDMS
IMS DB
Oracle for z/Linux
Teradata
WebSphere MQ for z/Linux
VSAM
Asterdata,
Greenplum
Vertica
ParAccel
Microsoft PDW
Kognitio
Facebook, Twitter, LinkedIn
DataSift, Kapow
MongoDB
HDFS
HIVE
HBASE
Big Data
Social
Hadoop
- Accessible in Real-time and/or via Change Data Capture (CDC)
NoSQL Support for HBase
Read
from HBase as
standard source
Sample HBase column
families
(Stored in JSON/complex
Write
formats)
to HBase as
standard target
Complete Mapping with
HBase Src/Tgt can
execute on hadoop
11
NoSQL Support for MongoDB
Access, integrate,
transform & ingest
MongoDB data into
other analytic
systems (e.g.
Hadoop, data
warehouse)
Sampling
MongoDB data &
flattening it to
relational format
Access, integrate,
transform, & ingest
data into MongoDB
IDR for Replicating to Hadoop
Supported
Distributions
HDFS
Source System
Cycle_1.work directory
Intermediate Files
EXTRACT
APPLY
Table 1 File
Table 2 File
Table N File
Schema.ini File
Apache
0.20.203.x
0.20.204.x
0.20.205.x
0.23.x
1.0.x
1.1.x
2.x.x
Cloudera
CDH3
CDH4
Real-Time Data Collection and Streaming
Management
and Monitoring
Web Servers,
Operations
Monitors, rsyslog,
SLF4J, etc.
Handhelds, Smart
Meters, etc.
Discrete Data
Messages
Node
Internet of Things,
Sensor Data
Node
Ultra Messaging Bus
Node
Publish / Subscribe
Zookeeper
HDFS, HBase,
Node
Node
Node
Real Time
Analysis, Complex
Event Processing
No SQL
Databases:
Cassandara, Riak,
MongoDB
Targets
Sources
Leverage High Performance Messaging
Infrastructure Publish with Ultra
Messaging for global distribution without
additional staging or landing.
14
Informatica Vibe Data Stream for Machine Data
High performance/efficient
streaming data collection over
LAN/WAN
GUI interface provides ease of
configuration, deployment & use
Continuous ingestion of real-time
generated data (sensors; logs;
etc.). Machine generated & other
data sources
Enable real-time interactions &
response
Real-time delivery directly to
multiple targets (batch/stream
processing)
Highly available; efficient;
scalable
Available ecosystem of light
weight agents (sources & targets)
15
Predictive Maintenance
with Event Processing and Analytics
United Technologies Aerospace Systems (UTAS)
provides engines and aircraft components to
leading commercial and defense manufacturers,
including the new Airbus A380 and Boeing B787.
The challenge:
5,000+ aircraft in service plus new design wins exponentially
increases the amount of sensor data being generated
Power by the Hour leasing model means the maintenance cost and
service outages falls to UTAS
No proactive capability to predict when a safety issue might occur
Once-per-day sensor readings moving to real-time, over-the-air
Archive to Hadoop
Compression Extends Hadoop Cluster Capacity
Without INFA Optimized
Archive Compression
10 TB
10 TB
With INFA Optimized
Archive 95% Compression
10 TB
500 GB
10 TB replicated 3X = 30TB
500 GB
500 GB
10 TB compressed 95% = 500GB
Replicated 3X = 1.5 TB
20X less I/O bandwidth required
20 min vs. 1 min response time
8 hours vs. 24 mins backup window
Parse and Prepare Data On
Hadoop
hParser and XMap
Parse and Prepare Data on Hadoop
The broadest coverage for Big Data
Engine
invocation
is
a
shared
library.
DT
engine
runs
The
DT
engine
is
also
thread-safe
and
re-entrant.
shown
below,
the
actual
transformation
logic
issend
2.
The
calling
application
can
buffer
the
data
and
DT
can
be
invoked
in
two
general
ways:
DT
can
also
be
embedded
inThe
other
middleware
3.As
To
2.
Developer
deploy
to
the
deploys
server,
transformation
this
service
1.
Developer
uses
Studio
to
The
DT
Engine
is
fully
For
4.
The
simple
DT
integration,
engine
can
immediately
a
command
Internal
custom
applications
can
PowerCenter
leverages
DT
via
fully within independent
the
process
of
calling
completely
of the
any
callingapplication.
application. the
buffers
to
DT
for processing.
technologies.
folder
to
local
is moved
service
to
repository
the
server
(directory).
via
FTP,
develop
aData
transformation
embeddable
and
can
be
invoked
XML
Interaction
data
line
use
interface
this
service
is
available
to
process
to
invoke
data.
embed
transformation
services
Unstructured
Transformation
Industry
Standards
ThisOn
allows
the
calling
application
to
invoke
DT
in
multiple
For
some
(WBIMB,
WebMethods,
BizTalk)
INFA
the
output
side,
DT
can
also
write
back
to
memory
1.
Filenames
can
be
passed
toany
it, and
DT
script,
etc.
This
means
you
cancopy,
develop
asupported
transformation
once,
andwill
It is not
anfiles
external
engine.
This
removes
overhead
threads
tothe
increase
throughput.
using
any
of
the
APIs.
All
needed
for
the
transformation
services.
using
various
APIs.
(UDT).
provides
similar
GUI
widgets
(agents)
for
the
buffers
which
are
returned
to
the
calling
application.
leverage it in multiple
simultaneously
resulting
directlyenvironments
open the file(s)
for processing.
Flat Files &
Documents
Svc Repository
Productivity
Visual
parsing
environment
Predefined
translations
from
passing
dataserver
between
processes,
thesocial
network,
NOTE:
If the
file
system
isacross
mountable
from
respective
design
environments.
are
moved.
in
reduced
development
and
maintenance
times
and
lower
etc. The
engine
is
also
dynamically
invoked
and
does
not
This
is
a
GUI
transformation
the
developer
machine
directly,
then
step
2
A good
example
is
DTs
support
of
PowerCenter
partitioning
Though
not
shown
below,
the
engine
fully
supports
multiple
input
impact
of
change.
On
the
output
side,
DT
can
also
directly
Fortoothers
the
API
layer
can
be
used
directly.
need
be started
up or
maintained
externally.
would
deploy
directly
tothe
thetransformation.
server.
and output
files
orto
buffers
as
needed
by
scale
up
processing.
widget
in Powercenter
which
write.NET,
to the filesystem.
Java, C++, C,
web services
Device/sensor
wraps around the DT API and
scientific
engine.
Any DI/BI architecture
PIG
EDW
MDM
Example use cases
Call Detail record
Why Hadoop?
CDR Large data sets every 7 seconds every mobile phone in
the region create a record
Desire to analyze behavior, location to personalize and
optimize pricing and ,marketing
Parse and Prepare Data on Hadoop
How does it work?
hadoop dt-hadoop.jar
My_Parser /input/*/input*.txt
1. Define parser in HParser
visual studio
2. Deploy the parser on
Hadoop Distributed File
System (HDFS)
3. Run HParser to extract
data and produce tabular
format in Hadoop
Profiling and Discovering
Data
Informatica Profiling & Data Discovery on
Hadoop
Hadoop Data Profiling Results
Value and Pattern
Frequency to isolated
inconsistent/dirty data or
unexpected patterns
Hadoop Data Profiling
results exposed to
anyone in enterprise
via browser
CUSTOMER_ID example
COUNTRY CODE example
1. Profiling Stats:
Min/Max Values, NULLs,
Inferred Data Types, etc.
Stats to identify
outliers and
anomalies in data
2. Value &
Pattern
Analysis of
Hadoop Data
3. Drilldown Analysis (into Hadoop Data)
Drill down into actual
data values to inspect
results across entire
data set, including
potential duplicates
Hadoop Data Domain Discovery
Finding functional meaning of Data in Hadoop
Leverage INFA rules/mapplets to
identify functional meaning of
Hadoop data
Sensitive data
(e.g. SSN, Credit Card number, etc.)
PHI: Protected Health Information
PII: Personally Identifiable Information
Scalable to look for/discover ANY Domain type
View/share report of data domains/
sensitive data contained in Hadoop.
Ability to drill down to see suspect
data values.
Transforming and Cleansing
Data
PowerCenter on Hadoop
Data Quality on Hadoop
PowerCenter developers are now Hadoop developers
No-code visual
development
environment
Preview results at
any point in the
data flow
Reuse and Import PC Metadata for Hadoop
Import existing
PC artifacts into
Hadoop
development
environment
Validate import
logic before the
actual import
process to ensure
compatibility
Natural Language Processing
Entity Extraction & Data Classification
Train NLP to find and
classify entities in
unstructured data
Address Validation & Data Cleansing
Configure Mapping for Hadoop Execution
No need to redesign
mapping logic to
execute on either
Traditional or Hadoop
infrastructure.
Configure where the
integration logic should
run Hadoop or Native
Data Integration & Quality on Hadoop
1. Entire Informatica mapping
translated to Hive Query Language
2. Optimized HQL converted to
MapReduce & submitted to Hadoop
cluster (job tracker).
3. Advanced mapping transformations
executed on Hadoop through User
Defined Functions using Vibe
SELECT
T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME,
customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY
FROM
(
SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx
FROM lineitem
GROUP BY L_ORDERKEY
) T1
JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)
JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY)
JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY)
WHERE nation.N_NAME = 'UNITED STATES'
) T2
INSERT OVERWRITE TABLE TARGET1 SELECT *
INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY,
count(ORDERKEY2) GROUP BY CUSTKEY;
Hive-QL
MapReduce
UDF
Example Mapping Execution
Mapping logic
translated to HQL
and submitted
to Hadoop Cluster
Cluster of Linux Machines
Repository
Engine
Source External
Relational Data
Local flat file
staged
temporarily
on HDFS
Source External
Flat File
Relational Data
streamed to
Hadoop for
processing
Final
processed data
loaded into
HDFS file
Read HDFS
file data
Source HDFS
File
Temp Staged
Lookup File
Target HDFS
FIle
Orchestrating and
Monitoring Hadoop
Informatica Workflow &
Monitoring for Hadoop
Metadata Manager for Hadoop
Dynamic Data Masking for Hadoop
Mixed Workflow Orchestration
One workflow running tasks on hadoop and local environments
MT_Load2Hadoop
+ Parse
MT_Data
Analysis
Cmd_Choose
LoadPath
Cmd_ProfileData
Cmd_Load2
Hadoop
Notification
MT_Cleanse
MT_Parse
List of variables:
Name
Type
Default Value
Description
Add
$User.LoadOptionPath
Integer
Load path for workflow, depending on output of cmd task
Edit
$User.DataSourceConnection
String
HiveSourceConnection
Source connection object
$User.ProfileResult
Integer
100
Output from profiling commnad task.
Informatica Corporation Confidential
Do Not Distribute.
Remove
Unified Administration
Single Place to Manage & Monitor
Full traceability from workflow
to MapReduce jobs
View generated
Hive scripts
Data Lineage and Business Glossary
PWX for
PC
Transactions,
OLTP, OLAP
Documents,
Email
PowerCenter SE
Enterprise Grid
PWX for
Mercury
Hadoop Architecture Overview
Social Media,
Web Logs
Mercury Services
PowerCenter Services
Hive
Client
Execution on
Hadoop
PWX
for
HDFS
NameNode
Job Tracker
Machine Device,
Scientific
PowerCenter on Hadoop
Data Quality on Hadoop
DT on Hadoop
Entity Extraction on Hadoop
Profiling on Hadoop
PWX
for
HDFS
DataNode3
Infa-Lib
RDBMS
Clients
HParser
Map Reduce
Hive
PWX
for
Hive
DataNode2
DataNode1
Infa-Lib
RDBMS
Clients
INFA Clients
HDFS
MYSQL
HParser
Infa-Lib
Infa-Lib
RDBMS
Clients
HParser
RDBMS
Clients
HParser
Map Reduce
Map Reduce
Map Reduce
HDFS
HDFS
HDFS
40