0% found this document useful (0 votes)
82 views40 pages

Big Data and Hadoop: Senior Product Specialist

The document discusses big data challenges and how Informatica solutions address them. It highlights that 80% of work in big data projects is data integration and quality. Informatica provides tools to get data into and out of Hadoop, parse and prepare data on Hadoop, and perform data ingestion, extraction and streaming in real-time.

Uploaded by

Robin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views40 pages

Big Data and Hadoop: Senior Product Specialist

The document discusses big data challenges and how Informatica solutions address them. It highlights that 80% of work in big data projects is data integration and quality. Informatica provides tools to get data into and out of Hadoop, parse and prepare data on Hadoop, and perform data ingestion, extraction and streaming in real-time.

Uploaded by

Robin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Big Data and Hadoop

Senior Product Specialist

The Challenge

2014
2011

Data fragmentation becomes the


barrier to business success

Devices
& Machines

2007
Communities
& Society

1990s

1980s

BUSINESS
1960s-1970s

USERS

VALUE

Few
Employees

Back Office
Automation

TECHNOLOGIES

Business
Ecosystems

Customers/
Consumers

Many
Employees

Front Office
Productivity

Line-of-Business
Self-Service

Social
Engagement

Real-Time
Optimization

E-Commerce

OS/360

SOURCES

TECHNOLOGY
MAINFRAME

10 2
CLIENT-SERVER

10 4

WEB

10 6
CLOUD

10 7

SOCIAL

10 9

INTERNET
OF THINGS

10 11

Big Data Challenges

Volume, Variety, Velocity, Veracity


Analytic Systems

Source Data

Batch ETL

Transactions,
OLTP, OLAP

Documents and Emails

Social Media, Web Logs

Where is
the data I
need?

Can I trust
this data?

Enterprise Data
Warehouse

Data
Mart
Data
Mart
Data
Mart
Data
Mart

Machine Device,
Scientific

Data
Mart
Data
Mart
Data
Mart
Data
Mart

Data
Mart

80% of the work in big data projects


is data integration and quality

80% of the work in any data


project is in cleaning the data

70% of my value is an ability


to pull the data, 20% of my
value is using data-science

I spend more than half my time


integrating, cleansing, and
transforming data without doing
any actual analysis.

Why Informatica for Big Data & Hadoop

PowerCenter Big Data Edition


No-Code Productivity

Big Transaction Data


Online Transaction
Processing (OLTP)
Oracle
DB2
Ingres
Informix
Sysbase
SQL Server

Online Analytical
Processing (OLAP) &
DW Appliances
Teradata
Redbrick
EssBase
Sybase IQ
Netezza
Exadata

High-Speed Data
Ingestion and
Extraction

HANA
Greenplum
DataAllegro
Asterdata
Vertica
Paraccel

Business-IT
Collaboration
Unified Administration

9.6

Salesforce.com
Concur
Google App Engine
Amazon

Complex Data
Parsing on Hadoop

Social Media & Web Data


Facebook
Twitter
Linkedin
Youtube

Web applications
Blogs
Discussion forums
Communities
Partner portals

Universal Data Access

Cloud

ETL on Hadoop

Big Interaction Data

the VibeTM virtual


data machine

Other Interaction Data


Clickstream
image/Text
Scientific
Genomoic/pharma
Medical

Medical/Device
Sensors/meters
RFID tags
CDR/mobile

Entity Extraction and


Data Classification on
Hadoop

Big Data Processing


Profiling on Hadoop

Get Data Into and Out of


Hadoop
PowerExchange for Hadoop
Replication to Hadoop
Streaming to Hadoop
Data Archiving to Hadoop

Data Ingestion and Extraction


Moving terabytes of data per hour

Transactions,
OLTP, OLAP

Batch Load

Applications

Documents,
Email

Replicate

Social Media,
Web Logs

Machine Device,
Scientific

Industry
Standards

Extract

Streaming

Archive

Data
Warehouse

MDM

Extract

Low
Cost
Store

PowerExchange Connectors
Enterprise
Applications,
Software as a
Service (SaaS)

JDE EnterpriseOne
JDE World
Lotus Notes
Oracle E-Business Suite

PeopleSoft Enterprise
Salesforce (salesforce.com)
SAP NetWeaver
SAP NetWeaver BI

SAS
Siebel
Netsuite
Microsoft Dynamics

Databases and
Data
Warehouses

Adabas for UNIX, Windows


C-ISAM
DB2 for LUW
Essbase

EMC/Greenplum
Informix Dynamic Server
Netezza Performance Server
ODBC

Oracle
SQL Server
Sybase
Teradata

Messaging
Systems

JMS
MSMQ

TIBCO
webMethods Broker

WebSphere MQ

Technology
Standards

Email (POP, IMAP)


HTTP(S)

LDAP
Web Services

XML

Mainframe

Adabas for z/OS


Datacom
DB2 for z/OS, z/Linux

IDMS
IMS DB
Oracle for z/Linux

Teradata
WebSphere MQ for z/Linux
VSAM

Asterdata,
Greenplum

Vertica
ParAccel

Microsoft PDW
Kognitio

Facebook, Twitter, LinkedIn

DataSift, Kapow

MongoDB

HDFS

HIVE

HBASE

Big Data
Social
Hadoop

- Accessible in Real-time and/or via Change Data Capture (CDC)

NoSQL Support for HBase

Read
from HBase as
standard source

Sample HBase column


families
(Stored in JSON/complex
Write
formats)
to HBase as
standard target

Complete Mapping with


HBase Src/Tgt can
execute on hadoop

11

NoSQL Support for MongoDB

Access, integrate,
transform & ingest
MongoDB data into
other analytic
systems (e.g.
Hadoop, data
warehouse)

Sampling
MongoDB data &
flattening it to
relational format

Access, integrate,
transform, & ingest
data into MongoDB

IDR for Replicating to Hadoop


Supported
Distributions
HDFS

Source System

Cycle_1.work directory

Intermediate Files
EXTRACT

APPLY

Table 1 File
Table 2 File

Table N File
Schema.ini File

Apache

0.20.203.x
0.20.204.x
0.20.205.x
0.23.x
1.0.x
1.1.x
2.x.x

Cloudera
CDH3
CDH4

Real-Time Data Collection and Streaming


Management
and Monitoring

Web Servers,
Operations
Monitors, rsyslog,
SLF4J, etc.
Handhelds, Smart
Meters, etc.
Discrete Data
Messages

Node

Internet of Things,
Sensor Data

Node

Ultra Messaging Bus

Node

Publish / Subscribe

Zookeeper

HDFS, HBase,
Node

Node

Node

Real Time
Analysis, Complex
Event Processing
No SQL
Databases:
Cassandara, Riak,
MongoDB

Targets
Sources

Leverage High Performance Messaging


Infrastructure Publish with Ultra
Messaging for global distribution without
additional staging or landing.
14

Informatica Vibe Data Stream for Machine Data


High performance/efficient
streaming data collection over
LAN/WAN

GUI interface provides ease of


configuration, deployment & use

Continuous ingestion of real-time


generated data (sensors; logs;
etc.). Machine generated & other
data sources

Enable real-time interactions &


response

Real-time delivery directly to


multiple targets (batch/stream
processing)

Highly available; efficient;


scalable

Available ecosystem of light


weight agents (sources & targets)
15

Predictive Maintenance

with Event Processing and Analytics


United Technologies Aerospace Systems (UTAS)
provides engines and aircraft components to
leading commercial and defense manufacturers,
including the new Airbus A380 and Boeing B787.

The challenge:

5,000+ aircraft in service plus new design wins exponentially


increases the amount of sensor data being generated

Power by the Hour leasing model means the maintenance cost and
service outages falls to UTAS

No proactive capability to predict when a safety issue might occur

Once-per-day sensor readings moving to real-time, over-the-air

Archive to Hadoop
Compression Extends Hadoop Cluster Capacity
Without INFA Optimized
Archive Compression

10 TB

10 TB

With INFA Optimized


Archive 95% Compression

10 TB
500 GB

10 TB replicated 3X = 30TB

500 GB

500 GB

10 TB compressed 95% = 500GB


Replicated 3X = 1.5 TB
20X less I/O bandwidth required
20 min vs. 1 min response time
8 hours vs. 24 mins backup window

Parse and Prepare Data On


Hadoop
hParser and XMap

Parse and Prepare Data on Hadoop


The broadest coverage for Big Data
Engine
invocation
is
a
shared
library.
DT
engine
runs
The
DT
engine
is
also
thread-safe
and
re-entrant.
shown
below,
the
actual
transformation
logic
issend
2.
The
calling
application
can
buffer
the
data
and
DT
can
be
invoked
in
two
general
ways:
DT
can
also
be
embedded
inThe
other
middleware
3.As
To
2.
Developer
deploy
to
the
deploys
server,
transformation
this
service
1.
Developer
uses
Studio
to
The
DT
Engine
is
fully
For
4.
The
simple
DT
integration,
engine
can
immediately
a
command
Internal
custom
applications
can
PowerCenter
leverages
DT
via
fully within independent
the
process
of
calling
completely
of the
any
callingapplication.
application. the
buffers
to
DT
for processing.
technologies.
folder
to
local
is moved
service
to
repository
the
server
(directory).
via
FTP,
develop
aData
transformation
embeddable
and
can
be
invoked
XML
Interaction
data
line
use
interface
this
service
is
available
to
process
to
invoke
data.
embed
transformation
services
Unstructured
Transformation
Industry
Standards
ThisOn
allows
the
calling
application
to
invoke
DT
in
multiple
For
some
(WBIMB,
WebMethods,
BizTalk)
INFA
the
output
side,
DT
can
also
write
back
to
memory
1.
Filenames
can
be
passed
toany
it, and
DT
script,
etc.
This
means
you
cancopy,
develop
asupported
transformation
once,
andwill
It is not
anfiles
external
engine.
This
removes
overhead
threads
tothe
increase
throughput.
using
any
of
the
APIs.
All
needed
for
the
transformation
services.
using
various
APIs.
(UDT).
provides
similar
GUI
widgets
(agents)
for
the
buffers
which
are
returned
to
the
calling
application.
leverage it in multiple
simultaneously
resulting
directlyenvironments
open the file(s)
for processing.

Flat Files &


Documents

Svc Repository

Productivity

Visual
parsing
environment
Predefined
translations

from
passing
dataserver
between
processes,
thesocial
network,
NOTE:
If the
file
system
isacross
mountable
from
respective
design
environments.
are
moved.
in
reduced
development
and
maintenance
times
and
lower
etc. The
engine
is
also
dynamically
invoked
and
does
not
This
is
a
GUI
transformation
the
developer
machine
directly,
then
step
2
A good
example
is
DTs
support
of
PowerCenter
partitioning
Though
not
shown
below,
the
engine
fully
supports
multiple
input
impact
of
change.
On
the
output
side,
DT
can
also
directly
Fortoothers
the
API
layer
can
be
used
directly.
need
be started
up or
maintained
externally.
would
deploy
directly
tothe
thetransformation.
server.
and output
files
orto
buffers
as
needed
by
scale
up
processing.
widget
in Powercenter
which

write.NET,
to the filesystem.
Java, C++, C,
web services
Device/sensor
wraps around the DT API and
scientific
engine.

Any DI/BI architecture

PIG

EDW
MDM

Example use cases


Call Detail record

Why Hadoop?
CDR Large data sets every 7 seconds every mobile phone in
the region create a record

Desire to analyze behavior, location to personalize and


optimize pricing and ,marketing

Parse and Prepare Data on Hadoop


How does it work?
hadoop dt-hadoop.jar
My_Parser /input/*/input*.txt

1. Define parser in HParser


visual studio
2. Deploy the parser on
Hadoop Distributed File
System (HDFS)
3. Run HParser to extract
data and produce tabular
format in Hadoop

Profiling and Discovering


Data
Informatica Profiling & Data Discovery on
Hadoop

Hadoop Data Profiling Results


Value and Pattern
Frequency to isolated
inconsistent/dirty data or
unexpected patterns

Hadoop Data Profiling


results exposed to
anyone in enterprise
via browser
CUSTOMER_ID example

COUNTRY CODE example

1. Profiling Stats:
Min/Max Values, NULLs,
Inferred Data Types, etc.

Stats to identify
outliers and
anomalies in data

2. Value &
Pattern
Analysis of
Hadoop Data

3. Drilldown Analysis (into Hadoop Data)

Drill down into actual


data values to inspect
results across entire
data set, including
potential duplicates

Hadoop Data Domain Discovery


Finding functional meaning of Data in Hadoop
Leverage INFA rules/mapplets to
identify functional meaning of
Hadoop data
Sensitive data
(e.g. SSN, Credit Card number, etc.)

PHI: Protected Health Information


PII: Personally Identifiable Information
Scalable to look for/discover ANY Domain type

View/share report of data domains/


sensitive data contained in Hadoop.
Ability to drill down to see suspect
data values.

Transforming and Cleansing


Data
PowerCenter on Hadoop
Data Quality on Hadoop

PowerCenter developers are now Hadoop developers

No-code visual
development
environment

Preview results at
any point in the
data flow

Reuse and Import PC Metadata for Hadoop

Import existing
PC artifacts into
Hadoop
development
environment
Validate import
logic before the
actual import
process to ensure
compatibility

Natural Language Processing


Entity Extraction & Data Classification

Train NLP to find and


classify entities in
unstructured data

Address Validation & Data Cleansing

Configure Mapping for Hadoop Execution


No need to redesign
mapping logic to
execute on either
Traditional or Hadoop
infrastructure.

Configure where the


integration logic should
run Hadoop or Native

Data Integration & Quality on Hadoop


1. Entire Informatica mapping
translated to Hive Query Language
2. Optimized HQL converted to
MapReduce & submitted to Hadoop
cluster (job tracker).
3. Advanced mapping transformations
executed on Hadoop through User
Defined Functions using Vibe
SELECT
T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME,
customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY
FROM
(
SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx
FROM lineitem
GROUP BY L_ORDERKEY
) T1
JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)
JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY)
JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY)
WHERE nation.N_NAME = 'UNITED STATES'
) T2
INSERT OVERWRITE TABLE TARGET1 SELECT *
INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY,
count(ORDERKEY2) GROUP BY CUSTKEY;

Hive-QL

MapReduce
UDF

Example Mapping Execution


Mapping logic
translated to HQL
and submitted
to Hadoop Cluster

Cluster of Linux Machines

Repository

Engine

Source External
Relational Data

Local flat file


staged
temporarily
on HDFS

Source External
Flat File

Relational Data
streamed to
Hadoop for
processing
Final
processed data
loaded into
HDFS file

Read HDFS
file data
Source HDFS
File

Temp Staged
Lookup File

Target HDFS
FIle

Orchestrating and
Monitoring Hadoop
Informatica Workflow &
Monitoring for Hadoop
Metadata Manager for Hadoop
Dynamic Data Masking for Hadoop

Mixed Workflow Orchestration

One workflow running tasks on hadoop and local environments

MT_Load2Hadoop
+ Parse

MT_Data
Analysis

Cmd_Choose
LoadPath

Cmd_ProfileData

Cmd_Load2
Hadoop

Notification

MT_Cleanse

MT_Parse

List of variables:
Name

Type

Default Value

Description

Add

$User.LoadOptionPath

Integer

Load path for workflow, depending on output of cmd task

Edit

$User.DataSourceConnection

String

HiveSourceConnection

Source connection object

$User.ProfileResult

Integer

100

Output from profiling commnad task.

Informatica Corporation Confidential


Do Not Distribute.

Remove

Unified Administration
Single Place to Manage & Monitor
Full traceability from workflow
to MapReduce jobs

View generated
Hive scripts

Data Lineage and Business Glossary

PWX for
PC

Transactions,
OLTP, OLAP

Documents,
Email

PowerCenter SE
Enterprise Grid

PWX for
Mercury

Hadoop Architecture Overview


Social Media,
Web Logs

Mercury Services

PowerCenter Services
Hive
Client

Execution on
Hadoop

PWX
for
HDFS

NameNode

Job Tracker

Machine Device,
Scientific

PowerCenter on Hadoop
Data Quality on Hadoop
DT on Hadoop
Entity Extraction on Hadoop
Profiling on Hadoop

PWX
for
HDFS

DataNode3

Infa-Lib
RDBMS
Clients

HParser

Map Reduce
Hive

PWX
for
Hive

DataNode2

DataNode1

Infa-Lib
RDBMS
Clients

INFA Clients

HDFS

MYSQL

HParser

Infa-Lib

Infa-Lib
RDBMS
Clients

HParser

RDBMS
Clients

HParser

Map Reduce

Map Reduce

Map Reduce

HDFS

HDFS

HDFS

40

You might also like