Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

Federated SQL on Hadoop and
Beyond: Leveraging Apache
Geode to Build a Poor Man's SAP
HANA
by Christian Tzolov
@christzolov

Whoami
Christian Tzolov
Technical Architect at Pivotal,
BigData, Hadoop, SpringXD,
Apache Committer, Crunch PMC
member
ctzolov@pivotal.io
blog.tzolov.net
@christzolov

Contents
• Data Systems - Principles
• Use Case: OLTP and OLAP Data Systems Integration
• Passive Data Synchronization (Demo)
• Federated Queries With HAWQ
• HAWQ Web Tables
• HAWQ PXF Architecture
• Geode PXF (Demo)

Compute Arbitrary Functions
on Arbitrary Data

Architectural Patterns
• Data Lake
• Lambda
• Kappa
• Tachyon
• …

Integration Stack
Apache HDFS Data Lake - PHD or HDP Hadoop
Apache HAWQ SQL on Hadoop (OLAP)
Apache Geode In-memory data grid (OLTP)
Spring XD Integration and Streaming Runtime
Apache Ambari Manages All Clusters
Apache Zeppelin Web UI for interaction with Data Systems
Hadoop/HDFS
Geode HAWQ
SpringXD
Ambari
Zeppelin

Apache Geode (OLTP)
• Cache - Performance / Consistency / Resiliency
• Region - Highly available, redundant, distributed
Map
China Railway
Corporation
5,700 train stations
4.5 million tickets per day
20 million daily users
1.4 billion page views per day
40,000 visits per second
Indian Railways
7,000 stations
72,000 miles of track
23 million passengers daily
120,000 concurrent users
10,000 transactions per minute

Apache HAWQ (OLAP)
• Built around a Greenplum MPP DB (C and C++)
• Hadoop Native: Parquet, HDFS and YARN
• 100% ANSI SQL compliant: SQL-92/99/2003…
• Extensible - Web Tables, PXF
• Connectivity: ODBC and JDBC
• Access internal store: HAWQ(Parquet)InputFormat

HAWQ - TPC-DS
• Outperforms Impala by overall 454%
• 344% of performance improvement over Hive/Tez
• Runs 100% of the TPC-DS queries. Unlike Impala
or Hive
• References: http://bit.ly/1NUDcLl, https://
github.com/dbbaskette/pivbench

Spring XD
Orchestrates and automates all steps across
multiple data stream pipelines
• HTTP
• Tail
• File
• Mail
• Twitter
• Gemﬁre
• Syslog
• TCP
• UDP
• JMS
• RabbitMQ
• MQTT
• Kafka
• Reactor TCP/UDP
• Filter
• Transformer
• Object-to-JSON
• JSON-to-Tuple
• Splitter
• Aggregator
• HTTP Client
• Groovy Scripts
• Java Code
• JPMML Evaluator
• Spark Streaming
• File
• HDFS
• JDBC
• TCP
• Log
• Mail
• RabbitMQ
• Gemﬁre
• Splunk
• MQTT
• Kafka
• Dynamic Router
• Counters

Use Case:
Join OLTP and OLAP
Data Systems

Use Case
• Integrate Geode with HAWQ
• Uniﬁed data view
• Slowly Changing Dimensions (SCDs)
• Keep the Operational and Historical data in Sync

Passive Sync Improved
(gpfdist)

HAWQ Web Tables
• HAWQ Web Table - access dynamic data sources
on a web server or by executing OS scripts
• Leverage Geode REST API and OQL
• SpringBoot Controller to convert JSON into TSV
CREATE EXTERNAL WEB TABLE EMPLOYEE_WEB_TABLE (...)
EXECUTE E'curl http://<hostname>/gemfire-api/v1/
queries/adhoc?q=<URLencoded OQL statement>'
ON MASTER FORMAT 'text' (delimiter '|' null 'null' escape E'');

HAWQ Web Tables
Architecture
Access dynamic data sources on a web server or by
executing OS scripts.

HAWQ Web Tables
Limitations
• Not Scalable
• No Push Down Predicates
• Static
• No Compression
• Requires Additional Components

P(ivotal) Extension Framework
(PXF)
• Java-Based
• Parallel, High Throughput Data Access
• ANSI-compliant SQL On Any Dataset
• Wide variety of PXF plugins

PXF Data Model
• Data Source is modeled as a collection of one or more
Fragments.
• Each Fragment consists of many Rows that in turn are
split into typed Fields.
• Analyzer (optional) provides PXF statistical data for the
HAWQ query optimizer
• Metadata about the data source locations, access
attributes, table schemas formats, SQL queries ﬁlters,
etc

PXF Processors
Plugin
InputData
Fragmeter
getFragments()
CustomAccessor CustomResolver
Analyzer
getEstimatedStat()
CustomAnalyzer
ReadResolver
getFields(OneRow)
WriteResolver
getFields(OneRow)
ReadAccessor
openForRead()
readNextObject()
closeForRead()
WriteAccessor
openForWrite()
writeNextObject()
closeForWrite()
CustomFragmeter
Extend Class
Implement Interface

PXF Deployment Model
HAWQ Master
Query
Dispatcher
NameNode
PXF
Service
Date Node X
PXF
Service
Query
Executor
data request for
Fragment X
pxfwritable records
Metadata
request
Fragment
list
External
(Distributed)
Data System
Date Node Z
PXF
Service
Query
Executor
data request for
Fragment Z
pxfwritable records
Scan plan Result
SQL
query
Result
Parallelexecution

PXF External Tables
CREATE EXTERNAL TABLE ext_table_name
<Attribute list, …>
LOCATION('pxf://<host>:<port>/path/to/data?
FRAGMENTER=package.name.FragmenterForX&
ACCESSOR=package.name.AccessorForX&
RESOLVER=package.name.ResolverForX&
<Other custom user options>=<Value>’
)
FORMAT ‘custom'(formatter='pxfwritable_import');

PXF Gallery
•HdfsTextSimple
•HdfsTextMulti
•Hive
•HiveRC
•HiveText
•HBase
•Avro
• Accumulo
• Casandra
• JSON
• Redis
• Geode/Gemﬁre
• JDBC

Federated Queries with PXF/
Geode - Architecture

PXF/Geode Table
CREATE EXTERNAL TABLE <GEMFIRE_TABLE_NAME>
(...)
LOCATION('pxf://<namenode>/<path>?
PROFILE=GEMFIRE &
LOCATORS=<gemfire-server:port> &
REGION=<region-name>')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

Geode Profile
<profile>
<name>GEMFIRE</name>
<description>A profile for reading Gemfire data</description>
<plugins>
<fragmenter>io.pivotal.pxf.plugins.gemfire.GemfireFragmenter</fragmenter>
<accessor>io.pivotal.pxf.plugins.gemfire.GemfireAccessor</accessor>
<resolver>io.pivotal.pxf.plugins.gemfire.GemfireResolver</resolver>
</plugins>
</profile>

Federated Queries With
PXF/Geode - Demo

Stay Connected
• PXF Maven Repository: https://bintray.com/big-data/maven/pxf/view
• PXF Community Plugins: https://bintray.com/big-data/maven/pxf-
plugins/view
• Apache HAWQ: https://github.com/apache/incubator-hawq
• Apache Geode: https://github.com/apache/incubator-geode
• Apache Zeppelin: https://zeppelin.incubator.apache.org
• Spring XD: http://projects.spring.io/spring-xd/

Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana

More Related Content

What's hot(20)

Viewers also liked(20)

Similar to Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana(20)

More from Christian Tzolov(6)

Recently uploaded(20)

Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- leveraging apache geode to build a poor mans sap hana