HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase

Efficient and portable data
processing with Apache Beam and
HBase
Eugene Kirpichov, Google

History of Beam
Philosophy of the Beam programming model
Agenda
1
2
Apache Beam project3
Beam and HBase4

The Evolution of Apache Beam
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache
Beam
Google Cloud
Dataflow

(2008) FlumeJava
High-level API
(2016) Apache Beam
Open ecosystem,
Community-driven
Vendor-independent
(2004) MapReduce
SELECT + GROUPBY
(2013) Millwheel
Deterministic
Streaming
(2014) Dataflow
Batch/streaming agnostic,
Infinite out-of-order data,
Portable

Beam model: Unbounded, temporal, out-of-order data
Unified No concept of "batch" / "streaming" at all
Time Event time (when it happened, not when we saw it)
Windowing Aggregation within time windows
Keys Windows scoped to a key (e.g. user sessions)
Triggers When is a window "complete enough"
What to do when late data arrives

What are you computing? Transforms
Where in event time? Windowing
When in processing time?
How do refinements relate?
Triggers

What Where When How
What - transforms
Element-Wise Aggregating Composite

Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via(
word → Arrays.asList(word.split("[^a-zA-Z']+"))))
.apply(Filter.byPredicate(word → !word.isEmpty()))
.apply(Count.perElement())
.apply(MapElements.via(
count → count.getKey() + ": " + count.getValue())
.apply(TextIO.Write.to("gs://.../..."));
p.run();

Where - windowing
What Where When How
● Windowing divides data into event-time-based finite chunks.
● Required when doing aggregations over unbounded data.

What Where When How
When - triggers
Control when a
window emits results
of aggregation
Often relative to the
watermark (promise
about lateness of a
source)
ProcessingTime
Event Time
Watermark

PCollection<KV<String, Integer>> output = input
.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.trigger(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetracting())
.apply(Sum.integersPerKey());
What Where When How
How do refinements relate?

1.Classic Batch 2. Batch with Fixed
Windows
3. Streaming 4. Streaming with
Speculative + Late Data
Customizing What Where When How
What Where When How

What is Apache Beam?
1. The Beam Model: What / Where / When / How
2. SDKs for writing Beam pipelines -- Java, Python
3. Runners for Existing Distributed Processing
Backends
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ (WIP) Gearpump and others
○ Local (in-process) runner for testing

The Apache Beam Vision
1. End users: who want to write
pipelines in a language that’s
familiar.
2. SDK writers: who want to make
Beam concepts available in new
languages.
3. Runner writers: who have a
distributed processing environment
and want to support Beam pipelines
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution

Apache Beam ecosystem
End-user's pipeline
Libraries: transforms, sources/sinks etc.
Language-specific SDK
Beam model (ParDo, GBK, Windowing…)
Runner
Execution environment
Java ...Python

02/01/2016
Enter Apache
Incubator
Early 2016
Internal API redesign
and relative chaos
Mid 2016
Stabilization of New
APIs
Late 2016
Multiple runners
02/25/2016
1st commit to
ASF repository
05/2017
Beam 2.0
First Stable Release
Early 2017
Polish and stability

Apache Beam Community
178 contributors
24 committers from 8 orgs (none >50%)
>3300 PRs, >8600 commits, 27 releases
>20 IO (storage system) connectors
5 runners

Beam IO connector ecosystem
Many uses of Beam = importing data from one place to another
Files Text, Avro, XML, TFRecord (pluggable FS - local, HDFS, GCS)
Hadoop ecosystem HBase, HadoopInputFormat, Hive (HCatalog)
Streaming systems Kafka, Kinesis, MQTT, JMS, (WIP) AMQP
Google Cloud Pubsub, BigQuery, Datastore, Bigtable, Spanner
Other JDBC, Cassandra, Elasticsearch, MongoDB, GridFS

HBaseIO
PCollection<Result> data = p.apply(
HBaseIO.read()
.withConfiguration(conf)
.withTableId(table)
… withScan, withFilter …)
PCollection<KV<byte[], Iterable<Mutation>>> mutations = …;
mutations.apply(
HBaseIO.write()
.withConfiguration(conf))
.withTableId(table)

IO Connectors = just Beam transforms
Made of Beam primitives
ParDo, GroupByKey, …
Write = often a simple ParDo
Read = a couple of ParDo,
“Source API” for power users
⇒ straightforward to develop, clean API, very flexible,
batch/streaming agnostic

Beam Write with HBase
A bundle is a group of elements processed and committed together.
APIs (ParDo/DoFn):
setup() -> Creates Connection
startBundle() -> Gets BufferedMutator
processElement() -> Applies Mutation(s)
finishBundle() -> BufferedMutator flush
tearDown() -> Connection close
Mutations must be idempotent, e.g. Put or Delete.
Increment and Append should not be used.
Transaction

Beam Source API
(similar to Hadoop InputFormat, but cleaner / more general)
Estimate size
Split into sub-sources (of ~given size)
Read
Iterate
Get progress
Dynamic split
Note: Separate API for unbounded sources + (WIP) a new unified API

HBase on Beam Source API
HBaseSource Scan
Estimate RegionSizeCalculator
Split RegionLocation
Read
Iterate ResultScanner
Get progress Key interpolation
Dynamic split* RangeTracker
Region Server 1
Region Server 2
* Dynamic Split for HBaseIO PR in progress

Google Cloud Platform 26
Digression: StragglersWorkers
Time

Beam approach: Dynamic splitting*
Workers
Time
Now Avg completion time
*Currently implemented only by Dataflow

Learn More!
Apache Beam
https://beam.apache.org
The World Beyond Batch 101 & 102
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
No Shard Left Behind
Straggler Free Data Processing in Cloud Dataflow
Join the mailing lists!
user-subscribe@beam.apache.org
dev-subscribe@beam.apache.org
Follow @ApacheBeam on Twitter

HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase

More Related Content

What's hot

Similar to HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase

More from HBaseCon

Recently uploaded

HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase