Tapping into Scientific Data with Hadoop and Flink

Tapping into Scientific Data
with Hadoop and Flink
Michael Häusler, ResearchGate
Dec 14, 2016

Big Data Engineering
Fun with Flink
Agenda

The social network gives scientists new tools
to connect, collaborate, and keep up with the
research that matters most to them. There are many variations of
passages of Lorem Ipsum
ResearchGate is built
for scientists.

Our mission is to connect the world of science
and make research open to all.

Structured system
There are many variations of
passages of Lorem Ipsum
We are changing how
scientific knowledge is
shared and discovered.

11,000,000+
Members
109,000,000+
Publications
1,300,000,000+
Citations

Bibliographic Metadata

Bibliographic Metadata – Data Model
Author Asset Derivative
Publication
Journal
Institution
Department
Account
Affiliation
Citation
Authorship Publication Link
Affiliation
Claiming

community publication service asset service
Bibliographic Metadata – Services
Publication
Journal
Institution
Department
Account
Affiliation
Citation
Affiliation
Claiming

Data Science Opportunities
Publication
Journal
Institution
Department
Account
Affiliation
Citation
Affiliation
claiming suggestions
affiliation analysis
citation analysis
publication deduplication
metadata extraction
topic indexing
author analysis
...
Claiming

High impact on individual user experience
Users and algorithms constantly enrich an evolving dataset
Should converge across different executions
User-Facing Features

Author Analysis (2010 / 2011)

Author Analysis – Clustering and Disambiguation

Author Analysis – High Product Impact

Near realtime incremental process
Direct integration with operational services via live database
Runtime for full reclustering: several weeks
Author Analysis – 2010
App Server Live Database
Author Analysis
Server

Export
Import
AuthorAnalysis
Hadoop
Java MapReduce / Hadoop
Runtime for full reclustering: few hours (incl. import and export)

Java MapReduce / Hadoop
Runtime for full reclustering: few hours (incl. import and export)
Export
Import
AuthorAnalysis
Hadoop

Key Learning
Continuous delivery in quick iterations is a game changer.

Do I need it?

Compute intensive and/or data intensive
Bibliographic metadata alone is 500+ GB (snappy compressed avro files)
Just reading the data with 100 MB/s takes 85 minutes
Seemingly simple tasks can quickly turn into big data tasks
Flavors of Big Data Tasks

RG Big Data Architecture (2016)

operational systems
end users
11+ M scientists
“Analytics” Cluster
(batch)
“Live” Cluster
(near realtime)
internal users
HBase
replication
transactional load
(HBase reads / writes)
continuous updates
(Flink streaming results)
batch updates
(MR / Hive / Flink results)
data ingestion
Hadoop Clusters

entity conveyor
(EC)
platform data import
(PDI)
hadoop analytics cluster
(mr / hive / flink)
Batch Processing
services
load balancer

service entity conveyor
kafka
queue
flink stream
processor
kafka
queue
hadoop
live cluster
Stream Processing
service

Picking a framework

“Frameworkitis is the disease that a framework
wants to do too much for you or it does it in a way
that you don’t want but you can’t change it.
Erich Gamma
Having fun with Hadoop?

“Simple things should be simple, complex things
should be possible.
Alan Kay
Having fun with Hadoop?

Obvious criteria
• Features
• Performance & Scalability
• Robustness & Stability
• Maturity & Community
Not so obvious
• Is it fun to solve simple, everyday tasks?
How to Evaluate a Framework

Fun with Hadoop and Flink
Apache Flink

“Platform for distributed stream and batch data processing”
Technology
• JVM based
• APIs for Java and Scala
• plays nicely with Hadoop, YARN, HDFS, Kafka, ...
http://flink.apache.org/
Apache Flink

Apache Flink – APIs
http://flink.apache.org/

Fun with Hadoop and Flink
Comparing Frameworks with a Simple Task:
Top 5 Coauthors

publication = {
"publicationUid": 7,
"title": "Foo",
"authorships": [
{
"authorUid": 23,
"authorName": "Jane"
},
{
"authorUid": 25,
"authorName": "John"
}
]
}
authorAccountMapping = {
"authorUid": 23,
"accountId": 42
}
(7, 23)
(7, 25)
(7, "AC:42")
(7, "AU:25")
topCoauthorStats = {
"authorKey": "AC:42",
"topCoauthors": [
{
"coauthorKey": "AU:23",
"coauthorCount": 1
}
]
}
Top 5 Coauthors

• Map-side joins require knowledge of distributed cache
• Both map and reduce-side joins require assumptions about data sizes
• Constant type juggling
• Hard to glue together
• Hard to test
• Implementing secondary sorting can be tricky
• Dealing with low level details is no fun
Map Reduce (Java)

• Join and group by are easy
• Common subexpressions are not optimized yet
• Dealing with denormalized data can be tricky
• UDFs are implemented low level and need to be deployed
• UDAFs (aggregation functions)
• Requires expert knowledge
• Especially with UDFs: not so much fun
Hive

• Fluent API
• Rich set of transformations
• Support for Tuples and POJOs
• With some discipline separation of business logic
possible
• Fastest and most fun to implement
Flink

00:00:00
00:07:12
00:14:24
00:21:36
00:28:48
00:36:00
50 100
Execution Time
Hive (Tez) MapReduce Flink
Performance and fun with Flink
Performance Comparison

Thank you!
Michael Häusler, Head of Engineering
https://www.researchgate.net/profile/Michael_Haeusler
https://www.researchgate.net/careers

Tapping into Scientific Data with Hadoop and Flink

More Related Content

What's hot

Similar to Tapping into Scientific Data with Hadoop and Flink

Recently uploaded

Tapping into Scientific Data with Hadoop and Flink