Tapping into Scientific Data
with Hadoop and Flink
Michael Häusler, ResearchGate
Dec 14, 2016
Tapping into Scientific Data
Big Data Engineering
Fun with Flink
Agenda
The social network gives scientists new tools
to connect, collaborate, and keep up with the
research that matters most to them. There are many variations of
passages of Lorem Ipsum
ResearchGate is built
for scientists.
Our mission is to connect the world of science
and make research open to all.
Structured system
There are many variations of
passages of Lorem Ipsum
We are changing how
scientific knowledge is
shared and discovered.
11,000,000+
Members
109,000,000+
Publications
1,300,000,000+
Citations
Tapping into Scientific Data
Bibliographic Metadata
Bibliographic Metadata – Data Model
Author Asset Derivative
Publication
Journal
Institution
Department
Account
Affiliation
Citation
Authorship Publication Link
Affiliation
Claiming
community publication service asset service
Bibliographic Metadata – Services
Author Asset Derivative
Publication
Journal
Institution
Department
Account
Affiliation
Citation
Authorship Publication Link
Affiliation
Claiming
Data Science Opportunities
Author Asset Derivative
Publication
Journal
Institution
Department
Account
Affiliation
Citation
Authorship Publication Link
Affiliation
claiming suggestions
affiliation analysis
citation analysis
publication deduplication
metadata extraction
topic indexing
author analysis
...
Claiming
High impact on individual user experience
Users and algorithms constantly enrich an evolving dataset
Should converge across different executions
User-Facing Features
Tapping into Scientific Data
Author Analysis (2010 / 2011)
Author Analysis – Clustering and Disambiguation
Author Analysis – High Product Impact
Near realtime incremental process
Direct integration with operational services via live database
Runtime for full reclustering: several weeks
Author Analysis – 2010
App Server Live Database
Author Analysis
Server
App Server Live Database
Export
Import
AuthorAnalysis
Hadoop
Author Analysis – 2011
Java MapReduce / Hadoop
Runtime for full reclustering: few hours (incl. import and export)
Java MapReduce / Hadoop
Runtime for full reclustering: few hours (incl. import and export)
App Server Live Database
Export
Import
AuthorAnalysis
Hadoop
Author Analysis – 2011
Key Learning
Continuous delivery in quick iterations is a game changer.
Big Data Engineering
Do I need it?
Compute intensive and/or data intensive
Bibliographic metadata alone is 500+ GB (snappy compressed avro files)
Just reading the data with 100 MB/s takes 85 minutes
Seemingly simple tasks can quickly turn into big data tasks
Flavors of Big Data Tasks
Big Data Engineering
RG Big Data Architecture (2016)
operational systems
end users
11+ M scientists
“Analytics” Cluster
(batch)
“Live” Cluster
(near realtime)
internal users
HBase
replication
transactional load
(HBase reads / writes)
continuous updates
(Flink streaming results)
batch updates
(MR / Hive / Flink results)
data ingestion
Hadoop Clusters
entity conveyor
(EC)
platform data import
(PDI)
hadoop analytics cluster
(mr / hive / flink)
Batch Processing
services
load balancer
service entity conveyor
kafka
queue
flink stream
processor
kafka
queue
hadoop
live cluster
Stream Processing
service
Big Data Engineering
Picking a framework
“Frameworkitis is the disease that a framework
wants to do too much for you or it does it in a way
that you don’t want but you can’t change it.
Erich Gamma
Having fun with Hadoop?
“Simple things should be simple, complex things
should be possible.
Alan Kay
Having fun with Hadoop?
Obvious criteria
• Features
• Performance & Scalability
• Robustness & Stability
• Maturity & Community
Not so obvious
• Is it fun to solve simple, everyday tasks?
How to Evaluate a Framework
Fun with Hadoop and Flink
Apache Flink
“Platform for distributed stream and batch data processing”
Technology
• JVM based
• APIs for Java and Scala
• plays nicely with Hadoop, YARN, HDFS, Kafka, ...
http://flink.apache.org/
Apache Flink
Apache Flink – APIs
http://flink.apache.org/
Fun with Hadoop and Flink
Comparing Frameworks with a Simple Task:
Top 5 Coauthors
publication = {
"publicationUid": 7,
"title": "Foo",
"authorships": [
{
"authorUid": 23,
"authorName": "Jane"
},
{
"authorUid": 25,
"authorName": "John"
}
]
}
authorAccountMapping = {
"authorUid": 23,
"accountId": 42
}
(7, 23)
(7, 25)
(7, "AC:42")
(7, "AU:25")
topCoauthorStats = {
"authorKey": "AC:42",
"topCoauthors": [
{
"coauthorKey": "AU:23",
"coauthorCount": 1
}
]
}
Top 5 Coauthors
• Map-side joins require knowledge of distributed cache
• Both map and reduce-side joins require assumptions about data sizes
• Constant type juggling
• Hard to glue together
• Hard to test
• Implementing secondary sorting can be tricky
• Dealing with low level details is no fun
Map Reduce (Java)
Hive
Hive Generic Aggregate UDF
• Join and group by are easy
• Common subexpressions are not optimized yet
• Dealing with denormalized data can be tricky
• UDFs are implemented low level and need to be deployed
• UDAFs (aggregation functions)
• Requires expert knowledge
• Especially with UDFs: not so much fun
Hive
Flink – DataSet API
• Fluent API
• Rich set of transformations
• Support for Tuples and POJOs
• With some discipline separation of business logic
possible
• Fastest and most fun to implement
Flink
00:00:00
00:07:12
00:14:24
00:21:36
00:28:48
00:36:00
50 100
Execution Time
Hive (Tez) MapReduce Flink
Performance and fun with Flink
Performance Comparison
Thank you!
Michael Häusler, Head of Engineering
https://www.researchgate.net/profile/Michael_Haeusler
https://www.researchgate.net/careers

Tapping into Scientific Data with Hadoop and Flink

  • 1.
    Tapping into ScientificData with Hadoop and Flink Michael Häusler, ResearchGate Dec 14, 2016
  • 2.
    Tapping into ScientificData Big Data Engineering Fun with Flink Agenda
  • 3.
    The social networkgives scientists new tools to connect, collaborate, and keep up with the research that matters most to them. There are many variations of passages of Lorem Ipsum ResearchGate is built for scientists.
  • 4.
    Our mission isto connect the world of science and make research open to all.
  • 5.
    Structured system There aremany variations of passages of Lorem Ipsum We are changing how scientific knowledge is shared and discovered.
  • 6.
  • 7.
    Tapping into ScientificData Bibliographic Metadata
  • 8.
    Bibliographic Metadata –Data Model Author Asset Derivative Publication Journal Institution Department Account Affiliation Citation Authorship Publication Link Affiliation Claiming
  • 9.
    community publication serviceasset service Bibliographic Metadata – Services Author Asset Derivative Publication Journal Institution Department Account Affiliation Citation Authorship Publication Link Affiliation Claiming
  • 10.
    Data Science Opportunities AuthorAsset Derivative Publication Journal Institution Department Account Affiliation Citation Authorship Publication Link Affiliation claiming suggestions affiliation analysis citation analysis publication deduplication metadata extraction topic indexing author analysis ... Claiming
  • 11.
    High impact onindividual user experience Users and algorithms constantly enrich an evolving dataset Should converge across different executions User-Facing Features
  • 12.
    Tapping into ScientificData Author Analysis (2010 / 2011)
  • 13.
    Author Analysis –Clustering and Disambiguation
  • 14.
    Author Analysis –High Product Impact
  • 15.
    Near realtime incrementalprocess Direct integration with operational services via live database Runtime for full reclustering: several weeks Author Analysis – 2010 App Server Live Database Author Analysis Server
  • 16.
    App Server LiveDatabase Export Import AuthorAnalysis Hadoop Author Analysis – 2011 Java MapReduce / Hadoop Runtime for full reclustering: few hours (incl. import and export)
  • 17.
    Java MapReduce /Hadoop Runtime for full reclustering: few hours (incl. import and export) App Server Live Database Export Import AuthorAnalysis Hadoop Author Analysis – 2011
  • 18.
    Key Learning Continuous deliveryin quick iterations is a game changer.
  • 19.
  • 20.
    Compute intensive and/ordata intensive Bibliographic metadata alone is 500+ GB (snappy compressed avro files) Just reading the data with 100 MB/s takes 85 minutes Seemingly simple tasks can quickly turn into big data tasks Flavors of Big Data Tasks
  • 21.
    Big Data Engineering RGBig Data Architecture (2016)
  • 22.
    operational systems end users 11+M scientists “Analytics” Cluster (batch) “Live” Cluster (near realtime) internal users HBase replication transactional load (HBase reads / writes) continuous updates (Flink streaming results) batch updates (MR / Hive / Flink results) data ingestion Hadoop Clusters
  • 23.
    entity conveyor (EC) platform dataimport (PDI) hadoop analytics cluster (mr / hive / flink) Batch Processing services load balancer
  • 24.
    service entity conveyor kafka queue flinkstream processor kafka queue hadoop live cluster Stream Processing service
  • 25.
  • 26.
    “Frameworkitis is thedisease that a framework wants to do too much for you or it does it in a way that you don’t want but you can’t change it. Erich Gamma Having fun with Hadoop?
  • 27.
    “Simple things shouldbe simple, complex things should be possible. Alan Kay Having fun with Hadoop?
  • 28.
    Obvious criteria • Features •Performance & Scalability • Robustness & Stability • Maturity & Community Not so obvious • Is it fun to solve simple, everyday tasks? How to Evaluate a Framework
  • 29.
    Fun with Hadoopand Flink Apache Flink
  • 30.
    “Platform for distributedstream and batch data processing” Technology • JVM based • APIs for Java and Scala • plays nicely with Hadoop, YARN, HDFS, Kafka, ... http://flink.apache.org/ Apache Flink
  • 31.
    Apache Flink –APIs http://flink.apache.org/
  • 32.
    Fun with Hadoopand Flink Comparing Frameworks with a Simple Task: Top 5 Coauthors
  • 33.
    publication = { "publicationUid":7, "title": "Foo", "authorships": [ { "authorUid": 23, "authorName": "Jane" }, { "authorUid": 25, "authorName": "John" } ] } authorAccountMapping = { "authorUid": 23, "accountId": 42 } (7, 23) (7, 25) (7, "AC:42") (7, "AU:25") topCoauthorStats = { "authorKey": "AC:42", "topCoauthors": [ { "coauthorKey": "AU:23", "coauthorCount": 1 } ] } Top 5 Coauthors
  • 34.
    • Map-side joinsrequire knowledge of distributed cache • Both map and reduce-side joins require assumptions about data sizes • Constant type juggling • Hard to glue together • Hard to test • Implementing secondary sorting can be tricky • Dealing with low level details is no fun Map Reduce (Java)
  • 35.
  • 36.
  • 37.
    • Join andgroup by are easy • Common subexpressions are not optimized yet • Dealing with denormalized data can be tricky • UDFs are implemented low level and need to be deployed • UDAFs (aggregation functions) • Requires expert knowledge • Especially with UDFs: not so much fun Hive
  • 38.
  • 39.
    • Fluent API •Rich set of transformations • Support for Tuples and POJOs • With some discipline separation of business logic possible • Fastest and most fun to implement Flink
  • 40.
    00:00:00 00:07:12 00:14:24 00:21:36 00:28:48 00:36:00 50 100 Execution Time Hive(Tez) MapReduce Flink Performance and fun with Flink Performance Comparison
  • 41.
    Thank you! Michael Häusler,Head of Engineering https://www.researchgate.net/profile/Michael_Haeusler https://www.researchgate.net/careers