Scott
Leberknight
Cloudera's
7/9/2013
History
lesson...
Google Map/Reduce
paper (2004)
Cutting & Cafarella
create Hadoop (2005)
Google Dremel paper (2010)
Facebook creates Hive (2007)*
Cloudera announces Impala
(October 2012)
HortonWorks' Stinger
(February 2013)
Apache Drill proposal
(August 2012)
* Hive => "SQL on Hadoop"
Write SQL queries
Translate into Map/Reduce job(s)
Convenient & easy
High-latency (batch processing)
What is Impala?
In-memory, distributed SQL
query engine (no Map/Reduce)
Native code (C++)
Distributed
(on HDFS data nodes)
Why Impala?
Interactive data analysis
Low-latency response
(roughly, 4 - 100x Hive)
Deploy on existing Hadoop clusters
Why Impala? (cont'd)
Data stored in HDFS avoids...
...duplicate storage
...data transformation
...moving data
Why Impala? (cont'd)
SPEED!
statestored & Hive metastore
(for database metadata)
Overview
impalad daemon runs on HDFS nodes
Queries run on "relevant" nodes
Supports common HDFS file formats
(for cluster metadata)
Overview (cont'd)
Does not use Map/Reduce
Not fault tolerant !
(query fails if any query on any node fails)
Submit queries via Hue/Beeswax
Thrift API, CLI, ODBC, JDBC
SQL Support
SELECT
Projection
UNION
INSERT OVERWRITE
INSERT INTO
ORDER BY
(w/ LIMIT)
Aggregation
Subqueries
(uncorrelated)
JOIN (equi-join only,
subject to memory
limitations)
(subset of Hive QL)
HBase Queries
Maps HBase tables via Hive
metastore mapping
Row key predicates => start/stop row
Non-row key predicates => SingleColumnValueFilter
HBase scan translations:
(Very) Unscientific Benchmarks
9 queries, run in CDH Quickstart VM
Macbook Pro Retina, mid 2012
16GB RAM,
4GB for VM (VMWare 5),
Intel i7 2.6GHz quad-core processor
Hardware
No other load on system during queries
Pseudo-cluster + Impala daemons
CDH 4.2, Impala 1.0
Benchmarks (cont'd)
(from simple projection queries to
multiple joins, aggregation, multiple
predicates, and order by)
Impala vs. Hive performance
"TPC-DS" sample dataset
(http://www.tpc.org/tpcds/)
Query "A"
select
c.c_first_name,
c.c_last_name
from customer c
limit 50;
Query "B"
select
   c.c_first_name,
   c.c_last_name,
   ca.ca_city,
   ca.ca_county,
   ca.ca_state
from customer c
   join customer_address ca
on c.c_current_addr_sk = ca.ca_address_sk
limit 50;
Query "C"
select
   c.c_first_name,
   c.c_last_name,
   ca.ca_city,
   ca.ca_county,
   ca.ca_state
from customer c
   join customer_address ca
on c.c_current_addr_sk = ca.ca_address_sk
where lower(c.c_last_name) like 'smi%'
limit 50;
Query "D"
select distinct cd_credit_rating
from customer_demographics;
Query "E"
select
   cd_credit_rating,
   count(*)
from customer_demographics
group by cd_credit_rating;
Query "F"
select
   c.c_first_name,
   c.c_last_name,
   ca.ca_city,
   ca.ca_county,
   ca.ca_state,
   cd.cd_marital_status,
   cd.cd_education_status
from customer c
   join customer_address ca
       on c.c_current_addr_sk = ca.ca_address_sk
   join customer_demographics cd
       on c.c_current_cdemo_sk = cd.cd_demo_sk
where
   lower(c.c_last_name) like 'smi%' and
   cd.cd_credit_rating in ('Unknown', 'High Risk')
limit 50;
Query "G"
select
   count(c.c_customer_sk)
from customer c
   join customer_address ca
       on c.c_current_addr_sk = ca.ca_address_sk
   join customer_demographics cd
       on c.c_current_cdemo_sk = cd.cd_demo_sk
where
   ca.ca_zip in ('20191', '20194') and
   cd.cd_credit_rating in ('Unknown', 'High Risk');
Query "H"
select
   c.c_first_name,
   c.c_last_name,
   ca.ca_city,
   ca.ca_county,
   ca.ca_state,
   cd.cd_marital_status,
   cd.cd_education_status
from customer c
   join customer_address ca
       on c.c_current_addr_sk = ca.ca_address_sk
   join customer_demographics cd
       on c.c_current_cdemo_sk = cd.cd_demo_sk
where
   ca.ca_zip in ('20191', '20194') and
   cd.cd_credit_rating in ('Unknown', 'High Risk')
limit 100;
select  
  i_item_id,
  s_state,
  avg(ss_quantity) agg1,
  avg(ss_list_price) agg2,
  avg(ss_coupon_amt) agg3,
  avg(ss_sales_price) agg4
from store_sales
join date_dim
   on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
join item
   on (store_sales.ss_item_sk = item.i_item_sk)
join customer_demographics
   on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
join store
   on (store_sales.ss_store_sk = store.s_store_sk)
where
  cd_gender = 'M' and
  cd_marital_status = 'S' and
  cd_education_status = 'College' and
  d_year = 2002 and
  s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD')
group by
  i_item_id,
  s_state
order by
  i_item_id,
  s_state
limit 100;
Query "TPC-DS"
Query Hive (sec) # M/R jobs Impala (sec) x Hive perf.
A 13.8 1 0.25 54
B 30.0 1 0.41 73
C 33.3 1 0.42 79
D 23.2 1 0.64 36
E 21.6 1 0.62 35
F 59.1 2 1.96 30
G 78.5 3 1.56 50
H 59.6 2 1.89 32
TPC-DS 204.5 6 3.23 63
(remember, unscientific...)
Cloudera Impala, updated for v1.0
A
rchitecture
Two daemons
impalad
statestored
impalad on each HDFS data node
statestored - cluster metadata
Thrift APIs, ODBC, JDBC
impalad
Query execution
Query coordination
Query planning
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
Queries performed in-memory
Intermediate data never hits disk!
Data streamed to clients
C++
runtime code generation
intrinsics for optimization
Execution engine:
statestored
Cluster membership
Acts as a cluster
monitor
Not a SPOF
(single point of failure)
Metadata
Impala uses Hive metastore
Daemons cache metadata
REFRESH when table
definition/data change
Create tables in Hive or Impala
Next up - how queries work...
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
Client Statestore Hive Metastore
table/
database
metadata
SQL
query
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
impalad
Query Coordinator
Query Planner
Query Executor
HDFS DataNode
HBase RegionServer
cluster
monitoring
Read directly from disk
Short-circuit reads
Bypass HDFS DataNode
(avoids overhead of HDFS API)
impalad
Query Coordinator
Query Planner
Query Executor
HBase
Region
Server
HDFS
DataNode
Local Filesystem
Read
directly
from disk
Cloudera Impala, updated for v1.0
Current Limitations
(as of version 1.0.1)
No join order optimization
No custom file formats, SerDes or UDFs
Limit required when using ORDER BY
Joins limited by aggregate memory of cluster
("put larger table on left")
Current Limitations
(as of version 1.0.1)
No advanced data structures
(arrays, maps, json, etc.)
Only basic DDL (otherwise do in Hive)
Limited file formats and compression
(though probably fine for most people)
Future...
Structure types (structs,
arrays, maps, json, etc.)
DDL support
Additional file formats &
compression support
"Performance"
Join optimization
(e.g. cost-based)
UDFs (???)
YARN integration
Fault-tolerance (???)
Cloudera Impala, updated for v1.0
Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout, it
is capable of running aggregation queries
over trillion-row tables in seconds. The
system scales to thousands of CPUs and
petabytes of data, and has thousands of
users at Google.
Comparing Impala to Dremel
- http://research.google.com/pubs/pub36632.html
Comparing Impala to Dremel
Impala = Dremel features circa 2010 + join
support, assuming columnar data format
(but, Google doesn't stand still...)
Dremel is production, mature
Basis for Google's BigQuery
Comparing Impala to Hive
Hive uses Map/Reduce -> high latency
Impala is in-memory, low-
latency query engine
Impala sacrifices fault tolerance
for performance
Comparing Impala to Drill
Apache Drill
Based on Dremel
In early stages...
"Apache Drill is an open-source software framework that supports
data-intensive distributed applications for interactive analysis of large-
scale datasets. Drill is the open source version of Google's Dremel
system which is available as an IaaS service called Google BigQuery. One
explicitly stated design goal is that Drill is able to scale to 10,000 servers
or more and to be able to process petabyes of data and trillions of
records in seconds. Currently, Drill is incubating at Apache."
- http://incubator.apache.org/drill/drill_overview.html
Comparing Impala to Drill
"The Stinger Initiative is a collection of
development threads in the Hive community
that will deliver 100X performance
improvements as well as SQL compatibility."
Comparing Impala to Stinger
- http://hortonworks.com/stinger/
Comparing Impala to Stinger
Stinger
Improve Hive performance (e.g. optimize execution plan)
Support for analytics (e.g. OVER clause, window functions)
TEZ framework to optimize execution
Columnar file format
http://hortonworks.com/stinger/
Stinger Phase 1 performance...
(Stinger phase 1 is really just Hive 0.11)
remember, these numbers are
non-scientific micro-benchmarks!
Same 9 queries (as w/ Impala), run
in HortonWorks Sandbox VM
Macbook Pro Retina, mid 2012
16GB RAM,
4GB for VM (VMWare 5),
Intel i7 2.6GHz quad-core processor
Hardware (same as w/ Impala)
No other load on system during queries
HortonWorks Data Platform (HDP) 1.3
Running pseudo-cluster
Query Hive (sec)
# M/R
jobs
Stinger
Phase 1 (sec)
# M/R
jobs
x Hive
perf.
A 13.8 1 10.0 1 1.4
B 30.0 1 15.8 1 1.9
C 33.3 1 14.1 1 2.4
D 23.2 1 18.7 1 1.2
E 21.6 1 19.7 1 1.1
F 59.1 2 34.3 1 1.7
G 78.5 3 35.2 1 2.2
H 59.6 2 31.5 1 1.9
TPC-DS 204.5 6 37.2 1 5.5
(remember, unscientific...)
Query
Stinger Phase 1
(sec)
Impala (sec) x Stinger perf.
A 10.0 0.25 39
B 15.8 0.41 38
C 14.1 0.42 33
D 18.7 0.64 29
E 19.7 0.62 32
F 34.3 1.96 18
G 35.2 1.56 23
H 31.5 1.89 17
TPC-DS 37.2 3.23 12
(remember, unscientific...)
Impala Review
In-memory, distributed
SQL query engine
Integrates into
existing HDFS
Not Map/Reduce
Focus on
performance
(native code)
Competition...
Interactive data
analysis
References
Google Dremel - http://research.google.com/pubs/pub36632.html
Apache Drill - http://incubator.apache.org/drill/
TPC-DS dataset - http://www.tpc.org/tpcds/
Stinger Initiative - http://hortonworks.com/blog/100x-faster-hive/
http://hortonworks.com/stinger/
Cloudera Impala resources
http://www.cloudera.com/content/support/en/documentation/cloudera-impala/cloudera-
impala-documentation-v1-latest.html
Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real
http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-
hadoop-for-real/
Photo Attributions
Impala - http://www.flickr.com/photos/gerardstolk/5897570970/
Measuring tape - http://www.morguefile.com/archive/display/24850
Bridge frame - http://www.morguefile.com/archive/display/9699
Balance - http://www.morguefile.com/archive/display/93433
* All others are iStockPhoto (I paid for them...)
My Info
twitter.com/sleberknight www.sleberknight.com/blog
scott dot leberknight at gmail dot com

More Related Content

PDF
Polyglot Persistence
PPTX
Using Spark to Load Oracle Data into Cassandra
PDF
Http4s, Doobie and Circe: The Functional Web Stack
PDF
Spring data requery
PDF
Requery overview
PPTX
Getting started with Elasticsearch and .NET
PDF
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
PDF
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Polyglot Persistence
Using Spark to Load Oracle Data into Cassandra
Http4s, Doobie and Circe: The Functional Web Stack
Spring data requery
Requery overview
Getting started with Elasticsearch and .NET
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
Cassandra 3.0 - JSON at scale - StampedeCon 2015

What's hot (20)

PPTX
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
PPTX
Spring data jpa
PPTX
Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...
PPTX
UKOUG 2011 - Drag, Drop and other Stuff. Using your Database as a File Server
PPTX
ElasticSearch for .NET Developers
KEY
An introduction to CouchDB
PDF
Spark Dataframe - Mr. Jyotiska
PDF
OrientDB
PDF
NoSQL and JavaScript: a Love Story
PDF
OrientDB introduction - NoSQL
PPTX
MongoDB + Java - Everything you need to know
PPTX
Oak Lucene Indexes
PPTX
MongoDB: Easy Java Persistence with Morphia
PPTX
Cassandra 2.2 & 3.0
PPTX
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
PPT
Spring data presentation
PDF
Green dao
PDF
Java Persistence Frameworks for MongoDB
PPT
Leveraging Hadoop in your PostgreSQL Environment
PDF
Green dao
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
Spring data jpa
Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...
UKOUG 2011 - Drag, Drop and other Stuff. Using your Database as a File Server
ElasticSearch for .NET Developers
An introduction to CouchDB
Spark Dataframe - Mr. Jyotiska
OrientDB
NoSQL and JavaScript: a Love Story
OrientDB introduction - NoSQL
MongoDB + Java - Everything you need to know
Oak Lucene Indexes
MongoDB: Easy Java Persistence with Morphia
Cassandra 2.2 & 3.0
OPP2010 (Brussels) - Programming with XML in PL/SQL - Part 1
Spring data presentation
Green dao
Java Persistence Frameworks for MongoDB
Leveraging Hadoop in your PostgreSQL Environment
Green dao

Viewers also liked (16)

PDF
CoffeeScript
PDF
jps & jvmtop
PDF
wtf is in Java/JDK/wtf7?
PDF
Cloudera Impala
PDF
RESTful Web Services with Jersey
PDF
Java 8 Lambda Expressions
PDF
Dropwizard
PDF
HBase Lightning Talk
PDF
Google Guava
PDF
Apache ZooKeeper
PDF
Awesomizing your Squarespace Website
PDF
AWS Lambda
CoffeeScript
jps & jvmtop
wtf is in Java/JDK/wtf7?
Cloudera Impala
RESTful Web Services with Jersey
Java 8 Lambda Expressions
Dropwizard
HBase Lightning Talk
Google Guava
Apache ZooKeeper
Awesomizing your Squarespace Website
AWS Lambda

Similar to Cloudera Impala, updated for v1.0 (20)

PPTX
SQL on Hadoop for the Oracle Professional
PPT
Hive @ Hadoop day seattle_2010
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
PPTX
Hadoop: An Industry Perspective
PPTX
Berlin Hadoop Get Together Apache Drill
PPTX
Presentation sreenu dwh-services
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PPTX
Hadoop_arunam_ppt
ODP
Sql on hadoop the secret presentation.3pptx
PPTX
Big data concepts
PPTX
Sf NoSQL MeetUp: Apache Hadoop and HBase
PPT
Hive Training -- Motivations and Real World Use Cases
PDF
Hypertable Distilled by edydkim.github.com
PDF
Apache Eagle - Monitor Hadoop in Real Time
PPTX
Hands on Hadoop and pig
PPTX
Etu L2 Training - Hadoop 企業應用實作
PPTX
מיכאל
PPTX
Hadoop and mysql by Chris Schneider
PPT
Nextag talk
PPT
Apache Hadoop
SQL on Hadoop for the Oracle Professional
Hive @ Hadoop day seattle_2010
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Hadoop: An Industry Perspective
Berlin Hadoop Get Together Apache Drill
Presentation sreenu dwh-services
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Hadoop_arunam_ppt
Sql on hadoop the secret presentation.3pptx
Big data concepts
Sf NoSQL MeetUp: Apache Hadoop and HBase
Hive Training -- Motivations and Real World Use Cases
Hypertable Distilled by edydkim.github.com
Apache Eagle - Monitor Hadoop in Real Time
Hands on Hadoop and pig
Etu L2 Training - Hadoop 企業應用實作
מיכאל
Hadoop and mysql by Chris Schneider
Nextag talk
Apache Hadoop

More from Scott Leberknight (6)

PDF
JShell & ki
PDF
JUnit Pioneer
PDF
JDKs 10 to 14 (and beyond)
PDF
Unit Testing
PDF
PDF
JShell & ki
JUnit Pioneer
JDKs 10 to 14 (and beyond)
Unit Testing

Recently uploaded (20)

PDF
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
PDF
Fitaura: AI & Machine Learning Powered Fitness Tracker
PDF
Decision Optimization - From Theory to Practice
PDF
Streamline Vulnerability Management From Minimal Images to SBOMs
PDF
The AI Revolution in Customer Service - 2025
PDF
Chapter 1: computer maintenance and troubleshooting
PPT
Storage Area Network Best Practices from HP
PPTX
Information-Technology-in-Human-Society (2).pptx
PDF
ELLIE29.pdfWETWETAWTAWETAETAETERTRTERTER
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
State of AI in Business 2025 - MIT NANDA
PPTX
maintenance powerrpoint for adaprive and preventive
PPTX
Digital Convergence: How GIS, BIM, and CAD Revolutionize Asset Management
PPTX
Strategic Picks — Prioritising the Right Agentic Use Cases [2/6]
PDF
EGCB_Solar_Project_Presentation_and Finalcial Analysis.pdf
PDF
Gestión Unificada de los Riegos Externos
PDF
Connector Corner: Transform Unstructured Documents with Agentic Automation
PDF
Technical Debt in the AI Coding Era - By Antonio Bianco
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
Fitaura: AI & Machine Learning Powered Fitness Tracker
Decision Optimization - From Theory to Practice
Streamline Vulnerability Management From Minimal Images to SBOMs
The AI Revolution in Customer Service - 2025
Chapter 1: computer maintenance and troubleshooting
Storage Area Network Best Practices from HP
Information-Technology-in-Human-Society (2).pptx
ELLIE29.pdfWETWETAWTAWETAETAETERTRTERTER
Presentation - Principles of Instructional Design.pptx
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
EIS-Webinar-Regulated-Industries-2025-08.pdf
State of AI in Business 2025 - MIT NANDA
maintenance powerrpoint for adaprive and preventive
Digital Convergence: How GIS, BIM, and CAD Revolutionize Asset Management
Strategic Picks — Prioritising the Right Agentic Use Cases [2/6]
EGCB_Solar_Project_Presentation_and Finalcial Analysis.pdf
Gestión Unificada de los Riegos Externos
Connector Corner: Transform Unstructured Documents with Agentic Automation
Technical Debt in the AI Coding Era - By Antonio Bianco

Cloudera Impala, updated for v1.0