1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Row/Column-level
Security in SQL
for Apache Spark
Dongjoon Hyun – Software Engineer
Bikas Saha – Software Engineer
April 2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who am I
 Software Engineer @ Hortonworks
 Apache REEF PMC member and committer
 Apache Spark project contributor
 https://github.com/dongjoon-hyun
Dongjoon Hyun
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Security Issues
Goals
Components
How it works
Demo
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security
 One of fundamental features for enterprise adoption
– Multi-tenancy: Billing team / Data science team / Marketing teams
 Row and column-level access control for SQL users
– Row filtering
– Column masking
 Must enforce shared policies to various SQL engines simultaneously
– E.g. Apache Spark 2.1/1.6 and Apache Hive 2.1
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Issue 1
 Spark reads all or nothing
– Directory/File-based permissions are insufficient
 Permission 777 on warehouse?
Security starts from storage
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Issue 2
 Spark apps should be rewritten
– Special data source tables
 Duplicated data
– Filtered rows
– Removed or masked columns
 SQL Views
– Maintained by manually
Overhead during starting and maintaining security policies
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goals
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 1: Spark SQL Apps
Support row/column-level security with the batch apps
from pyspark.sql import SparkSession
spark = SparkSession 
.builder 
.enableHiveSupport() 
.getOrCreate()
spark.sql("select * from db_common.t_customer").show()
db_common
t_customer
…
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 2: Spark shells (1/2)
Support row/column-level security in all shells
spark-shell
pyspark
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 2: Spark shells (2/2)
Support row/column-level security in all shells
sparkR
spark-sql
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 3: Spark Thrift Server
Support row/column-level security with Spark Thrift Server
Login as `hive`
Login as `spark`
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Components
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What are required?
 Kerberos
 Apache Hadoop (HDFS/YARN)
 Apache Ranger
 Apache Hive (LLAP)
 Spark-LLAP: A library and patches to integrate the above
Focus here
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger
Provide a standard authorization method across many Hadoop components
https://hortonworks.com/apache/ranger/#section_2
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
 Hive Ranger Plugin & Policies
– Support row/column-level security
 LLAP Daemon (GA in HDP 2.6)
– Persistent query servers with intelligent in-memory caching
– Provide a secure relational datanode view of the data
Trusted Service
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark-LLAP for Spark 1.6
• User should use LlapContext
• Support Scala/Java and spark-shell
HDP 2.5
var lc = new LlapContext(sc)
lc.sql("select * from t").show
Spark-LLAP (Technical Preview)
Milestone
Spark-LLAP for Spark 2.1
• No need to rewrite SQL related code
• Support all languages and shells
HDP 2.6 Next
Spark-LLAP for Spark 2.1
• Support YARN cluster mode
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark-LLAP GitHub (Apache License)
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How it works
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How it works – Overview
Case: spark-submit with YARN cluster mode
Spark
Hive
(HiveServer2)
Ranger
LLAP
User
Admin
2. Launch
3. Get delegation token
1. Manage policies
7. Monitor Audits
6. Read filtered/masked data
Authorize
5. Get data locations
4. Get metadata
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How it works – Overview
Spark
Hive
(HiveServer2)
Ranger
LLAP
User
Admin
2. Launch
3. Get delegation token
1. Manage policies
7. Monitor Audits
6. Read filtered/masked data
Authorize
5. Get data locations
4. Get metadata
Existing InfraNew for Spark
New for Hive (GA)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive
Enable LLAP
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Admin – Manage
Hive Database: db_common
Table: *
Hive Column: *
Select User: spark
Permissions: SELECT
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Admin – Audit
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
User
 spark-submit
--jars spark-llap.jar
--conf spark.sql.hive.llap=true
--conf spark.yarn.security.credentials.hiveserver2.enabled=true
--master yarn
--deploy-mode cluster
sql.py
Launch Spark jobs
Note: There exists more static configurations related LLAP
`--package` option is supported, too
Easy to turn on/off
Only used for YARN cluster mode
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
 HDFS Delegation Token
– HDFSCredentialProvider gets it from namenode
 Hive Metastore Delegation Token
– HiveCredentialProvider gets it from Hive Metastore
 HiveServer2 Delegation Token
– HiveServer2CredentialProvider gets it from HiveServer2
Get delegation tokens
Spark-LLAP
Existing
Note: Spark manages token renewal
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation
SELECT gender, count(*)
FROM db_common.t_customer
WHERE name LIKE '%Obama’
GROUP BY gender
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
UnresolvedRelation
Filter: name like %Obama
Parsed Logical Plan
Aggregate: gender
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation
Without Spark-LLAP
With Spark-LLAP
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapRelation supports predicate pushdown during optimization
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
LlapRelation
Filter: EndsWith(name,Obama)
Optimized Logical Plan
Project: gender
Aggregate: gender
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapRelation supports predicate pushdown during optimization
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
LlapRelation
Filter: EndsWith(name,Obama)
Optimized Logical Plan
Project: gender
Aggregate: gender
Scan LlapRelation
PushedFilter:
StringEndsWith(name, Obama)
Filter: EndsWith(name, Obama)
Physical Plan
Project: gender
HashAggregate: gender
…
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
Read filtered and masked data from LLAP
jobConf.set("hive.llap.zk.registry.user", "hive")
jobConf.set("llap.if.hs2.connection", parameters("url"))
jobConf.set("llap.if.query", queryString)
…
// Create Hadoop RDD and convert LLAP Row into Spark Row
sc.sparkContext
.hadoopRDD(…)
.mapPartitionsWithInputSplit(…)
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo (Video)
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Some related SPARK Issues
 SPARK-14743 Add a configurable credential manager for Spark running on YARN
 SPARK-15777 Catalog federation (Open)
 SPARK-17767 Spark SQL ExternalCatalog API custom implementation support (Closed as Later)
 SPARK-17819 Support default database in connection URIs for Spark Thrift Server
 SPARK-18517 DROP TABLE IF EXISTS should not warn for non-exist
 SPARK-18840 Avoid throw exception when getting token renewal interval in non HDFS security env.
 SPARK-18857 Don't use `Iterator.duplicate` in STS
 SPARK-19021 Generailize HDFSCredentialProvider to support non HDFS security filesystems
 SPARK-19038 Avoid overwriting keytab configuration in yarn-client
 SPARK-19179 Change spark.yarn.access.namenodes config and update docs
 SPARK-19970 Table owner should be USER instead of PRINCIPAL
 SPARK-19995 Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
 Support row/column-level security with
– Spark apps
– Spark shells
– Spark Thrift Server
 You can use the existing Spark 2.X SQL apps and scripts
 Easy to turn on/off with only configurations
 Ranger enforces Hive/Spark simultaneously and consistently
Spark-LLAP with HDP 2.6 is TP
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgement
 Apache Hive / Apache Spark / Apache Ranger
 Bikas Saha, Saisai Shao, Jason Dere, Thejas Nair, Zhan Zhang, and
many others
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you

Row/Column- Level Security in SQL for Apache Spark

  • 1.
    1 © HortonworksInc. 2011 – 2016. All Rights Reserved Row/Column-level Security in SQL for Apache Spark Dongjoon Hyun – Software Engineer Bikas Saha – Software Engineer April 2017
  • 2.
    2 © HortonworksInc. 2011 – 2016. All Rights Reserved Who am I  Software Engineer @ Hortonworks  Apache REEF PMC member and committer  Apache Spark project contributor  https://github.com/dongjoon-hyun Dongjoon Hyun
  • 3.
    3 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda Security Issues Goals Components How it works Demo
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved Security  One of fundamental features for enterprise adoption – Multi-tenancy: Billing team / Data science team / Marketing teams  Row and column-level access control for SQL users – Row filtering – Column masking  Must enforce shared policies to various SQL engines simultaneously – E.g. Apache Spark 2.1/1.6 and Apache Hive 2.1
  • 5.
    5 © HortonworksInc. 2011 – 2016. All Rights Reserved Issue 1  Spark reads all or nothing – Directory/File-based permissions are insufficient  Permission 777 on warehouse? Security starts from storage
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved Issue 2  Spark apps should be rewritten – Special data source tables  Duplicated data – Filtered rows – Removed or masked columns  SQL Views – Maintained by manually Overhead during starting and maintaining security policies
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved Goals
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved Goal 1: Spark SQL Apps Support row/column-level security with the batch apps from pyspark.sql import SparkSession spark = SparkSession .builder .enableHiveSupport() .getOrCreate() spark.sql("select * from db_common.t_customer").show() db_common t_customer …
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved Goal 2: Spark shells (1/2) Support row/column-level security in all shells spark-shell pyspark
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved Goal 2: Spark shells (2/2) Support row/column-level security in all shells sparkR spark-sql
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved Goal 3: Spark Thrift Server Support row/column-level security with Spark Thrift Server Login as `hive` Login as `spark`
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved Components
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved What are required?  Kerberos  Apache Hadoop (HDFS/YARN)  Apache Ranger  Apache Hive (LLAP)  Spark-LLAP: A library and patches to integrate the above Focus here
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Ranger Provide a standard authorization method across many Hadoop components https://hortonworks.com/apache/ranger/#section_2
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Hive  Hive Ranger Plugin & Policies – Support row/column-level security  LLAP Daemon (GA in HDP 2.6) – Persistent query servers with intelligent in-memory caching – Provide a secure relational datanode view of the data Trusted Service
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark-LLAP for Spark 1.6 • User should use LlapContext • Support Scala/Java and spark-shell HDP 2.5 var lc = new LlapContext(sc) lc.sql("select * from t").show Spark-LLAP (Technical Preview) Milestone Spark-LLAP for Spark 2.1 • No need to rewrite SQL related code • Support all languages and shells HDP 2.6 Next Spark-LLAP for Spark 2.1 • Support YARN cluster mode
  • 17.
    17 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark-LLAP GitHub (Apache License)
  • 18.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved How it works
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved How it works – Overview Case: spark-submit with YARN cluster mode Spark Hive (HiveServer2) Ranger LLAP User Admin 2. Launch 3. Get delegation token 1. Manage policies 7. Monitor Audits 6. Read filtered/masked data Authorize 5. Get data locations 4. Get metadata
  • 20.
    20 © HortonworksInc. 2011 – 2016. All Rights Reserved How it works – Overview Spark Hive (HiveServer2) Ranger LLAP User Admin 2. Launch 3. Get delegation token 1. Manage policies 7. Monitor Audits 6. Read filtered/masked data Authorize 5. Get data locations 4. Get metadata Existing InfraNew for Spark New for Hive (GA)
  • 21.
    21 © HortonworksInc. 2011 – 2016. All Rights Reserved Hive Enable LLAP
  • 22.
    22 © HortonworksInc. 2011 – 2016. All Rights Reserved Admin – Manage Hive Database: db_common Table: * Hive Column: * Select User: spark Permissions: SELECT
  • 23.
    23 © HortonworksInc. 2011 – 2016. All Rights Reserved Admin – Audit
  • 24.
    24 © HortonworksInc. 2011 – 2016. All Rights Reserved User  spark-submit --jars spark-llap.jar --conf spark.sql.hive.llap=true --conf spark.yarn.security.credentials.hiveserver2.enabled=true --master yarn --deploy-mode cluster sql.py Launch Spark jobs Note: There exists more static configurations related LLAP `--package` option is supported, too Easy to turn on/off Only used for YARN cluster mode
  • 25.
    25 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark  HDFS Delegation Token – HDFSCredentialProvider gets it from namenode  Hive Metastore Delegation Token – HiveCredentialProvider gets it from Hive Metastore  HiveServer2 Delegation Token – HiveServer2CredentialProvider gets it from HiveServer2 Get delegation tokens Spark-LLAP Existing Note: Spark manages token renewal
  • 26.
    26 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation SELECT gender, count(*) FROM db_common.t_customer WHERE name LIKE '%Obama’ GROUP BY gender LlapRelation SubqueryAlias Analyzed Logical Plan Filter: name like %Obama Aggregate: gender UnresolvedRelation Filter: name like %Obama Parsed Logical Plan Aggregate: gender
  • 27.
    27 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation Without Spark-LLAP With Spark-LLAP
  • 28.
    28 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark LlapRelation supports predicate pushdown during optimization LlapRelation SubqueryAlias Analyzed Logical Plan Filter: name like %Obama Aggregate: gender LlapRelation Filter: EndsWith(name,Obama) Optimized Logical Plan Project: gender Aggregate: gender
  • 29.
    29 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark LlapRelation supports predicate pushdown during optimization LlapRelation SubqueryAlias Analyzed Logical Plan Filter: name like %Obama Aggregate: gender LlapRelation Filter: EndsWith(name,Obama) Optimized Logical Plan Project: gender Aggregate: gender Scan LlapRelation PushedFilter: StringEndsWith(name, Obama) Filter: EndsWith(name, Obama) Physical Plan Project: gender HashAggregate: gender …
  • 30.
    30 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark Read filtered and masked data from LLAP jobConf.set("hive.llap.zk.registry.user", "hive") jobConf.set("llap.if.hs2.connection", parameters("url")) jobConf.set("llap.if.query", queryString) … // Create Hadoop RDD and convert LLAP Row into Spark Row sc.sparkContext .hadoopRDD(…) .mapPartitionsWithInputSplit(…)
  • 31.
    31 © HortonworksInc. 2011 – 2016. All Rights Reserved Demo (Video)
  • 32.
    32 © HortonworksInc. 2011 – 2016. All Rights Reserved Some related SPARK Issues  SPARK-14743 Add a configurable credential manager for Spark running on YARN  SPARK-15777 Catalog federation (Open)  SPARK-17767 Spark SQL ExternalCatalog API custom implementation support (Closed as Later)  SPARK-17819 Support default database in connection URIs for Spark Thrift Server  SPARK-18517 DROP TABLE IF EXISTS should not warn for non-exist  SPARK-18840 Avoid throw exception when getting token renewal interval in non HDFS security env.  SPARK-18857 Don't use `Iterator.duplicate` in STS  SPARK-19021 Generailize HDFSCredentialProvider to support non HDFS security filesystems  SPARK-19038 Avoid overwriting keytab configuration in yarn-client  SPARK-19179 Change spark.yarn.access.namenodes config and update docs  SPARK-19970 Table owner should be USER instead of PRINCIPAL  SPARK-19995 Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode
  • 33.
    33 © HortonworksInc. 2011 – 2016. All Rights Reserved Summary  Support row/column-level security with – Spark apps – Spark shells – Spark Thrift Server  You can use the existing Spark 2.X SQL apps and scripts  Easy to turn on/off with only configurations  Ranger enforces Hive/Spark simultaneously and consistently Spark-LLAP with HDP 2.6 is TP
  • 34.
    34 © HortonworksInc. 2011 – 2016. All Rights Reserved Acknowledgement  Apache Hive / Apache Spark / Apache Ranger  Bikas Saha, Saisai Shao, Jason Dere, Thejas Nair, Zhan Zhang, and many others
  • 35.
    35 © HortonworksInc. 2011 – 2016. All Rights Reserved Thank you