Sgx-Pyspark: Secure Distributed Data Analytics: Do Le Quoc Franz Gregor Jatinder Singh Christof Fetzer
Sgx-Pyspark: Secure Distributed Data Analytics: Do Le Quoc Franz Gregor Jatinder Singh Christof Fetzer
ABSTRACT                                                                               potential serious legal consequences (and fines) for the data mis-
Data analytics is central to modern online services, particularly                      handling, mismanagement and leakage, and more generally, for
those data-driven. Often this entails the processing of large-scale                    failing to implement the appropriate security measures [9]. Ser-
datasets which may contain private, personal and sensitive informa-                    vice providers must ensure that data is always protected, i.e., at
tion relating to individuals and organisations. Particular challenges                  rest, during transmission, and computation. Many organisations
arise where cloud is used to store and process the sensitive data. In                  make use of public cloud services for the processing to reduce time
such settings, security and privacy concerns become paramount,                         and computation cost. This setting is vulnerable to many security
as the cloud provider is trusted to guarantee the security of the                      threats, e.g., data breaches [10]. Concerns are compounded when
services they offer, including data confidentiality. Therefore, the                    we consider attacks from inside the cloud provider, where attackers
issue this work tackles is “How to securely perform data analytics in                  might have root privileges and/or physical access to machines de-
a public cloud?”                                                                       ployed at the service providers’ premises. Therefore, to protect the
    To assist this question, we design and implement SGX-PySpark–                      sensitive data and the analytics computation over the data, service
a secure distributed data analytics system which relies on a trusted                   providers cannot rely solely on the operating system access control
execution environment (TEE) such as Intel SGX to provide strong                        nor their security policy-based mechanisms.
security guarantees. To build SGX-PySpark, we integrate PySpark                            An promising approach to helping resolve these security chal-
- a widely used framework for data analytics in industry to sup-                       lenges is to make use of Trusted Execution Environments (TEEs),
port a wide range of queries, with SCONE - a shielded execution                        such as Intel Software Guard Extensions (SGX). Intel SGX protects
framework using Intel SGX.                                                             the confidentiality and integrity of application code and data even
                                                                                       against privileged attackers with root access and physical access.
CCS CONCEPTS                                                                           In general, Intel SGX provides an isolated secure memory area
                                                                                       called enclaves, where the code and data can be executed safely.
• Information systems → Data analytics; • Security and pri-
                                                                                       These security guarantees are solely provided by the CPU, thus
vacy → Distributed systems security.
                                                                                       even if system software is compromised, the attacker can never
KEYWORDS                                                                               access the enclave’s content. This approach supports data analytics
                                                                                       at processor speeds while ensuring the security guarantee for both
Confidential computing; data analytics; security; distributed system                   computation and sensitive data.
ACM Reference Format:                                                                      While promising at first glance, to build a practical secure data an-
Do Le Quoc, Franz Gregor, Jatinder Singh, and Christof Fetzer. 2019. SGX-              alytics system using TEEs, e.g., Intel SGX, we need to deal with sev-
PySpark: Secure Distributed Data Analytics. In Proceedings of the 2019 World           eral challenges. (A) In the current version, Intel SGX supports only
Wide Web Conference (WWW ’19), May 13–17, 2019, San Francisco, CA, USA.                a limited memory space (∼ 94MB) for applications running inside
ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3308558.3314129
                                                                                       enclaves. Meanwhile, most big data analytics systems (e.g., Hadoop
                                                                                       and Apache Spark [1]) are extremely memory-intensive, since these
1    INTRODUCTION
                                                                                       systems are almost always based on Java Virtual Machine (JVM).
Cloud-based services are used to collect, process and analyze large                    (B) Intel SGX still suffers from side-channel attacks [12]. These
amounts of user’s personal data, some of it highly sensitive, such                     side-channel attacks happen both at memory level [12, 16] and
at that relating to personal finances, political views, health, and so                 network level [16]. (C) Deployment and bootstrapping of a data
forth. Indeed, we have seen increasing attention of regulators on                      analytics framework to run inside enclaves is not trivial, in fact,
issues regarding the way in which personal data is handled and                         challenging. Securely transferring configuration secrets such as
processed - the EU’s General Data Protection Regulation a case in                      certificates, encryption keys and passwords to start the framework
point. Thus, confidentiality and integrity of the data processing in                   inside enclaves is complicated because these secrets need to be
clouds are becoming more important, not least because of increased                     protected on the network as well as securely moved into the en-
demands for accountability regarding service providers, and the                        claves. (D) Typically, Intel SGX requires users to heavily modify
This paper is published under the Creative Commons Attribution 4.0 International       the source code of their application to run inside enclaves. Thus,
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their     transparently supporting a unmodified distributed data analytics
personal and corporate Web sites with the appropriate attribution.
                                                                                       framework to run inside enclaves is not a trivial task.
WWW ’19, May 13–17, 2019, San Francisco, CA, USA
© 2019 IW3C2 (International World Wide Web Conference Committee), published                In the context of building secure data analytics systems using
under Creative Commons CC-BY 4.0 License.                                              Intel SGX, VC3 [13] is one of the first works that applied SGX tech-
ACM ISBN 978-1-4503-6674-8/19/05.                                                      nology for Hadoop MapReduce framework. VC3 handles challenges
https://doi.org/10.1145/3308558.3314129
                                                                                3564
(A) and (D) by using C/C++ to implement the framework and                      The enclave memory is acquired from Enclave Page Cache
support unmodified Hadoop. However, the challenge (B) is outside            (EPC)—a dedicated memory region protected by an on-chip Mem-
the scope of this work. Recently, Opaque [16] overcomes this issue          ory Encryption Engine (MEE). The MEE transparently encrypts
by introducing a oblivious mechanism to hide access patterns at net-        cache lines on cache-line evictions and decrypts and verifies cache
work level. Opaque provides secure data analytics on Apache Spark           lines with on cache-line loads. The EPC cannot be directly accessed
framework using Intel SGX. It deals with (A) by reimplementing              by non-enclave applications including operating systems. To sup-
SQL operators for Spark SQL Catalyst engine [6] using C++. These            port multiple enclaves on a system, the EPC is partitioned into 4KB
operators run inside enclaves and communicate with Scala code of            pages which can be assigned to various enclaves. Currently, the
Spark using a JNI interface. Opaque supports common operators               size of EPC is limited to 128MB in which only ∼ 94MB can be used
including map, reduce, filter, sort, aggregation, and join, but not all     for user applications and the rest is used to store SGX metadata.
operators of Apache Spark. This means that it does not handle the           Fortunately, SGX supports a secure paging mechanism to an un-
challenge (D) completely. In addition, it does not support remote           protected memory region even though the paging mechanism may
attestation to verify the integrity of the code running inside SGX          introduce significant overheads.
enclaves. It also does not handle the challenge (C) since it does              The EPC is managed as the rest of the physical memory by an
not provide a secrets transferring mechanism for execution inside           operating system (or a hypervisor in virtualized environments). The
enclaves. Finally, Opaque requires to run Spark master/driver at            operating system makes use of SGX instructions to allocate and
client side or in a trusted domain. This might affect significantly         free EPC pages for enclaves. In addition, the operating system is
the performance of the system.                                              supposed to expose the enclave services (creation and management)
   In this work, we overcome these limitations by building a secure         to applications. Since the operating system cannot be trusted, the
data analytics system called SGX-PySpark. We handle the chal-               SGX hardware verifies the correctness of EPC pages allocations and
lenge (A) by using PySpark [3], a system built on top of Apache             denies any operations that would violate the security guarantees.
Spark to support data analytics using Python processes. Instead of          For example, the SGX hardware will not allow the operating system
running a whole JVM inside an enclave to secure Apache Spark or             to allocate the same EPC page for different enclaves.
reimplementing operators in C/C++ as Opaque, we run only Python
processes inside enclaves since these processes perform analytics           2.2    SCONE
over encrypted data. Thus, our system supports out-of-the-box
                                                                            Our system builds on SCONE [7] – a shielded execution framework
operators of PySpark (the challenge (D)), i.e., users do not need
                                                                            to enable unmodified applications to run inside SGX enclaves. In the
to modify their source code. To run Python processes inside Intel
                                                                            SCONE platform, the source code of an application is recompiled
SGX enclaves, our system makes use of SCONE [4, 7] a shielded
                                                                            against a modified standard C library (SCONE libc) to facilitate
execution framework which enables unmodified applications to
                                                                            the execution of system calls. The address space of the application
run inside Intel SGX enclaves.
                                                                            stays within an enclave, and the application only can access the
   In addition, SGX-PySpark, with the help of SCONE, supports a
                                                                            untrusted memory via the system call interface.
remote attestation mechanism to ensure the code and data running
                                                                               SCONE uses the compiler-based approach to prepare and build
inside enclaves are correct and not modified by an attacker. SGX-
                                                                            native applications for executing inside SGX enclaves. SCONE ap-
PySpark also copes with challenge (C) by providing a mechanism to
                                                                            plies its mechanism into GNU Compiler Collection (GCC) tool-chain
securely transfer secrets (keys and certificates) to Python processes
                                                                            to change the compiling process such that it can build position in-
running inside enclaves. To handle challenge (B), SGX-PySpark
                                                                            dependent, statically linked code, and eventually linked with the
protects its execution against side channel attacks at memory level
                                                                            starter program. Therefore, SCONE natively supports C/C++ appli-
using a mechanism integrated with SCONE, called Varys [12]. Fi-
                                                                            cations. For Python applications e.g., PySpark executors, we need
nally, the design of SGX-PySpark allows users to run the Spark
                                                                            to compile the CPython/PyPy interpreter with SCONE to run these
driver/master in the same infrastructure as workers.
                                                                            Python processes inside SGX enclaves. Similarly, to run a Java
                                                                            application inside an enclave, we compile JVM with SCONE.
                                                                            2.3    PySpark
2 BACKGROUND
                                                                            PySpark is built on top of Apache Spark [1] to provide the Python
2.1 Intel SGX                                                               API for users. Thus, before explaining PySpark, it is useful to under-
Intel SGX is an ISA extension which is a set of special CPU instruc-        stand what is Apache Spark. Apache Spark [1] is an open-source
tions for Trusted Execution Environments (TEE). These instructions          large-scale data analytics framework. Today, it has become the most
enable applications to create enclaves – protected areas in the appli-      popular and widely used big data framework in both academia and
cations address space to provide strong confidentiality and integrity       industry. Comparing to earlier frameworks such as Hadoop MapRe-
guarantees against adversaries with privileged root access. Intel           duce, Spark is much faster in processing large-scale datasets since
SGX enables trusted computing by isolating the environment of               it enables the in-memory computing concept where intermediate
each enclave from untrusted applications outside the enclave. In            data is cached in memory to reduce latency [11].
addition, by offering the remote attestation mechanism, Intel SGX               For the in-memory computation, Spark introduces the core ab-
allows a remote party to attest the application executing inside an         straction – Resilient Distributed Datasets (RDDs) [15] for distributed
enclave [8].                                                                data-parallel computing. An RDD is an immutable and fault-tolerant
                                                                     3565
                                         Worker
                                                                                            the input data and upload the encrypted data into a distributed stor-
                                                                                            age in an untrusted infrastructure (e.g., a public cloud). Thereafter,
                                         Pipe                                               SGX-PySpark decrypts and processes the encrypted data inside
                                                                                            enclaves in a distributed manner.
Python process using Py4J [2]. The Python process creates a Spark- Driver
Context object in the JVM and the SparkContext orchestrates the                                                                                                          Pipe
computation as the regular Spark framework (to reuse almost the                                           : Enclave                   : Java           : Python                 SCONE Lib
                                                                                     3566
data, we protect them against malicious activities, i.e., ensure the     4                     DEMONSTRATIONS
integrity and confidentiality by running them inside enclaves with       In this section, we demonstrate how a user can securely perform
the help of the SCONE platform. Note that, in SGX-PySpark both           data analytics using SGX-PySpark1 . For demonstration purposes,
input data and computation (Python code) are encrypted before            we consider a simple and classical workload of “wordcount”. As-
upload to the system. They are decrypted inside enclaves using           sume that the user wants to perform data analytics (e.g. wordcount)
keys transparently obtained from CAS (see §3.3).                         over a sensitive input data.
                                                                  3567
grant agreements No. 777154 (ATOMSPHERE) and No. 780681                                         movement of such data, and repealing Directive 95/46. Official Journal of the
(LEGaTO).                                                                                       European Union (OJ) (2016).
                                                                                         [10]   Tim Greene. Biggest data breaches of 2015. https://www.networkworld.com/
                                                                                                article/3011103/security/biggest-data-breaches-of-2015.html. Accessed: Jan,
REFERENCES                                                                                      2019.
[1] Apache Spark. https://spark.apache.org. Accessed: Jan, 2019.                         [11]   Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia. 2015. Learn-
[2] Py4J. http://py4j.sourceforge.net. Accessed: Jan, 2019.                                     ing Spark: Lightning-Fast Big Data Analysis. " O’Reilly Media, Inc.".
[3] PySpark. http://spark.apache.org/docs/2.2.0/api/python/pyspark.html. Accessed:       [12]   Oleksii Oleksenko, Bohdan Trach, Robert Krahn, Mark Silberstein, and Christof
    Jan, 2019.                                                                                  Fetzer. 2018. Varys: Protecting SGX Enclaves from Practical Side-Channel Attacks.
[4] Scontain Technology. https://sconedocs.github.io/. Accessed: Jan, 2019.                     In 2018 USENIX Annual Technical Conference (USENIX ATC).
[5] TPC-H Benchmark. http://www.tpc.org/tpch/. Accessed: Jan, 2019.                      [13]   Felix Schuster, Manuel Costa, Cédric Fournet, Christos Gkantsidis, Marcus
[6] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K.               Peinado, Gloria Mainar-Ruiz, and Mark Russinovich. 2015. VC3: Trustworthy
    Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei            Data Analytics in the Cloud Using SGX. In Proceedings of the Symposium on
    Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of            Security and Privacy (SP).
    the International Conference on Management of Data (SIGMOD).                         [14]   Bohdan Trach, Alfred Krohmer, Franz Gregor, Sergei Arnautov, Pramod Bhato-
[7] Sergei Arnautov, Bohdan Trach, Franz Gregor, Thomas Knauth, Andre Martin,                   tia, and Christof Fetzer. 2018. ShieldBox: Secure Middleboxes using Shielded
    Christian Priebe, Joshua Lind, Divya Muthukumaran, Dan O’Keeffe, Mark L.                    Execution. In Proceedings of the Symposium on SDN Research (SOSR).
    Stillwell, David Goltzsche, Dave Eyers, Rüdiger Kapitza, Peter Pietzuch, and         [15]   Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
    Christof Fetzer. 2016. SCONE: Secure Linux Containers with Intel SGX. In 12th               Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Re-
    USENIX Symposium on Operating Systems Design and Implementation (OSDI).                     silient Distributed Datasets: A Fault Tolerant Abstraction for In-Memory Cluster
[8] Victor Costan and Srinivas Devadas. 2016. Intel SGX Explained. Cryptology                   Computing. In Proceedings of the 9th USENIX Conference on Networked Systems
    ePrint Archive, Report 2016/086.                                                            Design and Implementation (NSDI).
[9] General Data Protection Regulation. 2016. Regulation (EU) 2016/679 of the            [16]   Wenting Zheng, Ankur Dave, Jethro G. Beekman, Raluca Ada Popa, Joseph E.
    European Parliament and of the Council of 27 April 2016 on the protection of                Gonzalez, and Ion Stoica. 2017. Opaque: An Oblivious and Encrypted Distributed
    natural persons with regard to the processing of personal data and on the free              Analytics Platform. In Proceedings of the 14th USENIX Conference on Networked
                                                                                                Systems Design and Implementation (NSDI).
3568