Spark is a data processing system that can handle large data sets rapidly and spread
processing tasks across many devices, either on its own or in conjunction with other
distributed computing resources. These two characteristics are critical in the fields of big data
and machine learning, which necessitate the use of massive computing resources to process
large data sets. Spark also relieves developers of some of the programming pressures
associated with these activities by providing an easy-to-use API that abstracts away much of
the grunt work associated with distributed computing and big data processing.
Authentication involves confirming that the customer is who they say they are based on the
information they sent, and that the information matches their identity contained in our system
or a third-party data provider exactly. Authentication is typically not part of the framework
because it is difficult to know the user until they input information, particularly for
applications that enable users to communicate / enter data. In spark authorization and
authentication can be done by the following methods. However, since all of Spark's users
have already registered, the authentication layer is required to protect the whole system. So,
before you can connect to the Spark API, you must first pass Spark authentication, which
comes in a variety of flavors. The purpose of authorization is to ensure that the user can
access the services he or she has requested. Authentication and authorization are perhaps the
two most critical players in ensuring the infrastructure is stable, but security strategies are far
more than that.
Authentication Handler
It's a piece of software that manages the authentication process. Token, Credential, Adapter,
and Request Filter are the components. To begin, the Tokens are accepted and decoded by the
application sending the request for validation. Then passwords are checked to see if the user's
username and password are right when the request is made to access it.
Then, based on existing tokens for a user or new tokens for an existing user, tokens are
created. An exception is thrown if they don't fit. Request Filter will come into play and
authenticate all current users if no value for authentication is specified.
Spark currently supports shared secret authentication for RPC networks. The
[Link] configuration parameter can be used to allow authentication. The precise
method for generating and disseminating the mutual secret depends on the implementation.
The secret must be identified by setting the [Link]. Secret configuration option as
otherwise stated below. In that case, all Spark applications and daemons share the same
secret, limiting the security of these installations, especially on multi-tenant clusters.
The authentication mechanism is not aided by the REST Submission Server or
MesosClusterDispatcher. All network access to the REST API and MesosClusterDispatcher
(ports 6066 and 7077, respectively, by default) should be limited to hosts that can send jobs.
Authorization is required for each resource, and programs are included in this category. As a
result, almost every Spark programs will have its own authorization configuration. Database,
tables, and partitions layer authorization granularity. However, Hive Spark does not accept
Grant and Revoke.
JWT (JSON Web Token)
JSON Web Token encompasses of three parts, for example Header, Payload and Signature.
The header is divided into two parts the hashing algorithm and the token form. The payload
includes all of the data we want to send, while the signature consists of an encoded header
and payload appended with a hidden key. The JWT token is generated by combining these
three elements. When a user logs in with their credentials, the server verifies the request and
returns a token containing the user's identity, which is then stored on the client system and
allows the user to access the application. The token is now attached to the permission header
and sent to the server if a user requests access to a resource. If the token fits, the server allows
him or her access to the resource. A filter that implements the authentication method you
want to use is needed. There are no built-in authentication filters in Spark.
Where an authentication filter is present, Spark also supports UI access control. ACLs can be
configured separately for each framework. Spark distinguishes between "view" and "modify"
permissions (who is allowed to display the application's UI) (who can do stuff like destroy
jobs in a running application). JwtGenerator Interface and JwtParser Interface are two JWT
interfaces in Spark that generate and parse tokens, respectively.
Kubernetes
Spark can also generate a special authentication secret for each Kubernetes program.
Environment variables are used to spread the secret to executor pods. This means that any
user who has permission to list pods in the namespace where the Spark program is running
will now see the authentication secret.
Yarn
Spark on YARN will generate and spread shared secrets automatically. A unique mutual
secret is used by each program. This role relies on the ability of YARN RPC encryption to
secure the dissemination of secrets in the case of YARN.
Here are some additional security concerns that are added for authorization and
authentication purpose:
1. The files are encrypted, meaning you won't be able to read them even though you
have access to them. For example shuffle files and shuffle spills are temporary files
that are saved on local discs.
2. Throughout the organized hierarchy, Spark offers SSL support. Such that the user can
conveniently incorporate SSL configuration while also having the option to customize
each one separately.
3. If you like to use Kerberos to authenticate your identity, Spark supports it. In YARN
and Mesos modes, the delegation token must be configured.
4. Applications that never close or sessions that are never closed will run into problems
when they exceed the maximum time limit. Spark immediately renewed the token in
this situation, but you must customize your linking programs in YARN mode.
5. For messages sent between the client and the Spark server, Spark supports AES-based
encryption. RPC authentication must be allowed and installed correctly in order to
allow encryption.
References
1. Authentication overview. (n.d.). Spark Platform.
[Link]
2. Security - Spark 2.1.0 documentation. (n.d.). Apache Spark™ - Unified Analytics
Engine for Big Data.
[Link]
3. Security - Spark 3.0.1 documentation. (n.d.). Apache Spark™ - Unified Analytics
Engine for Big Data.
[Link]
4. DataStax Enterprise security checklists. (n.d.). Retrieved May 24, 2021, from
[Link]
[Link]
[Link]
5. Enabling spark authentication. (n.d.). Retrieved May 24, 2021, from [Link]
[Link]
spark/topics/[Link]
6. (N.d.). Retrieved May 24, 2021, from [Link] website:
[Link]
authentication-in-spark/