How to Create SQLContext in Spark Using Scala?
Last Updated :
13 May, 2024
Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in Scala is an object. It is a statically typed language although unlike other statically typed languages like C, C++, or Java, it doesn't require type information while writing the code. The type verification is done at the compile time. Static typing allows to building of safe systems by default. Smart built-in checks and actionable error messages, combined with thread-safe data structures and collections, prevent many tricky bugs before the program first runs.
This article focuses on discussing steps to create SQLContext in Spark using Scala.
What is SQLContext?
The official definition in the documentation of Spark is:
"The entry point for running relational queries using Spark. Allows the creation of SchemaRDD objects and the execution of SQL queries."
The purpose of SQLContext is to introduce processing on structured data in Spark. Before it, spark only had RDDs to manipulate data. RDDs are simply a collection of rows (notice the absence of columns) that can be manipulated using lambda functions and other functionalities. SQLContext introduced objects that would add schema (like column name and data type column) to the data to make it similar to relational databases. The additional information about data would also open the gate to optimizations for data processing.
Looking more at the documentation, it shows that the SQLContext is a class introduced in version 1.0.0 and provides a set of functions that allow creating and manipulating a SchemaRDD object. Here is the list of functions:
- cacheTable
- createParquetFile
- createSchemaRDD
- logicalPlanToSparkQuery
- parquetFile
- registerRDDAsTable
- sparkContext
- sql
- table
- uncacheTable
The APIs revolve around inter-transformation of Parquet files and SchemaRDD objects. SchemaRDD objects are an RDD of Row objects that has an associated schema. In addition to standard RDD functions, SchemaRDDs can be used in relational queries, like as below:
Scala
// One method for defining the schema of an RDD is to make a case class with the desired column
// names and types.
case class Record(key: Int, value: String)
val sc: SparkContext // An existing spark context.
val sqlContext = new SQLContext(sc)
// Importing the SQL context gives access to all the SQL functions and implicit conversions.
import sqlContext._
val rdd = sc.parallelize((1 to 100).map(i => Record(i, s"val_$i")))
// Any RDD containing case classes can be registered as a table. The schema of the table is
// automatically inferred using scala reflection.
rdd.registerAsTable("records")
val results: SchemaRDD = sql("SELECT * FROM records")
The above code would not run on the latest versions of Spark because SchemaRDDs are now obsolete.
Currently, SQLContext is itslef not used and instead SparkSession is used to create a unified interface for many such different contexts like SQLContext, SparkContext HiveContext and others. Inside SparkSession, the SQLContext is still present. Also, instead of SchemaRDDs spark now uses DataSets and DataFrames to denote structured data.
Creating SQLContext
1. Using SparkContext
We can create an SQLContext from a sparkcontext. The constructor is as follows:
public SQLContext(SparkContext sparkContext)
We can create a simple sparkcontext object with "master" (the cluster url) being set to "local" (just use the current machine) and "appName" to "createSQLContext". We can then supply this sparkcontext to the SQLContext constructor.
Scala
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object createSQLContext {
def main(args: Array[String]): Unit = {
val sc = new SparkContext("local[*]", "createSQLContext")
val sqlc = new SQLContext(sc)
println(sqlc)
}
}
Output:
The SQLContext ObjectExplanation:
As you can see above we have created a new SQLContext object. Although we were successful but this method is deprecated and SQLContext is replaced with SparkSession. SQLContext is kept in newer versions only for backward compatibility.
2. Using Existing SQLContext Object
We can also use an existing SQLContext object to create a new SQLContext object. Every SQLContext provides a newSession API to create a new object based on the same SparkContext object. The API is as follows:
def newSession(): SQLContext
// Returns a SQLContext as new session, with separated SQL configurations, temporary tables, registered functions, but sharing the same SparkContext, cached data and other things
Below is the Scala program to implement the approach:
Scala
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object createSQLContext {
def main(args: Array[String]): Unit = {
val sc = new SparkContext("local[*]", "createSQLContext")
val sqlc = new SQLContext(sc)
val nsqlc = sqlc.newSession()
println(nsqlc)
}
}
Output:
The SQLContext ObjectExplanation:
As you can see above we have created a new SQLContext object. Although we were successful but this method is deprecated and SQLContext is replaced with SparkSession. SQLContext is kept in newer versions only for backward compatibility.
3. Using SparkSession
The latest way (as of version 3.5.0) is to use SparkSession object. The SparkSession is a culmination of various previous contexts and provides a unified interface for all of them. We can create a SparkSession object using the builder API and then access the SQLContext object from it as follows:
Scala
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SparkSession
object createSQLContext {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("createSQLContext")
.master("local[*]")
.getOrCreate()
println(spark.sqlContext)
}
}
Output:
The SQLContext ObjectExplanation:
As you can see we accessed the SQLContext object from inside the SparkSession object.
Similar Reads
How to create Spark session in Scala? Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in scala is an object e.g. - values like 1,2 can invoke functions like toString(). Scala is a statically typ
5 min read
How to Convert RDD to Dataframe in Spark Scala? This article focuses on discussing ways to convert rdd to dataframe in Spark Scala. Table of Content RDD and DataFrame in SparkConvert Using createDataFrame MethodConversion Using toDF() Implicit MethodConclusionFAQsRDD and DataFrame in SparkRDD and DataFrame are Spark's two primary methods for hand
6 min read
How to Create Delta Table in Databricks Using PySpark An open-source storage layer called Delta Lake gives data lakes scalability, performance, and dependability. It offers a transactional layer on top of cloud storage and lets you handle massive volumes of data in a data lake. This post will explain how to use PySpark to generate a Delta table in Data
4 min read
How to convert String to structtype in Scala? In Scala, whenever one deals with structured data such as JSON or CSV, there is usually a need to convert strings into structured data types like "StructType". This is very important for efficient processing and analysis of data, particularly in big data setups where Apache Spark is mostly used. To
3 min read
How to Use Spark-Shell to Execute Scala File? Apache Spark is a lightning-quick analytics tool that is used for cluster registering for huge data sets like BigData and Hadoop which can run programs lined up across different nodes. Users can perform a wide range of work using Spark Shell, like stacking information, controlling DataFrames and RDD
4 min read
How to Import SparkSession in Scala? This article focuses on discussing how to import SparkSession in Scala. Table of Content What is Sparksession?PrerequisitesApproach to Import SparkSession in ScalaImplementationCreate a DataFrame Using SparkSessionConclusionWhat is Sparksession?When spark runs, spark Driver creates a SparkSession wh
2 min read
How to Check the Schema of DataFrame in Scala? With DataFrames in Apache Spark using Scala, you could check the schema of a DataFrame and get to know its structure with column types. The schema contains data types and names of columns that are available in a DataFrame. Apache Spark is a powerful distributed computing framework used for processin
3 min read
How to parse nested JSON using Scala Spark? In this article, we will learn how to parse nested JSON using Scala Spark. To parse nested JSON using Scala Spark, you can follow these steps:Define the schema for your JSON data.Read the JSON data into a Datc aFrame.Select and manipulate the DataFrame columns to work with the nested structure.Scala
1 min read
How to create partition in scala? In the world of big data, processing efficiency is key, and data partitioning emerges as a vital tool for optimizing performance. By strategically dividing large datasets into smaller subsets, partitioning enables parallel processing, significantly accelerating data manipulation tasks. In Scala, ach
2 min read
How to check dataframe size in Scala? In this article, we will learn how to check dataframe size in Scala. To check the size of a DataFrame in Scala, you can use the count() function, which returns the number of rows in the DataFrame. Here's how you can do it: Syntax: val size = dataframe.count() Example #1: Scala import org.apache.spar
2 min read