Sitemap
Data Querying

Modernizing & Simplifying how to Query Data

A Spark SQL Editor via Hue and the Spark SQL Server

--

Write and Execute some Spark SQL quickly in your own Web Editor.

Apache Spark is popular for wrangling/preparing data, especially when embedding some SQL snippets to keep the data manipulation programs declarative and simpler.

One good news is that the SQL syntax is very similar to Apache Hive so the very powerful Hive autocomplete of Hue works very well.

Zoom image will be displayed
SparkSql Editor in action via both types of connection

Here we will describe how to integrate with the Spark SQL Thrift Server interface that might be already available in your stack.

The article comes with a One click demo setup. The scenario is pretty simple and about batch querying, we will see for more live data in a dedicated follow-up episode.

Two connectors possible for communicating with the SparkSql Thrift Server

For fetching the Docker Compose configuration and starting everything:

mkdir spark
cd spark
wget https://raw.githubusercontent.com/romainr/query-demo/master/spark/docker-compose.yml
docker-compose up -d
>
Creating network "spark_default" with the default driver
Creating hue-database ... done
Creating livy-spark ... done
Creating spark-master ... done
Creating spark-sql ... done
Creating hue ... done

Then those URLs will be up:

For stopping everything:

docker-compose down

Hello World

How to create a SQL table representing some cities and number of inhabitants:

CREATE TABLE IF NOT EXISTS us_population (
state CHAR(2),
city VARCHAR(20),
population BIGINT
)
;
INSERT INTO us_population
VALUES
('NY', 'New York', 8143197),
('CA', 'Los Angeles', 3844829),
('IL', 'Chicago', 2842518),
('TX', 'Houston', 2016582),
('PA', 'Philadelphia', 1463281),
('AZ', 'Phoenix', 1461575),
('TX', 'San Antonio', 1256509),
('CA', 'San Diego', 1255540),
('TX', 'Dallas', 1213825),
('CA', 'San Jose', 91233)
;
SELECT
state as State,
count(city) as CityCount,
sum(population) as PopulationSum
FROM
us_population
GROUP BY
state
ORDER BY
sum(population) DESC
;

Which interface for connecting with the Spark SQL Server?

We previously demoed how to leverage Apache Livy to submit some SparkSql via Hue. As detailed there, Livy was initially created within the Hue project and offers a lightweight submission of interactive or batch PySpark / Scala Spark /SparkSql statements.

However one main drawback is that it might appear less official than the Distributed SQL Engine (also known as “Thrift Server”) shipped within Spark.

Hue can connect to the Spark SQL Thrift Server via two interfaces:

Long story short: the main advantage of SqlAlchemy is to be have more SparkSql nits ironed out but queries are submitted synchronously (i.e. queries of more than a few seconds don’t have progress report yet and long ones will time out, unless the Hue Task Server is setup).

So we recommend to get started with SqlAlchemy but help report/contribute back small fixes on the HiveServer2 API which is more native/advanced.

Note: SqlAlchemy interface requires the Hive connector which does not work out of the box because of the issue #150. But Hue ships and show a slightly patched module that works: https://github.com/gethue/PyHive

Configurations

In the hue.ini configure the connectors and make sure you installed the PyHive connector as shown in the docs:

[notebook] 
[[interpreters]]
[[[sparksql-alchemy]]]
name=SparkSql (via SqlAlchemy)
interface=sqlalchemy
options='{"url": "hive://localhost:10000/default"}'
[[[sparksql]]]
# Must be named 'sparksql', hostname and more options are
# in the 'spark' section
name=SparkSql (via HiveServer2)
interface=hiveserver2
[spark]
sql_server_host=localhost
sql_server_port=10000

What’s next?

Et voila!

Next time we will describe the progress on the Hue SQL Scratchpad component that can be leveraged for easily editing and quick testing embedded SparkSql snippets as well as how to query live data.

--

--

Responses (1)