Schema – structure of data
A schema is the description of the structure of your data and can be either implicit or explicit. There are two main ways to convert existing RDDs into datasets as the DataFrames are internally based on the RDD; they are as follows:
- Using reflection to infer the schema of the RDD
- Through a programmatic interface with the help of which you can take an existing RDD and render a schema to convert the RDD into a dataset with schema
Implicit schema
Let's look at an example of loading a comma-separated values (CSV)Â file into a DataFrame. Whenever a text file contains a header, the read API can infer the schema by reading the header line. We also have the option to specify the separator to be used to split the text file lines.
We read the csv inferring the schema from the header line and use the comma (,) as the separator. We also show the use of the schema command and the printSchema command to verify the schema of the input file:
scala> val statesDF = spark.read.option...