Skip to content

Spark and Hive

Spark and Hive

Spark gives us the ability to use SQL for data processing. With that we can connect with JDBC and ODBC to pretty much any database or use structured data formats like avro, parquet, orc. We can also connect to Hive and use all the structures we have there. In Spark 2.0 entry points to SQL (SQLContext) and Hive (HiveContext) were substituted with one object – SparkSession. SparkSession allows you to read and write to Hive, use HiveSQL language and Hive UDFs. No all of the Hive features are supported yet, but probably they will at some point (here is a list).

If you have the Spark and Hive setup ready, accessing Hive from Spark is pretty straightforward.

 

HOW TO

I was using Spark 2.2, Hive 1.1 with Kerberos.

If you’re using Maven you need to add following dependencies:

 
 
First you need to initialise the SparkSession with Hive support. If you want to use Hive partitioning you also need to set the Hive partitioning properties.

Those properties can also be set after the creation of SparkSession.

 
 
The Hive tables can either be created with Hive or through the SparkSQL commands.

And now we can read and write:

Instead of using the write option you may use sql command with insert:

 
 
Note: Without specifying the path option Spark tries to write into the default data warehouse, even if the table definition has other path specified.

Note: For Cloudera distribution of Hive and Spark before 5.14 there is a (bug). You cannot data read from table into which you wrote without path specification.

Note: By default Spark will use its own Parquet support instead of Hive SerDe for better performance. It is controlled by spark.sql.hive.convertMetastoreParquet parameter.

Leave a Reply

Your email address will not be published. Required fields are marked *