Skip to content

Tag: Spark

Spark and Hive

Spark gives us the ability to use SQL for data processing. With that we can connect with JDBC and ODBC to pretty much any database or use structured data formats like avro, parquet, orc. We can also connect to Hive and use all the structures we have there. In Spark 2.0 entry points to SQL (SQLContext) and Hive (HiveContext) were substituted with one object – SparkSession. SparkSession allows you to read and write to Hive, use HiveSQL language and Hive UDFs.

Read more

Hadoop and Spark shuffling

Shuffling is generally the most costly operation that we encounter in Hadoop and Spark processing. It has a huge impact on processing performance and can be a bottleneck in cases when our big data requires a lot of grouping or joining. That’s why I think it’s worth to spend a while to understand how shuffle is handled by both of those engines.

Read more

Spark 2 APIs

Spark 2.0 brought some changes to the API – the link between the Dataset and DataFrame was created. Now the DataFrame = Dataset[Row] (in Scala and Java), where Row is an untyped generic object representing a table-like record, with a schema. But it doesn’t mean that the DataFrame itself was dropped. It’s still the main abstraction for MLlib or SparkR and Python language. Dataset is currently only available in Scala and Java as it’s a strongly typed abstraction. It’s super easy to switch between the untyped DataFrame and typed Dataset.

Read more