Spark 2 APIs

14/10/2017 2:41 PM
Alice
Tags: Spark
0

Spark 2.0 brought some changes to the API – the link between the Dataset and DataFrame was created. Now the DataFrame = Dataset[Row] (in Scala and Java), where Row is an untyped generic object representing a table-like record, with a schema. But it doesn’t mean that the DataFrame itself was dropped. It’s still the main abstraction for MLlib or SparkR and Python language. Dataset is currently only available in Scala and Java as it’s a strongly typed abstraction. It’s super easy to switch between the untyped DataFrame and typed Dataset. To do that we need to define the case class in Scala :

case class Fruit(Fruit_id: Integer,
                 Fruit_Name: String,
                 Fruit_colour: String,
                 Fruit_group: String)

val fruitsDS = fruitsDF.as[Fruit]

case class Fruit(Fruit_id: Integer,

Fruit_Name: String,

Fruit_colour: String,

Fruit_group: String)

val fruitsDS = fruitsDF.as[Fruit]

or JavaBean in Java.
In addition to typed Dataset transformations we can use some untyped ones which were previously available just for DataFrames (Spark 2.2 docs).

And what about the RDDs? All in all every Spark abstraction is an RDD underneath, so we can always switch to RDD to perform some low level operations. Although due to all optimisations applied to Datasets it’s not that convenient.

Here’s a little cheat-sheet for using those 3 Spark abstractions:

Contents

RDD – resilient distributed dataset

it’s a core Spark abstraction – all other abstractions are the RDDs underneath
allows to perform the low-level transformations and actions on unstructured and structured data (but it can not infer the data schema)
can be created from file or other RDD
queries on RDDs are just slightly optimised with the use of DAG (i.e. ordering filter transformation before the map) – no Catalyst optimiser
RDD is a JVM object, which implies serialization and Garbage Collection overhead
it’s an immutable collection
provides compile-time type safety
RDDs form a lineage of operations to be executed
can be used with Scala, Java or Python

DataFrame

it’s a distributed collection of tabular data organized into rows and named columns (equivalent of relation database table) – instances of Row class
DataFrames hold the schema of the data
it’s an abstraction for structured and semi-structured data
can be created from file (like Csv, Parquet, JSON), database or Hive table, RDD or build from the scratch
query plan is optimised by Catalyst optimiser
provides no compile type-safety, errors are detected in the runtime
doesn’t need Java serialization; data is serialized into off-heap storage in a binary format to perform many transformations directly on this off-heap memory
can be used with Scala, Java, Python, R and SQL

Dataset

joins features of RDDs and DataFrames
can be used with structured and unstructured data
provides compile-time type-safety
data schema is inferred from the data type
query plan is optimised by Catalyst optimiser
doesn’t need Java serialization; so called Encoders generate byte code to interact with off-heap data and provide on-demand access to individual attributes of an object without having to de-serialize this object
can be used with Scala, Java and SQL

So now which abstraction when?

Depending on the use case it’s good to choose the one that suits the needs. Table below tries to combine all the features together.

RDD	Dataset	DataFrame
for unstructured data	for structured data
provides low-level transformations and actions	it's a high-level abstraction, provides use of filters, maps, aggregation, averages, sum, SQL queries, columnar access
it's a collection of typed objects		it's a collection of untyped objects no lambda functions available
creates in-memory JVM objects	uses off-heap memory
DAG optimisations	Catalyst optimisations
compile time type-safety for Scala and Java		runtime detection of analytical errors
Scala, Java, Python	Scala, Java, SQL	Scala, Java, R, Python, SQL

« Simple Java API HBase manipulation

HBase in Oozie shell action »

Spark 2 APIs

Spark 2 APIs

RDD – resilient distributed dataset

DataFrame

Dataset

So now which abstraction when?

Leave a Reply Cancel reply