Spark 2 APIs
Spark 2.0 brought some changes to the API – the link between the Dataset and DataFrame was created. Now the DataFrame = Dataset[Row] (in Scala and Java), where Row is an untyped generic object representing a table-like record, with a schema. But it doesn’t mean that the DataFrame itself was dropped. It’s still the main abstraction for MLlib or SparkR and Python language. Dataset is currently only available in Scala and Java as it’s a strongly typed abstraction. It’s super easy to switch between the untyped DataFrame and typed Dataset. To do that we need to define the case class in Scala :
1 2 3 4 5 6 |
case class Fruit(Fruit_id: Integer, Fruit_Name: String, Fruit_colour: String, Fruit_group: String) val fruitsDS = fruitsDF.as[Fruit] |
or JavaBean in Java.
In addition to typed Dataset transformations we can use some untyped ones which were previously available just for DataFrames (Spark 2.2 docs).
And what about the RDDs? All in all every Spark abstraction is an RDD underneath, so we can always switch to RDD to perform some low level operations. Although due to all optimisations applied to Datasets it’s not that convenient.
Here’s a little cheat-sheet for using those 3 Spark abstractions:
Contents
RDD – resilient distributed dataset
- it’s a core Spark abstraction – all other abstractions are the RDDs underneath
- allows to perform the low-level transformations and actions on unstructured and structured data (but it can not infer the data schema)
- can be created from file or other RDD
- queries on RDDs are just slightly optimised with the use of DAG (i.e. ordering filter transformation before the map) – no Catalyst optimiser
- RDD is a JVM object, which implies serialization and Garbage Collection overhead
- it’s an immutable collection
- provides compile-time type safety
- RDDs form a lineage of operations to be executed
- can be used with Scala, Java or Python
DataFrame
- it’s a distributed collection of tabular data organized into rows and named columns (equivalent of relation database table) – instances of Row class
- DataFrames hold the schema of the data
- it’s an abstraction for structured and semi-structured data
- can be created from file (like Csv, Parquet, JSON), database or Hive table, RDD or build from the scratch
- query plan is optimised by Catalyst optimiser
- provides no compile type-safety, errors are detected in the runtime
- doesn’t need Java serialization; data is serialized into off-heap storage in a binary format to perform many transformations directly on this off-heap memory
- can be used with Scala, Java, Python, R and SQL
Dataset
- joins features of RDDs and DataFrames
- can be used with structured and unstructured data
- provides compile-time type-safety
- data schema is inferred from the data type
- query plan is optimised by Catalyst optimiser
- doesn’t need Java serialization; so called Encoders generate byte code to interact with off-heap data and provide on-demand access to individual attributes of an object without having to de-serialize this object
- can be used with Scala, Java and SQL
So now which abstraction when?
Depending on the use case it’s good to choose the one that suits the needs. Table below tries to combine all the features together.
RDD | Dataset | DataFrame |
---|---|---|
for unstructured data | for structured data | |
provides low-level transformations and actions | it's a high-level abstraction, provides use of filters, maps, aggregation, averages, sum, SQL queries, columnar access |
|
it's a collection of typed objects | it's a collection of untyped objects no lambda functions available |
|
creates in-memory JVM objects | uses off-heap memory |
|
DAG optimisations | Catalyst optimisations |
|
compile time type-safety for Scala and Java | runtime detection of analytical errors |
|
Scala, Java, Python | Scala, Java, SQL | Scala, Java, R, Python, SQL |