Skip to content

Spark 2 APIs

Spark 2 APIs

Spark 2.0 brought some changes to the API – the link between the Dataset and DataFrame was created. Now the DataFrame = Dataset[Row] (in Scala and Java), where Row is an untyped generic object representing a table-like record, with a schema. But it doesn’t mean that the DataFrame itself was dropped. It’s still the main abstraction for MLlib or SparkR and Python language. Dataset is currently only available in Scala and Java as it’s a strongly typed abstraction. It’s super easy to switch between the untyped DataFrame and typed Dataset. To do that we need to define the case class in Scala :

or JavaBean in Java.
In addition to typed Dataset transformations we can use some untyped ones which were previously available just for DataFrames (Spark 2.2 docs).
 
And what about the RDDs? All in all every Spark abstraction is an RDD underneath, so we can always switch to RDD to perform some low level operations. Although due to all optimisations applied to Datasets it’s not that convenient.
 
Here’s a little cheat-sheet for using those 3 Spark abstractions:

Contents

RDD – resilient distributed dataset

  • it’s a core Spark abstraction – all other abstractions are the RDDs underneath
  • allows to perform the low-level transformations and actions on unstructured and structured data (but it can not infer the data schema)
  • can be created from file or other RDD
  • queries on RDDs are just slightly optimised with the use of DAG (i.e. ordering filter transformation before the map) – no Catalyst optimiser
  • RDD is a JVM object, which implies serialization and Garbage Collection overhead
  • it’s an immutable collection
  • provides compile-time type safety
  • RDDs form a lineage of operations to be executed
  • can be used with Scala, Java or Python
 

DataFrame

  • it’s a distributed collection of tabular data organized into rows and named columns (equivalent of relation database table) – instances of Row class
  • DataFrames hold the schema of the data
  • it’s an abstraction for structured and semi-structured data
  • can be created from file (like Csv, Parquet, JSON), database or Hive table, RDD or build from the scratch
  • query plan is optimised by Catalyst optimiser
  • provides no compile type-safety, errors are detected in the runtime
  • doesn’t need Java serialization; data is serialized into off-heap storage in a binary format to perform many transformations directly on this off-heap memory
  • can be used with Scala, Java, Python, R and SQL
 

Dataset

  • joins features of RDDs and DataFrames
  • can be used with structured and unstructured data
  • provides compile-time type-safety
  • data schema is inferred from the data type
  • query plan is optimised by Catalyst optimiser
  • doesn’t need Java serialization; so called Encoders generate byte code to interact with off-heap data and provide on-demand access to individual attributes of an object without having to de-serialize this object
  • can be used with Scala, Java and SQL
 

So now which abstraction when?

Depending on the use case it’s good to choose the one that suits the needs. Table below tries to combine all the features together.

RDDDatasetDataFrame
for unstructured datafor structured data
provides low-level transformations and actionsit's a high-level abstraction, provides
use of filters, maps, aggregation, averages, sum, SQL queries, columnar access
it's a collection of typed objectsit's a collection of untyped objects
no lambda functions available
creates in-memory JVM objectsuses off-heap memory
DAG optimisationsCatalyst optimisations
compile time type-safety for Scala and Java
runtime detection of analytical errors
Scala, Java, PythonScala, Java, SQLScala, Java, R, Python, SQL

Leave a Reply

Your email address will not be published. Required fields are marked *