Spark & R – SparkR vs sparklyr

17/04/2021 12:24 PM
Alice
Tags: Spark, sparklyr, SparkR
0

R enthusiasts can benefit from Spark using one of two available libraries – SparkR or sparklyr. They both differ in usage structure and slightly in available functionality. SparkR is an official Spark library, while sparklyr is created by the RStudio community. Due to the fact that currently Python is favourite language for Data Scientists using Spark, Spark R libraries are evolving in a slower pace and in general catch-up with the functionality available in pyspark. Still they both provide support for data processing and distributed Machine Learning, converting user code into Spark manipulations across the cluster of machines. You can easily switch between local and distributed processing using either one of them:

Spark with R	local R
Distributed processing	Single machine processing
SparkR/sparklyr functions	R functions
R UDFs	R libraries
Lazy execution	Immediate execution

Let’s see the difference between those two Spark R packages and available functionality.

Contents

SparkR

SparkR is part of the official Spark project and is supported as such. Its concept is built around R data.frame API. We can create Spark data frame using read.df function, subset data with filter or select, aggregate it with groupBy-summarize or reference a column with $ sign. Basically most typical R data manipulation functions can be found and applied in Spark distributed environment.

Example:

cars <- read.df("mtcars.csv", "csv")
summarize(groupBy(cars, cars$cyl), count = n(cars$cyl))

1 2	cars <- read.df("mtcars.csv", "csv") summarize(groupBy(cars, cars$cyl), count = n(cars$cyl))

Mind that not all data.frame constructs are supported though. For instance we cannot index a Spark DataFrame according to row or change particular point values as we can with R DataFrame. Additionally, SparkR package when loaded, masks many R functions, like stats::filter, stats::lag, base::sample or base::rbind. Check more details about this API here.

In contrast to pyspark there is no RDD support, SparkR is based on DataFrames only.

SparkR natively supports reading json, csv, orc, avro or parquet files – in addition you can find connectors to other popular data formats.

spark.lapply function allows to run multiple instances of any R function in Spark, each with different parameter value provided. That can be very helpful in determining best parameters for some Machine Learning model. Mind that each instance is run on a separate node and needs to fit into the node’s memory.

There are also functions which allow to run R UDFs on Spark, so basically any custom R functionality not natively available in Spark. Those are dapply, dapplyCollect and gapply, gapplyCollect for grouped data. The dapplyCollect/gapplyCollect functions additionally turn Spark output into R data frame.

Moreover, SparkR benefits from Arrow processing optimizations. This applies to functions translating the frames between R and Spark: collect, createDataFrame and dapply, gapply.

There is a decent amount of Spark MLlib algorithms available. We have 5 of classification, regression and clustering algorithms, collaborative filtering, frequent pattern mining and Kolmogorov-Smirnov Test. Interestingly SparkR does not openly support Pipelines API, available in pyspark. MLlib functions resemble the ones available in R (i.e. spark.kmeans in SparkR vs kmeans in R) and apply several data transformations under the hood. Comparison between modeling in SparkR and pyspark can be found in my previous post.

sparklyr

sparklyr is an R package developed by RStudio folks and provides a complete dplyr backend to Spark, using the same dplyr syntax. That implies that switching between environments does not require changing of function names. In contrast to SparkR, here we operate on tables/tibbles, which are mapped to Spark DataFrames. We have copy_to function for moving data from R to Spark and well known dplyr functions for data manipulations like groupby-summarize, filter or select.

Example:

sc <- spark_connect(master = "local", ...)
flights_tbl <- copy_to(sc, nycflights13::fligths, "flights")
delay <- flights_tbl %>%
    groupby(tailnum) %>%
    summarize(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
    filter(count > 20, dist < 2000, !is.na(delay))

sc <- spark_connect(master = "local", ...)

flights_tbl <- copy_to(sc, nycflights13::fligths, "flights")

delay <- flights_tbl %>%

groupby(tailnum) %>%

summarize(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%

filter(count > 20, dist < 2000, !is.na(delay))

More functions can be found here.

Instead of the SparkSession object, which we have in official Spark libraries, in sparklyr we use spark_connect, which gives us the same functionality entry point.

As SparkR, sparklyr is also only available for the DataFrame Spark API (no RDD support).

sparklyr natively supports most popular data formats: csv, json, parquet, txt, avro and orc. For other formats extensions can be found.

In contrast to SparkR, for the UDF with custom R code execution we have just one function – spark_apply, which works for grouped and not grouped data. It also benefits from Arrow optimizations, together with collect and copy_to functions, which translate data between R and Spark frames.

sparklyr provides a link to MLlib functionality, by 3 families of functions:

ml_* – machine learning algorithms, i.e. ml_linear_regression
ft_* – feature transformers, i.e. ft_string_indexer
sdf_* – data frame manipulations, i.e. sdf_random_split

There are 13 modelling functions available currently. In addition to that we also can benefit from different manipulation and feature transforming functions. Those very much resemble the approach taken for the pyspark library. In the same manner we can create Pipelines, model evaluators, run cross validation. However functionality still lags behind in some places, when compared to pyspark.

Summary

Here I compiled a list of differences between the packages. Ultimately which package to use depends purely on preferences, as both are capable of doing the job.

SparkR	sparklyr
has SparkSession concept	uses spark_connect
follows R syntax	benefits from dplyr syntax
masks some R functions	consistent naming convention
provides different dplyr syntax for functions	same dplyr functions
R to Spark DF: as.DataFrame	R to Spark DF: copy_to
for UDFs: dapply, gapply, spark.apply	for UDFs: spark_apply
ML models: spark.* functions, i.e spark.kmeans	ML models: ml_* functions, i.e. ml_linear_regression
ML transformations: special functions, i.e. add_months	ML transformations: ft_* functions, i.e. ft_string_indexer
ML: no model validators nor evaluators	ML: model validators and evaluators
List of available functions	List of available functions

« Spark 3 highlights

Memory leakage while plotting in a loop »

Spark & R – SparkR vs sparklyr