Skip to content

SparkR MLlib

SparkR MLlib

When working with Spark MLlib library you may notice that there are different features available in Python and R APIs. In Python, in addition to models, we can benefit from Transformers, which represent feature transformations that can be done before the modelling. Transformers are also available in sparklyr, but are clearly missing in SparkR. Also the way models are constructed and called differs in SparkR. In this regard it follows the standard R modeling pattern, where we build a model by calling R formula and use predict function to obtain the prediction.


Let’s compare Random Forest modeling in SparkR and PySpark.


In order to build a model we need to provide the names of label and features columns in our train dataframe. Interestingly features column actually needs to be a vector combined of all the features we want to use for modeling, and a label needs to be a column of type double. As data rarely comes in such a form, we need to manually apply transformers to the data to come up with the required format.

For that purpose we may use:

  • VectorAssembler – which combines a given list of columns into a single vector column. Useful for combining feature variables.
original data
output – after using VectorAssembler on owns_car and age columns
  • StringIndexer – which converts categorical values into category indices, ordered by frequencies, starting with 0. Can be used to transform a label column.
output – data after applying StringIndexer on name column
  • OneHotEncoder – which maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. Used to change string columns into numeric ones.
c_output – index_name column coded into binary vectors

Interestingly SparkR modeling is much easier:

Here we do not need to apply any particular data transformations, we just specify the dependency between label and feature columns, using the R formula. Why is that?

Actually even though we do not have ML transformers available in SparkR, they are automatically called when building a model. To be exact:

SparkR models, using R formula, automatically invoke StringIndexer and OneHotEncoder on all String columns, convert Numeric columns to Doubles and combine them into a single vector using VectorAssembler.

Different SparkR models may be calling some other transformers instead.

Such automatic data conversion can be observed on a simple example.


Let’s run kmeans models on R build-in dataset Titanic, both in R and SparkR. As this dataset comes as a table, it needs to be converted into a dataframe (i.e. with for modeling. Let’s divide this data into 3 clusters, using Class, Sex, Age and Freq columns.


Obtained clusters:

Now pure R:

Obtained clusters:

As you can see I had to transform string columns into numeric, as R kmeans cannot run for categorical features. If we try to run kmeans without any transformation it fails with error: NAs introduced by coercion.

Such modification was done automatically in SparkR.


Leave a Reply

Your email address will not be published. Required fields are marked *