SparkR MLlib
When working with Spark MLlib library you may notice that there are different features available in Python and R APIs. In Python, in addition to models, we can benefit from Transformers, which represent feature transformations that can be done before the modelling. Transformers are also available in sparklyr, but are clearly missing in SparkR. Also the way models are constructed and called differs in SparkR. In this regard it follows the standard R modeling pattern, where we build a model by calling R formula and use predict function to obtain the prediction.
Contents
Example
Let’s compare Random Forest modeling in SparkR and PySpark.
PySpark:
1 2 3 4 5 |
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10) rfModel = rf.fit(trainData) preditions = rfModel.transform(testData) |
In order to build a model we need to provide the names of label and features columns in our train dataframe. Interestingly features column actually needs to be a vector combined of all the features we want to use for modeling, and a label needs to be a column of type double. As data rarely comes in such a form, we need to manually apply transformers to the data to come up with the required format.
For that purpose we may use:
- VectorAssembler – which combines a given list of columns into a single vector column. Useful for combining feature variables.
123assembler = VectorAssembler(inputCols=["owns_car", "age"],outputCol="features")output = assembler.transform(data)


- StringIndexer – which converts categorical values into category indices, ordered by frequencies, starting with 0. Can be used to transform a label column.
123indexer = StringIndexer(inputCol="name",outputCol="index_name")output = indexer.fit(data).transform(data)

- OneHotEncoder – which maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. Used to change string columns into numeric ones.
123encoder = OneHotEncoder(inputCol="index_name",outputCol="coded_name")c_output = encoder.transform(output)

Interestingly SparkR modeling is much easier:
1 2 3 4 5 |
rfModel = spark.randomForest(trainData, label ~ feature1 + feature2, "classification", numTrees=10) preditions = predict(rfModel, testData) |
Here we do not need to apply any particular data transformations, we just specify the dependency between label and feature columns, using the R formula. Why is that?
Actually even though we do not have ML transformers available in SparkR, they are automatically called when building a model. To be exact:
SparkR spark.xxx models, using R formula, automatically invoke StringIndexer and OneHotEncoder on all String columns, convert Numeric columns to Doubles and combine them into a single vector using VectorAssembler.
Different SparkR models may be calling some other transformers instead.
Such automatic data conversion can be observed on a simple example.
Example
Let’s run kmeans models on R build-in dataset Titanic, both in R and SparkR. As this dataset comes as a table, it needs to be converted into a dataframe (i.e. with as.data.frame) for modeling. Let’s divide this data into 3 clusters, using Class, Sex, Age and Freq columns.

SparkR:
1 2 3 4 5 6 7 8 |
# Create Spark dataframe data <- createDataFrame(as.data.frame(Titanic)) # Fit a k-means model with spark.kmeans kmeansModel <- spark.kmeans(data, ~ Class + Sex + Age + Freq, k = 3) # Model clusters showDF(summary(kmeansModel)$cluster, 6) |
Obtained clusters:

Now pure R:
1 2 3 4 5 6 7 8 9 10 11 12 |
library(data.table) library(mltools) # Apply one hot encoding to String columns data <- one_hot(as.data.table( as.data.frame(Titanic)[, c("Class", "Sex", "Age", "Freq"]) # Fit a k-means model with spark.kmeans kmeansModel <- kmeans(as.data.frame(data), centers = 3) # Model clusters kmeansModel$cluster |
Obtained clusters:

As you can see I had to transform string columns into numeric, as R kmeans cannot run for categorical features. If we try to run kmeans without any transformation it fails with error: NAs introduced by coercion.
Such modification was done automatically in SparkR.