Spark AI Summit, Amsterdam 2019
Spark AI Summit Europe, which happened in October, was full of interesting stuff. It was mostly focused on features coming with Spark 3, but not only. Preview release of Spark 3 is already available and can be obtained here. There are a lot of cool features planned, especially when it comes to making Data Science easier on big data and with Spark in particular. Let’s take a look.
SPARK 3
Spark Graph
Spark Graph is a new module to be added to Spark for graph processing. It is being build on top of already existing Spark SQL API, so on Data Frames. This is a big difference to currently existing GraphX module, which abstractions are extensions of the RDDs. It aims to provide Cypher queries in similar manner as we can currently use SQL for Data Frames, which will be a nice add-on to GraphFrames. Additionally it will also benefit from Catalyst optimizations. Spark Graph concept is based on a commonly known and used Property Graph Model (combining nodes and relationships). There’s a nice video with the preview which you can check out.
DataSourceV2 API
Some improvements to the DataSource API are also planned. This includes:
- unified API for streaming and batch processing, so that we can easily combine both in our Spark jobs or switch from one to the other.
- pluggable catalog integration, that allows Spark to not only read but also find particular data sources.
- improved pushdown, which means faster queries by fetching only necessary data for processing.
Improved optimizer
Databricks promised several optimizations in Spark processing itself. They announced:
- dynamic switch of joining strategy. At runtime Spark will be able to determine if it’s beneficial to switch from sort merge join to broadcast join. If so then it’s going to change the execution plan at runtime.
- dynamic partition pruning. If there is a filter applied to a smaller table before join, it will also be transferred to the bigger one, if possible, in order to limit the data for scanning.
GPU support
Currently Spark did not offer much in terms of Deep Learning support. Now with the option to leverage the GPUs we’ll be able to run much more algorithms. Especially those that require all of data partitions to be available at once for processing. That will open an option to extend the spark.ml package with DL algorithms.
Better support for Kubernetes
We already can use Kubernetes as Spark cluster manager in current Spark version (2.4.4). But not all the native Spark features are supported. We can expect this integration to be improved with new releases with stuff like new shuffling service.
What is going to be removed
For sake of maintainability Python 2 will no longer be supported in Spark 3. So mind that if you’re still using it and migrate your code to version 3 as soon as possible.
Furthermore, MLlib for the RDD will no longer be maintained and supported. Switch already to newer version, so ml package, which is based on DataFrame API instead.
And I am pretty sure that’s not it. But we’ll have to wait and see.
These were some highlights of what is coming in Spark 3. You can check also the video with the official announcement of some of the features. In addition to that there are 3 open sourced Databricks products worth looking into.
Koalas
Anyone who has worked with PySpark most certainly used Pandas Python library. You can easily transform your Spark DataFrame to Pandas DafaFrame by simply calling toPandas() method. By doing that we gain an entry point to using other Python libraries for modeling or plotting the data. The drawback is that a Pandas DataFrame needs to fit into memory of a single node on the cluster. For exploratory analysis it is fine, but then you need to maintain other piece of code for scaled stuff. That is where Koalas come for rescue. Koalas bridge the gap between Spark and Pandas. Koalas DataFrame can be treated as Pandas DataFrame API on Spark. It looks and provides same capabilities as Pandas DF. The same code you run on Pandas works now on Koalas, with the addition of scaling to distributed environment. No need to decide whether to use Pandas or PySpark for a given dataset. Simply call to_koalas on your Spark DF and you’re ready to go. Just mind that Koalas does not target 100% compatibility of both Pandas and PySpark, so some workarounds may be needed. You can check out the docs for more info.
Delta Lake
Delta lake is an open source storage layer, which provides several features to allow to manage the big data in similar manner as we do on databases. It sits on top of Spark and provides ACID transactions, data versioning, schema enforcement. It allows for data updates and deletion. In practice it means writing new versions of the tables with applied changes, so that in the meantime you can still access the previous ones.
Moreover Delta works in the same manner for streaming and batched data and can be used as easily as any data format. So instead of specifying to store the data as Parquet, you can store it as delta (which is still Parquet underneath) and gain all of the mentioned capabilities. Delta stores a transaction log to keep track of all the changes applied to the data. It also versions the data in order to be able to access some historical versions, or as the creators say go back in time. You can store as many versions as your storage system allows you too (by default it is 7 days).
Delta only works on a table level, it does not support multi-table transactions or foreign keys. It is available for Scala, Java and Python APIs. More info can be found here.
MLflow
MLflow is an open source platform for managing the machine learning lifecycle. It’s been active since a while now so I also took a look into that. It’s aim is to allow for easy and efficient ML model preparation and monitoring, so that your primary focus can be put on the model itself. It provides support for commonly used ML libraries and tools. You can use it with R, Python, Tensorflow, Java, Docker, Conda and some others. It consists of 3 components:
- Tracking – it’s an API for logging and tracking information about modeling experiments. With each run you can track your model parameters, times, metrics, output files, using simple functions. Those can be stored in files, database or some server. You can use it in R, Python Java or REST API. Or later query those from a web UI to compare results of your experiments.
- Projects – provide a standard format for packaging reusable data science code as set of directories. Project can be defined using YAML file, where you specify the environment (Conda, Docker, your System env), entry points for invoking commands and named parameters to run. Then MLflows takes care of setting the environment up and running the project.
- Models – standard format for packaging machine learning models for further use. When you save particular model it is saved also with so called flavours – descriptions how this model can be run by supported deployment tools. They can then easily be deployed using MLflow. Models can run as Python or R functions, or by tools like H2O, or those which can run libraries like Keras, Tensorflow, PyTorch or others.