Spark AI summit 2018 in San Francisco
Last week I attended the Spark AI summit in San Francisco. I was really curious to check what’s the difference between conferences in Europe and in the US. I must say the one in SF was definitely bigger. There were 11 different session tracks to attend and around 4000 people. In terms of organisation and content it was as good as the Spark summit 2016 in Brussels (well, minus the chocolate ;), which I also attended.
One thing that really sticks out is the new name of the conference. It used to be just Spark, now it’s Spark and AI. And that relates to new trends that are emerging. Big data processing we more or less handle at this point. Same with very trendy data science. But how to efficiently combine those two, that is still a question which many people try to answer. Tendency towards AI was already very clearly seen last year on DataWorks summit (check my previous post). Companies are investing in Deep Learning, like image processing or recommendation systems. Challenges which are faced at the moment are around making the AI user friendly. It means finding a nice way for analysing different data formats and qualities, choosing the bes framework out of many and productionizing the AI models. Data engineers and data scientists need to work closely together in order to respond to current market demands.
Cool things that emerged recently as a reply to the current hypes are:
- Project Hydrogen
- Databrics Delta
- BigDL library
- MLflow platform
Spark currently provides solid Machine learning functionality (MLlib) but it was lacking the way of integrating the Deep Learning frameworks like Tensorflow, Keras or Caffe2. Project Hydrogen came as an answer to that. It provides the so called Gang Scheduler, which is an alternative to current Spark scheduler. It either runs all of the tasks concurrently or doesn’t run any of them (if one fails, fail them all). This reflects the nature of DL tasks better, where the dependencies of the tasks are strong. Simple relaunching just the failed tasks is definitely not enough for many DL problems. The Project Hydrogen API is expected to be added to the core Apache Spark project soon.
Delta is a unified data management system for large scale systems. Instead of having some big data lake, data warehouse and some streaming system you can have the Delta. It combines the scale of a data lake, the reliability and performance of a data warehouse, and the low latency of streaming, as the Databrics convinces. Delta runs over Amazon S3 and should give us the reliability and performance of a data warehouse, the speed of streaming systems, the scale and cost-efficiency of a data lake.
BigDL is a distributed deep learning library for Spark created by Intel. With BigDL, you can write deep learning applications in Scala or Python and leverage the power of Spark.
Databrics also announced the MLflow – Open Machine learning platform. It is already available as alpha release and is open sourced. It is designed to work with any ML library, algorithm, deployment tool or language. Currently it consists of 3 components:
- tracking – for logging parameters, code versions, metrics and output files
- projects – which allows packaging reusable data science code
- models – convention for packaging machine learning models in multiple formats
For other updates, you can check the recorded sessions, once they’re made available online. I definitely recommend watching: