Skip to content

Tag: Spark

Spark & R – SparkR vs sparklyr

R enthusiasts can benefit from Spark using one of two available libraries – SparkR or sparklyr. They both differ in usage structure and slightly in available functionality. SparkR is an official Spark library, while sparklyr is created by the RStudio community. Due to the fact that currently Python is favourite language for Data Scientists using […]

Read more

Spark 3 highlights

Recently Apache Spark 3.1.1 was released. Let’s take a look into some of the new features provided within Spark version 3.   HIGHLIGHTS Adaptive query execution That means allowing Spark to change the execution plan during runtime, when run statistics are being updated. In other words after some processing steps are already done and stats […]

Read more

SparkR MLlib

When working with Spark MLlib library you may notice that there are different features available in Python and R APIs. In Python, in addition to models, we can benefit from Transformers, which represent feature transformations that can be done before the modelling. Transformers are also available in sparklyr, but are clearly missing in SparkR. Also […]

Read more

Spark performance tuning

Spark job performance is heavily dependent on the sort of task you aim to accomplish and data you’re dealing with. Because of that there is no one magic recipe to follow when creating a job. However there are several things that impact a job execution. Those which I consider are: file format selection small data […]

Read more

Spark AI Summit, Amsterdam 2019

Spark AI Summit Europe, which happened in October, was full of interesting stuff. It was mostly focused on features coming with Spark 3, but not only. Preview release of Spark 3 is already available and can be obtained here. There are a lot of cool features planned, especially when it comes to making Data Science easier on big data and with Spark in particular.

Read more

Apache Beam JDBC

With Apache Beam we can connect to different databases – HBase, Cassandra, MongoDB using specific Beam APIs. We also have a JdbcIO for JDBC connections. Here I show how to connect with MSSQL database using Beam and do some data importing and exporting in Kerberised environment.

Read more

Apache Beam and HBase

HBase is a NoSql database, which allows you to store data in many different formats (like pictures, pdfs, textfiles and others) and gives the ability to do fast and efficient data lookups. HBase has two APIs to chose from – Java API and HBase Shell. We can also connect HBase with some different tools like Hive or Phoenix and use SQL. HBase also integrates with Apache Beam via HBaseIO transform.

Read more

Apache Beam – getting started

Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream processing. It is a processing tool which allows you to create data pipelines in Java or Python without specifying on which engine the code will run. So the same code can be run on MapReduce, Spark, Flink, Apex or some other engine.

Read more

Spark AI summit 2018 in San Francisco

Last week I attended the Spark AI summit in San Francisco. I was really curious to check what’s the difference between conferences in Europe and in the US. I must say the one in SF was definitely bigger. There were 11 different session tracks to attend and around 4000 people. In terms of organisation and content it was as good as the Spark summit 2016 in Brussels (well, minus the chocolate ;), which I also attended.

Read more