Skip to content


Targets R package for managing workflows

Workflows help us to keep a clear structure of the flow we are building, allow for easier steps traceability and simplified maintenance. They are especially useful when dealing with data science work, where heavy computations take time to run. In R world first popular package to deal with pipelines was drake. It allows not only […]

Read more

Memory leakage while plotting in a loop

Issue Memory leakage while generating python matplotlib plots in a loop on MacOS system. I was using python 3.9 and MacOS Catalina.   I was trying to generate lots of plots for my analysis. Idea was to create them in a loop: render plot save the output iterate further Simple example of the case:


Read more

Spark & R – SparkR vs sparklyr

R enthusiasts can benefit from Spark using one of two available libraries – SparkR or sparklyr. They both differ in usage structure and slightly in available functionality. SparkR is an official Spark library, while sparklyr is created by the RStudio community. Due to the fact that currently Python is favourite language for Data Scientists using […]

Read more

Spark 3 highlights

Recently Apache Spark 3.1.1 was released. Let’s take a look into some of the new features provided within Spark version 3.   HIGHLIGHTS Adaptive query execution That means allowing Spark to change the execution plan during runtime, when run statistics are being updated. In other words after some processing steps are already done and stats […]

Read more

SparkR MLlib

When working with Spark MLlib library you may notice that there are different features available in Python and R APIs. In Python, in addition to models, we can benefit from Transformers, which represent feature transformations that can be done before the modelling. Transformers are also available in sparklyr, but are clearly missing in SparkR. Also […]

Read more

Spark performance tuning

Spark job performance is heavily dependent on the sort of task you aim to accomplish and data you’re dealing with. Because of that there is no one magic recipe to follow when creating a job. However there are several things that impact a job execution. Those which I consider are: file format selection small data […]

Read more

xgboost time series forecast in R

xgboost, or Extreme Gradient Boosting is a very convenient algorithm that can be used to solve regression and classification problems. You can check may previous post to learn more about it. It turns out we can also benefit from xgboost while doing time series predictions.

Read more

Impala JDBC connection with Kerberos

The idea was to use Java locally (in my case with InteliJ) to connect to Hive metastore through Impala. That was in order to read some data and then be able to use them by some other processes on later stages. Hadoop cluster that I was connecting to was Kerberised, which made the exercise more tricky. Here’s how I managed to establish such a connection.

Read more

Spark AI Summit, Amsterdam 2019

Spark AI Summit Europe, which happened in October, was full of interesting stuff. It was mostly focused on features coming with Spark 3, but not only. Preview release of Spark 3 is already available and can be obtained here. There are a lot of cool features planned, especially when it comes to making Data Science easier on big data and with Spark in particular.

Read more