Data Side of Life

Spark performance tuning

Alice
Tags: Spark
0

Spark job performance is heavily dependent on the sort of task you aim to accomplish and data you’re dealing with. Because of that there is no one magic recipe to follow when creating a job. However there are several things that impact a job execution. Those which I consider are: file format selection small data […]

Alice
Tags: Forecasting, R
45

It turns out that deep learning, with all its power, can also help with forecasting. Especially the LSTM (Long Short Term Memory) model, which proved to be useful while solving problems involving sequences with autocorrelation.

Alice
Tags: Forecasting, R, Xgb
24

xgboost, or Extreme Gradient Boosting is a very convenient algorithm that can be used to solve regression and classification problems. You can check may previous post to learn more about it. It turns out we can also benefit from xgboost while doing time series predictions.

Alice
Tags: Impala, JDBC
0

The idea was to use Java locally (in my case with InteliJ) to connect to Hive metastore through Impala. That was in order to read some data and then be able to use them by some other processes on later stages. Hadoop cluster that I was connecting to was Kerberised, which made the exercise more tricky. Here’s how I managed to establish such a connection.

Spark AI Summit, Amsterdam 2019

Spark AI Summit Europe, which happened in October, was full of interesting stuff. It was mostly focused on features coming with Spark 3, but not only. Preview release of Spark 3 is already available and can be obtained here. There are a lot of cool features planned, especially when it comes to making Data Science easier on big data and with Spark in particular.

Alice
Tags: AI
0

How to put AI projects in production – that’s a topic that since a while occupies heads of many data scientists and plays a key role on many international conferences. Unfortunately there is no one clear answer to that, as the variety of topics that we nowadays name as AI is huge.

Alice
Tags: Forecasting, R
0

Data forecasting is a process of estimating the future based on historical values. It is described by time series, which is simply a series of time dependent data points. We usually forecast different costs or sales over time. We can try to predict weather conditions or model stock changes. Basically look at any process that can be described as time dependent with certain time interval (hourly, daily, monthly…).

Alice
Tags: Forecasting, R
0

R has a nice package support when it comes to forecasting. There is the forecast package which allows you to build several types of models

Alice
Tags: R
0

Many times there is a demand from end users to deliver results in a form of Excel files. The bigger and more complex the files grow, the more difficult it is to properly interpret them. This is when nice formatting can help you out.

XGBoost

XGBoost or in long version Extreme Gradient Boosting got recently very popular, especially on Kaggle competitions. It proved to outperform many other algorithms on tasks such as classification and regression. I used it few times as well and that’s why I decided to take a closer look into XGBoost to see how it works.

« Previous

Data stories and processing

Spark performance tuning

lstm time series prediction in R

xgboost time series forecast in R

Impala JDBC connection with Kerberos

Spark AI Summit, Amsterdam 2019

AI in production

Data forecasting

Forecasting pipeline in R

Excel formatting with R

XGBoost