Skip to content

Data stories and processing

Apache Beam JDBC

With Apache Beam we can connect to different databases – HBase, Cassandra, MongoDB using specific Beam APIs. We also have a JdbcIO for JDBC connections. Here I show how to connect with MSSQL database using Beam and do some data importing and exporting in Kerberised environment.

Read more

Apache Beam and HBase

HBase is a NoSql database, which allows you to store data in many different formats (like pictures, pdfs, textfiles and others) and gives the ability to do fast and efficient data lookups. HBase has two APIs to chose from – Java API and HBase Shell. We can also connect HBase with some different tools like Hive or Phoenix and use SQL. HBase also integrates with Apache Beam via HBaseIO transform.

Read more

Apache Beam – getting started

Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream processing. It is a processing tool which allows you to create data pipelines in Java or Python without specifying on which engine the code will run. So the same code can be run on MapReduce, Spark, Flink, Apex or some other engine.

Read more

Machine Learning with Java libraries

When I think about Machine Learning there comes R and Python into my mind. There’s a nice set of ML libraries and packages that can be used to perform analysis or visualize data in both of those languages. But when it comes to Java ML libraries there aren’t that many. Of course there are nice Java frameworks, but they are mostly designed in such a way that you don’t actually do the coding. So how can a Java programmer easily incorporate ML into their application? I used 2 libraries which allowed me to do exactly that.

Read more

Spark AI summit 2018 in San Francisco

Last week I attended the Spark AI summit in San Francisco. I was really curious to check what’s the difference between conferences in Europe and in the US. I must say the one in SF was definitely bigger. There were 11 different session tracks to attend and around 4000 people. In terms of organisation and content it was as good as the Spark summit 2016 in Brussels (well, minus the chocolate ;), which I also attended.

Read more

Spark and Hive

Spark gives us the ability to use SQL for data processing. With that we can connect with JDBC and ODBC to pretty much any database or use structured data formats like avro, parquet, orc. We can also connect to Hive and use all the structures we have there. In Spark 2.0 entry points to SQL (SQLContext) and Hive (HiveContext) were substituted with one object – SparkSession. SparkSession allows you to read and write to Hive, use HiveSQL language and Hive UDFs.

Read more

HBase Java API with Oozie

One of the ways to access HBase is through Java API. It can be done in multiple ways, depending on the case and tools used. Here’s how to achieve that with Oozie Java action and Pig action with UDF doing lookups in HBase.

Read more

Pig HBase lookups

Pig can nicely read from and write data to HBase, which can be done as I described here. Additionally we may use Pig UDF to manage data in HBase – like retrieving some values for a given key. There is one difficulty though – Zookeeper manages the number of concurrent connections done to HBase and if our application exceeds that, then the whole job will simply fail.

Read more

Hadoop and Spark shuffling

Shuffling is generally the most costly operation that we encounter in Hadoop and Spark processing. It has a huge impact on processing performance and can be a bottleneck in cases when our big data requires a lot of grouping or joining. That’s why I think it’s worth to spend a while to understand how shuffle is handled by both of those engines.

Read more