Skip to content


HBase Java API with Oozie

One of the ways to access HBase is through Java API. It can be done in multiple ways, depending on the case and tools used. Here’s how to achieve that with Oozie Java action and Pig action with UDF doing lookups in HBase.

Read more

Pig HBase lookups

Pig can nicely read from and write data to HBase, which can be done as I described here. Additionally we may use Pig UDF to manage data in HBase – like retrieving some values for a given key. There is one difficulty though – Zookeeper manages the number of concurrent connections done to HBase and if our application exceeds that, then the whole job will simply fail.

Read more

Hadoop and Spark shuffling

Shuffling is generally the most costly operation that we encounter in Hadoop and Spark processing. It has a huge impact on processing performance and can be a bottleneck in cases when our big data requires a lot of grouping or joining. That’s why I think it’s worth to spend a while to understand how shuffle is handled by both of those engines.

Read more

Spark 2 APIs

Spark 2.0 brought some changes to the API – the link between the Dataset and DataFrame was created. Now the DataFrame = Dataset[Row] (in Scala and Java), where Row is an untyped generic object representing a table-like record, with a schema. But it doesn’t mean that the DataFrame itself was dropped. It’s still the main abstraction for MLlib or SparkR and Python language. Dataset is currently only available in Scala and Java as it’s a strongly typed abstraction. It’s super easy to switch between the untyped DataFrame and typed Dataset.

Read more

Sqoop with HCatalog and Oozie

Sqoop may use HCatalog to import and export data directly into/from Hive tables. It uses HCatalog to read table’s structure, data formats, partitions and then imports/exports data appropriately. It’s very useful combination for efficient data move, but requires matching column names on both sides. Here’s how to make Sqoop with HCatalog work through Oozie.

Read more

Pig Java UDF

We can define Pig UDF in few languages: Java, Jython, JavaScript, Ruby, Groovy and Python. But currently the biggest choice of options we have in Java, so I’ll stick to it in this post.

Read more