Skip to content

Hadoop troubleshooting & tricks

Hadoop troubleshooting & tricks

Last years while working with Hadoop I spent a lot of time dealing with issues or finding tricks for some solutions. That involved a lot searching, reading and mostly try and error approach. That’s why I decided to share some of the solutions I found and tried. Maybe you’ll find them helpful at some point.

Contents

TROUBLE SHOOTING

I was working with CDH-5.5.4 and CDH-5.8.4.

 

Pig

1. ERROR:

SITUATION:
While doing JOIN followed by FOREACH. Until JOIN clause everything works fine.

SOLUTION:
Pig is case sensitive and column-names sensitive. When you pull data into HDFS (i.e. with Sqoop) schema and column names are preserved in metadata. You can read this data into different schema and it will work for some operations like filtering. But the combination with JOIN and FOREACH may fail if column names are different that the one in metadata. It includes also differences regarding small and capital letters.
 
2. ERROR:

SITUATION:
While reading data which is partitioned on a column with data type other than string with HCatalog.

SOLUTION:
Pig can only read using HCatalog those tables which are partitioned on columns with string data type. There’s no escaping that.
 
3. ERROR:

SITUATION:
While reading the data with PigStorage with no schema specification all fields are by default assumed bytearrays.

If we now do SPLIT and some other operations which may change schema of just few fields and after that we do UNION, we’ll get the error. bytearray and chararray columns can’t be mixed while doing UNION operation.

SOLUTION:
Explicitly cast all fields to chararray or other data types before doing UNION operation.
 

Sqoop

1. ERROR:

SITUATION:
While exporting data with Sqoop to SQL Server using ‐‐staging-table option. Sqoop can’t work with staging tables while doing export to SQL Server at the moment.

SOLUTION:
Export into temporary table and then copy data manually to destination table.
 
2. ERROR:

SITUATION:
While doing export to a SQLServer table which has a column named RowCount. RowCount is a keyword for SQLServer.

SOLUTION:
Change column name.
 
3. ERROR:

SITUATION:
While exporting data into table with column index.

SOLUTION:
Drop the index before export and recreate it afterwards. Or export into staging table with no index and then copy the data into indexed table.
 

Parquet

1. ERROR:

SITUATION:
While Sqoop-ing into Parquet format data which contain a timestamp column and trying to create a Hive/Impala table on top of that.

SOLUTION:
Hive and Impala code Parquet timestamp as INT96, but Sqoop as INT64. As a result Parquet columns set with Sqoop as timestamps are not compatible with Hive or Impala. You need to change the datatype, do some formatting or consider not using Parquet.
 
2. ERROR:

SITUATION:
While storing data using Pig and HCatalog into Parquet format table.

SOLUTION:
It’s currently impossible to store data with Pig and HCatalog into Parquet format table. There’s an issue pending.
 

Oozie

1. ERROR:

SITUATION:
While writing data with Pig to HBase using Oozie workflow.

SOLUTION:
Known issue, fixed in Pig version 0.12.1. To workaround this in Pig set following parameter in the script:

   

TIPS

Pig

TIP #1
CSVExcelStorage with SKIP_INPUT_HEADER option may not work as expected.
Pig while loading files doesn’t put whole files into same mapper (one file may get splitted into few mappers). Only the mappers that receive files actually starting with the header will remove this header. Those which receive header stucked somewhere in the middle will not remove it. That made me drop the idea of using this storage and apply some custom checks.
 

Parquet

TIP #1
Pig cannot read empty Parquet files.
This can be painful if your Pig processing depends on some other workflow step, result of which you don’t fully control. As I used Oozie I added decision nodes to check whether the file has some content and acted accordingly.

 
TIP #2
You won’t succeed while trying to create a table on an empty Parquet file.
Similarly to previous situation, if you build a workflow, make sure you add a check if the file has size bigger that 0.
 

Oozie

TIP #1
Oozie Hive action ignores the last line in the script.
While running Hive action remember to include an empty line at the end of your Hive script. Oozie simply ignores script’s last line.
 
TIP #2
Add ${nameNode} parameter to the paths you specify for Oozie fs action.
Only fs action’s paths require the ${nameNode} at the beginning. Without it you’ll get a message stating that the path cannot be located.
 
TIP #3
Watch out for empty strings in Oozie expression language.
You can’t compare empty strings using Oozie expression language. If you define parameter as param=” or param= (blank_space) then referring to it with equals check will resolve to false!

 

Hive

TIP #1
Table created with LIKE can only use defaults.
When creating a table using LIKE clause you may specify the output table data format, but the format settings can not be changed – the defaults are used.

Example:

 

Impala

TIP #1
There can be a problem with language-specific characters formatting.
Some language specific characters (like letters with accents) may cause Impala some trouble. Although such characters can be displayed properly, applying some text formatting functions (like UPPER/LOWER) may return other symbols than expected.
 
TIP #2
LOAD INPATH function will not work if the specified file is empty.
You may add some check of the file size with expression language if you use Oozie.

1 thought on “Hadoop troubleshooting & tricks

Leave a Reply

Your email address will not be published. Required fields are marked *