Hadoop troubleshooting & tricks
Last years while working with Hadoop I spent a lot of time dealing with issues or finding tricks for some solutions. That involved a lot searching, reading and mostly try and error approach. That’s why I decided to share some of the solutions I found and tried. Maybe you’ll find them helpful at some point.
Contents
TROUBLE SHOOTING
I was working with CDH-5.5.4 and CDH-5.8.4.
Pig
1. ERROR:
1 |
ERROR 2000: Error processing rule ColumnMapKeyPrune |
SITUATION:
While doing JOIN followed by FOREACH. Until JOIN clause everything works fine.
SOLUTION:
Pig is case sensitive and column-names sensitive. When you pull data into HDFS (i.e. with Sqoop) schema and column names are preserved in metadata. You can read this data into different schema and it will work for some operations like filtering. But the combination with JOIN and FOREACH may fail if column names are different that the one in metadata. It includes also differences regarding small and capital letters.
2. ERROR:
1 |
Filtering is supported only on partition keys of type string |
SITUATION:
While reading data which is partitioned on a column with data type other than string with HCatalog.
SOLUTION:
Pig can only read using HCatalog those tables which are partitioned on columns with string data type. There’s no escaping that.
3. ERROR:
1 2 |
Error: org.apache.pig.tools.grunt.Grunt - ERROR 1052: Cannot cast bytearray to chararray |
SITUATION:
While reading the data with PigStorage with no schema specification all fields are by default assumed bytearrays.
1 |
LOAD data USING PigStorage(‘\t’). |
If we now do SPLIT and some other operations which may change schema of just few fields and after that we do UNION, we’ll get the error. bytearray and chararray columns can’t be mixed while doing UNION operation.
SOLUTION:
Explicitly cast all fields to chararray or other data types before doing UNION operation.
Sqoop
1. ERROR:
1 |
Error during export: The active connection manager (org.apache.sqoop.manager.SQLServerManager) does not support staging of data for export. Please retry without specifying the --staging-table option. |
SITUATION:
While exporting data with Sqoop to SQL Server using ‐‐staging-table option. Sqoop can’t work with staging tables while doing export to SQL Server at the moment.
SOLUTION:
Export into temporary table and then copy data manually to destination table.
2. ERROR:
1 |
Timeout |
SITUATION:
While doing export to a SQLServer table which has a column named RowCount. RowCount is a keyword for SQLServer.
SOLUTION:
Change column name.
3. ERROR:
1 2 3 |
INSERT statement failed because data cannot be updated in a table with a columnstore index. Consider disabling the columnstore index before issuing the INSERT statement, then rebuilding the columnstore index after INSERT is complete. |
SITUATION:
While exporting data into table with column index.
SOLUTION:
Drop the index before export and recreate it afterwards. Or export into staging table with no index and then copy the data into indexed table.
Parquet
1. ERROR:
1 2 3 |
Backend 0:File 'hdfs://nameservice1/data/dd754439-1d81-4553-8198-214ce3f3bb48.parquet' has an incompatible type with the table schema for column 'datetime'. Expected type: INT96. Actual type: INT64 |
SITUATION:
While Sqoop-ing into Parquet format data which contain a timestamp column and trying to create a Hive/Impala table on top of that.
SOLUTION:
Hive and Impala code Parquet timestamp as INT96, but Sqoop as INT64. As a result Parquet columns set with Sqoop as timestamps are not compatible with Hive or Impala. You need to change the datatype, do some formatting or consider not using Parquet.
2. ERROR:
1 |
java.lang.RuntimeException: Should never be used |
SITUATION:
While storing data using Pig and HCatalog into Parquet format table.
SOLUTION:
It’s currently impossible to store data with Pig and HCatalog into Parquet format table. There’s an issue pending.
Oozie
1. ERROR:
1 |
java.io.IOException: No FileSystem for scheme: hbase |
SITUATION:
While writing data with Pig to HBase using Oozie workflow.
SOLUTION:
Known issue, fixed in Pig version 0.12.1. To workaround this in Pig set following parameter in the script:
1 |
SET mapreduce.fileoutputcommitter.marksuccessfuljobs false; |
TIPS
Pig
TIP #1
CSVExcelStorage with SKIP_INPUT_HEADER option may not work as expected.
Pig while loading files doesn’t put whole files into same mapper (one file may get splitted into few mappers). Only the mappers that receive files actually starting with the header will remove this header. Those which receive header stucked somewhere in the middle will not remove it. That made me drop the idea of using this storage and apply some custom checks.
Parquet
TIP #1
Pig cannot read empty Parquet files.
This can be painful if your Pig processing depends on some other workflow step, result of which you don’t fully control. As I used Oozie I added decision nodes to check whether the file has some content and acted accordingly.
1 2 3 4 5 6 7 8 |
<decision name="decision"> <switch> <case to="file_not_exists"> ${fs:dirSize(concat(NAMENODE,PATH_TO_DIR)) eq 0} </case> <default to="file_exists"/> </switch> </decision> |
TIP #2
You won’t succeed while trying to create a table on an empty Parquet file.
Similarly to previous situation, if you build a workflow, make sure you add a check if the file has size bigger that 0.
Oozie
TIP #1
Oozie Hive action ignores the last line in the script.
While running Hive action remember to include an empty line at the end of your Hive script. Oozie simply ignores script’s last line.
TIP #2
Add ${nameNode} parameter to the paths you specify for Oozie fs action.
Only fs action’s paths require the ${nameNode} at the beginning. Without it you’ll get a message stating that the path cannot be located.
TIP #3
Watch out for empty strings in Oozie expression language.
You can’t compare empty strings using Oozie expression language. If you define parameter as param=” or param= (blank_space) then referring to it with equals check will resolve to false!
1 2 3 4 5 6 7 8 |
<property> <name>param</name> <value>’’</value> </property> ... <case to="action_x"> ${param eq ‘’} --this will resolve to false </case> |
Hive
TIP #1
Table created with LIKE can only use defaults.
When creating a table using LIKE clause you may specify the output table data format, but the format settings can not be changed – the defaults are used.
Example:
1 2 3 4 5 6 |
CREATE EXTERNAL TABLE IF NOT EXISTS example_table LIKE the_other_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' -- this option is ignored STORED AS textfile LOCATION '/user/cloudera/tables/table_example'; |
Impala
TIP #1
There can be a problem with language-specific characters formatting.
Some language specific characters (like letters with accents) may cause Impala some trouble. Although such characters can be displayed properly, applying some text formatting functions (like UPPER/LOWER) may return other symbols than expected.
TIP #2
LOAD INPATH function will not work if the specified file is empty.
You may add some check of the file size with expression language if you use Oozie.
1 2 3 4 5 6 7 8 |
<decision name="decision"> <switch> <case to="file_not_exists"> ${fs:dirSize(concat(NAMENODE,PATH_TO_DIR)) eq 0} </case> <default to="file_exists"/> </switch> </decision> |
Awesome post, thanks for sharing!