Hadoop troubleshooting & tricks

04/06/2017 2:44 PM
Alice
Tags: HBase, Hive, Impala, Oozie, Parquet, Pig, Sqoop
1

Last years while working with Hadoop I spent a lot of time dealing with issues or finding tricks for some solutions. That involved a lot searching, reading and mostly try and error approach. That’s why I decided to share some of the solutions I found and tried. Maybe you’ll find them helpful at some point.

Contents

TROUBLE SHOOTING

I was working with CDH-5.5.4 and CDH-5.8.4.

Pig

1. ERROR:

ERROR 2000: Error processing rule ColumnMapKeyPrune

1	ERROR 2000: Error processing rule ColumnMapKeyPrune

SITUATION:
While doing JOIN followed by FOREACH. Until JOIN clause everything works fine.

SOLUTION:
Pig is case sensitive and column-names sensitive. When you pull data into HDFS (i.e. with Sqoop) schema and column names are preserved in metadata. You can read this data into different schema and it will work for some operations like filtering. But the combination with JOIN and FOREACH may fail if column names are different that the one in metadata. It includes also differences regarding small and capital letters.

2. ERROR:

Filtering is supported only on partition keys of type string

1	Filtering is supported only on partition keys of type string

SITUATION:
While reading data which is partitioned on a column with data type other than string with HCatalog.

SOLUTION:
Pig can only read using HCatalog those tables which are partitioned on columns with string data type. There’s no escaping that.

3. ERROR:

Error: org.apache.pig.tools.grunt.Grunt - ERROR 1052: Cannot cast bytearray
to chararray

1 2	Error: org.apache.pig.tools.grunt.Grunt - ERROR 1052: Cannot cast bytearray to chararray

SITUATION:
While reading the data with PigStorage with no schema specification all fields are by default assumed bytearrays.

 LOAD data USING PigStorage(‘\t’).

1	LOAD data USING PigStorage(‘\t’).

If we now do SPLIT and some other operations which may change schema of just few fields and after that we do UNION, we’ll get the error. bytearray and chararray columns can’t be mixed while doing UNION operation.

SOLUTION:
Explicitly cast all fields to chararray or other data types before doing UNION operation.

Sqoop

1. ERROR:

Error during export: The active connection manager (org.apache.sqoop.manager.SQLServerManager) does not support staging of data for export. Please retry without specifying the --staging-table option.

1	Error during export: The active connection manager (org.apache.sqoop.manager.SQLServerManager) does not support staging of data for export. Please retry without specifying the --staging-table option.

SITUATION:
While exporting data with Sqoop to SQL Server using ‐‐staging-table option. Sqoop can’t work with staging tables while doing export to SQL Server at the moment.

SOLUTION:
Export into temporary table and then copy data manually to destination table.

2. ERROR:

Timeout

Timeout

SITUATION:
While doing export to a SQLServer table which has a column named RowCount. RowCount is a keyword for SQLServer.

SOLUTION:
Change column name.

3. ERROR:

INSERT statement failed because data cannot be updated in a table with a
columnstore index. Consider disabling the columnstore index before issuing the
INSERT statement, then rebuilding the columnstore index after INSERT is complete.

INSERT statement failed because data cannot be updated in a table with a

columnstore index. Consider disabling the columnstore index before issuing the

INSERT statement, then rebuilding the columnstore index after INSERT is complete.

SITUATION:
While exporting data into table with column index.

SOLUTION:
Drop the index before export and recreate it afterwards. Or export into staging table with no index and then copy the data into indexed table.

Parquet

1. ERROR:

Backend 0:File 'hdfs://nameservice1/data/dd754439-1d81-4553-8198-214ce3f3bb48.parquet' 
has an incompatible type with the table schema for column 'datetime'.
Expected type: INT96. Actual type: INT64

Backend 0:File 'hdfs://nameservice1/data/dd754439-1d81-4553-8198-214ce3f3bb48.parquet'

has an incompatible type with the table schema for column 'datetime'.

Expected type: INT96. Actual type: INT64

SITUATION:
While Sqoop-ing into Parquet format data which contain a timestamp column and trying to create a Hive/Impala table on top of that.

SOLUTION:
Hive and Impala code Parquet timestamp as INT96, but Sqoop as INT64. As a result Parquet columns set with Sqoop as timestamps are not compatible with Hive or Impala. You need to change the datatype, do some formatting or consider not using Parquet.

2. ERROR:

java.lang.RuntimeException: Should never be used

1	java.lang.RuntimeException: Should never be used

SITUATION:
While storing data using Pig and HCatalog into Parquet format table.

SOLUTION:
It’s currently impossible to store data with Pig and HCatalog into Parquet format table. There’s an issue pending.

Oozie

1. ERROR:

java.io.IOException: No FileSystem for scheme: hbase

1	java.io.IOException: No FileSystem for scheme: hbase

SITUATION:
While writing data with Pig to HBase using Oozie workflow.

SOLUTION:
Known issue, fixed in Pig version 0.12.1. To workaround this in Pig set following parameter in the script:

SET mapreduce.fileoutputcommitter.marksuccessfuljobs false;

1	SET mapreduce.fileoutputcommitter.marksuccessfuljobs false;

TIPS

Pig

TIP #1
CSVExcelStorage with SKIP_INPUT_HEADER option may not work as expected.
Pig while loading files doesn’t put whole files into same mapper (one file may get splitted into few mappers). Only the mappers that receive files actually starting with the header will remove this header. Those which receive header stucked somewhere in the middle will not remove it. That made me drop the idea of using this storage and apply some custom checks.

Parquet

TIP #1
Pig cannot read empty Parquet files.
This can be painful if your Pig processing depends on some other workflow step, result of which you don’t fully control. As I used Oozie I added decision nodes to check whether the file has some content and acted accordingly.

<decision name="decision">
    <switch>
        <case to="file_not_exists">
            ${fs:dirSize(concat(NAMENODE,PATH_TO_DIR)) eq 0}
        </case>
        <default to="file_exists"/>
    </switch>
</decision>

${fs:dirSize(concat(NAMENODE,PATH_TO_DIR)) eq 0}

</case>

</switch>

</decision>

TIP #2
You won’t succeed while trying to create a table on an empty Parquet file.
Similarly to previous situation, if you build a workflow, make sure you add a check if the file has size bigger that 0.

Oozie

TIP #1
Oozie Hive action ignores the last line in the script.
While running Hive action remember to include an empty line at the end of your Hive script. Oozie simply ignores script’s last line.

TIP #2
Add ${nameNode} parameter to the paths you specify for Oozie fs action.
Only fs action’s paths require the ${nameNode} at the beginning. Without it you’ll get a message stating that the path cannot be located.

TIP #3
Watch out for empty strings in Oozie expression language.
You can’t compare empty strings using Oozie expression language. If you define parameter as param=” or param= (blank_space) then referring to it with equals check will resolve to false!

<property>
  <name>param</name>
  <value>’’</value>
</property>
...
<case to="action_x">
   ${param eq ‘’} --this will resolve to false
</case>

<name>param</name>

</property>

...

${param eq ‘’} --this will resolve to false

</case>

Hive

TIP #1
Table created with LIKE can only use defaults.
When creating a table using LIKE clause you may specify the output table data format, but the format settings can not be changed – the defaults are used.

Example:

CREATE EXTERNAL TABLE IF NOT EXISTS example_table
LIKE the_other_table
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'  -- this option is ignored
STORED AS textfile
LOCATION '/user/cloudera/tables/table_example';

CREATE EXTERNAL TABLE IF NOT EXISTS example_table

LIKE the_other_table

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' -- this option is ignored

STORED AS textfile

LOCATION '/user/cloudera/tables/table_example';

Impala

TIP #1
There can be a problem with language-specific characters formatting.
Some language specific characters (like letters with accents) may cause Impala some trouble. Although such characters can be displayed properly, applying some text formatting functions (like UPPER/LOWER) may return other symbols than expected.

TIP #2
LOAD INPATH function will not work if the specified file is empty.
You may add some check of the file size with expression language if you use Oozie.

<decision name="decision">
    <switch>
        <case to="file_not_exists">
            ${fs:dirSize(concat(NAMENODE,PATH_TO_DIR)) eq 0}
        </case>
        <default to="file_exists"/>
    </switch>
</decision>

${fs:dirSize(concat(NAMENODE,PATH_TO_DIR)) eq 0}

</case>

</switch>

</decision>

« Oozie Sqoop action

Pig Java UDF »

1 thought on “Hadoop troubleshooting & tricks”

Maryjane Serio says:

January 28, 2020 at 5:28 pm

Awesome post, thanks for sharing!

Reply

Hadoop troubleshooting & tricks

Hadoop troubleshooting & tricks

TROUBLE SHOOTING

Pig

Sqoop

Parquet

Oozie

TIPS

Pig

Parquet

Oozie

Hive

Impala

1 thought on “Hadoop troubleshooting & tricks”

Leave a Reply Cancel reply