Apache Beam and HCatalog

12/08/2018 1:11 PM
Alice
Tags: Beam, HCatalog, Spark
5

HCatalog gives a flexibility to read and write data to Hive metastore tables without specifying the tables schemas. Apache Beam provides a transform which allows querying the Hive data. It’s called HCatalogIO. Here I show how to use it in Kerberised environment.

Contents

HOW TO

I was using Beam 2.3 Java SDK and Spark 2.2 on CDH Hadoop cluster v. 5.12.1 with Kerberos.

For how to start working with Beam check my previous post.

Using HCatalogIO may look like this:

//HCatOptions may extend PipelineOptions or HadoopFileSystemOptions
HCatReadWrite.HCatOptions options = PipelineOptionsFactory.fromArgs(args)
                .withValidation()
                .as(HCatReadWrite.HCatOptions.class);
Pipeline p = Pipeline.create(options);
    
//We need to provide the hive.metastore.uris
Map<String, String> configProperties = new HashMap<String, String>();
configProperties.put("hive.metastore.uris","my_hive_metastore_uri");

p.apply(HCatalogIO.read()
                  .withConfigProperties(configProperties)
                  .withDatabase("my_database")
                  .withTable("my_input_table"))
 .apply(HCatalogIO.write()
                  .withConfigProperties(configProperties)
                  .withDatabase("my_database")
                  .withTable("my_output_table"));

//HCatOptions may extend PipelineOptions or HadoopFileSystemOptions

HCatReadWrite.HCatOptions options = PipelineOptionsFactory.fromArgs(args)

.withValidation()

.as(HCatReadWrite.HCatOptions.class);

Pipeline p = Pipeline.create(options);

//We need to provide the hive.metastore.uris

Map<String, String> configProperties = new HashMap<String, String>();

configProperties.put("hive.metastore.uris","my_hive_metastore_uri");

p.apply(HCatalogIO.read()

.withConfigProperties(configProperties)

.withDatabase("my_database")

.withTable("my_input_table"))

.apply(HCatalogIO.write()

.withConfigProperties(configProperties)

.withDatabase("my_database")

.withTable("my_output_table"));

Nothing fancy happens here, just simple reading from my_input_table and writing it back to my_output_table. Of course you may add some transforms in between.

Kerberos authentication

With HCatalogIO the trickiest part is to authenticate with Kerberos. In order to do that I had to:

set following Spark properties:
• spark.driver.extraJavaOptions=-Djavax.security.auth.useSubjectCredsOnly=false
• spark.executor.extraJavaOptions=-Djavax.security.auth.useSubjectCredsOnly=false
to prevent the error:

Exception in thread "main" org.apache.spark.SparkException: Keytab file: hdfs://.../mykeytab.keytab does not exist

1
2

Exception in thread "main" org.apache.spark.SparkException: Keytab file:
hdfs://.../mykeytab.keytab does not exist

pass the keytab file 2 times (each time with different name). Once with files option and secondly with keytab option. First one needs to available on HDFS with proper reading permissions set, while the second one needs to be stored locally due to pending issue.
Keytabs need to have different names, otherwise I was getting:

Exception in thread "main" java.lang.IllegalArgumentException: Attempt 
to add (file:/.../my_keytab.keytab#tmp.keytab) multiple times to the 
distributed cache.

Exception in thread "main" java.lang.IllegalArgumentException: Attempt

to add (file:/.../my_keytab.keytab#tmp.keytab) multiple times to the

distributed cache.

pass the principal parameter

specify hive.metastore.uris in order to connect to metastore

Old Hive version

Due to old Hive version (1.1) distributed in CDH, I had to modify the HCatalogIO class in Beam SDK to use HCatUtil.getHiveClient method instead of HCatUtil.getHiveMetastoreClient.

//modified part
public long getEstimatedSizeBytes(PipelineOptions pipelineOptions)
                                                     throws Exception {
            Configuration conf = new Configuration();
            for (Entry<String, String> entry : spec.getConfigProperties()
                                                   .entrySet()) {
                conf.set(entry.getKey(), entry.getValue());
            }
            HiveMetaStoreClient client = null;
            try {
                HiveConf hiveConf = HCatUtil.getHiveConf(conf);
                client = HCatUtil.getHiveClient(hiveConf);
                Table table = HCatUtil.getTable(client, spec.getDatabase(),
                                                spec.getTable());
                return StatsUtils.getFileSizeForTable(hiveConf, table);
            } finally {
                if (client != null) {
                    client.close();
                }
            }
        }

//modified part

public long getEstimatedSizeBytes(PipelineOptions pipelineOptions)

throws Exception {

Configuration conf = new Configuration();

for (Entry<String, String> entry : spec.getConfigProperties()

.entrySet()) {

conf.set(entry.getKey(), entry.getValue());

}

HiveMetaStoreClient client = null;

try {

HiveConf hiveConf = HCatUtil.getHiveConf(conf);

client = HCatUtil.getHiveClient(hiveConf);

Table table = HCatUtil.getTable(client, spec.getDatabase(),

spec.getTable());

return StatsUtils.getFileSizeForTable(hiveConf, table);

} finally {

if (client != null) {

client.close();

}

Without that I was getting errors like:

User class threw exception: java.lang.RuntimeException: java.lang.LinkageError: loader constraint violation: when resolving method "org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(Lorg/apache/hadoop/hive/conf/HiveConf;Lorg/apache/hadoop/hive/ql/metadata/Table;)J" the class loader (instance of org/apache/spark/util/ChildFirstURLClassLoader) of the current class, org/apache/beam/sdk/io/hcatalog/HCatalogIO$BoundedHCatalogSource, and the class loader (instance of sun/misc/Launcher$AppClassLoader)for the method's defining class, org/apache/hadoop/hive/ql/stats/StatsUtils, have different Class objects for the type org/apache/hadoop/hive/conf/HiveConf used in the signature

User class threw exception: java.lang.RuntimeException: java.lang.LinkageError: loader constraint violation: when resolving method "org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(Lorg/apache/hadoop/hive/conf/HiveConf;Lorg/apache/hadoop/hive/ql/metadata/Table;)J" the class loader (instance of org/apache/spark/util/ChildFirstURLClassLoader) of the current class, org/apache/beam/sdk/io/hcatalog/HCatalogIO$BoundedHCatalogSource, and the class loader (instance of sun/misc/Launcher$AppClassLoader)for the method's defining class, org/apache/hadoop/hive/ql/stats/StatsUtils, have different Class objects for the type org/apache/hadoop/hive/conf/HiveConf used in the signature

User class threw exception: java.lang.RuntimeException: com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(Lorg/apache/hadoop/hive/conf/HiveConf;[Ljava/lang/Class;[Ljava/lang/Object;Ljava/lang/String;)Lorg/apache/hadoop/hive/metastore/IMetaStoreClient;

User class threw exception: java.lang.RuntimeException: com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(Lorg/apache/hadoop/hive/conf/HiveConf;[Ljava/lang/Class;[Ljava/lang/Object;Ljava/lang/String;)Lorg/apache/hadoop/hive/metastore/IMetaStoreClient;

User class threw exception: java.lang.RuntimeException: com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME

1	User class threw exception: java.lang.RuntimeException: com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME

java.io.InvalidClassException: org.apache.hadoop.hive.metastore.api.Table; local class incompatible: stream classdesc serialVersionUID = 7046373721250106722, local class serialVersionUID = 398473631015277182

1	java.io.InvalidClassException: org.apache.hadoop.hive.metastore.api.Table; local class incompatible: stream classdesc serialVersionUID = 7046373721250106722, local class serialVersionUID = 398473631015277182

For Beam 2.4 there is a workaround described on official Beam webpage.

Libraries

Additionally I had to attach following libraries:

beam-sdks-java-core-2.3.0.jar
beam-runners-spark-2.3.0.jar
beam-runners-direct-java-2.3.0.jar
beam-runners-core-construction-java-2.3.0.jar
beam-runners-core-java-2.3.0.jar
beam-sdks-java-io-hadoop-file-system-2.3.0.jar
beam-sdks-java-io-hcatalog-2.3.0.jar
beam-sdks-java-io-hadoop-common-2.3.0.jar
hive-hcatalog-core-1.1.0-cdh5.12.1.jar because of

ERROR yarn.ApplicationMaster: User class threw exception: java.lang.NoClassDefFoundError: org/apache/hive/hcatalog/data/transfer/ReaderContext

1

ERROR yarn.ApplicationMaster: User class threw exception: java.lang.NoClassDefFoundError: org/apache/hive/hcatalog/data/transfer/ReaderContext

snappy-java-1.1.7.1.jar because of

Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I)

1	Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I)

Running job

I used spark2-submit command to run my job:

spark2-submit --jars beam-sdks-java-core-2.3.0.jar,
                     beam-runners-spark-2.3.0.jar,
                     beam-runners-direct-java-2.3.0.jar,
                     beam-runners-core-construction-java-2.3.0.jar,
                     beam-runners-core-java-2.3.0.jar,
                     beam-sdks-java-io-hadoop-file-system-2.3.0.jar,
                     beam-sdks-java-io-hcatalog-2.3.0.jar,
                     beam-sdks-java-io-hadoop-common-2.3.0.jar,
                     hive-hcatalog-core-1.1.0-cdh5.12.1.jar,
                     snappy-java-1.1.7.1.jar
              --conf spark.driver.userClassPathFirst=true
              --conf spark.driver.extraJavaOptions=
                        -Djavax.security.auth.useSubjectCredsOnly=false
              --conf spark.executor.extraJavaOptions=
                        -Djavax.security.auth.useSubjectCredsOnly=false
              --files hdfs://.../my_keytab1.keytab
              --principal my_name
              --keytab my_keytab2.keytab
              --class hcatalog.HCatReadWrite
              --master yarn
              --queue=my_queue
              --deploy-mode cluster
              my_jar.jar
              --runner=SparkRunner
              --sparkMaster=yarn

spark2-submit --jars beam-sdks-java-core-2.3.0.jar,

beam-runners-spark-2.3.0.jar,

beam-runners-direct-java-2.3.0.jar,

beam-runners-core-construction-java-2.3.0.jar,

beam-runners-core-java-2.3.0.jar,

beam-sdks-java-io-hadoop-file-system-2.3.0.jar,

beam-sdks-java-io-hcatalog-2.3.0.jar,

beam-sdks-java-io-hadoop-common-2.3.0.jar,

hive-hcatalog-core-1.1.0-cdh5.12.1.jar,

snappy-java-1.1.7.1.jar

--conf spark.driver.userClassPathFirst=true

--conf spark.driver.extraJavaOptions=

-Djavax.security.auth.useSubjectCredsOnly=false

--conf spark.executor.extraJavaOptions=

-Djavax.security.auth.useSubjectCredsOnly=false

--files hdfs://.../my_keytab1.keytab

--principal my_name

--keytab my_keytab2.keytab

--class hcatalog.HCatReadWrite

--master yarn

--queue=my_queue

--deploy-mode cluster

my_jar.jar

--runner=SparkRunner

--sparkMaster=yarn

My pom.xml contained following dependencies:

<dependency>
    <groupid>org.apache.beam</groupid>
    <artifactid>beam-sdks-java-core</artifactid>
    <version>2.3.0</version>
</dependency>
<dependency>
    <groupid>org.apache.beam</groupid>
    <artifactid>beam-runners-direct-java</artifactid>
     <version>2.3.0</version>
</dependency>
<dependency>
    <groupid>org.apache.beam</groupid>
    <artifactid>beam-runners-spark</artifactid>
    <version>2.3.0</version>
</dependency>
<dependency>
    <groupid>org.apache.beam</groupid>
    <artifactid>beam-sdks-java-io-hadoop-file-system</artifactid>
    <version>2.3.0</version>
</dependency>
<dependency>
    <groupid>org.apache.hadoop</groupid>
    <artifactid>hadoop-common</artifactid>
    <version>2.6.0</version>
</dependency>
<dependency>
    <groupid>org.apache.beam</groupid>
    <artifactid>beam-sdks-java-io-hcatalog</artifactid>
    <version>2.3.0</version>
</dependency>
<dependency>
    <groupid>org.apache.beam</groupid>
    <artifactid>beam-sdks-java-io-hadoop-common</artifactid>
    <version>2.3.0</version>
</dependency>
 <dependency>
    <groupid>org.apache.hive.hcatalog</groupid>
    <artifactid>hive-hcatalog-core</artifactid>
    <version>2.1.0</version>
    <exclusions>
       <exclusion>
       <groupid>org.apache.hive</groupid>
       <artifactid>hive-exec</artifactid>
       </exclusion>
    </exclusions>
 </dependency>
<dependency>
    <groupid>org.apache.hive</groupid>
    <artifactid>hive-exec</artifactid>
    <version>2.1.0</version>
</dependency>