Oozie Sqoop action
There are few ways to build Oozie Sqooping action. You can decide to use:
- arg properties
- command property
- or options file
Each of the options is equally good so you can choose which one suits you best.
Contents
HOW TO
1. arg properties
You can specify all Sqoop options using the arg properties. They go in pairs: one for option preceded by “‐‐” and one for its value, or alone for option which don’t require a value (also with “‐‐”). The only difference is the schema option, which requires 3 args: “‐‐”, “‐‐schema” and “schema_value”. schema option needs to be specified as the last one, otherwise some other parameters coming afterwards will not be visible.
Here’s how such Sqoop action may look like:
- export to database
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
<action name="sqoop export with arg" cred=""> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <arg>export</arg> <arg>--connect</arg> <arg>${jdbc_uri};database=${database_name}</arg> <arg>--username</arg> <arg>${db_username}</arg> <arg>--password-file</arg> <arg>${dir_db_password_file}</arg> <arg>--table</arg> <arg>${db_table_name}</arg> <arg>--fields-terminated-by</arg> <arg>${fields_separator}</arg> <arg>--export-dir</arg> <arg>${export_dir}</arg> <arg>--batch</arg> <arg>--num-mappers</arg> <arg>${number_of_mappers}</arg> <arg>--</arg> <arg>--schema</arg> <arg>${db_schema}</arg> <file>${jdbc_connection.jar}</file> </sqoop> <ok to="end"/> <error to="kill"/> </action> |
- import from database
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
<action name="sqoop import with arg" cred=""> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <arg>import</arg> <arg>--connect</arg> <arg>${jdbc_uri};database=${db_name}</arg> <arg>--username</arg> <arg>${db_username}</arg> <arg>--password-file</arg> <arg>${dir_db_password_file}</arg> <arg>--table</arg> <arg>${db_table_name}</arg> <arg>--target-dir</arg> <arg>${target_dir}</arg> <arg>--batch</arg> <arg>--num-mappers</arg> <arg>${number_of_mappers}</arg> <arg>--map-column-java </arg> <arg>${column_name1}=Integer,${column_name1}=String, ${column_name3}=Float</arg> <arg>--as-parquetfile</arg> <arg>--</arg> <arg>--schema</arg> <arg>${db_schema}</arg> <file>${jdbc_connection.jar}</file> </sqoop> <ok to="end"/> <error to="kill"/> </action> |
Note: Instead of “‐‐table” and “‐‐ ‐‐schema” options you may use “‐‐query” option like:
1 |
“select * from ${db_schema}.${db_table_name} where $CONDITIONS”. |
where $CONDITIONS clause in needed for Sqoop to run properly. You can add other filter clauses to where as well.
“query” is only available for import action.
- call stored procedure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
<action name="stored procedure" cred=""> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <arg>eval</arg> <arg>--connect</arg> <arg>${jdbc_uri};database=${db_name}</arg> <arg>--username</arg> <arg>${db_username}</arg> <arg>--password-file</arg> <arg>${dir_db_password_file}</arg> <arg>--query</arg> <arg>${procedure_name}</arg> <file>${jdbc_connection.jar}</file> </sqoop> <ok to="end"/> <error to="kill"/> </action> |
2. command property
The alternative to many arg properties is a command property, which allows you to define all Sqoop options in a way that you would run it using Sqoop shell.
Almost all used options require a preceding “‐‐”. The only difference is the schema option, which requires “‐‐ ‐‐”. schema option needs to be specified as the last one, otherwise some other parameters coming afterwards will not be visible.
Here’s how such Sqoop action may look like:
- export to database
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
<action name="sqoop export with command" cred=""> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <command>export --connect ${jdbc_uri};database=${db_name} --username ${db_username} --password-file ${dir_db_password_file} --table ${db_table_name} --export-dir ${export_dir} --fields-terminated-by ${fields_separator} --batch --num-mappers ${number_of_mappers} -- --schema ${db_schema}</command> <file>${jdbc_connection.jar}</file> </sqoop> <ok to="end"/> <error to="kill"/> </action> |
- import from database
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
<action name="sqoop import with command" cred=""> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <command>import --connect ${jdbc_uri};database=${db_name} --username ${db_username} --password-file ${dir_db_password_file} --table=${db_table_name} --target-dir ${target_dir} --num-mappers ${number_of_mappers} -- --schema ${db_schema}</command> <file>${jdbc_connection.jar}</file> </sqoop> <ok to="end"/> <error to="kill"/> </action> |
Note: Instead of “‐‐table” and “‐‐ ‐‐schema” options you may use “‐‐query” option like:
1 |
“select * from ${db_schema}.${db_table_name} where $CONDITIONS”. |
where $CONDITIONS clause in needed for Sqoop to run properly. You can add other filter clauses to where as well.
“query” is only available for import action.
- call stored procedure
1 2 3 4 5 6 7 8 9 10 11 12 13 |
<action name="stored procedure" cred=""> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <command>eval --connect ${jdbc_uri};database=${database_name} --username ${db_username} --password-file ${dir_db_password_file} --query ${procedure_name}</command> <file>${jdbc_connection.jar}</file> </sqoop> <ok to="end"/> <error to="kill"/> </action> |
3. options file
Instead of defining all options in the workflow itself, you may define them in the file stored in HDFS. It gives a nice option of switching files without changing the workflow or creating a generic one. Created file needs to be pointed to with “‐‐options-file” option and attached with workflow file property.
Remember to use “‐‐ ‐‐schema” option at the very end.
Here’s how such Sqoop action may look like:
1 2 3 4 5 6 7 8 9 10 11 12 |
<action name="sqoop with properties file" cred=""> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <arg>--options-file</arg> <arg>${config.sqoop}</arg> <file>${dir_to_config.sqoop}</file> <file>${jdbc_connection.jar}</file> </sqoop> <ok to="end" /> <error to="kill" /> </action> |
and options file:
- export to database
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
export --connect ${jdbc_uri};database=${database_name} --username ${db_username} --;password-file ${dir_db_password_file} --table ${db_table_name} --export-dir ${export_dir} --fields-terminated-by ${fields_separator} --batch --num-mappers ${number_of_mappers} -- --schema ${db_schema} |
- import from database
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import --connect ${jdbc_uri};database=${database_name} --username ${db_username} --password-file ${dir_db_password_file} --table ${db_table_name} --target-dir ${target_dir} --num-mappers ${number_of_mappers} --fields-terminated-by ${fields_separator} -- --schema ${db_schema} |
- call stored procedure
1 2 3 4 5 6 7 8 9 |
eval --connect ${jdbc_uri};database=${database_name} --username ${db_username} --password-file ${dir_db_password_file} --query ${procedure_name} |