All,
Please can you help and advise on the following:-
We are in the process of exporting data from Teradata to Hadoop using the TDCH (Teradata Connector for Hadoop). We have managed to successfully import full tables from Teradata to Hadoop, however where the Teradata tables are > 200GB we want to only import the delta's on a daily basis.
We have changed our script to use SOURCEQUERY in place of SOURCETABLE and supplied the SQL with the where clause which only selects a subset of the data based on the date processing. We have also specified the method as split.by.hash, however this is being overridden by split.by.partition when using the SOURCEQUERY parameter.
Using split.by.partition causes the staging table to be created in the DB area which is the full size of the exisitng table, this is causing us issues since we do not have the spare space of replicating the table in the Databse Area and therefore our job abends with "2644 - No more Room in Database".
Please can anyone help why the method split.by.hash cannot be used when using SOURCEQUERY ?
This is a smaple script just to show the parameters we have used to invoke the IMPORT process using TDCH.
hadoop jar /usr/lib/sqoop/lib/teradata-connector-1.4.1.jar terajdbc4.jar \
com.teradata.hadoop.tool.TeradataImportTool \
-D mapreduce.job.queuename=insights \
-url jdbc:teradata://tdprod/database=insight \
-username xxxxxxxx \
-password xxxxxxxxx \
-classname com.teradata.jdbc.TeraDriver \
-fileformat textfile \
-splitbycolumn address_id \
-jobtype hdfs \
-method split.by.hash \
-targetpaths hdfs://dox/user/user1/td_prd_addresses/mail_drop_date='2015-11-20'/ \
-nummappers 1 \
-sourcequery "select col1, col2, col3, col4 from EDWPRODT.ADDRESSES where mail_drop_date = '2015-11-20'"
Forums: