Monday, May 20, 2013

Automatically Create DataStage Schema Files

Introduction

Loading a csv file to a database table is one of the simplest DataStage tasks. Using Runtime Column Propagation (RCP), a single parameterized job can handle multiple files. Here is a step-by-step tutorial to do just that: Guidelines to Develop a Generic Job using Schema Files. For each csv file, DataStage needs a schema file to load column information at runtime. In the tutorial, a sample schema file is created by hand. If there are dozens of csv files to be loaded, creating all those schema files manually is quite tedious. Since every csv file is mirrored in a database table, it is possible to let DataStage collect column information from the database automatically. I will show you 3 approaches in this post.

Semi-Automated Solution

The first solution is semi-automatic, using DataStage's Import Table Definition wizard. In DataStage designer, click menu item Import / Table Definitions / Orchestrate Schema Definitions, and then select Database table (via orchdbutil) option. Type in database connection info, and select the table to be imported. In the next step, DataStage reads column information from the database, and displays it in the schema file format, as shown below. 

Copy and paste the columns to a schema template. The template contains other format information, such as field and record delimiters. Save it to a new text file, and this table is done. Go back to the previous step, select another table, and repeat.
This approach works well on a handful of tables. If there are a lot more tables, or table structures change frequently, a fully automated approach is more desirable.

Fully Automated Solution 1

As revealed in the previous approach, the DataStage import wizard uses a command-line application (orchdbutil) to generate the schema under the hood. If you have access to the DataStage server and are comfortable with shell scripting or other scripting language, you can write a script to loop through table names in an input file, feed the table to orchdbutil, and parse its output. I am not going to get into the details. Here is a blog post that can help you get started: Importing Table Definitions With OrchDbUtil.

Fully Automated Solution 2

Another fully automated approach takes advantage of DataStage itself. Create a parallel job to read schema  through an Oracle connector using a custom SQL statement, and save it to a sequential file. Create a sequence job to loop through a list of tables. Below is a sample implementation of the parallel job, the SQL statement and the sequence job.


8 comments:

kumar said...
This comment has been removed by a blog administrator.
Anonymous said...

Thank you very much, nicely described and very helpful!

Unknown said...

could you please send a new link for shell scripting method

Jing said...

The link has been updated.

Anonymous said...

What is ALL_TAB_COLS in SQL statement ?

Anonymous said...

Hi Jing,

Thanks a lot for your post, it is very very helpful to create dynamic schema file without manual intervention.

In your "Fully Automated Solution 2", Can you please elaborate your SQL (which is between and 1st and 2nd Union All) to create schema file for all columns in a table ??

Based on your SQL, schema file will have just one column from a table

Thanks in Advance

Jing said...

The middle select statement gets column information from Oracle's built-in view ALL_TAB_COLS. It gives you ALL columns from a table.

Anonymous said...

Looks like you are assuming all CSV file are originated from a database tables as you wrote -- "Since every csv file is mirrored in a database table" -- what if the CSV file is made up or from an application log which has no associated database table? Is there a way to create schema file for such a CSV file using data state?

Thanks!