Thursday, May 30, 2013

Generate Insert/Update Script for DataStage Oracle Connector Automatically

Here is an updated post.

For transactional (non-bulk) data load in DataStage using Oracle connectors, it is always the best option to use the "update then insert" mode. As shown in this post, the "update then insert" mode is as fast as the "insert then update" mode when it is 100% inserts, and outperform other modes when updates has higher percentage. Even if you know that the incoming data is 100% inserts, "update then insert" is still a better choice. Because your job may fail in the middle, and the data has been partially loaded. Now, if you restart the job, it is no longer 100% inserts.

In order to use the "update then insert" mode, you need to specify key column(s) in an Oracle connector stage, so that it knows which columns should be matched in where clause while generating update statement at run-time. Alternatively, you can provide insert/update statements yourself.

Writing insert/update statements by hand is tedious, especially if you have a lot of tables. The following SQL scripts generate insert/update statements automatically. You may paste the output into Oracle connector stages directly. Or you may wrap the scripts in a simple parallel job, and save the output into a text file. Use a sequence job to loop through hundreds of tables.

For update statement:
For insert statement:

Saturday, May 25, 2013

ISD Code Customization Tips, Part III: Avoid Reading QueryString Directly


ISD makes it trivial to implement URL parameter encryption. All you need to do is to tick a checkbox in application generation options. This is extremely helpful in scenarios when you need to turn on/off the functionality in the middle of your development. For example, your client requests to add URL encryption at the last minute before you are ready to release the application. Another scenario is that you had encryption turned on from the beginning of your project. In order to track a tricky bug, you turned encryption off temporarily. After the bug was fixed, you turned encryption back on.

However, if you read URL parameters directly in your customization code, like this,
string companyId = Page.Request.QueryString["CompanyId"];
you lose all the flexibility and convenience ISD provides. That's because ISD adds/removes encryption-related code only in auto generated section II. It does not touch custom code in section I. Therefore, it is your responsibility to switch on/off encryption in section I.


In order to read URL parameters in section I custom code without losing the encryption on/off flexibility, you can use URL( ) formula function, like this,
string companyId = EvaluateFormula("URL(\"CompanyId\")");
Formula function URL( ) automatically decrypts parameters if URL encryption is turned on. 

Thursday, May 23, 2013

Age Calculation: Oracle vs. SQL Server

Oracle months_between(sysdate, dob)/12
SQL server year(getdate()) - year(dob) + (month(getdate()) - month(dob))/12.0 + (day(getdate()) - day(dob))/365.0

If you are only interested in the integer part of the age, here is an alternative method  for SQL server:
(cast(convert(varchar(8), getdate(), 112) as int) - cast(convert(varchar(8), dob, 112) as int))/10000

Tuesday, May 21, 2013

DateDiff: Oracle vs. SQL Server


Oracle and SQL server are very different on calculating the difference between two dates, both syntactically and semantically. The following table summarizes their differences.
Syntax Semantics
Oracle date2 - date1 Time elapsed from date1 to date2 measured in days.
SQL Server datediff(unit, date1, date2) The number of date/time boundaries crossed from date1 to date2.

The SQL server's semantics is best demonstrated with the following example.

In both cases, the time elapsed is the same, 10 seconds. But the boundary counts are different. Be careful about the semantic difference. Make sure what you calculate is what you want.

Now it begs the question: how to calculate boundaries crossed in Oracle or time elapsed in SQL server.

Boundaries crossed in Oracle

First, use trunc() function to align dates to their boundaries. Then calculate the difference. For example,

Time elapsed in SQL server

Use second boundaries crossed as an approximation to seconds elapsed. Then convert it to other unit. For example,

Monday, May 20, 2013

Automatically Create DataStage Schema Files


Loading a csv file to a database table is one of the simplest DataStage tasks. Using Runtime Column Propagation (RCP), a single parameterized job can handle multiple files. Here is a step-by-step tutorial to do just that: Guidelines to Develop a Generic Job using Schema Files. For each csv file, DataStage needs a schema file to load column information at runtime. In the tutorial, a sample schema file is created by hand. If there are dozens of csv files to be loaded, creating all those schema files manually is quite tedious. Since every csv file is mirrored in a database table, it is possible to let DataStage collect column information from the database automatically. I will show you 3 approaches in this post.

Semi-Automated Solution

The first solution is semi-automatic, using DataStage's Import Table Definition wizard. In DataStage designer, click menu item Import / Table Definitions / Orchestrate Schema Definitions, and then select Database table (via orchdbutil) option. Type in database connection info, and select the table to be imported. In the next step, DataStage reads column information from the database, and displays it in the schema file format, as shown below. 

Copy and paste the columns to a schema template. The template contains other format information, such as field and record delimiters. Save it to a new text file, and this table is done. Go back to the previous step, select another table, and repeat.
This approach works well on a handful of tables. If there are a lot more tables, or table structures change frequently, a fully automated approach is more desirable.

Fully Automated Solution 1

As revealed in the previous approach, the DataStage import wizard uses a command-line application (orchdbutil) to generate the schema under the hood. If you have access to the DataStage server and are comfortable with shell scripting or other scripting language, you can write a script to loop through table names in an input file, feed the table to orchdbutil, and parse its output. I am not going to get into the details. Here is a blog post that can help you get started: Importing Table Definitions With OrchDbUtil.

Fully Automated Solution 2

Another fully automated approach takes advantage of DataStage itself. Create a parallel job to read schema  through an Oracle connector using a custom SQL statement, and save it to a sequential file. Create a sequence job to loop through a list of tables. Below is a sample implementation of the parallel job, the SQL statement and the sequence job.