Sunday, December 22, 2013

Apply Balanced Optimization Principles in DataStage Job Design

DataStage v8.5 introduced an optional add-on: balanced optimizer. The optimizer can redesign parallel jobs to improve performance. It applies the following principles in the redesign.
Minimize I/O and data movement
Reduce the amount of source data read by the job by performing computations within the source database. Where possible, move processing of data to the database and avoid extracting data just to process it and write it back to the same database.
Maximize optimization within source or target databases
Make use of the highly developed optimizations that databases achieve by using local indexes, statistics, and other specialized features.
Maximize parallelism
Take advantage of default InfoSphere DataStage behavior when reading and writing databases: use parallel interfaces and pipe the data through the job, so that data flows from source to target without being written to intermediate destinations.
The key takeaway is to do as much as possible within databases. It seems to me that an experienced DataStage developer could do a much better job in applying these principles to optimize a job than a piece of software. So let me give it a try on a real-world production job.
The original design is shown in Fig. 1. All data processing and transformation are performed within DataStage server, including,
  1. two slowly changing dimension stages,
  2. five lookup stages, and
  3. three transformer stages for adding timestamp columns.
Eight Oracle connector stages serve as data sources/targets only, reading 5 source tables and writing 3 target tables within the same database. No other tasks are performed within the connectors. This type of job is an ideal candidate for balanced optimization.

Figure 1: Original job design

The optimized job is shown in Fig. 2. A single Oracle connector stage replaces all 20 processing and data stages. The main data stream is bulk-loaded into a temp table (Fig. 3), and processed in after-SQL.
Figure 2: Redesigned job applying the principles of Balanced Optimization

Figure 3: Oracle connector configuration in the optimized job

Here are the after-SQL statements.
-- Update OSURF_AWARD_PF_DIM --
MERGE INTO OSURF_AWARD_PF_DIM D USING (
SELECT DISTINCT
PRCS_FLAG
FROM OSURF_AWARD_FCT_TEMP
) M ON (D.PRCS_FLAG = M.PRCS_FLAG)
WHEN NOT MATCHED THEN INSERT (
D.AWARD_PF_KEY,
D.PRCS_FLAG,
D.IW_INSERT_DT_TM
) VALUES (
#IWGeneralPS.$SCHEMA_PREFIX##OsurfPS.Schema_Name#.OSURF_AWARD_PF_DIM_SEQ.NEXTVAL,
M.PRCS_FLAG,
SYSDATE
);
-- Update OSURF_EMP_DIM --
MERGE INTO OSURF_EMP_DIM D USING (
SELECT DISTINCT
PI_EMP_ID,
FULL_NM,
NM_PFX,
NM_SFX,
LAST_NM,
FIRST_NM,
MIDDLE_NM,
WRK_TITLE,
JOB_TITLE,
ADDR_1,
ADDR_2,
CITY,
STATE,
ZIP_CD,
HI_EDU_LVL_CD,
STD_HRS,
FTE
FROM OSURF_AWARD_FCT_TEMP
) M ON (
D.EMP_ID = M.PI_EMP_ID and
D.WRK_TITLE = M.WRK_TITLE and
D.JOB_TITLE = M.JOB_TITLE and
D.STD_HRS = M.STD_HRS and
D.FTE = M.FTE
)
WHEN MATCHED THEN UPDATE SET
D.FULL_NM = M.FULL_NM,
D.NM_PFX = M.NM_PFX,
D.NM_SFX = M.NM_SFX,
D.LAST_NM = M.LAST_NM,
D.FIRST_NM = M.FIRST_NM,
D.MIDDLE_NM = M.MIDDLE_NM,
D.ADDR_1 = M.ADDR_1,
D.ADDR_2 = M.ADDR_2,
D.CITY = M.CITY,
D.STATE = M.STATE,
D.ZIP_CD = M.ZIP_CD,
D.HI_EDU_LVL_CD = M.HI_EDU_LVL_CD,
D.IW_INSERT_DT_TM = SYSDATE
WHERE
D.FULL_NM <> M.FULL_NM OR
D.NM_PFX <> M.NM_PFX OR
D.NM_SFX <> M.NM_SFX OR
D.LAST_NM <> M.LAST_NM OR
D.FIRST_NM <> M.FIRST_NM OR
D.MIDDLE_NM <> M.MIDDLE_NM OR
D.ADDR_1 <> M.ADDR_1 OR
D.ADDR_2 <> M.ADDR_2 OR
D.CITY <> M.CITY OR
D.STATE <> M.STATE OR
D.ZIP_CD <> M.ZIP_CD OR
D.HI_EDU_LVL_CD <> M.HI_EDU_LVL_CD
WHEN NOT MATCHED THEN INSERT (
D.EMP_KEY,
D.EMP_ID,
D.FULL_NM,
D.NM_PFX,
D.NM_SFX,
D.LAST_NM,
D.FIRST_NM,
D.MIDDLE_NM,
D.WRK_TITLE,
D.JOB_TITLE,
D.ADDR_1,
D.ADDR_2,
D.CITY,
D.STATE,
D.ZIP_CD,
D.HI_EDU_LVL_CD,
D.STD_HRS,
D.FTE,
D.IW_INSERT_DT_TM
) VALUES (
#IWGeneralPS.$SCHEMA_PREFIX##OsurfPS.Schema_Name#.OSURF_EMP_DIM_SEQ.NEXTVAL,
M.PI_EMP_ID,
M.FULL_NM,
M.NM_PFX,
M.NM_SFX,
M.LAST_NM,
M.FIRST_NM,
M.MIDDLE_NM,
M.WRK_TITLE,
M.JOB_TITLE,
M.ADDR_1,
M.ADDR_2,
M.CITY,
M.STATE,
M.ZIP_CD,
M.HI_EDU_LVL_CD,
M.STD_HRS,
M.FTE,
SYSDATE
);
-- Load OSURF_AWARD_FCT --
MERGE INTO OSURF_AWARD_FCT D USING (
SELECT
g.GRANT_DTL_KEY,
pf.AWARD_PF_KEY,
to_number(to_char(t.award_dt, 'J')) as AWARD_DT_KEY,
e.emp_key as PI_EMP_KEY,
hm.org_key as DEPT_ORG_KEY,
tiu.org_key as TIU_ORG_KEY,
s.spnsr_key as SPNSR_KEY,
ps.spnsr_key as PRIM_SPNSR_KEY,
T.AMT,
T.AWARD_BSE,
T.FA_CHARGE_DIFF,
T.AWARD_NO
FROM
OSURF_AWARD_FCT_TEMP T join osurf_grant_dtl_dim g on
t.ctrct_no = g.ctrct_no
join osurf_emp_dim e on
t.pi_emp_id = e.emp_id and
t.wrk_title = e.wrk_title and
t.job_title = e.job_title and
t.std_hrs = e.std_hrs and
t.fte = e.fte
join osurf_award_pf_dim pf on
t.prcs_flag = pf.prcs_flag
join osurf_spnsr_dim ps on
t.prim_spnsr_id = ps.src_spnsr_id
join osurf_spnsr_dim s on
t.spnsr_id = s.src_spnsr_id
join osurf_org_dim hm on
t.home_org_id = hm.org_id
join osurf_org_dim tiu on
t.tiu_dept_id = tiu.org_id
) M
ON (
D.GRANT_DTL_KEY = M.GRANT_DTL_KEY and
D.AWARD_NO = M.AWARD_NO
)
WHEN MATCHED THEN UPDATE SET
D.AWARD_PF_KEY = M.AWARD_PF_KEY,
D.AWARD_DT_KEY = M.AWARD_DT_KEY,
D.PI_EMP_KEY = M.PI_EMP_KEY,
D.DEPT_ORG_KEY = M.DEPT_ORG_KEY,
D.TIU_ORG_KEY = M.TIU_ORG_KEY,
D.SPNSR_KEY = M.SPNSR_KEY,
D.PRIM_SPNSR_KEY = M.PRIM_SPNSR_KEY,
D.AMT = M.AMT,
D.AWARD_BSE = M.AWARD_BSE,
D.FA_CHARGE_DIFF = M.FA_CHARGE_DIFF,
D.IW_INSERT_DT_TM = SYSDATE
where
D.AWARD_PF_KEY <> M.AWARD_PF_KEY or
D.AWARD_DT_KEY <> M.AWARD_DT_KEY or
D.PI_EMP_KEY <> M.PI_EMP_KEY or
D.DEPT_ORG_KEY <> M.DEPT_ORG_KEY or
D.TIU_ORG_KEY <> M.TIU_ORG_KEY or
D.SPNSR_KEY <> M.SPNSR_KEY or
D.PRIM_SPNSR_KEY <> M.PRIM_SPNSR_KEY or
D.AMT <> M.AMT or
D.AWARD_BSE <> M.AWARD_BSE or
D.FA_CHARGE_DIFF <> M.FA_CHARGE_DIFF
WHEN NOT MATCHED THEN INSERT (
D.GRANT_DTL_KEY,
D.AWARD_PF_KEY,
D.AWARD_DT_KEY,
D.PI_EMP_KEY,
D.DEPT_ORG_KEY,
D.TIU_ORG_KEY,
D.SPNSR_KEY,
D.PRIM_SPNSR_KEY,
D.AMT,
D.AWARD_BSE,
D.FA_CHARGE_DIFF,
D.AWARD_NO,
D.IW_INSERT_DT_TM
) VALUES (
M.GRANT_DTL_KEY,
M.AWARD_PF_KEY,
M.AWARD_DT_KEY,
M.PI_EMP_KEY,
M.DEPT_ORG_KEY,
M.TIU_ORG_KEY,
M.SPNSR_KEY,
M.PRIM_SPNSR_KEY,
M.AMT,
M.AWARD_BSE,
M.FA_CHARGE_DIFF,
M.AWARD_NO,
SYSDATE
);
view raw bo-after.sql hosted with ❤ by GitHub


The performance gain is dramatic, dropping from ~30 seconds all the way down to ~3 seconds. In addition, I get two extra bonuses for free.

  1. The original job is not transactional. If it failed in the middle, the target tables might be in inconsistent states with partial results. In order to recover from the job failure, I need to do some cleanup. The optimized job, on the other hand, is transactional. The update to the target tables is carried out in the after-SQL statements. If they fail, the target tables are automatically rolled back.
  2. If the job fails because of data quality issues, such as lookup failure or unique constraint violation, it is not easy to find the offending rows in the original design. In the optimized design, all intermediate results are kept in the temp table within the same database as the target tables. It is fairly straightforward to pinpoint the offending rows by joining the temp and the target tables.
In conclusion, balanced optimization (BO) can greatly improve job performance, as well as simplify maintenance. However, it is not necessary to pay the additional license fee if the BO principles can been applied in job design by developers.

Saturday, December 21, 2013

Scrollable Table Control with Fixed Header

Introduction

ISD provides a scrollable table control for displaying a lot of rows in a window with fixed height. It lacks, however, the most desirable feature of a scrollable table: a fixed table header. Here is a screenshot and a demo page. The previous versions of ISD do have a fixed-header scrollable table control. But it only works in IE, not in Chrome or Firefox.
A scrollable table is, in my opinion, not a good UI design in the first place. It significantly slows down initial page load. The issue (or the header) can be easily fixed. If your client or manager asks for a scrollable table, try your best to steer them away from it. If you can't, then follow these steps to implement a fixed-header scrollable table.

Implementation

Step 1: Prepare a built-in scrollable table

Add a built-in scrollable table control onto your page. Split the merged header cell above row buttons. This is an important step. Any merged cell will screw up column width calculation while fixing the header row.

Step 2: Download JavaScript libraries

ISD has added out-of-box jQuery support in latest versions. If you still use old versions without jQuery, you can download it from jQuery website, and include it in master pages.
Another library is the jQuery Scrollable Table Plugin. Download it from its website, and put it under your project's root folder.

Step 3: Modify Styles.css

Add the following classes to the Styles.css file.
.tablescroll {
font: 12px normal Tahoma, Geneva, "Helvetica Neue", Helvetica, Arial, sans-serif;
}
.tablescroll td,
.tablescroll_wrapper,
.tablescroll_head,
.tablescroll_foot {
border:1px solid #ccc;
}
.tablescroll td {
padding:3px 5px;
border-bottom:0;
border-right:0;
}
.tablescroll_wrapper {
background-color:#fff;
border-left:0;
}
.tablescroll_head,
.tablescroll_foot {
background-color:#eee;
border-left:0;
border-top:0;
font-size:11px;
font-weight:bold;
}
.tablescroll_head {
margin-bottom:3px;
}
.tablescroll_foot {
margin-top:3px;
}
.tablescroll tbody tr.first td {
border-top:0;
}
view raw tablescroll.css hosted with ❤ by GitHub

Step 4: Add JavaScript code to page prologue

Add the following JavaScript code to the page prologue.
<script language="JavaScript" type="text/javascript" src="../jquery.tablescroll.js"></script>
<script language="JavaScript" type="text/javascript">
Sys.Application.add_load(function(){
var d = $("#tableContainer");
var s = d.children("table");
var t = d.closest("td");
s.detach().appendTo(t);
d.remove();
s.tableScroll({height:200});
});
</script>
Build and run.

Conclusion

Here is what you get on screen, and a demo page. Please note the long delay (~10 seconds) at initial page load, during which the table header appears not fixed. This endorses my previous suggestion: avoid scrollable tables as much as you can.