DataStage v8.5 introduced an optional add-on: balanced optimizer. The optimizer can redesign parallel jobs to improve performance. It applies the following principles in the redesign.
Minimize I/O and data movement
Reduce the amount of source data read by the job by performing computations within the source database. Where possible, move processing of data to the database and avoid extracting data just to process it and write it back to the same database.
Maximize optimization within source or target databases
Make use of the highly developed optimizations that databases achieve by using local indexes, statistics, and other specialized features.
Maximize parallelism
Take advantage of default InfoSphere DataStage behavior when reading and writing databases: use parallel interfaces and pipe the data through the job, so that data flows from source to target without being written to intermediate destinations.
The key takeaway is to do as much as possible within databases. It seems to me that an experienced DataStage developer could do a much better job in applying these principles to optimize a job than a piece of software. So let me give it a try on a real-world production job.
The original design is shown in Fig. 1. All data processing and transformation are performed within DataStage server, including,
- two slowly changing dimension stages,
- five lookup stages, and
- three transformer stages for adding timestamp columns.
Eight Oracle connector stages serve as data sources/targets only, reading 5 source tables and writing 3 target tables within the same database. No other tasks are performed within the connectors. This type of job is an ideal candidate for balanced optimization.
Figure 1: Original job design
The optimized job is shown in Fig. 2. A single Oracle connector stage replaces all 20 processing and data stages. The main data stream is bulk-loaded into a temp table (Fig. 3), and processed in after-SQL.
Figure 2: Redesigned job applying the principles of Balanced Optimization
Figure 3: Oracle connector configuration in the optimized job
Here are the after-SQL statements.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- Update OSURF_AWARD_PF_DIM -- | |
MERGE INTO OSURF_AWARD_PF_DIM D USING ( | |
SELECT DISTINCT | |
PRCS_FLAG | |
FROM OSURF_AWARD_FCT_TEMP | |
) M ON (D.PRCS_FLAG = M.PRCS_FLAG) | |
WHEN NOT MATCHED THEN INSERT ( | |
D.AWARD_PF_KEY, | |
D.PRCS_FLAG, | |
D.IW_INSERT_DT_TM | |
) VALUES ( | |
#IWGeneralPS.$SCHEMA_PREFIX##OsurfPS.Schema_Name#.OSURF_AWARD_PF_DIM_SEQ.NEXTVAL, | |
M.PRCS_FLAG, | |
SYSDATE | |
); | |
-- Update OSURF_EMP_DIM -- | |
MERGE INTO OSURF_EMP_DIM D USING ( | |
SELECT DISTINCT | |
PI_EMP_ID, | |
FULL_NM, | |
NM_PFX, | |
NM_SFX, | |
LAST_NM, | |
FIRST_NM, | |
MIDDLE_NM, | |
WRK_TITLE, | |
JOB_TITLE, | |
ADDR_1, | |
ADDR_2, | |
CITY, | |
STATE, | |
ZIP_CD, | |
HI_EDU_LVL_CD, | |
STD_HRS, | |
FTE | |
FROM OSURF_AWARD_FCT_TEMP | |
) M ON ( | |
D.EMP_ID = M.PI_EMP_ID and | |
D.WRK_TITLE = M.WRK_TITLE and | |
D.JOB_TITLE = M.JOB_TITLE and | |
D.STD_HRS = M.STD_HRS and | |
D.FTE = M.FTE | |
) | |
WHEN MATCHED THEN UPDATE SET | |
D.FULL_NM = M.FULL_NM, | |
D.NM_PFX = M.NM_PFX, | |
D.NM_SFX = M.NM_SFX, | |
D.LAST_NM = M.LAST_NM, | |
D.FIRST_NM = M.FIRST_NM, | |
D.MIDDLE_NM = M.MIDDLE_NM, | |
D.ADDR_1 = M.ADDR_1, | |
D.ADDR_2 = M.ADDR_2, | |
D.CITY = M.CITY, | |
D.STATE = M.STATE, | |
D.ZIP_CD = M.ZIP_CD, | |
D.HI_EDU_LVL_CD = M.HI_EDU_LVL_CD, | |
D.IW_INSERT_DT_TM = SYSDATE | |
WHERE | |
D.FULL_NM <> M.FULL_NM OR | |
D.NM_PFX <> M.NM_PFX OR | |
D.NM_SFX <> M.NM_SFX OR | |
D.LAST_NM <> M.LAST_NM OR | |
D.FIRST_NM <> M.FIRST_NM OR | |
D.MIDDLE_NM <> M.MIDDLE_NM OR | |
D.ADDR_1 <> M.ADDR_1 OR | |
D.ADDR_2 <> M.ADDR_2 OR | |
D.CITY <> M.CITY OR | |
D.STATE <> M.STATE OR | |
D.ZIP_CD <> M.ZIP_CD OR | |
D.HI_EDU_LVL_CD <> M.HI_EDU_LVL_CD | |
WHEN NOT MATCHED THEN INSERT ( | |
D.EMP_KEY, | |
D.EMP_ID, | |
D.FULL_NM, | |
D.NM_PFX, | |
D.NM_SFX, | |
D.LAST_NM, | |
D.FIRST_NM, | |
D.MIDDLE_NM, | |
D.WRK_TITLE, | |
D.JOB_TITLE, | |
D.ADDR_1, | |
D.ADDR_2, | |
D.CITY, | |
D.STATE, | |
D.ZIP_CD, | |
D.HI_EDU_LVL_CD, | |
D.STD_HRS, | |
D.FTE, | |
D.IW_INSERT_DT_TM | |
) VALUES ( | |
#IWGeneralPS.$SCHEMA_PREFIX##OsurfPS.Schema_Name#.OSURF_EMP_DIM_SEQ.NEXTVAL, | |
M.PI_EMP_ID, | |
M.FULL_NM, | |
M.NM_PFX, | |
M.NM_SFX, | |
M.LAST_NM, | |
M.FIRST_NM, | |
M.MIDDLE_NM, | |
M.WRK_TITLE, | |
M.JOB_TITLE, | |
M.ADDR_1, | |
M.ADDR_2, | |
M.CITY, | |
M.STATE, | |
M.ZIP_CD, | |
M.HI_EDU_LVL_CD, | |
M.STD_HRS, | |
M.FTE, | |
SYSDATE | |
); | |
-- Load OSURF_AWARD_FCT -- | |
MERGE INTO OSURF_AWARD_FCT D USING ( | |
SELECT | |
g.GRANT_DTL_KEY, | |
pf.AWARD_PF_KEY, | |
to_number(to_char(t.award_dt, 'J')) as AWARD_DT_KEY, | |
e.emp_key as PI_EMP_KEY, | |
hm.org_key as DEPT_ORG_KEY, | |
tiu.org_key as TIU_ORG_KEY, | |
s.spnsr_key as SPNSR_KEY, | |
ps.spnsr_key as PRIM_SPNSR_KEY, | |
T.AMT, | |
T.AWARD_BSE, | |
T.FA_CHARGE_DIFF, | |
T.AWARD_NO | |
FROM | |
OSURF_AWARD_FCT_TEMP T join osurf_grant_dtl_dim g on | |
t.ctrct_no = g.ctrct_no | |
join osurf_emp_dim e on | |
t.pi_emp_id = e.emp_id and | |
t.wrk_title = e.wrk_title and | |
t.job_title = e.job_title and | |
t.std_hrs = e.std_hrs and | |
t.fte = e.fte | |
join osurf_award_pf_dim pf on | |
t.prcs_flag = pf.prcs_flag | |
join osurf_spnsr_dim ps on | |
t.prim_spnsr_id = ps.src_spnsr_id | |
join osurf_spnsr_dim s on | |
t.spnsr_id = s.src_spnsr_id | |
join osurf_org_dim hm on | |
t.home_org_id = hm.org_id | |
join osurf_org_dim tiu on | |
t.tiu_dept_id = tiu.org_id | |
) M | |
ON ( | |
D.GRANT_DTL_KEY = M.GRANT_DTL_KEY and | |
D.AWARD_NO = M.AWARD_NO | |
) | |
WHEN MATCHED THEN UPDATE SET | |
D.AWARD_PF_KEY = M.AWARD_PF_KEY, | |
D.AWARD_DT_KEY = M.AWARD_DT_KEY, | |
D.PI_EMP_KEY = M.PI_EMP_KEY, | |
D.DEPT_ORG_KEY = M.DEPT_ORG_KEY, | |
D.TIU_ORG_KEY = M.TIU_ORG_KEY, | |
D.SPNSR_KEY = M.SPNSR_KEY, | |
D.PRIM_SPNSR_KEY = M.PRIM_SPNSR_KEY, | |
D.AMT = M.AMT, | |
D.AWARD_BSE = M.AWARD_BSE, | |
D.FA_CHARGE_DIFF = M.FA_CHARGE_DIFF, | |
D.IW_INSERT_DT_TM = SYSDATE | |
where | |
D.AWARD_PF_KEY <> M.AWARD_PF_KEY or | |
D.AWARD_DT_KEY <> M.AWARD_DT_KEY or | |
D.PI_EMP_KEY <> M.PI_EMP_KEY or | |
D.DEPT_ORG_KEY <> M.DEPT_ORG_KEY or | |
D.TIU_ORG_KEY <> M.TIU_ORG_KEY or | |
D.SPNSR_KEY <> M.SPNSR_KEY or | |
D.PRIM_SPNSR_KEY <> M.PRIM_SPNSR_KEY or | |
D.AMT <> M.AMT or | |
D.AWARD_BSE <> M.AWARD_BSE or | |
D.FA_CHARGE_DIFF <> M.FA_CHARGE_DIFF | |
WHEN NOT MATCHED THEN INSERT ( | |
D.GRANT_DTL_KEY, | |
D.AWARD_PF_KEY, | |
D.AWARD_DT_KEY, | |
D.PI_EMP_KEY, | |
D.DEPT_ORG_KEY, | |
D.TIU_ORG_KEY, | |
D.SPNSR_KEY, | |
D.PRIM_SPNSR_KEY, | |
D.AMT, | |
D.AWARD_BSE, | |
D.FA_CHARGE_DIFF, | |
D.AWARD_NO, | |
D.IW_INSERT_DT_TM | |
) VALUES ( | |
M.GRANT_DTL_KEY, | |
M.AWARD_PF_KEY, | |
M.AWARD_DT_KEY, | |
M.PI_EMP_KEY, | |
M.DEPT_ORG_KEY, | |
M.TIU_ORG_KEY, | |
M.SPNSR_KEY, | |
M.PRIM_SPNSR_KEY, | |
M.AMT, | |
M.AWARD_BSE, | |
M.FA_CHARGE_DIFF, | |
M.AWARD_NO, | |
SYSDATE | |
); |
The performance gain is dramatic, dropping from ~30 seconds all the way down to ~3 seconds. In addition, I get two extra bonuses for free.
- The original job is not transactional. If it failed in the middle, the target tables might be in inconsistent states with partial results. In order to recover from the job failure, I need to do some cleanup. The optimized job, on the other hand, is transactional. The update to the target tables is carried out in the after-SQL statements. If they fail, the target tables are automatically rolled back.
- If the job fails because of data quality issues, such as lookup failure or unique constraint violation, it is not easy to find the offending rows in the original design. In the optimized design, all intermediate results are kept in the temp table within the same database as the target tables. It is fairly straightforward to pinpoint the offending rows by joining the temp and the target tables.
In conclusion, balanced optimization (BO) can greatly improve job performance, as well as simplify maintenance. However, it is not necessary to pay the additional license fee if the BO principles can been applied in job design by developers.