
I've received the question a few times now - what should be studied before taking a DataStage Certification exam. Here is a list of sections of manuals to study and some exercise ideas to go with it.
Keep an eye out on my DataStage Certification Squidoo lens for links to my certification blog entries and IBM certification pages. A few months ago I had an idea to publish a book on this subject but with blogging, article writing and work I simply don't have the time. Writing blogs is turning out to be more fun than writing a tech manual.
This is not a complete list. Go to Squidoo for the link to the IBM pages for the full test scope. These are just the topics I can remember from the exam.
It helps if you have a working version of DataStage parallel jobs! Turn on APT_DUMP_SCORE as it outputs a message in the job log that shows the operators, processes and datasets in a parallel job run. This is important when running exercises on the effects of combining operators, sorting and partitioning.
1) Understanding the parallel engine.
Read the Parallel Job Developers Guide opening chapters on what the parallel engine and job partitioning is all about. Read each partitioning type.
Exercise: I like to generate a set of rows into a sequential file for testing out various partitioning types. One column with unique ids 1 to 100 and a second column with repeating codes such as A, A, A, A, A, B, B, B, B, B etc. I write a job that reads from the input, sends it through a partitioning stage such as a transformer and writes it to a peek stage. The Director logs tell me which rows went where. You should also view the Director monitor and expand and show the row counts on each instance of each stage in the job to see how stages are split and run on each node and how many rows each intance gets.
2) Understanding Parallel Environment Variables
There are a set of variables that show up in the Administrator tool that affect various parts of the engine and the behaviour of parallel jobs. The ones to focus on are those that appear in the Advanced Parallel Developers Guide relating to parallel jobs. Anything about sorting, buffering and debug.
Exercise: I did a lot of test jobs around the two SORT_INSERTION variables examining the dump score after each run. Look at stages that need sorted data such as join and remove duplicates and how the different SORT_INSERTION settings affect the processes in the job. This was one of the areas where studying for certification made me a better developer as an understanding of sorting options is important for high volume jobs. Create a job with a lookup leading to a transformer. Switch COMBINE_OPERATORS on and off and examine the job score message to see the effect.
3) Database stages
There are four enterprise database stages: DB2, Oracle, Teradata and ODBC. There are a lot of questions on these stages. You don't need to get all these databases running in order to learn how they work. Skim through the PDF that comes with each stage. It helps, if you can find it, to discover how each stage is configured to match the partitioning of the database table it reads or writes.
Exercise: Add each one to a parallel job and, if you don't have that database installed, do some pretend configuration of the stage to see the different settings for insert, upsert, load, import etc. The Test Objectives pages states that non Enterprise RDBMS stages such as Sybase, DRS and Informix are in scope but I didn't see any questions on them in my exam.
4) Parallel Stages
It is worth memorising the differences between join, lookup and merge. Refresh your skills with the most common stages: transformer, change data capture, remove duplicates, sort, copy, filter. I doubt you will see any questions on the development stages (but don't hold me to that!)
Exercise: whatever you want. Have a look at the "optional options" on stages, those stages that do not appear on the default settings but can be added by clicking on the options folder on the stage properties tab. I do lots of testing with the row generator. I would skip the Modify stage - too painful to use even for training. Write out filesets, datasets and lookup filesets and go into the node folders and temp folders to find out how each one is written. File names, data formats - that type of thing.
5) Server Job Developers Guide
A lot of parallel developers are not aware of the sections in this guide that apply to all DataStage developers. The main section is the command line interface of dsjob, dssearch and dsadmin. Once again this is a section of learning that can make you a better developer. I think this is the only section from this guide you need to know for certification.
Exercise: run a bunch of different dsjob, dssearch and dsadmin commands.
6) Installation and Configuration
Skim through the installation guide. Read the section on parallel installation. Read the pre-requisites for each type of install such as users and groups, the compiler, project locations, kernel settings for each platform. Make sure you know what goes into the dsenv file.
7) Config Files
You can easily spend a lot of time developing parallel jobs without having anything to do with config files. Often they are setup at the start of the project and are never touched again. There may be a lot of questions on config files and pools.
Exercise: pretty simple, run the job against different config file settings. Try some sort pooling and watch the dump score for results.
8) Miscellaneous
I read through these sections but had no easy way to do any exercises or testing, and as it turned out they were barely mentioned in the exam: installation architecture on USS (Installation Guide and Parallel Job Developers Guide). NLS, very quickly skim the NLS Guide PDF available through your client documentation).
0 comments:
Post a Comment