0% found this document useful (0 votes)
1K views69 pages

Interview/44 Important Practical Interview Questions - HTML

This document provides answers to common interview questions about using Informatica to handle duplicate records in source data and load unique records into target systems. It discusses using the DISTINCT option in a source qualifier for relational database sources, and using a Sorter transformation with the DISTINCT option for flat file sources. It also covers using an Aggregator transformation to group on duplicate fields, and using dynamic lookup cache with insert/update to eliminate duplicates. Other questions cover loading data to multiple targets based on conditions, using a Normalizer to reshape data, and best practices for source qualifier configuration.

Uploaded by

abburu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views69 pages

Interview/44 Important Practical Interview Questions - HTML

This document provides answers to common interview questions about using Informatica to handle duplicate records in source data and load unique records into target systems. It discusses using the DISTINCT option in a source qualifier for relational database sources, and using a Sorter transformation with the DISTINCT option for flat file sources. It also covers using an Aggregator transformation to group on duplicate fields, and using dynamic lookup cache with insert/update to eliminate duplicates. Other questions cover loading data to multiple targets based on conditions, using a Normalizer to reshape data, and best practices for source qualifier configuration.

Uploaded by

abburu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 69

http://www.dwbiconcepts.

com/tutorial/10-
interview/44-important-practical-
interview-questions.html

Best Informatica Interview Questions &


Answers

Learn the answers of some critical questions commonly asked during


Informatica interview.

Deleting duplicate row using Informatica

Q1. Suppose we have Duplicate records in Source System and we


want to load only the unique records in the Target System eliminating
the duplicate rows. What will be the approach?
Ans.
Let us assume that the source system is a Relational Database . The
source table is having duplicate rows. Now to eliminate duplicate
records, we can check the Distinct option of the Source Qualifier of
the source table and load the target accordingly.
Source Qualifier Transformation DISTINCT clause

But what if the source is a flat file? How can we remove the duplicates from flat file
source? Read On...

Deleting duplicate row for FLAT FILE sources

Now suppose the source system is a Flat File. Here in the Source
Qualifier you will not be able to select the distinct clause as it is
disabled due to flat file source table. Hence the next approach may be
we use a Sorter Transformation and check the Distinct option.
When we select the distinct option all the columns will the selected as
keys, in ascending order by default.
Sorter Transformation DISTINCT clause

Deleting Duplicate Record Using Informatica Aggregator

Other ways to handle duplicate records in source batch run is to use an


Aggregator Transformation and using the Group By checkbox on
the ports having duplicate occurring data. Here you can have the
flexibility to select the last or the first of the duplicate column value
records. Apart from that using Dynamic Lookup Cache of the target
table and associating the input ports with the lookup port and checking
the Insert Else Update option will help to eliminate the duplicate
records in source and hence loading unique records in the target.

For more details on Dynamic Lookup Cache

Loading Multiple Target Tables Based on Conditions

Q2. Suppose we have some serial numbers in a flat file source. We


want to load the serial numbers in two target files one containing the
EVEN serial numbers and the other file having the ODD ones.

Ans.
After the Source Qualifier place a Router Transformation . Create
two Groups namely EVEN and ODD, with filter conditions as
MOD(SERIAL_NO,2)=0 and MOD(SERIAL_NO,2)=1 respectively.
Then output the two groups into two flat file targets.

Router Transformation Groups Tab

Normalizer Related Questions

Q3. Suppose in our Source Table we have data as given below:

Student Name Maths Life Science Physical Science

Sam 100 70 80

John 75 100 85

Tom 80 100 85

We want to load our Target Table as:


Student Name Subject Name Marks

Sam Maths 100


Sam Life Science 70

Sam Physical Science 80

John Maths 75

John Life Science 100

John Physical Science 85

Tom Maths 80

Tom Life Science 100

Tom Physical Science 85


Describe your approach.

Ans.
Here to convert the Rows to Columns we have to use the Normalizer Transformation
followed by an Expression Transformation to Decode the column taken into
consideration. For more details on how the mapping is performed please visit Working
with Normalizer

Q4. Name the transformations which converts one to many rows i.e
increases the i/p:o/p row count. Also what is the name of its reverse
transformation.

Ans.
Normalizer as well as Router Transformations are the Active
transformation which can increase the number of input rows to output
rows.

Aggregator Transformation is the active transformation that performs


the reverse action.

Q5. Suppose we have a source table and we want to load three target
tables based on source rows such that first row moves to first target
table, secord row in second target table, third row in third target table,
fourth row again in first target table so on and so forth. Describe your
approach.

Ans.
We can clearly understand that we need a Router transformation to
route or filter source data to the three target tables. Now the question
is what will be the filter conditions. First of all we need an Expression
Transformation where we have all the source table columns and
along with that we have another i/o port say seq_num, which is gets
sequence numbers for each source row from the port NextVal of a
Sequence Generator start value 0 and increment by 1. Now the
filter condition for the three router groups will be:

MOD(SEQ_NUM,3)=1 connected to 1st target table,


MOD(SEQ_NUM,3)=2 connected to 2nd target table,
MOD(SEQ_NUM,3)=0 connected to 3rd target table.

Router Transformation Groups Tab

Loading Multiple Flat Files using one mapping

Q6. Suppose we have ten source flat files of same structure. How can
we load all the files in target database in a single batch run using a
single mapping.

Ans.
After we create a mapping to load data in target database from flat
files, next we move on to the session property of the Source Qualifier.
To load a set of source files we need to create a file say final.txt
containing the source falt file names, ten files in our case and set the
Source filetype option as Indirect. Next point this flat file final.txt
fully qualified through Source file directory and Source filename .
Image: Session Property Flat File

Q7. How can we implement Aggregation operation without using an


Aggregator Transformation in Informatica.

Ans.
We will use the very basic concept of the Expression
Transformation that at a time we can access the previous row data
as well as the currently processed data in an expression
transformation. What we need is simple Sorter, Expression and Filter
transformation to achieve aggregation at Informatica level.
For detailed understanding visit Aggregation without Aggregator

Q8. Suppose in our Source Table we have data as given below:


Student Name Subject Name Marks

Sam Maths 100

Tom Maths 80

Sam Physical Science 80

John Maths 75

Sam Life Science 70

John Life Science 100

John Physical Science 85

Tom Life Science 100

Tom Physical Science 85

We want to load our Target Table as:


Student Name Maths Life Science Physical Science
Sam 100 70 80

John 75 100 85

Tom 80 100 85
Describe your approach.

Ans.
Here our scenario is to convert many rows to one rows, and the transformation which will
help us to achieve this is Aggregator . Our Mapping will look like this:

Mapping using sorter and Aggregator

We will sort the source data based on STUDENT_NAME ascending followed by


SUBJECT ascending.

Sorter Transformation
Now based on STUDENT_NAME in GROUP BY clause the following output subject
columns are populated as
MATHS: MAX(MARKS, SUBJECT='Maths')
LIFE_SC: MAX(MARKS, SUBJECT='Life Science')
PHY_SC: MAX(MARKS, SUBJECT='Physical Science')

Aggregator Transformation

Revisiting Source Qualifier Transformation

Q9. What is a Source Qualifier? What are the tasks we can perform
using a SQ and why it is an ACTIVE transformation?

Ans.
A Source Qualifier is an Active and Connected Informatica
transformation that reads the rows from a relational database or flat
file source.

We can configure the SQ to join [Both INNER as well as OUTER JOIN]


data originating from the same source database.
We can use a source filter to reduce the number of rows the
Integration Service queries.
We can specify a number for sorted ports and the Integration Service
adds an ORDER BY clause to the default SQL query.
We can choose Select Distinct option for relational databases and the
Integration Service adds a SELECT DISTINCT clause to the default SQL
query.
Also we can write Custom/Used Defined SQL query which will
override the default query in the SQ by changing the default settings of
the transformation properties.
Aslo we have the option to write Pre as well as Post SQL statements
to be executed before and after the SQ query in the source database.

Since the transformation provides us with the property Select


Distinct , when the Integration Service adds a SELECT DISTINCT
clause to the default SQL query, which in turn affects the number of
rows returned by the Database to the Integration Service and hence it
is an Active transformation.

Q10. What happens to a mapping if we alter the datatypes between


Source and its corresponding Source Qualifier?

Ans.
The Source Qualifier transformation displays the transformation
datatypes. The transformation datatypes determine how the source
database binds data when the Integration Service reads it.
Now if we alter the datatypes in the Source Qualifier transformation or
the datatypes in the source definition and Source Qualifier
transformation do not match, the Designer marks the mapping as
invalid when we save it.

Q11. Suppose we have used the Select Distinct and the Number Of
Sorted Ports property in the SQ and then we add Custom SQL Query.
Explain what will happen.

Ans.
Whenever we add Custom SQL or SQL override query it overrides the
User-Defined Join, Source Filter, Number of Sorted Ports, and Select
Distinct settings in the Source Qualifier transformation. Hence only the
user defined SQL Query will be fired in the database and all the other
options will be ignored .

Q12. Describe the situations where we will use the Source Filter,
Select Distinct and Number Of Sorted Ports properties of Source
Qualifier transformation.

Ans.
Source Filter option is used basically to reduce the number of rows
the Integration Service queries so as to improve performance.
Select Distinct option is used when we want the Integration Service
to select unique values from a source, filtering out unnecessary data
earlier in the data flow, which might improve performance.
Number Of Sorted Ports option is used when we want the source
data to be in a sorted fashion so as to use the same in some following
transformations like Aggregator or Joiner, those when configured for
sorted input will improve the performance.

Q13. What will happen if the SELECT list COLUMNS in the Custom
override SQL Query and the OUTPUT PORTS order in SQ transformation
do not match?

Ans.
Mismatch or Changing the order of the list of selected columns to that
of the connected transformation output ports may result is session
failure.

Q14. What happens if in the Source Filter property of SQ


transformation we include keyword WHERE say, WHERE
CUSTOMERS.CUSTOMER_ID > 1000.

Ans.
We use source filter to reduce the number of source records. If we
include the string WHERE in the source filter, the Integration Service
fails the session .

Q15. Describe the scenarios where we go for Joiner transformation


instead of Source Qualifier transformation.

Ans.
While joining Source Data of heterogeneous sources as well as to
join flat files we will use the Joiner transformation.
Use the Joiner transformation when we need to join the following types
of sources:
Join data from different Relational Databases.
Join data from different Flat Files.
Join relational sources and flat files.

Q16. What is the maximum number we can use in Number Of Sorted


Ports for Sybase source system.

Ans.
Sybase supports a maximum of 16 columns in an ORDER BY clause. So
if the source is Sybase, do not sort more than 16 columns.

Q17. Suppose we have two Source Qualifier transformations SQ1 and


SQ2 connected to Target tables TGT1 and TGT2 respectively. How do
you ensure TGT2 is loaded after TGT1?
Ans.
If we have multiple Source Qualifier transformations connected to
multiple targets, we can designate the order in which the Integration
Service loads data into the targets.
In the Mapping Designer, We need to configure the Target Load Plan
based on the Source Qualifier transformations in a mapping to specify
the required loading order.

Image: Target Load Plan

Target Load Plan Ordering

Q18. Suppose we have a Source Qualifier transformation that


populates two target tables. How do you ensure TGT2 is loaded after
TGT1?

Ans.
In the Workflow Manager, we can Configure Constraint based load
ordering for a session. The Integration Service orders the target load
on a row-by-row basis. For every row generated by an active source,
the Integration Service loads the corresponding transformed row first
to the primary key table, then to the foreign key table.
Hence if we have one Source Qualifier transformation that provides
data for multiple target tables having primary and foreign key
relationships, we will go for Constraint based load ordering.

Image: Constraint based loading


Revisiting Filter Transformation

Q19. What is a Filter Transformation and why it is an Active one?

Ans.
A Filter transformation is an Active and Connected transformation
that can filter rows in a mapping.
Only the rows that meet the Filter Condition pass through the Filter
transformation to the next transformation in the pipeline. TRUE and
FALSE are the implicit return values from any filter condition we set. If
the filter condition evaluates to NULL, the row is assumed to be FALSE.
The numeric equivalent of FALSE is zero (0) and any non-zero value is
the equivalent of TRUE.

As an ACTIVE transformation, the Filter transformation may change


the number of rows passed through it. A filter condition returns TRUE
or FALSE for each row that passes through the transformation,
depending on whether a row meets the specified condition. Only rows
that return TRUE pass through this transformation. Discarded rows do
not appear in the session log or reject files.

Q20. What is the difference between Source Qualifier transformations


Source Filter to Filter transformation?

Ans.

SQ Source Filter Filter Transformation

Source Qualifier
transformation filters Filter transformation filters rows
rows when read from a from within a mapping
source.

Source Qualifier
Filter transformation filters rows
transformation can only
coming from any type of source
filter rows from
system in the mapping level.
Relational Sources.

Source Qualifier limits


Filter transformation limits the
the row set extracted
row set sent to a target.
from a source.

Source Qualifier reduces To maximize session


the number of rows used performance, include the Filter
throughout the mapping transformation as close to the
sources in the mapping as
and hence it provides possible to filter out unwanted
better performance. data early in the flow of data from
sources to targets.

The filter condition in Filter Transformation can define a


the Source Qualifier condition using any statement or
transformation only uses transformation function that
standard SQL as it runs returns either a TRUE or FALSE
in the database. value.

Revisiting Joiner Transformation

Q21. What is a Joiner Transformation and why it is an Active one?

Ans.
A Joiner is an Active and Connected transformation used to join
source data from the same source system or from two related
heterogeneous sources residing in different locations or file systems.
The Joiner transformation joins sources with at least one matching
column. The Joiner transformation uses a condition that matches one
or more pairs of columns between the two sources.
The two input pipelines include a master pipeline and a detail pipeline
or a master and a detail branch. The master pipeline ends at the Joiner
transformation, while the detail pipeline continues to the target.

In the Joiner transformation, we must configure the transformation


properties namely Join Condition, Join Type and Sorted Input option to
improve Integration Service performance.
The join condition contains ports from both input sources that must
match for the Integration Service to join two rows. Depending on the
type of join selected, the Integration Service either adds the row to
the result set or discards the row .
The Joiner transformation produces result sets based on the join type,
condition, and input data sources. Hence it is an Active transformation.

Q22. State the limitations where we cannot use Joiner in the mapping
pipeline.

Ans.
The Joiner transformation accepts input from most transformations.
However, following are the limitations:
Joiner transformation cannot be used when either of the input pipeline
contains an Update Strategy transformation.
Joiner transformation cannot be used if we connect a Sequence
Generator transformation directly before the Joiner transformation.

Q23. Out of the two input pipelines of a joiner, which one will you set
as the master pipeline?

Ans.
During a session run, the Integration Service compares each row of
the master source against the detail source.
The master and detail sources need to be configured for optimal
performance .

To improve performance for an Unsorted Joiner transformation, use


the source with fewer rows as the master source. The fewer unique
rows in the master, the fewer iterations of the join comparison occur,
which speeds the join process.
When the Integration Service processes an unsorted Joiner
transformation, it reads all master rows before it reads the detail rows.
The Integration Service blocks the detail source while it caches rows
from the master source . Once the Integration Service reads and
caches all master rows, it unblocks the detail source and reads the
detail rows.

To improve performance for a Sorted Joiner transformation, use the


source with fewer duplicate key values as the master source.
When the Integration Service processes a sorted Joiner transformation,
it blocks data based on the mapping configuration and it stores fewer
rows in the cache, increasing performance. Blocking logic is possible if
master and detail input to the Joiner transformation originate from
different sources . Otherwise, it does not use blocking logic. Instead,
it stores more rows in the cache.

Q24. What are the different types of Joins available in Joiner


Transformation?

Ans.
In SQL, a join is a relational operator that combines data from multiple
tables into a single result set. The Joiner transformation is similar to an
SQL join except that data can originate from different types of sources.

The Joiner transformation supports the following types of joins :


Normal
Master Outer
Detail Outer
Full Outer

Join Type property of Joiner Transformation

Note: A normal or master outer join performs faster than a full outer or detail outer
join.

Q25. Define the various Join Types of Joiner Transformation.

Ans.
In a normal join , the Integration Service discards all rows of data
from the master and detail source that do not match, based on the join
condition.
A master outer join keeps all rows of data from the detail source and
the matching rows from the master source. It discards the unmatched
rows from the master source.
A detail outer join keeps all rows of data from the master source and
the matching rows from the detail source. It discards the unmatched
rows from the detail source.
A full outer join keeps all rows of data from both the master and
detail sources.
Q26. Describe the impact of number of join conditions and join order
in a Joiner Transformation.

Ans.
We can define one or more conditions based on equality between
the specified master and detail sources.
Both ports in a condition must have the same datatype . If we need
to use two ports in the join condition with non-matching datatypes we
must convert the datatypes so that they match. The Designer validates
datatypes in a join condition.
Additional ports in the join condition increases the time necessary
to join two sources.

The order of the ports in the join condition can impact the performance
of the Joiner transformation. If we use multiple ports in the join
condition, the Integration Service compares the ports in the order we
specified.

NOTE: Only equality operator is available in joiner join condition.

Q27. How does Joiner transformation treat NULL value matching.

Ans.
The Joiner transformation does not match null values .
For example, if both EMP_ID1 and EMP_ID2 contain a row with a null
value, the Integration Service does not consider them a match and
does not join the two rows.
To join rows with null values, replace null input with default values in
the Ports tab of the joiner, and then join on the default values.

Note: If a result set includes fields that do not contain data in either of
the sources, the Joiner transformation populates the empty fields with
null values. If we know that a field will return a NULL and we do not
want to insert NULLs in the target, set a default value on the Ports tab
for the corresponding port.

Q28. Suppose we configure Sorter transformations in the master and


detail pipelines with the following sorted ports in order: ITEM_NO,
ITEM_NAME, PRICE.
When we configure the join condition, what are the guidelines we need
to follow to maintain the sort order?

Ans.
If we have sorted both the master and detail pipelines in order of the
ports say ITEM_NO, ITEM_NAME and PRICE we must ensure that:
Use ITEM_NO in the First Join Condition.
If we add a Second Join Condition, we must use ITEM_NAME.
If we want to use PRICE as a Join Condition apart from ITEM_NO, we
must also use ITEM_NAME in the Second Join Condition.
If we skip ITEM_NAME and join on ITEM_NO and PRICE, we will lose the
input sort order and the Integration Service fails the session .

Q29. What are the transformations that cannot be placed between the
sort origin and the Joiner transformation so that we do not lose the
input sort order.

Ans.
The best option is to place the Joiner transformation directly after the
sort origin to maintain sorted data.
However do not place any of the following transformations between
the sort origin and the Joiner transformation:

Custom
Unsorted Aggregator
Normalizer
Rank
Union transformation
XML Parser transformation
XML Generator transformation
Mapplet [if it contains any one of the above mentioned
transformations]

Q30. Suppose we have the EMP table as our source. In the target we
want to view those employees whose salary is greater than or equal to
the average salary for their departments.

Describe your mapping approach. Ans.


Our Mapping will look like this:
Image: Mapping using Joiner

To start with the mapping we need the following transformations:


After the Source qualifier of the EMP table place a Sorter
Transformation . Sort based on DEPTNO port.
Sorter Ports Tab

Next we place a Sorted Aggregator Transformation . Here we will find out the
AVERAGE SALARY for each (GROUP BY) DEPTNO .
When we perform this aggregation, we lose the data for individual employees. To
maintain employee data, we must pass a branch of the pipeline to the Aggregator
Transformation and pass a branch with the same sorted source data to the Joiner
transformation to maintain the original data. When we join both branches of the pipeline,
we join the aggregated data with the original data.
Aggregator Ports Tab
Aggregator Properties Tab

So next we need Sorted Joiner Transformation to join the sorted aggregated data with
the original data, based on DEPTNO .
Here we will be taking the aggregated pipeline as the Master and original dataflow as
Detail Pipeline.
Joiner Condition Tab
Joiner Properties Tab

After that we need a Filter Transformation to filter out the employees having salary less
than average salary for their department.
Filter Condition: SAL>=AVG_SAL
Filter Properties Tab

Lastly we have the Target table instance.

Revisiting Sequence Generator Transformation

Q31. What is a Sequence Generator Transformation?

Ans.
A Sequence Generator transformation is a Passive and Connected
transformation that generates numeric values.
It is used to create unique primary key values, replace missing primary
keys, or cycle through a sequential range of numbers.
This transformation by default contains ONLY Two OUTPUT ports
namely CURRVAL and NEXTVAL . We cannot edit or delete these
ports neither we cannot add ports to this unique transformation.
We can create approximately two billion unique numeric values with
the widest range from 1 to 2147483647.
Q32. Define the Properties available in Sequence Generator
transformation in brief.
Ans.

Sequence
Generator Description
Properties

Start value of the generated sequence that we


want the Integration Service to use if we use
the Cycle option. If we select Cycle, the
Start Value
Integration Service cycles back to this value
when it reaches the end value.
Default is 0.

Difference between two consecutive values


Increment
from the NEXTVAL port.
By
Default is 1.

Maximum value generated by SeqGen. After


reaching this value the session will fail if the
End Value sequence generator is not configured to cycle.

Default is 2147483647.

Current value of the sequence. Enter the


Current value we want the Integration Service to use
Value as the first value in the sequence.
Default is 1.

If selected, when the Integration Service


reaches the configured end value for the
Cycle sequence, it wraps around and starts the cycle
again, beginning with the configured Start
Value.

Number of sequential values the Integration


Service caches at a time.
Number of
Default value for a standard Sequence
Cached
Generator is 0.
Values
Default value for a reusable Sequence
Generator is 1,000.

Reset Restarts the sequence at the current value


each time a session runs.
This option is disabled for reusable Sequence
Generator transformations.

Q33. Suppose we have a source table populating two target tables.


We connect the NEXTVAL port of the Sequence Generator to the
surrogate keys of both the target tables.
Will the Surrogate keys in both the target tables be same? If not how
can we flow the same sequence values in both of them.
Ans.
When we connect the NEXTVAL output port of the Sequence
Generator directly to the surrogate key columns of the target tables,
the Sequence number will not be the same .
A block of sequence numbers is sent to one target tables surrogate key
column. The second targets receives a block of sequence numbers
from the Sequence Generator transformation only after the first target
table receives the block of sequence numbers.
Suppose we have 5 rows coming from the source, so the targets will
have the sequence values as TGT1 (1,2,3,4,5) and TGT2 (6,7,8,9,10).
[Taken into consideration Start Value 0, Current value 1 and Increment
by 1.

Now suppose the requirement is like that we need to have the same
surrogate keys in both the targets.
Then the easiest way to handle the situation is to put an Expression
Transformation in between the Sequence Generator and the Target
tables. The SeqGen will pass unique values to the expression
transformation, and then the rows are routed from the expression
transformation to the targets.
Sequence Generator

Q34. Suppose we have 100 records coming from the source. Now for a
target column population we used a Sequence generator.
Suppose the Current Value is 0 and End Value of Sequence generator
is set to 80. What will happen?
Ans.
End Value is the maximum value the Sequence Generator will
generate. After it reaches the End value the session fails with the
following error message:
TT_11009 Sequence Generator Transformation: Overflow error.

Failing of session can be handled if the Sequence Generator is


configured to Cycle through the sequence, i.e. whenever the
Integration Service reaches the configured end value for the sequence,
it wraps around and starts the cycle again, beginning with the
configured Start Value.

Q35. What are the changes we observe when we promote a non


resuable Sequence Generator to a resuable one?
And what happens if we set the Number of Cached Values to 0 for a
reusable transformation?

Ans.
When we convert a non reusable sequence generator to resuable one
we observe that the Number of Cached Values is set to 1000 by
default; And the Reset property is disabled.
When we try to set the Number of Cached Values property of a
Reusable Sequence Generator to 0 in the Transformation Developer we
encounter the following error message:
The number of cached values must be greater than zero for
reusable sequence transformation.

Normalizer, a native transformation in Informatica, can ease many


complex data transformation requirement. Learn how to effectively use
normalizer here.

Using Noramalizer Transformation

A Normalizer is an Active transformation that returns multiple rows


from a source row, it returns duplicate data for single-occurring source
columns. The Normalizer transformation parses multiple-occurring
columns from COBOL sources, relational tables, or other sources.
Normalizer can be used to transpose the data in columns to rows.
Normalizer effectively does the opposite of Aggregator!

Example of Data Transpose using Normalizer

Think of a relational table that stores four quarters of sales by store


and we need to create a row for each sales occurrence. We can
configure a Normalizer transformation to return a separate row for
each quarter like below..
The following source rows contain four quarters of sales by store:
Source Table
Store Quarter1 Quarter2 Quarter3 Quarter4

Store1 100 300 500 700

Store2 250 450 650 850


The Normalizer returns a row for each store and sales combination. It
also returns an index(GCID) that identifies the quarter number:

Target Table
Store Sales Quarter

Store 1 100 1
Store 1 300 2

Store 1 500 3

Store 1 700 4

Store 2 250 1

Store 2 450 2

Store 2 650 3

Store 2 850 4

How Informatica Normalizer Works

Suppose we have the following data in source:


Name Month Transportation House Rent Food

Sam Jan 200 1500 500

John Jan 300 1200 300

Tom Jan 300 1350 350

Sam Feb 300 1550 450

John Feb 350 1200 290

Tom Feb 350 1400 350


and we need to transform the source data and populate this as below
in the target table:

Name Month Expense Type Expense

Sam Jan Transport 200

Sam Jan House rent 1500


Sam Jan Food 500

John Jan Transport 300

John Jan House rent 1200

John Jan Food 300

Tom Jan Transport 300

Tom Jan House rent 1350

Tom Jan Food 350


.. like this.
Now below is the screen-shot of a complete mapping which shows how
to achieve this result using Informatica PowerCenter Designer. Image:
Normalization Mapping Example 1

I will explain the mapping further below.

Setting Up Normalizer Transformation Property

First we need to set the number of occurences property of the Expense head as 3 in the
Normalizer tab of the Normalizer transformation, since we have Food,Houserent and
Transportation.
Which in turn will create the corresponding 3 input ports in the ports tab along with the
fields Individual and Month
In the Ports tab of the Normalizer the ports will be created
automatically as configured in the Normalizer tab. Interestingly we will
observe two new columns namely GK_EXPENSEHEAD and
GCID_EXPENSEHEAD.
GK field generates sequence number starting from the value as
defined in Sequence field while GCID holds the value of the occurence
field i.e. the column no of the input Expense head.
Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.
Now the GCID will give which expense corresponds to which field while
converting columns to rows.
Below is the screen-shot of the expression to handle this GCID
efficiently:

Image: Expression to handle GCID


This is how we will accomplish our task!

A LookUp cache does not change once built. But what if the underlying
lookup table changes the data after the lookup cache is created? Is
there a way so that the cache always remain up-to-date even if the
underlying table changes?

Dynamic Lookup Cache

Let's think about this scenario. You are loading your target table
through a mapping. Inside the mapping you have a Lookup and in the
Lookup, you are actually looking up the same target table you are
loading. You may ask me, "So? What's the big deal? We all do it quite
often...". And yes you are right.
There is no "big deal" because Informatica (generally) caches the
lookup table in the very beginning of the mapping, so whatever record
getting inserted to the target table through the mapping, will have no
effect on the Lookup cache. The lookup will still hold the previously
cached data, even if the underlying target table is changing.
But what if you want your Lookup cache to get updated as and when
the target table is changing? What if you want your lookup cache to
always show the exact snapshot of the data in your target table at that
point in time? Clearly this requirement will not be fullfilled in case you
use a static cache. You will need a dynamic cache to handle this.

But why on earth someone will need a dynamic cache?

To understand this, let's next understand a static cache scenario.

Static Cache Scenario

Let's suppose you run a retail business and maintain all your customer
information in a customer master table (RDBMS table). Every night, all
the customers from your customer master table is loaded in to a
Customer Dimension table in your data warehouse. Your source
customer table is a transaction system table, probably in 3rd normal
form, and does not store history. Meaning, if a customer changes his
address, the old address is updated with the new address. But your
data warehouse table stores the history (may be in the form of SCD
Type-II). There is a map that loads your data warehouse table from the
source table. Typically you do a Lookup on target (static cache) and
check with your every incoming customer record to determine if the
customer is already existing in target or not. If the customer is not
already existing in target, you conclude the customer is new and
INSERT the record whereas if the customer is already existing, you
may want to update the target record with this new record (if the
record is updated). This is illustrated below, You don't need dynamic
Lookup cache for this

Image: A static Lookup Cache to determine if a source record is new or


updatable
Dynamic Lookup Cache Scenario

Notice in the previous example I mentioned that your source table is


an RDBMS table. This ensures that your source table does not have
any duplicate record. What if you had a flat file as source with many
duplicate records? Would the scenario be same? No, see the below
illustration.

Image: A Scenario illustrating the use of dynamic lookup cache


Here are some more examples when you may consider using dynamic
lookup,

• Updating a master customer table with both new and updated customer
information as shown above
• Loading data into a slowly changing dimension table and a fact table at the same
time. Remember, you typically lookup the dimension while loading to fact. So
you load dimension table before loading fact table. But using dynamic lookup,
you can load both simultaneously.
• Loading data from a file with many duplicate records and to eliminate duplicate
records in target by updating a duplicate row i.e. keeping the most recent row or
the initial row
• Loading the same data from multiple sources using a single mapping. Just
consider the previous Retail business example. If you have more than one shops
and Linda has visited two of your shops for the first time, customer record Linda
will come twice during the same load.

So, How does dynamic lookup work?

When the Integration Service reads a row from the source, it updates
the lookup cache by performing one of the following actions:
Inserts the row into the cache: If the incoming row is not in the
cache, the Integration Service inserts the row in the cache based on
input ports or generated Sequence-ID. The Integration Service flags
the row as insert.
Updates the row in the cache: If the row exists in the cache, the
Integration Service updates the row in the cache based on the input
ports. The Integration Service flags the row as update.
Makes no change to the cache: This happens when the row exists
in the cache and the lookup is configured or specified To Insert New
Rows only or, the row is not in the cache and lookup is configured to
update existing rows only or,
the row is in the cache, but based on the lookup condition, nothing
changes. The Integration Service flags the row as unchanged.
Notice that Integration Service actually flags the rows based on the
above three conditions. This is a great thing, because, if you know the
flag you can actually reroute the row to achieve different logic. This
flag port is called "NewLookupRow" and using this the rows can be
routed for insert, update or to do nothing. You just need to use a
Router or Filter transformation followed by an Update Strategy.
Oh, forgot to tell you the actual values that you can expect in NewLookupRow port:
0 Integration Service does not update or insert the row in the cache.
1 Integration Service inserts the row into the cache.
2 Integration Service updates the row in the cache.
When the Integration Service reads a row, it changes the lookup cache
depending on the results of the lookup query and the Lookup
transformation properties you define. It assigns the value 0, 1, or 2 to
the NewLookupRow port to indicate if it inserts or updates the row in
the cache, or makes no change.

Example of Dynamic Lookup Implementation

Ok, I design a mapping for you to show Dynamic lookup


implementation. I have given a full screenshot of the mapping. Since
the screenshot is slightly bigger, so I link it below..
Image: Dynamic Lookup Mapping

And here I provide you the screenshot of the lookup below. Lookup
ports screen shot first,
Image: Dynamic Lookup Ports Tab

And here is Dynamic Lookup Properties Tab


If you check the mapping screenshot, there I have used a router to
reroute the INSERT group and UPDATE group. The router screenshot is
also given below. New records are routed to the INSERT group and
existing records are routed to the UPDATE group.
Router Transformation Groups Tab

About the Sequence-ID

While using a dynamic lookup cache, we must associate each


lookup/output port with an input/output port or a sequence ID. The
Integration Service uses the data in the associated port to insert or
update rows in the lookup cache. The Designer associates the
input/output ports with the lookup/output ports used in the lookup
condition.
When we select Sequence-ID in the Associated Port column, the
Integration Service generates a sequence ID for each row it inserts into
the lookup cache.
When the Integration Service creates the dynamic lookup cache, it
tracks the range of values in the cache associated with any port using
a sequence ID and it generates a key for the port by incrementing the
greatest sequence ID existing value by one, when the inserting a new
row of data into the cache.
When the Integration Service reaches the maximum number for a
generated sequence ID, it starts over at one and increments each
sequence ID by one until it reaches the smallest existing value minus
one. If the Integration Service runs out of unique sequence ID
numbers, the session fails.
About the Dynamic Lookup Output Port

The lookup/output port output value depends on whether we choose to


output old or new values when the Integration Service updates a row:
Output old values on update: The Integration Service outputs the
value that existed in the cache before it updated the row.
Output new values on update: The Integration Service outputs the
updated value that it writes in the cache. The lookup/output port value
matches the input/output port value.
Note: We can configure to output old or new values using the Output
Old Value On Update transformation property.

Handling NULL in dynamic LookUp

If the input value is NULL and we select the Ignore Null inputs for
Update property for the associated input port, the input value does not
equal the lookup value or the value out of the input/output port. When
you select the Ignore Null property, the lookup cache and the target
table might become unsynchronized if you pass null values to the
target. You must verify that you do not pass null values to the target.
When you update a dynamic lookup cache and target table, the source
data might contain some null values. The Integration Service can
handle the null values in the following ways:
Insert null values: The Integration Service uses null values from the
source and updates the lookup cache and target table using all values
from the source.
Ignore Null inputs for Update property : The Integration Service
ignores the null values in the source and updates the lookup cache and
target table using only the not null values from the source.
If we know the source data contains null values, and we do not want
the Integration Service to update the lookup cache or target with null
values, then we need to check the Ignore Null property for the
corresponding lookup/output port.
When we choose to ignore NULLs, we must verify that we output the
same values to the target that the Integration Service writes to the
lookup cache. We can Configure the mapping based on the value we
want the Integration Service to output from the lookup/output ports
when it updates a row in the cache, so that lookup cache and the
target table might not become unsynchronized
New values. Connect only lookup/output ports from the Lookup
transformation to the target.
Old values. Add an Expression transformation after the Lookup
transformation and before the Filter or Router transformation. Add
output ports in the Expression transformation for each port in the
target table and create expressions to ensure that we do not output
null input values to the target.

When we run a session that uses a dynamic lookup cache, the


Integration Service compares the values in all lookup ports with
the values in their associated input ports by default.
It compares the values to determine whether or not to update the row
in the lookup cache. When a value in an input port differs from the
value in the lookup port, the Integration Service updates the row in the
cache.

But what if we don't want to compare all ports? We can choose the
ports we want the Integration Service to ignore when it compares
ports. The Designer only enables this property for lookup/output ports
when the port is not used in the lookup condition. We can improve
performance by ignoring some ports during comparison.

We might want to do this when the source data includes a column that
indicates whether or not the row contains data we need to update.
Select the Ignore in Comparison property for all lookup ports
except the port that indicates whether or not to update the
row in the cache and target table.

Note: We must configure the Lookup transformation to compare at


least one port else the Integration Service fails the session when we
ignore all ports.

Here is an easy to understand primer on Oracle architecture. Read this


first to give yourself a head-start before you read more advanced
articles on Oracle Server Architecture.

We need to touch two major things here- first server architecture


where we will know memory and process structure and then we will
learn the Oracle storage structure.

Database and Instance

Let’s first understand the difference between Oracle database and


Oracle Instance.
Oracle database is a group of files that reside on disk and store the
data. Whereas an Oracle instance is a piece of shared memory and a
number of processes that allow information in the database to be
accessed quickly and by multiple concurrent users.
The following picture shows the parts of database and instance.
Database Instance

• Control File
• Online Redo Log • Shared Memory (SGA)
• Data File
• Processes
• Temp File

Now let's learn some details of both Database and Oracle Instance.
The Database
The database is comprised of different files as follows
Control Control File contains information that defines the rest of the database like
File names, location and types of other files etc.

Redo Log Redo Log file keeps track of the changes made to the database. All user and
file meta data are stored in data files

Temp file stores the temporary information that are often generated when
Temp file
sorts are performed.
Each file has a header block that contains metadata about the file like
SCN or system change number that says when data stored in buffer
cache was flushed down to disk. This SCN information is important for
Oracle to determine if the database is consistent.

The Instance
This is comprised of a shared memory segment (SGA) and a few
processes. The following picture shows the Oracle structure.

Shared Memory Segment


Shared Pool Contains various structure for running SQL and dependency
Shared SQL Area tracking
Database Buffer Contains various data blocks that are read from database for some
Cache transaction
It stores the redo information until the information is flushed out to
Redo Log Buffer
disk

Details of the Processes are shown below


- Cleans up abnormally terminated connection
- Rolls back uncommited transactions
PMON (Process
- Releases locks held by a terminated process
Monitor)
- Frees SGA resources allocated to the failed processes
- Database maintenance
- Performs automatic instance recovery
SMON (System
- Reclaims space used by temporary segments no longer in use
Monitor)
- Merges contiguous area of free space in the datafile
DBWR - write all dirty buffers to datafiles
(Database - Use a LRU algorithm to keep most recently used blocks in memory
Writer) - Defers write for I/O optimization
LGWR (Log
- writes redo log entries to disk
Writer)
CKPT (Check - If enabled (by setting the parameter
CHECKPOINT_PROCESS=TRUE), take over LGWR’s task of updating
files at a checkpoint
Point)
- Updates header of datafiles and control files at the end of checkpoint
- More frequent checkpoint reduce recovery time from instance failure
LCKn (Lock), Dnnn (Dispatcher), Snnn (Server), RECO (Recover),
Other Processes
Pnnn(Parallel), SNPn(Job Queue), QMNn(Queue Monitor) etc.

Storage Structure

Here we will learn about both physical and logical storage structure. Physical storage is
how Oracle stores the data physically in the system. Whereas logical storage talks about
how an end user actually accesses that data.
Physically Oracle stores everything in file, called data files. Whereas an end user
accesses that data in terms of accessing the RDBMS tables, which is the logical part.
Let's see the details of these structures.
Physical storage space is comprised of different datafiles which
contains data segments. Each segment can contain multiple extents
and each extent contains the blocks which are the most granular
storage structure. Relationship among Segments, extents and blocks
are shown below.
Data Files
|
^
Segments (size: 96k)
|
^
Extents (Size: 24k)
|
^
Blocks (size: 2k)

All about Informatica Lookup


A Lookup is a Passive , Connected or Unconnected Transformation
used to look up data in a relational table, view, synonym or flat file.
The integration service queries the lookup table to retrieve a value
based on the input source value and the lookup condition.

All about Informatica LookUp Transformation

A connected lookup recieves source data, performs a lookup and returns data to the
pipeline; While an unconnected lookup is not connected to source or target and is called
by a transformation in the pipeline by :LKP expression which in turn returns only one
column value to the calling transformation.

Lookup can be Cached or Uncached . If we cache the lookup then again we can further
go for static or dynamic or persistent cache,named cache or unnamed cache . By default
lookup transformations are cached and static.

Lookup Ports Tab


The Ports tab of Lookup Transformation contains
Input Ports: Create an input port for each lookup port we want to use in the lookup
condition. We must have at least one input or input/output port in a lookup
transformation.

Output Ports: Create an output port for each lookup port we want to link to another
transformation. For connected lookups, we must have at least one output port. For
unconnected lookups, we must select a lookup port as a return port (R) to pass a return
value.

Lookup Port: The Designer designates each column of the lookup source as a lookup
port.

Return Port: An unconnected Lookup transformation has one return port that returns
one column of data to the calling transformation through this port.

Notes: We can delete lookup ports from a relational lookup if the mapping does not use
the lookup ports which will give us performance gain. But if the lookup source is a flat
file then deleting of lookup ports fails the session.

Now let us have a look on the Properties Tab of the Lookup Transformation:

Lookup Sql Override: Override the default SQL statement to add a WHERE clause or
to join multiple tables.

Lookup table name: The base table on which the lookup is performed.

Lookup Source Filter: We can apply filter conditions on the lookup table so as to reduce
the number of records. For example, we may want to select the active records of the
lookup table hence we may use the condition CUSTOMER_DIM.ACTIVE_FLAG = 'Y'.

Lookup caching enabled: If option is checked it caches the lookup table during the
session run. Otherwise it goes for uncached relational database hit. Remember to
implement database index on the columns used in the lookup condition to provide better
performance when the lookup in Uncached.

Lookup policy on multiple match: While lookup if the integration service finds
multiple match we can configure the lookup to return the First Value, Last Value, Any
Value or to Report Error.

Lookup condition: The condition to lookup values from the lookup table based on
source input data. For example, IN_EmpNo=EmpNo.

Connection Information: Query the lookup table from the source or target connection.
In can of flat file lookup we can give the file path and name, whether direct or indirect.

Source Type: Determines whether the source is relational database or flat file.

Tracing Level: It provides the amount of detail in the session log for the transformation.
Options available are Normal, Terse, Vebose Initialization, Verbose Data.

Lookup cache directory name: Determines the directory name where the lookup cache
files will reside.

Lookup cache persistent: Indicates whether we are going for persistent cache or non-
persistent cache.

Dynamic Lookup Cache: When checked We are going for Dyanamic lookup cache else
static lookup cache is used.

Output Old Value On Update: Defines whether the old value for output ports will be
used to update an existing row in dynamic cache.

Cache File Name Prefix: Lookup will used this named persistent cache file based on the
base lookup table.

Re-cache from lookup source: When checked, integration service rebuilds lookup cache
from lookup source when the lookup instance is called in the session.

Insert Else Update: Insert the record if not found in cache, else update it. Option is
available when using dynamic lookup cache.

Update Else Insert: Update the record if found in cache, else insert it. Option is
available when using dynamic lookup cache.

Datetime Format: Used when source type is file to determine the date and time format
of lookup columns.

Thousand Separator: By default it is None, used when source type is file to determine
the thousand separator.

Decimal Separator: By default it is "." else we can use "," and used when source type is
file to determine the thousand separator.
Case Sensitive String Comparison: To be checked when we want to go for Case
sensitive String values in lookup comparison. Used when source type is file.

Null ordering: Determines whether NULL is the highest or lowest value. Used when
source type is file.

Sorted Input: Checked whenever we expect the input data to be sorted and is used when
the source type is flat file.

Lookup source is static: When checked it assumes that the lookup source is not going to
change during the session run.

Pre-build lookup cache: Default option is Auto. If we want the integration service to
start building the cache whenever the session just begins we can chose the option Always
allowed.

Aggregation with out Informatica


Aggregator
Since Informatica process data row by row, it is generally possible to
handle data aggregation operation even without an Aggregator
Transformation. On certain cases, you may get huge performance gain
using this technique!

General Idea of Aggregation


without Aggregator
Transformation
Let us take an example: Suppose we want to find the SUM of SALARY
for Each Department of the Employee Table. The SQL query for this
would be:
SELECT DEPTNO,SUM(SALARY) FROM EMP_SRC GROUP BY DEPTNO;
If we need to implement this in Informatica, it would be very easy as
we would obviously go for an Aggregator Transformation. By taking the
DEPTNO port as GROUP BY and one output port as SUM(SALARY the
problem can be solved easily.
Now the trick is to use only Expression to achieve the functionality of
Aggregator expression. We would use the very funda of the expression
transformation of holding the value of an attribute of the previous
tuple over here.
But wait... why would we do this? Aren't we
complicating the thing here?

Yes, we are. But as it appears, in many cases, it might have an


performance benefit (especially if the input is already sorted or when
you know input data will not violate the order, like you are loading
daily data and want to sort it by day). Remember Informatica holds all
the rows in Aggregator cache for aggregation operation. This needs
time and cache space and this also voids the normal row by row
processing in Informatica. By removing the Aggregator with an
Expression, we reduce cache space requirement and ease out row by
row processing. The mapping below will show how to do this

Sorter (SRT_SAL) Ports Tab

Now I am showing a sorter here just illustrate the concept. If you already have sorted data
from the source, you need not use this thereby increasing the performance benefit.
Expression (EXP_SAL) Ports Tab
Image: Expression Ports Tab Properties
Sorter (SRT_SAL1) Ports Tab

Expression (EXP_SAL2) Ports Tab


Filter (FIL_SAL) Properties Tab

This is how we can implement aggregation without using Informatica


aggregator transformation. Hope you liked it!

Informatica Reject File - How to Identify


rejection reason
When we run a session, the integration service may create a reject file
for each target instance in the mapping to store the target reject
record. With the help of the Session Log and Reject File we can
identify the cause of data rejection in the session. Eliminating the
cause of rejection will lead to rejection free loads in the subsequent
session runs. If the Informatica Writer or the Target Database rejects
data due to any valid reason the integration service logs the rejected
records into the reject file. Every time we run the session the
integration service appends the rejected records to the reject file.

Working with Informatica Bad Files or Reject Files

By default the Integration service creates the reject files or bad files in
the $PMBadFileDir process variable directory. It writes the entire
reject record row in the bad file although the problem may be in any
one of the Columns. The reject files have a default naming convention
like [target_instance_name].bad . If we open the reject file in an
editor we will see comma separated values having some tags/ indicator
and some data values. We will see two types of Indicators in the
reject file. One is the Row Indicator and the other is the Column
Indicator .
For reading the bad file the best method is to copy the contents of the
bad file and saving the same as a CSV (Comma Sepatared Value) file.
Opening the csv file will give an excel sheet type look and feel. The
firstmost column in the reject file is the Row Indicator , that
determines whether the row was destined for insert, update, delete or
reject. It is basically a flag that determines the Update Strategy for the
data row. When the Commit Type of the session is configured as
User-defined the row indicator indicates whether the transaction was
rolled back due to a non-fatal error, or if the committed transaction
was in a failed target connection group.

List of Values of Row Indicators:

Row Indicator Indicator Significance Rejected By

0 Insert Writer or target

1 Update Writer or target

2 Delete Writer or target

3 Reject Writer

4 Rolled-back insert Writer

5 Rolled-back update Writer

6 Rolled-back delete Writer

7 Committed insert Writer

8 Committed update Writer

9 Committed delete Writer


Now comes the Column Data values followed by their Column Indicators, that
determines the data quality of the corresponding Column.

List of Values of Column Indicators:

>
Column
Type of data Writer Treats As
Indicator

Writer passes it to the target


database. The target accepts it
Valid data or
D unless a database error occurs,
Good Data.
such as finding a duplicate key
while inserting.

Numeric data exceeded the


specified precision or scale for
Overflowed
the column. Bad data, if you
O Numeric
configured the mapping target
Data.
to reject overflow or truncated
data.

The column contains a null


value. Good data. Writer passes
N Null Value. it to the target, which rejects it
if the target database does not
accept null values.

String data exceeded a


specified precision for the
column, so the Integration
Truncated
T Service truncated it. Bad data,
String Data.
if you configured the mapping
target to reject overflow or
truncated data.
Also to be noted that the second column contains column indicator flag
value 'D' which signifies that the Row Indicator is valid.
Now let us see how Data in a Bad File looks like:

0,D,7,D,John,D,5000.375,O,,N,BrickLand Road Singapore,T


Database Performance Tuning
This article tries to comprehensively list down many things one needs
to know for Oracle Database Performance Tuning. The ultimate goal of
this document is to provide a generic and comprehensive guideline to
Tune Oracle Databases from both programmer and administrator's
standpoint.

Oracle terms and Ideas you


need to know before
beginning
Just to refresh your Oracle skills, here is a short go-through as a starter.

Oracle Parser

It performs syntax analysis as well as semantic analysis of SQL


statements for execution, expands views referenced in the query into
separate query blocks, optimizing it and building (or locating) an
executable form of that statement.

Hard Parse

A hard parse occurs when a SQL statement is executed, and the SQL
statement is either not in the shared pool , or it is in the shared pool
but it cannot be shared. A SQL statement is not shared if the metadata
for the two SQL statements is different i.e. a SQL statement textually
identical to a preexisting SQL statement, but the tables referenced in
the two statements are different, or if the optimizer environment is
different.

Soft Parse

A soft parse occurs when a session attempts to execute a SQL


statement, and the statement is already in the shared pool, and it can
be used (that is, shared). For a statement to be shared, all data,
(including metadata, such as the optimizer execution plan) of the
existing SQL statement must be equal to the current statement being
issued.

Cost Based Optimizer


It generates a set of potential execution plans for SQL statements,
estimates the cost of each plan, calls the plan generator to generate
the plan, compares the costs, and then chooses the plan with the
lowest cost.
This approach is used when the data dictionary has statistics for at
least one of the tables accessed by the SQL statements. The CBO is
made up of the query transformer, the estimator and the plan
generator.

EXPLAIN PLAN

A SQL statement that enables examination of the execution plan


chosen by the optimizer for DML statements. EXPLAIN PLAN makes the
optimizer to choose an execution plan and then to put data describing
the plan into a database table. The combination of the steps Oracle
uses to execute a DML statement is called an execution plan. An
execution plan includes an access path for each table that the
statement accesses and an ordering of the tables i.e. the join order
with the appropriate join method.

Oracle Trace

Oracle utility used by Oracle Server to collect performance and


resource utilization data, such as SQL parse, execute, fetch
statistics, and wait statistics. Oracle Trace provides several SQL
scripts that can be used to access server event tables, collects server
event data and stores it in memory, and allows data to be formatted
while a collection is occurring.

SQL Trace

It is a basic performance diagnostic tool to monitor and tune


applications running against the Oracle server. SQL Trace helps to
understand the efficiency of the SQL statements an application runs
and generates statistics for each statement. The trace files produced
by this tool are used as input for TKPROF.

TKPROF

It is also a diagnostic tool to monitor and tune applications running


against the Oracle Server. TKPROF primarily processes SQL trace
output files and translates them into readable output files, providing a
summary of user-level statements and recursive SQL calls for the trace
files. It also shows the efficiency of SQL statements, generate
execution plans, and create SQL scripts to store statistics in the
database.

The following are generally accepted “Best Practices” for Informatica


PowerCenter ETL development and if implemented, can significantly
improve the overall performance.

Category Technique Benefits

Source Extracts Loading data from Fixed-width Performance


files take less time than Improvement
delimited, since delimited files
require extra parsing. Incase of
Fixed width files, Integration
service know the Start and End
position of each columns upfront
and thus reduces the processing
time.

Using flat files located on the Performance


server machine loads faster than Improvement
a database located on the server
machine.

Mapping Designer There should be a place holder Best Practices


transformation (Expression)
immediately after the Source and
one before the target. Data type
and Data width changes are
bound to happen during
development phase and these
place holder transformations are
used to preserve the port link
between transformations.

Connect only the ports that are Code


required in targets to subsequent Optimization
transformations. Also, active
transformations that reduce the
number of records should be
used as early in the mapping.

If a join must be used in the Performance


Mapping, select appropriate Improvement
driving/master table while using
joins. The table with the lesser
number of rows should be the
driving/master table.

Transformations If there are multiple Lookup Code


condition, make the condition Optimization
with the “=” sign first in order to
optimize the lookup
performance. Also, indexes on
the database table should include
every column used in the lookup
condition.

Persistent caches should be used Performance


if the lookup data is not expected Improvement
to change often. This cache files
are saved and can be reused for
subsequent runs, eliminating
querying the database.

Integration Service processes Code


numeric operations faster than Optimization
string operations. For example, if
a lookup is done on a large
amount of data on two columns,
EMPLOYEE_NAME and
EMPLOYEE_ID, configuring
the lookup around
EMPLOYEE_ID improves
performance.

Replace Complex filter Best Practices


expression with a flag (Y/N).
Complex logic should be moved
to the expression transformation
and the result should be stored in
a port. Filter expression should
take less time to evaluate this
port rather than executing the
entire logic in Filter expression.

Power Center Server Performance


automatically makes conversions Improvement
between compatible data types
which slowdown the
performance considerably. For
example, if a mapping moves
data from an Integer port to a
Decimal port, then back to an
Integer port, the conversion may
be unnecessary.

Assigning default values to a Performance


port; Transformation errors Improvement
written to session log will always
slow down the session
performance. Try removing
default values and eliminate
transformation errors.

Complex joins in Source


Qualifiers should be replaced
with Database views. There
won’t be any performance gains,
but it improves the readability a
lot. Also, any new conditions
can be evaluated easily by just
changing the Database view
“WHERE” clause.

Informatica Development Best


Practice – Workflow
Workflow Manager default properties can be modified to improve the
overall performance and few of them are listed below. This
properties can impact the ETL runtime directly and needs to configured
based on :

i) Source Database
ii) Target Database
iii) Data Volume

Category Technique
While loading Staging Tables for FULL LOADS, Truncate target table option
should be checked. Based on the Target database and the primary key defined,
Integration Service fires TRUNCATE or
DELETE statement.Database Primary Key Defined No
Primary KeyDB2 TRUNCATE
TRUNCATE
INFORMIX DELETE DELETE
ODBC DELETE DELETE
ORACLE DELETE UNRECOVERABLE TRUNCATE
MSSQL DELETE TRUNCATE
SYBASE TRUNCATE TRUNCATE

Workflow Property “Commit interval” (Default value : 10,000)


should be increased for increased for Volumes more than 1
million records. Database Rollback Segment size should also
Session be updated, while increasing “Commit Interval”.
Properties
Insert/Update/Delete options should be set as determined by
the target population method.

Target Option Integration Service


Insert Uses Target update
Option
Update as Update
Update as Insert
Update else Insert
Update as update Updates all rows as
Update
Update as Insert Inserts all rows
Update else Insert Updates existing rows else
Insert
Partition Maximum number of partitions for a session should be 1.5
times the number of processes in the Informatica server.
i.e. 1.5 X 4 Processors = 6 partitions.
Key Value partitions should be used only when an even
Distribution of data can be obtained. In other cases, Pass
Through partitions should be used.
A Source filter should be added to evenly distribute the
data between Pass through Partitions. Key Value should
have ONLY numeric values. MOD(NVL(<Numeric Key
Value>,0),# No of Partitions defined) Ex:
MOD(NVL(product_sys_no,0),6)

If a session contains “N” partition, increase


the DTM Buffer Size to at least “N” times the
value for the session with One partition
If the Source or Target database is of MPP( Massively Parallel
Processing ), enable Pushdown Optimization. By enabling this,
Integration Service will push as much Transformation Logic to
Source database or Target database or FULL ( both ) , based
on the settings. This property can be ignored for Conventional
databases.

Informatica and Oracle hints in


SQL overrides
HINTS used in a SQL statement helps in sending instructions to the Oracle optimizer which would
quicken the query processing time involved. Can we make use of these hints in SQL overrides within
our Informatica mappings so as to improve a query performance?

On a general note any Informatica help material would suggest: you can enter any valid SQL
statement supported by the source database in a SQL override of a Source qualifier or a Lookup
transformation or at the session properties level.

While using them as part of Source Qualifier has no complications, using them in a Lookup SQL
override gets a bit tricky. Use of forward slash followed by an asterix (“/*”) in lookup SQL Override
[generally used for commenting purpose in SQL and at times as Oracle hints.] would result in
session failure with an error like:

TE_7017 : Failed to Initialize Server Transformation lkp_transaction


2009-02-19 12:00:56 : DEBUG : (18785 | MAPPING) : (IS | Integration_Service_xxxx) :
node01_UAT-xxxx : DBG_21263 : Invalid lookup override
SELECT SALES. SALESSEQ as SalesId, SALES.OrderID as ORDERID, SALES.OrderDATE as
ORDERDATE FROM SALES, AC_SALES WHERE AC_SALES. OrderSeq >= (Select /*+
FULL(AC_Sales) PARALLEL(AC_Sales,12) */ min(OrderSeq) From AC_Sales)
This is because Informatica’s parser fails to recognize this special character when used in a Lookup
override. There has been a parameter made available starting with PowerCenter 7.1.3 release,
which enables the use of forward slash or hints.

 Infa 7.x
1. Using a text editor open the PowerCenter server configuration file (pmserver.cfg).
2. Add the following entry at the end of the file:
LookupOverrideParsingSetting=1
3. Re-start the PowerCenter server (pmserver).

 Infa 8.x
1. Connect to the Administration Console.
2. Stop the Integration Service.
3. Select the Integration Service.
4. Under the Properties tab, click Edit in the Custom Properties section.
5. Under Name enter LookupOverrideParsingSetting
6. Under Value enter 1.
7. Click OK.
8. And start the Integration Service.
 Starting with PowerCenter 8.5, this change could be done at the session task itself
as follows:

1. Edit the session.


2. Select Config Object tab.
3. Under Custom Properties add the attribute LookupOverrideParsingSetting and set the Value
to 1.
4. Save the session.

Informatica PowerCenter 8x
Key Concepts – 1
We shall look at the fundamental components of the Informatica
PowerCenter 8.x Suite, the key components are

1. PowerCenter Domain
2. PowerCenter Repository
3. Administration Console
4. PowerCenter Client
5. Repository Service
6. Integration Service

PowerCenter Domain

A domain is the primary unit for management and administration of


services in PowerCenter. Node, Service Manager and Application
Services are components of a domain.

Node

Node is the logical representation of a machine in a domain. The


machine in which the PowerCenter is installed acts as a Domain and
also as a primary node. We can add other machines as nodes in the
domain and configure the nodes to run application services such as the
Integration Service or Repository Service. All service requests from
other nodes in the domain go through the primary node also called as
‘master gateway’.

The Service Manager


The Service Manager runs on each node within a domain and is
responsible for starting and running the application services. The
Service Manager performs the following functions,

• Alerts. Provides notifications of events like shutdowns, restart


• Authentication. Authenticates user requests from the Administration Console,
PowerCenter Client, Metadata Manager, and Data Analyzer
• Domain configuration. Manages configuration details of the domain like machine
name, port
• Node configuration. Manages configuration details of a node metadata like
machine name, port
• Licensing. When an application service connects to the domain for the first time
the licensing registration is performed and for subsequent connections the
licensing information is verified
• Logging. Manages the event logs from each service, the messages could be
‘Fatal’, ‘Error’, ‘Warning’, ‘Info’
• User management. Manages users, groups, roles, and privileges

Application services
The services that essentially perform data movement, connect to
different data sources and manage data are called Application
services, they are namely Repository Service, Integration Service, Web
Services Hub, SAPBW Service, Reporting Service and Metadata
Manager Service. The application services run on each node based on
the way we configure the node and the application service
Domain Configuration
Some of the configurations for a domain involves assigning host name,
port numbers to the nodes, setting up Resilience Timeout values,
providing connection information of metadata Database, SMTP details
etc. All the Configuration information for a domain is stored in a set of
relational database tables within the repository. Some of the global
properties that are applicable for Application Services like ‘Maximum
Restart Attempts’, ‘Dispatch Mode’ as ‘Round Robin’/’Metric
Based’/’Adaptive’ etc are configured under Domain Configuration

2. PowerCenter Repository

The PowerCenter Repository is one of best metadata storage among all


ETL products. The repository is sufficiently normalized to store
metadata at a very detail level; which in turn means the Updates to
therepository are very quick and the overall Team-based Development
is smooth. The repository data structure is also useful for the users to
do analysis and reporting.
Accessibility to the repository through MX views and SDK kit extends
the repositories capability from a simple storage of technical data to a
database for analysis of the ETL metadata.

PowerCenter Repository is a collection of 355 tables which can be


created on any major relational database. The kinds of information that
are stored in the repository are,

1. Repository configuration details


2. Mappings
3. Workflows
4. User Security
5. Process Data of session runs

For a quick understanding,


When a user creates a folder, corresponding entries are made into
table OPB_SUBJECT; attributes like folder name, owner id, type of the
folder like shared or not are all stored.
When we create\import sources and define field names, datatypes etc
in source analyzer entries are made into opb_src and OPB_SRC_FLD.
When target and related fields are created/imported from any
database entries are made into tables like OPB_TARG and
OPB_TARG_FLD.
Table OPB_MAPPING stores mapping attributes like Mapping Name,
Folder Id, Valid status and mapping comments.
Table OPB_WIDGET stores attributes like widget type, widget name,
comments etc. Widgets are nothing but the Transformations which
Informatica internally calls them as Widgets.
Table OPB_SESSION stores configurations related to a session task and
table OPB_CNX_ATTR stores information related to connection objects.
Table OPB_WFLOW_RUN stores process details like workflow name,
workflow started time, workflow completed time, server node it ran
etc.
REP_ALL_SOURCES, REP_ALL_TARGETS and REP_ALL_MAPPINGS are
few of the many views created over these tables.
PowerCenter applications access the PowerCenter repository through
the Repository Service. The Repository Service protects metadata in
the repository by managing repository connections and using object-
locking to ensure object consistency.
We can create a repository as global or local. We can go for‘global’ to
store common objects that multiple developers can use through
shortcuts and go for local repository to perform of development
mappings and workflows. From a local repository, we can create
shortcuts to objects in shared folders in the global repository.
PowerCenter supports versioning. A versioned repository can store
multiple versions of an object.
3. Administration Console
The Administration Console is a web application that we use to administer the PowerCenter
domain and PowerCenter security. There are two pages in the console, Domain Page &
Security Page.
We can do the following In Domain Page:
o Create & manage application services like Integration Service and Repository Service
o Create and manage nodes, licenses and folders
o Restart and shutdown nodes
o View log events
o Other domain management tasks like applying licenses and managing grids and
resources
We can do the following in Security Page:
o Create, edit and delete native users and groups
o Configure a connection to an LDAP directory service. Import users and groups from the
LDAP directory service
o Create, edit and delete Roles (Roles are collections of privileges)
o Assign roles and privileges to users and groups
o Create, edit, and delete operating system profiles. An operating system profile is a level
of security that the Integration Services uses to run workflows
4. PowerCenter Client
Designer, Workflow Manager, Workflow Monitor, Repository Manager & Data Stencil are five
client tools that are used to design mappings, Mapplets, create sessions to load data and
manage repository.
Mapping is an ETL code pictorially depicting logical data flow from source to target involving
transformations of the data. Designer is the tool to create mappings
Designer has five window panes, Source Analyzer, Warehouse Designer, Transformation
Developer, Mapping Designer and Mapplet Designer.
Source Analyzer:
Allows us to import Source table metadata from Relational databases, flat files, XML and
COBOL files. We can only import the source definition in the source Analyzer and not the
source data itself is to be understood. Source Analyzer also allows us to define our own
Source data definition.
Warehouse Designer:
Allows us to import target table definitions which could be Relational databases, flat files,
XML and COBOL files. We can also create target definitions manually and can group them
into folders. There is an option to create the tables physically in the database that we do not
have in source analyzer. Warehouse designer doesn’t allow creating two tables with same
name even if the columns names under them vary or they are from different
databases/schemas.
Transformation Developer:
Transformations like Filters, Lookups, Expressions etc that have scope to be re-used are
developed in this pane. Alternatively Transformations developed in Mapping Designer can
also be reused by checking the option‘re-use’ and by that it would be displayed under
Transformation Developer folders.
Mapping Designer:
This is the place where we actually depict our ETL process; we bring in source definitions,
target definitions, transformations like filter, lookup, aggregate and develop a logical ETL
program. In this place it is only a logical program because the actual data load can be done
only by creating a session and workflow.
Mapplet Designer:
We create a set of transformations to be used and re-used across mappings.
4. PowerCenter Client (contd)
Workflow Manager : In the Workflow Manager, we define a set of instructions called a workflow
to execute mappings we build in the Designer. Generally, a workflow contains a session and any
other task we may want to perform when we run a session. Tasks can include a session, email
notification, or scheduling information.

A set of tasks grouped together becomes worklet. After we create a workflow, we run the
workflow in the Workflow Manager and monitor it in the Workflow Monitor. Workflow Manager has
following three window panes,Task Developer, Create tasks we want to accomplish in the
workflow. Worklet Designer, Create a worklet in the Worklet Designer. A worklet is an object that
groups a set of tasks. A worklet is similar to a workflow, but without scheduling information. You
can nest worklets inside a workflow. Workflow Designer, Create a workflow by connecting tasks
with links in the Workflow Designer. We can also create tasks in the Workflow Designer as you
develop the workflow. The ODBC connection details are defined in Workflow Manager
“Connections “ Menu .

Workflow Monitor : We can monitor workflows and tasks in the Workflow Monitor. We can view
details about a workflow or task in Gantt Chart view or Task view. We can run, stop, abort, and
resume workflows from the Workflow Monitor. We can view sessions and workflow log events in
the Workflow Monitor Log Viewer.

The Workflow Monitor displays workflows that have run at least once. The Workflow Monitor
continuously receives information from the Integration Service and Repository Service. It also
fetches information from the repository to display historic information.

The Workflow Monitor consists of the following windows:

Navigator window – Displays monitored repositories, servers, and repositories objects.

Output window – Displays messages from the Integration Service and Repository
Service.

Time window – Displays progress of workflow runs.

Gantt chart view – Displays details about workflow runs in chronological format.

Task view – Displays details about workflow runs in a report format.

Repository Manager

We can navigate through multiple folders and repositories and perform basic repository tasks with
the Repository Manager. We use the Repository Manager to complete the following tasks:

2. Add and connect to a repository, we can add repositories to the Navigator window and
client registry and then connect to the repositories.

3. Work with PowerCenter domain and repository connections, we can edit or remove domain
connection information. We can connect to one repository or multiple repositories. We
can export repository connection information from the client registry to a file. We can
import the file on a different machine and add the repository connection information to the
client registry.

4. Change your password. We can change the password for our user account.
5. Search for repository objects or keywords. We can search for repository objects containing
specified text. If we add keywords to target definitions, use a keyword to search for a
target definition.

6. View objects dependencies. Before we remove or change an object, we can view


dependencies to see the impact on other objects.

7. Compare repository objects. In the Repository Manager, wecan compare two repository
objects of the same type to identify differences between the objects.

8. Truncate session and workflow log entries. we can truncate the list of session and workflow
logs that the Integration Service writes to the repository. we can truncate all logs, or
truncate all logs older than a specified date.

5. Repository Service

As we already discussed about metadata repository, now we discuss a


separate,multi-threaded process that retrieves, inserts and updates metadata in the
repository database tables, it is Repository Service.
Repository service manages connections to the PowerCenter repository from
PowerCenter client applications like Desinger, Workflow Manager, Monitor, Repository
manager, console and integration service. Repository service is responsible for
ensuring the consistency of metdata in the repository.

Creation & Properties:

Use the PowerCenter Administration Console Navigator window to create a


Repository Service. The properties needed to create are,

Service Name – name of the service like rep_SalesPerformanceDev


Location – Domain and folder where the service is created
License – license service name
Node, Primary Node & Backup Nodes – Node on which the service process runs
CodePage – The Repository Service uses the character set encoded in the repository
code page when writing data to the repository
Database type & details – Type of database, username, pwd, connect string and
tablespacename
The above properties are sufficient to create a repository service, however we can
take a look at following features which are important for better performance and
maintenance.
General Properties
> OperatingMode: Values are Normal and Exclusive. Use Exclusive mode to perform
administrative tasks like enabling version control or promoting local to global
repository
> EnableVersionControl: Creates a versioned repository

Node Assignments: “High availability option” is licensed feature which allows us to


choose Primary & Backup nodes for continuous running of the repository service.
Under normal licenses would see only only Node to select from
Database Properties
> DatabaseArrayOperationSize: Number of rows to fetch each time an array
database operation is issued, such as insert or fetch. Default is 100

> DatabasePoolSize:Maximum number of connections to the repository database that


the Repository Service can establish. If the Repository Service tries to establish more
connections than specified for DatabasePoolSize, it times out the connection attempt
after the number of seconds specified for DatabaseConnectionTimeout

Advanced Properties
> CommentsRequiredFor Checkin: Requires users to add comments when checking
in repository objects.

> Error Severity Level: Level of error messages written to the Repository Service log.
Specify one of the following message levels: Fatal, Error, Warning, Info, Trace &
Debug

> EnableRepAgentCaching:Enables repository agent caching. Repository agent


caching provides optimal performance of the repository when you run workflows.
When you enable repository agent caching, the Repository Service process caches
metadata requested by the Integration Service. Default is Yes.
> RACacheCapacity:Number of objects that the cache can contain when repository
agent caching is enabled. You can increase the number of objects if there is available
memory on the machine running the Repository Service process. The value must be
between 100 and 10,000,000,000. Default is 10,000
> AllowWritesWithRACaching: Allows you to modify metadata in the repository when
repository agent caching is enabled. When you allow writes, the Repository Service
process flushes the cache each time you save metadata through the PowerCenter
Client tools. You might want to disable writes to improve performance in a production
environment where the Integration Service makes all changes to repository
metadata. Default is Yes.

Environment Variables

The database client code page on a node is usually controlled by an environment


variable. For example, Oracle uses NLS_LANG, and IBM DB2 uses DB2CODEPAGE. All
Integration Services and Repository Services that run on this node use the same
environment variable. You can configure a Repository Service process to use a
different value for the database client code page environment variable than the
value set for the node.

You might want to configure the code page environment variable for a Repository
Service process when the Repository Service process requires a different database
client code page than the Integration Service process running on the same node.

For example, the Integration Service reads from and writes to databases using the
UTF-8 code page. The Integration Service requires that the code page environment
variable be set to UTF-8. However, you have a Shift-JIS repository that requires that
the code page environment variable be set to Shift-JIS. Set the environment variable
on the node to UTF-8. Then add the environment variable to the Repository Service
process properties and set the value to Shift-JIS.

6. Integration Service (IS)


The key functions of IS are

• Interpretation of the workflow and mapping metadata from the repository.


• Execution of the instructions in the metadata
• Manages the data from source system to target system within the memory and
disk

The main three components of Integration Service which enable data


movement are,

• Integration Service Process


• Load Balancer
• Data Transformation Manager

6.1 Integration Service Process (ISP)

The Integration Service starts one or more Integration Service


processes to run and monitor workflows. When we run a workflow, the
ISP starts and locks the workflow, runs the workflow tasks, and starts
the process to run sessions. The functions of the Integration Service
Process are,

• Locks and reads the workflow


• Manages workflow scheduling, ie, maintains session dependency
• Reads the workflow parameter file
• Creates the workflow log
• Runs workflow tasks and evaluates the conditional links
• Starts the DTM process to run the session
• Writes historical run information to the repository
• Sends post-session emails

6.2 Load Balancer

The Load Balancer dispatches tasks to achieve optimal performance. It


dispatches tasks to a single node or across the nodes in a grid after
performing a sequence of steps. Before understanding these steps we
have to know about Resources, Resource Provision Thresholds,
Dispatch mode and Service levels

• Resources – we can configure the Integration Service to check the resources


available on each node and match them with the resources required to run the
task. For example, if a session uses an SAP source, the Load Balancer dispatches
the session only to nodes where the SAP client is installed
• Three Resource Provision Thresholds, The maximum number of runnable
threads waiting for CPU resources on the node called Maximum CPU Run Queue
Length. The maximum percentage of virtual memory allocated on the node
relative to the total physical memory size called Maximum Memory %. The
maximum number of running Session and Command tasks allowed for each
Integration Service process running on the node called Maximum Processes
• Three Dispatch mode’s – Round-Robin: The Load Balancer dispatches tasks to
available nodes in a round-robin fashion after checking the “Maximum Process”
threshold. Metric-based: Checks all the three resource provision thresholds and
dispatches tasks in round robin fashion. Adaptive: Checks all the three resource
provision thresholds and also ranks nodes according to current CPU availability
• Service Levels establishes priority among tasks that are waiting to be dispatched,
the three components of service levels are Name, Dispatch Priority and Maximum
dispatch wait time. “Maximum dispatch wait time” is the amount of time a task
can wait in queue and this ensures no task waits forever

A .Dispatching Tasks on a node

1. The Load Balancer checks different resource provision thresholds on the node
depending on the Dispatch mode set. If dispatching the task causes any threshold
to be exceeded, the Load Balancer places the task in the dispatch queue, and it
dispatches the task later
2. The Load Balancer dispatches all tasks to the node that runs the master
Integration Service process

B. Dispatching Tasks on a grid,

1. The Load Balancer verifies which nodes are currently running and enabled
2. The Load Balancer identifies nodes that have the PowerCenter resources required
by the tasks in the workflow
3. The Load Balancer verifies that the resource provision thresholds on each
candidate node are not exceeded. If dispatching the task causes a threshold to be
exceeded, the Load Balancer places the task in the dispatch queue, and it
dispatches the task later
4. The Load Balancer selects a node based on the dispatch mode

6.3 Data Transformation Manager (DTM) Process

When the workflow reaches a session, the Integration Service Process


starts the DTM process. The DTM is the process associated with the
session task. The DTM process performs the following tasks:

• Retrieves and validates session information from the repository.


• Validates source and target code pages.
• Verifies connection object permissions.
• Performs pushdown optimization when the session is configured for pushdown
optimization.
• Adds partitions to the session when the session is configured for dynamic
partitioning.
• Expands the service process variables, session parameters, and mapping variables
and parameters.
• Creates the session log.
• Runs pre-session shell commands, stored procedures, and SQL.
• Sends a request to start worker DTM processes on other nodes when the session is
configured to run on a grid.
• Creates and runs mapping, reader, writer, and transformation threads to extract,
transform, and load data
• Runs post-session stored procedures, SQL, and shell commands and sends post-
session email
• After the session is complete, reports execution result to ISP

Pictorial Representation of Workflow execution:

1. A PowerCenter Client request IS to start workflow


2. IS starts ISP
3. ISP consults LB to select node
4. ISP starts DTM in node selected by LB

Change Data Capture in


Informatica
Change data capture (CDC) is an approach or a technique to identify
changes, only changes, in the source. I have seen applications that are
built without CDC and later mandate to implement CDC at a higher
cost. Building an ETL application without CDC is a costly miss and
usually a backtracking step. In this article we can discuss different
methods of implementing CDC.

Scenario #01: Change detection using timestamp on source


rows
In this typical scenario the source rows have extra two columns say
row_created_time & last_modified_time. Row_created_time : time at
which the record was first created ; Last_modified_time: time at which
the record was last modified

1. In the mapping create mapping variable $$LAST_ETL_RUN_TIME of datetime


data type
2. Evaluate condition SetMaxVariable ($$LAST_ETL_RUN_TIME,
SessionStartTime); this steps stores the time at which the Session was started to $
$LAST_ETL_RUN_TIME
3. Use $$LAST_ETL_RUN_TIME in the ‘where’ clause of the source SQL. During
the first run or initial seed the mapping variable would have a default value and
pull all the records from the source, like: select * from employee where
last_modified_date > ’01/01/1900 00:00:000’
4. Now let us assume the session is run on ’01/01/2010 00:00:000’ for initial seed
5. When the session is executed on ’02/01/2010 00:00:000’, the sequel would be like
: select * from employee where last_modified_date > ’01/01/2010 00:00:000’,
hereby pulling records that had only got changed in between successive runs

Scenario #02: Change detection using load_id or Run_id


Under this scenario the source rows have a column say load_id, a
positive running number. The load_id is updated as and when the
record is updated

1. In the mapping create mapping variable $$LAST_READ_LOAD_ID of integer


data type
2. Evaluate condition SetMaxVariable ($$LAST_READ_LOAD_ID,load_id); the
maximum load_id is stored into mapping variable
3. Use $$LAST_READ_LOAD_ID in the ‘where’ clause of the source SQL. During
the first run or initial seed the mapping variable would have a default value and
pull all the records from the source, like: select * from employee where load_id >
0; Assuming all records during initial seed have load_id =1, the mapping variable
would store ‘1’ into the repository.
4. Now let us assume the session is run after five load’s into the source, the sequel
would be select * from employee where load_id >1 ; hereby we limit the source
read only to the records that have been changed after the initial seed
5. Consecutive runs would take care of updating the load_id & pulling the delta in
sequence

In the next blog we can see how to implement CDC when reading from
Salesforce.com

You might also like