ETL Tools: Basic Details About Informatica
ETL Tools: Basic Details About Informatica
ETL Tools are meant to extract, transform and load the data into the data warehouse for
decision making.
Before ETL Tools, ETL Process was done manually by SQL code created by programmers.
This task was cumbersome and tedious since it involved many resources, complex coding
and more work hours.
Maintaining the code placed a great challenge among the programmers.
ETL tools are very powerful and then offer many advantages in all stages of ETL process
starting from execution, data cleansing(Purification), data profiling, and transformation,
debugging and loading the data into data warehouse when compared to old method.
Why Informatica?
GE MSAT Internal
Informatica Versions
Version Release Date Comments
Informatica Powercenter 4.1
Informatica Powercenter 5.1
Informatica Powercenter 6.1.2
Informatica Powercenter 7.1.2 Nov 2003
Informatica Powercenter 8.1 Jun 2006 Service Oriented Architecture
Informatica Powercenter 8.5 Nov 2007
Informatica Powercenter 8.6 Jun 2008
Informatica Powercenter 9.1 Jun 2011
Informatica Powercenter 9.5 May 2012
Configure domains
Create the repository
Deploy the code
Change password
Source Analyzer: - Import or create source definitions for flat file, XML, COBOL,
Application, and relational sources.
GE MSAT Internal
Mapping Designer:- Create mappings
To create a workflow, you first create tasks such as a session, which contains the mapping
you build in the Designer.
You then connect tasks with conditional links to specify the order of execution for the tasks
you created.
The Workflow Manager consists of three tools to help you develop a workflow:
Task Developer. Use the Task Developer to create tasks you want to run in the workflow.
Workflow Designer. Use the Workflow Designer to create a workflow by connecting tasks
with links. You can also create tasks in the Workflow Designer as you develop the
workflow.
Worklet Designer: - it is used to create a worklet.
Workflow Tasks
You can create the following types of tasks in the Workflow Manager:
Event-Wait: - Waits for an event to occur before executing the next task.
GE MSAT Internal
Workflow Monitor is used to
Basic Definitions:
Mapping represents data flow from sources to targets.
Mapplet is a set of transformations. That can be used in one or more mappings.
Session is a set of instructions to move data from sources to targets.
Workflow is a set of instructions that tell the Informatica server how to execute the tasks.
Worklet is an object that represents a set of tasks.
Informatica Architecture
Informatica ETL product, known as Informatica Power Center consists of 3 main components.
Designer
Repository manager
Workflow manager
Workflow monitor
GE MSAT Internal
This architecture is visually explained in diagram below:
Sources
Targets
Legacy: Mainframes
(DB2, VSAM, IMS, Legacy: Mainframes
IDMS, Adabas)AS400 (DB2)AS400 (DB2)
(DB2, Flat File)
Remote Targets
Remote Sources
GE MSAT Internal
How components works in the Informatica Architecture
Repository: - Repository is nothing but a relational database which stores all the
metadata created in power center. Whenever you develop any mapping, session,
workflow, an entries are made in the repository.
Integration Service: - it extracts data from sources, processes it as per business logic
and load data to targets.
Repository Service: - It connects to the repository, fetches data from the repository
and sends back them to the requested components (mostly client tools and
integration service).
Power Center Client tools: - The PowerCenter Client consists of multiple tools. They
are used to manage users, define sources and targets, build mappings and mapplets
with the transformation logic, and create workflows to run the mapping logic. The
PowerCenter Client connects to the repository through the Repository Service to
fetch details. It connects to the Integration Service to start workflows. So essentially
client tools are used to code and give instructions to PowerCenter servers.
PowerCenter Administration Console: This is simply a web-based administration
tool you can use to administer the PowerCenter installation.
GE MSAT Internal
1. What are the functionalities we can do with source qualifier
transformation?
Source qualifier is an active and connected transformation. It is used to represent the rows that the
integrations service reads when it runs a session.
Source qualifier transformation converts the source data types to the Informatica native
data types.
Joins: You can join two or more tables from the same source database.
Filter rows: You can filter the rows from the source database
Sorting input: You can sort the source data by specifying the number for sorted ports. The
Integration Service adds an ORDER BY clause to the default SQL query
Distinct rows: You can get distinct rows from the source by choosing the "Select Distinct"
property. The Integration Service adds a SELECT DISTINCT statement to the default SQL
query.
Custom SQL Query: You can write your own SQL query to do calculations.
Join Type
The joiner transformation supports the following four types of joins.
Normal Join
Master Outer Join
Details Outer Join
Full Outer Join
We will learn about each join type with an example. Let say i have the following students and subjects
tables as the source.
Subject_Id subject_Name
-----------------------
1 Maths
2 Chemistry
GE MSAT Internal
3 Physics
Student_Id Subject_Id
---------------------
10 1
20 2
30 NULL
Assume that subjects source is the master and students source is the detail and we will join these sources
on the subject_id port.
Normal Join:
The joiner transformation outputs only the records that match the join condition and discards all the rows
that do not match the join condition. The output of the normal join is
---------------------------------------------
---------------------------------------------
1 Maths 10 1
2 Chemistry 20 2
In a master outer join, the joiner transformation keeps all the records from the detail source and only the
matching rows from the master source. It discards the unmatched rows from the master source. The output
of master outer join is
GE MSAT Internal
Master Ports | Detail Ports
---------------------------------------------
---------------------------------------------
1 Maths 10 1
2 Chemistry 20 2
In a detail outer join, the joiner transformation keeps all the records from the master source and only the
matching rows from the detail source. It discards the unmatched rows from the detail source. The output of
detail outer join is
---------------------------------------------
---------------------------------------------
1 Maths 10 1
2 Chemistry 20 2
The full outer join first brings the matching rows from both the sources and then it also keeps the non-
matched records from both the master and detail sources. The output of full outer join is
GE MSAT Internal
Master Ports | Detail Ports
---------------------------------------------
---------------------------------------------
1 Maths 10 1
2 Chemistry 20 2
If possible, perform joins in a database. Performing joins in a database is faster than performing
joins in a session.
You can improve the session performance by configuring the Sorted Input option in the joiner
transformation properties tab.
Specify the source with fewer rows and with fewer duplicate keys as the master and the other
source as detail.
You cannot use joiner transformation when the input pipeline contains an update strategy
transformation.
You cannot connect a sequence generator transformation directly to the joiner transformation.
GE MSAT Internal
Why do we need Lookup?
Lookup transformation is used to look up data in a flat file, relational table, view or
synonym.
Lookup is a passive/active transformation
It can be used in both connected/unconnected modes.
From Informatica version 9 onwards lookup is an active transformation. The lookup
transformation can return a single row or multiple rows.
Get a Related Value: You can get a value from the lookup table based on the source value. As
an example, we can get the related value like city name for the zip code value.
Get Multiple Values: You can get multiple rows from a lookup table. As an example, get all the
states in a country.
Perform Calculation. We can use the value from the lookup table and use it in calculations.
Update Slowly Changing Dimension tables: Lookup transformation can be used to
determine whether a row exists in the target or not.
We can configure a Lookup transformation to cache the underlying lookup table. In case of static or
read-only lookup cache the Integration Service caches the lookup table at the beginning of the session
and does not update the lookup cache while it processes the Lookup transformation.
In case of dynamic lookup cache the Integration Service dynamically inserts or updates data in the
lookup cache and passes the data to the target. The dynamic cache is synchronized with the target.
GE MSAT Internal
Difference Between Joiner Transformation And Lookup Transformation
Joiner Lookup
We cannot override the query in joiner We can override the query in lookup to fetch the
data from multiple tables.
Support Equi Join only Support Equi Join And Non Equi Join
In joiner we cannot configure to use persistence Where as in lookup we can configure to use
cache, shared cache, uncached and dynamic persistence cache, shared cache, uncached and
cache dynamic cache.
We can perform outer join in joiner We cannot perform outer join in lookup
transformation. transformation.
Joiner used only as Source Lkp used as Source and as well as Target
A transformation is a repository object which reads the data, modifies the data and passes the
data.
Active Transformations:
Change the number of rows: For example, the filter transformation is active because it removes
the rows that do not meet the filter condition.
Change the transaction boundary: The transaction control transformation is active because it
defines a commit or roll back transaction.
Change the row type: Update strategy is active because it flags the rows for insert, delete,
update or reject.
GE MSAT Internal
Passive Transformations:
Transformations which does not change the number of rows passed through them, maintains the
transaction boundary and row type are called passive transformation
Active Transformation - An active transformation changes the number of rows that pass through
the mapping.
Passive Transformation - Passive transformations do not change the number of rows that pass
through the mapping.
1. Expression Transformation
2. Sequence Generator Transformation
3. Lookup Transformation
4. Stored Procedure Transformation
5. XML Source Qualifier Transformation
6. External Procedure Transformation
7. Input Transformation(Mapplet)
8. Output Transformation(Mapplet)
GE MSAT Internal
Is look up Active Transformation?
Before Informatica version 9.1, Look up transformation was passive. For each input row that we pass
to lookup transformation, we can get only one output row even if we get multiple rows as output.
This property determines which rows to return when the Lookup transformation finds
multiple rows that match the lookup condition. Select one of the following values:
Report Error. The Integration Service reports an error and does not return a row.
Use First Value. Returns the first row that matches the lookup condition.
Use Last Value. Return the last row that matches the lookup condition.
Use Any Value. The Integration Service returns the first value that matches the lookup
condition. It creates an index based on the key ports instead of all Lookup
transformation ports.
From Informatica 9.1 onwards Lookup transformation can returns multiple rows as output.
GE MSAT Internal
Rank Transformation
You have to flag each row for inserting, updating, deleting or rejecting. The constants and their
numeric equivalents for each database operation are listed below.
Important Note:
Update strategy works only when we have a primary key on the target table. If there is no primary key
available on the target table, then you have to specify a primary key in the target definition in the mapping
for update strategy transformation to work
GE MSAT Internal
Stored Procedure Transformation
It is passive transformation
It can work as a connected or unconnected transformation
It is used to run the stored procedure in database
Check the status of the target database before loading data into it
Determine if enough space exists in data base or not
Perform a specialized calculation
Dropping and re-creating indexes
Input / Output Parameters: Used to send and receive data from the stored procedure.
Return Values: After running a stored procedure, most databases returns a value. This value can
either be user-definable, which means that it can act similar to a single output parameter, or it may
only return an integer value. If a stored procedure returns a result set rather than a single return
value, the Stored Procedure transformation takes only the first value returned from the procedure.
Status Codes: Status codes provide error handling for the Integration Service during a workflow.
Stored procedure issues a status code that notifies whether or not the stored procedure completed
successfully. You cannot see this value.
GE MSAT Internal
Specifying when the Stored Procedure Runs:
The property, "Stored Procedure Type" is used to specify when the stored procedure runs. The different
values of this property are shown below:
Normal: The stored procedure transformation runs for each row passed in the mapping. This is
useful when running a calculation against an input port. Connected stored procedures run only in
normal mode.
Pre-load of the Source: Runs before the session reads data from the source. Useful for verifying
the existence of tables or performing joins of data in a temporary table.
Post-load of the Source: Runs after reading data from the source. Useful for removing temporary
tables.
Pre-load of the Target: Runs before the session sends data to the target. This is useful for
verifying target tables or disk space on the target system.
Post-load of the Target: Runs after loading data into the target. This is useful for re-creating
indexes on the database.
Receives input values directly Receives input values from the result of a
from the pipeline. :LKP expression in another transformation.
Cache includes all lookup columns Cache includes all lookup/output ports in
used in the mapping. the lookup condition and the lookup/return
port.
If there is a match for the lookup If there is a match for the lookup
condition, the Power Center condition,the Power Center Server returns
Server returns the result of the the result of the lookup condition into the
lookup condition for all return port.
lookup/output ports.
GE MSAT Internal
Number of Transformations in Informatica?
Around 30
It is an active transformation
Can output multiple rows from each input row
Can transpose the data(transposing columns to rows)
Let's imagine we have a table like below that stores the sales figure for 4 quarters of a year in 4
different columns. As you can see each row represent one shop and the columns represent the
corresponding sales. Next, imagine - our task is to generate a result-set where we will have separate
rows for every quarter. We can configure a Normalizer transformation to return a separate row for
each quarter like below..
Source Table
The Normalizer returns a row for each shop and sales combination. It also returns an index - called
GCID (we will know later in detail) - that identifies the quarter number:
Target Table
Shop 1 100 1
GE MSAT Internal
Shop 1 300 2
Shop 1 500 3
Shop 1 700 4
Shop 2 250 1
Shop 2 450 2
Shop 2 650 3
Shop 2 850 4
Persistent cache:
Persistent cache is required when lookup table size is huge and the same lookup
table is being used in different mappings.
In persistent cache, the integration service saves the lookup cache files after a
successful session run.
If by mistake, persistent cache gets deleted then integration services generates the
cache file again when the session run next time.
In the first mapping we will create the Named Persistent Cache file by setting three properties in the
Properties tab of Lookup transformation.
GE MSAT Internal
Lookup cache persistent:
user_defined_cache_file_name i.e. the Named Persistent cache file name that will be used in all
the other mappings using the same lookup table. Enter the prefix name only. Do not enter .idx
or .dat
To be checked i.e. the Named Persistent Cache file will be rebuilt or refreshed with the current
data of the lookup table.
Next in all the mappings where we want to use the same already built Named Persistent Cache we
need to set two properties in the Properties tab of Lookup transformation.
GE MSAT Internal
Lookup cache persistent:
To be checked i.e. the lookup will be using a Named Persistent Cache that is already saved in
Cache Directory and if the cache file is not there the session will not fail it will just create the
cache file instead.
user_defined_cache_file_name i.e. the Named Persistent cache file name that was defined in the
mapping where the persistent cache file was created.
If there is any Lookup SQL Override then the SQL statement in all the lookups should
match exactly even also an extra blank space will fail the session that is using the already
GE MSAT Internal
So if the incoming source data volume is high, the lookup table’s data volume that need to be cached
is also high, and the same lookup table is used in many mappings then the best way to handle the
situation is to use one-time build, already created persistent named cache
2. What will be your approach when a workflow is running for more hours?
Yes, we can do in Informatica as well but there are some situation where we must have to go
with Stored procedure transformation like
If the complex logic has to be built and if we try to build through Informatica, then we have to
keep many number of transformations in the mapping
Which will make more complex to understand?
GE MSAT Internal
6. Workflow failed, out of 200, 100 records got loaded, now how do you re-run the
workflow?
If there is a lookup on target to check if we get new records then insert it otherwise update it, in this
case we can re-run the workflow from the starting itself.
If the source system is RDBMS, and then to eliminate duplicate records we can check the
DISTINCT option of the source qualifier in the source table and load the data into the target.
If the source system is flat file, and then we can use SORTER transformation to eliminate
duplicate records.
Yes, it is required.
Persistent cache does not delete the lookup cache even after the session has completed.
If the lookup tables has huge data and in many mappings it is being used. In this scenario we
should go for persistent cache.
Because each time when we run the session, it has to build the cache and the time for building
the cache depends on the size of the lookup table.
If the size of the lookup table is small, then building cache will require less time.
If the size of the lookup table is huge, then building cache will require more time. Hence,
performance degrades.
To avoid these performance degrade, build the cache once and use the same cache whenever
required.
If any changes in the data in the lookup table then need to refresh the persistent lookup
cache.
10. How do you update target table without using update strategy transformation.
Yes, we can update target table without using update Strategy transformation.
For this, we need to define the key in the target table in Informatica level and
Then we need to connect the key and the field we want to update in the mapping Target.
In the session level, we should set the target property as "Update as Update" in the mapping
and check the "Update" check-box in Session properties,
GE MSAT Internal
11. Which Informatica version you have worked?
Informatica 9.1
Informatica scheduler
Star Schema
Around 40 to 50
15. How do you load data into dimension table and fact table?
First we will load the data from source to the stage tables then stage to the dimensional
table. And stage to the fact table.
Fact Table – 1
Dimensional Table – 6
Around 10
Solution: Sequence generator value should be updated with max (primary key) in that mapping. If the
data is completely loaded into the target running the session is not required
GE MSAT Internal
Issue: Severity Timestamp Node Thread Message Code Message
CMN_1022 [
Database Error: Failed to connect to database using user [csdwh_inf] and connection string
[EUERDWP1.AE.GE.COM].]”
Raise a SC with DB team to check if there are any password changes for this user recently.
GE MSAT Internal
Issue: “ORA-01089: immediate shutdown in progress - no operations are permitted”
Cause:: This error occurs when an invalid data comes from ERP.
GE MSAT Internal
Cause - DB is changed to etpsblp2 from GPSOR252
Raise SC with ERP DB team and the DB team has to add the
TNS on the new server and the connection string should be changed to
etpsblp2
Issue:
SQL Error [
Cause:
Work order should be raised with ERP DB team for getting access to the tables for GEPSCSIDW user.
Issue:
SQL Error [
Raise a service call with ERP DB team for getting access to DB link for GEPSCSIDW user.
GE MSAT Internal
Issue:
CMN_1022 [
Database Error: Failed to connect to database using user [GEPSREADONLY] and connection string
[gpsesp76]. “
Cause: DB is down.
Solution: Check with DB team if the DB is fine and rerun the session.
Issue:
Cause:: DB issue
GE MSAT Internal
Solution: Rerun the session
Issue Transformation Evaluation Error [<<Expression Error>> [TO_DATE]: invalid string for
converting to Date
Solution: Sequence generator value must be updated with the value max (primary key) in that
mapping.
CMN_1022 [
GE MSAT Internal
Function Name: Connect
Database Error: Failed to connect to database using user [epsuser] and connection string [atlorp38].]
“
Database Error: Failed to connect to database using user [gepsreadonly] and connection string
[gpsescp1].].”
Cause: DB is down.
Solution: Check with DB team and if the instance is up then rerun the session.
Issue:
GE MSAT Internal
Ticket to other teams : Raise SC with GAMS team
Issue:
ORA-12545: Connect failed because target host or object does not exist
Database Error: Failed to connect to database using user [epsuser] and connection string
[ATLORP38]."
Ticket to other teams : Raise SC with ERP DB team for login details
Issue:
GE MSAT Internal
Project Architecture:
GE MSAT Internal
What kind of enhancements you have worked on?
I have worked on many enhancements as per business requirements.
In some enhancement, we had source as flat files like .dat and .csv files,
We need to develop a mapping to load the data from source systems to the staging environment
after performing all field level validations. From stage to base table, we call PLSQL procedure to load
the data.
All file level validations will be done through UNIX shell scripting.
In few enhancements, our source was database. We need to fetch the data from the table and
generate a file files and send it to the integration team.
SESSSTARTTIME
SESSSTARTTIME returns the current date and time value on the node that runs the session
when the Integration Service initializes the session.
Use SESSSTARTTIME in a mapping or a mapplet. You can reference SESSSTARTTIME only
within the expression language.
SYSDATE
SYSDATE returns the current date and time up to seconds on the node that runs the session for each
row passing through the transformation
To capture a static system date, use the SESSSTARTTIME variable instead of SYSDATE.
GE MSAT Internal
What is the complex mapping you have developed?
Source qualifier
Expression
Lookup
Stored procedure
Filter
Update Strategy
Router
Union
GE MSAT Internal
Pushdown Optimization
When you run a session configured for pushdown optimization, the Integration Service translates
the transformation logic into SQL queries and sends the SQL queries to the database
The following figure shows a mapping containing transformation logic that can be pushed to the source database:
GE MSAT Internal
This mapping contains an Expression transformation that creates an item ID based on the store number 5419 and the
item ID from the source. To push the transformation logic to the database, the Integration Service generates the
INSERT INTO T_ITEMS(ITEM_ID, ITEM_NAME, ITEM_DESC) SELECT CAST((CASE WHEN 5419 IS NULL
THEN '' ELSE 5419 END) + '_' + (CASE WHEN ITEMS.ITEM_ID IS NULL THEN '' ELSE
ITEMS.ITEM_ID END) AS INTEGER), ITEMS.ITEM_NAME, ITEMS.ITEM_DESC FROM ITEMS2 ITEMS
The Integration Service generates an INSERT SELECT statement to retrieve the ID, name, and description values from
the source table, create new item IDs, and insert the values into the ITEM_ID, ITEM_NAME, and ITEM_DESC columns
in the target table. It concatenates the store number 5419, an underscore, and the original ITEM ID to get the new
item ID.
Source-side pushdown optimization. The Integration Service pushes as much transformation logic
as possible to the source database.
Target-side pushdown optimization. The Integration Service pushes as much transformation logic
as possible to the target database.
Full pushdown optimization. The Integration Service attempts to push all transformation logic to
the target database. If the Integration Service cannot push all transformation logic to the database,
it performs both source-side and target-side pushdown optimization.
When you run a session configured for source-side pushdown optimization, the Integration Service
analyzes the mapping from the source to the target .
The Integration Service generates and executes a SELECT statement based on the transformation
logic for each transformation it can push to the database. Then, it reads the results of this SQL query
and processes the remaining transformations.
GE MSAT Internal
When you run a session configured for target-side pushdown optimization, the Integration Service
analyzes the mapping from the target to the source. It generates an INSERT, DELETE, or UPDATE
statement based on the transformation logic for each transformation it can push to the target
database.
The Integration Service processes the transformation logic up to the point that it can push the
transformation logic to the database. Then, it executes the generated SQL on the target database.
To use full pushdown optimization, the source and target databases must be in the same
relational database management system.
When you configure a session for full optimization, the Integration Service analyzes the
mapping from the source to the target or until it reaches a downstream transformation it
cannot push to the target database.
If the Integration Service cannot push all transformation logic to the target database, it tries
to push all transformation logic to the source database.
If it cannot push all transformation logic to the source or target, the Integration Service
pushes as much transformation logic to the source database, processes intermediate
transformations that it cannot push to any database, and then pushes the remaining
transformation logic to the target database.
The Integration Service generates and executes an INSERT SELECT, DELETE, or UPDATE
statement for each database to which it pushes transformation logic.
The Rank transformation cannot be pushed to the source or target database. If you configure the
session for full pushdown optimization, the Integration Service pushes the Source Qualifier
transformation and the Aggregator transformation to the source, processes the Rank
transformation, and pushes the Expression transformation and target to the target database. The
Integration Service does not fail the session if it can push only part of the transformation logic to the
database.
GE MSAT Internal
1. Normal
2. Bulk
In Normal Load, record by record gets loaded into the target table and it generates log for that but it
takes time.
In Bulk Load, a number of records get loaded into the target table but it ignores logs. It takes less
time to load the data into the target table.
In Full Load or One-time load or History Load, complete data from source table will be loaded into
the target table in single time. It truncates all rows and loads from scratch. It takes more time.
In Incremental load, difference between target and source data is loaded at regular interval.
Timestamp of pervious load has to be maintained. It takes less time.
By Full Load or One-time load we mean that all the data in the Source table(s) should be processed. This
contains historical data usually. Once the historical data is loaded we keep on doing incremental loads to
process the data that came after one-time load
PMCMD Command
The pmcmd is a command line utility provided by the Informatica to perform the following tasks.
GE MSAT Internal
Start workflows.
Start workflow from a specific task.
Stop Abort workflows and Sessions.
Schedule the workflows.
1. Start workflow
2. Stop workflow
you can start the workflow from a specified task. This is shown below:
4. Stopping a task.
The following pmcmd commands are used to abort workflow and task in a workflow:
GE MSAT Internal
pmcmd abortworkflow -service informatica-integration-Service -d domain-name -u
user-name -p password -f folder-name -w workflow-name
the pmcmd command syntax for scheduling the workflow is shown below:
You cannot specify the scheduling options here. This command just schedules the workflow for the next
run.
GE MSAT Internal
Partitioning In Informatica
Is used to improve performance in Informatica
It is done at session Level
Adding partitions in the pipeline
Use more of the system hardware
Achieve performance through parallel data processing
A pipeline consists of a source qualifier and all the transformations and Targets that receive
data from that source qualifier.
When the Integration Service runs the session, it can achieve higher Performance by
partitioning the pipeline and performing the extract, Transformation, and load for each
partition in parallel.
PARTITIONING ATTRIBUTES
1. Partition points
2. Number of Partitions
GE MSAT Internal
We can define up to 64 partitions at any partition point in a pipeline.
When we increase or decrease the number of partitions at any partition point, the
Workflow Manager increases or decreases the number of partitions at all Partition points in
the pipeline.
Increasing the number of partitions or partition points increases the number of threads.
The number of partitions we create equals the number of connections to the source or
target. For one partition, one database connection will be used.
GE MSAT Internal
3. Partition types
The Integration Service creates a default partition type at each partition point.
If we have the Partitioning option, we can change the partition type. This option is
purchased separately.
The partition type controls how the Integration Service distributes data among partitions at
partition points.
Types of Partition
Database Partitioning
In Database Partitioning Integration Service queries the oracle database system for table
partition information. It reads the partitioned data from the corresponding nodes in the
database.
Hash auto-keys. The Integration Service uses a hash function to group rows of data among
partitions. The Integration Service groups the data based on a partition key.
Hash user keys. The Integration Service uses a hash function to group rows of data among
partitions. You define the number of ports to generate the partition key.
Key range. With key range partitioning, the Integration Service distributes rows of data
based on a port or set of ports that you define as the partition key. For each port, you define
a range of values. The Integration Service uses the key and ranges to send rows to the
appropriate partition. Use key range partitioning when the sources or targets in the pipeline
are partitioned by key range.
Pass-through. In pass-through partitioning, the Integration Service processes data without
redistributing rows among partitions. All rows in a single partition stay in the partition after
crossing a pass-through partition point. Choose pass-through partitioning when you want to
create an additional pipeline stage to improve performance, but do not want to change the
distribution of data across partitions
Round-robin. The Integration Service distributes data evenly among all partitions. Use
round-robin partitioning where you want each partition to process approximately the same
number of rows
GE MSAT Internal
New Features in Informatica 9.1
Database deadlock resilience feature - this will ensure that your session does not immediately fail if it
encounters any database deadlock, it will now retry the operation again. You can configure number of
retry attempts
Lookups can now be configured as an Active transformation to return Multiple Rows.We can configure
the Lookup transformation to return all rows that match a lookup condition
You can limit the size of session logs for real-time sessions.
Passive transformation
We can configure the SQL transformation to run in passive mode instead of active mode. When the
SQL transformation runs in passive mode, the SQL transformation returns one output row for each
input row.
When a database deadlock error occurs, the session does not fail. The Integration Service attempts
to re-execute the last statement for a specified retry period.
You can configure the number of deadlock retries and the deadlock sleep interval for an Integration
Service. You can override these values at the session level as custom properties.
GE MSAT Internal
Configure following Integration Service Properties:
NumOfDeadlockRetries. The number of times the PowerCenter Integration Service retries a target
write on a database deadlock. Minimum is 0. Default is 10. If you want the session to fail on
deadlock set NumOfDeadlockRetries to zero.
DeadlockSleep. Number of seconds before the PowerCenter Integration Service retries a target
write on database deadlock.
If a deadlock occurs, the Integration Service attempts to run the statement. The Integration Service
waits for a delay period between each retry attempt. If all attempts fail due to deadlock, the session
fails. The Integration Service logs a message in the session log whenever it retries a statement.
GE MSAT Internal
DTM PROCESS
DTM means Data Transformation Manager. In informatica this is the main background process which
runs after completion of the Load Manager.
When Powercenter Server runs a workflow it initializes Load manager and the Load Manager is
responsible to perform the below tasks.
Expression Transformation
It is passive and connected transformation.
It is used to calculate values on a single row.
Examples of calculations are concatenating the first and last name, adjusting the employee
salaries, converting strings to date etc.
Expression transformation can also be used to test conditional statements before passing
the data to other transformations.
GE MSAT Internal
Sorter Transformation
It is active and connected transformation.
It is used to sort the data in ascending or descending order.
The sorter transformation is used to sort the data from relational or flat file sources.
The sorter transformation can also be used for case-sensitive sorting and can be used to
specify whether the output rows should be distinct or not.
Use the sorter transformation before the aggregator and joiner transformation and sort the data for better
performance.
Filter Transformation
It is active and connected transformation.
It is used to filter out rows in the mapping.
Use the filter transformation as close as possible to the sources in the mapping. This will reduce the
number of rows to be processed in the downstream transformations.
In case of relational sources, if possible use the source qualifier transformation to filter the rows. This will
reduce the number of rows to be read from the source.
Note: The input ports to the filter transformation must come from a single transformation. You cannot
connect ports from more than one transformation to the filter.
GE MSAT Internal
Aggregator Transformation
It is active and connected transformation.
It is used to perform calculations such as sums, averages, counts on groups of data.
Aggregate Cache: The integration service stores the group values in the index cache and row data in the
data cache.
Aggregate Expression: You can enter expressions in the output port or variable port.
Group by Port: This tells the integration service how to create groups. You can configure input,
input/output or variable ports for the group.
Sorted Input: This option can be used to improve the session performance. You can use this option only
when the input to the aggregator transformation in sorted on group by ports.
Incremental Aggregation:
After you create a session that includes an Aggregator transformation, you can enable the session option,
Incremental Aggregation. When the Integration Service performs incremental aggregation, it passes source
data through the mapping and uses historical cache data to perform aggregation calculations incrementally.
Sorted Input:
You can improve the performance of aggregator transformation by specifying the sorted input. The
Integration Service assumes all the data is sorted by group and it performs aggregate calculations as it
reads rows for a group. If you specify the sorted input option without actually sorting the data, then
integration service fails the session.
GE MSAT Internal
Rank Transformation
It is an Active and Connected Transformation.
It is used to select top or bottom rank of data.
When we create RANK Transformation, by default it creates RANKINDEX port. This port is
used to store the ranking position of each row in the group.
In the ports tab, check the Rank (R) option for the port which you want to do ranking. You
can check the Rank (R) option for only one port. Optionally you can create the groups for
ranked rows. Select the Group By option for the ports that define the groups.
Top/Bottom: Specify whether you want to select the top or bottom rank of data.
Number of Ranks: specify the number of rows you want to rank.
Rank Data Cache Size: The data cache size default value is 2,000,000 bytes. You can set a
numeric value, or Auto for the data cache size. In case of Auto, the Integration Service
determines the cache size at runtime.
Rank Index Cache Size: The index cache size default value is 1,000,000 bytes. You can set a
numeric value, or Auto for the index cache size. In case of Auto, the Integration Service
determines the cache size at runtime.
GE MSAT Internal
Stored Procedure Transformation
It is passive and can be acts as connected or unconnected transformation.
Stored Procedure Transformation is used to run the stored procedure in the database.
Check the status of a target database before loading data into it.
Determine if enough space exists in a database.
Perform a specialized calculation.
Dropping and recreating indexes
It is not directly connected to other transformations in the mapping. It runs either before or after
the session or is being called by an expression in other transformation in the mapping.
The property, "Stored Procedure Type" is used to specify when the stored procedure runs. The
different values of this property are shown below:
Normal: The stored procedure transformation runs for each row passed in the mapping. This is
useful when running a calculation against an input port. Connected stored procedures run only in
normal mode.
Pre-load of the Source: Runs before the session reads data from the source. Useful for verifying
the existence of tables or performing joins of data in a temporary table.
GE MSAT Internal
Post-load of the Source: Runs after reading data from the source. Useful for removing temporary
tables.
Pre-load of the Target: Runs before the session sends data to the target. This is useful for verifying
target tables or disk space on the target system.
Post-load of the Target: Runs after loading data into the target. This is useful for re-creating
indexes on the database.
In the Informatica, you can set the update strategy at two different levels:
Session Level: Configuring at session level instructs the integration service to either treat all rows
in the same way (Insert or update or delete).
Mapping Level: Use update strategy transformation to flag rows for insert, update, delete or
reject.
Important Note:
Update strategy works only when we have a primary key on the target table. If there is no primary key
available on the target table, then you have to specify a primary key in the target definition in the
mapping for update strategy transformation to work.
GE MSAT Internal
Lookup Transformation
It is passive/Active transformation.
It can be used in both connected/unconnected modes.
It is used to look up data in flat file or relational database.
From Informatica 9 onwards, Lookup is an active transformation. It can return single row or
multiple rows.
Get a Related Value: You can get a value from the lookup table based on the source value. As an
example, we can get the related value like city name for the zip code value.
Get Multiple Values: You can get multiple rows from a lookup table. As an example, get all the
states in a country.
Perform Calculation. We can use the value from the lookup table and use it in calculations.
Update Slowly Changing Dimension tables: Lookup transformation can be used to determine
whether a row exists in the target or not.
Connected or Unconnected lookup: A connected lookup receives source data, performs a lookup
and returns data to the pipeline. An unconnected lookup is not connected to source or target or any
other transformation. A transformation in the pipeline calls the lookup transformation with the :LKP
expression. The unconnected lookup returns one column to the calling transformation.
Cached or Un-cached Lookup: You can improve the performance of the lookup by caching the
lookup source. If you cache the lookup source, you can use a dynamic or static cache. By default, the
lookup cache is static and the cache does not change during the session. If you use a dynamic cache,
the integration service inserts or updates row in the cache. You can lookup values in the cache to
determine if the values exist in the target, then you can mark the row for insert or update in the
target.
GE MSAT Internal
Union Transformation
Union is an active transformation because it combines two or more data streams into one. Though
the total number of rows passing into the Union is the same as the total number of rows passing out
of it, and the sequence of rows from any given input stream is preserved in the output, the positions
of the rows are not preserved, i.e. row number 1 from input stream 1 might not be row number 1 in
the output stream. Union does not even guarantee that the output is repeatable
Joiner Transformation
Important Notes
GE MSAT Internal
Drag the ports from the first source into the joiner transformation. By default the designer creates
the input/output ports for the source fields in the joiner transformation as detail fields.
Now drag the ports from the second source into the joiner transformation. By default the designer
configures the second source ports as master fields
Join Condition
The integration service joins both the input sources based on the join condition. The join condition
contains ports from both the input sources that must match. You can specify only the equal (=)
operator between the join columns. Other operators are not allowed in the join condition. As an
example, if you want to join the employees and departments table then you have to specify the join
condition as department_id1= department_id. Here department_id1 is the port of departments
source and department_id is the port of employees source.
Join Type
Normal Join
Master Outer Join
Details Outer Join
Full Outer Join
We will learn about each join type with an example. Let say i have the following students and
subjects tables as the source.
Subject_Id subject_Name
-----------------------
1 Maths
2 Chemistry
3 Physics
Student_Id Subject_Id
GE MSAT Internal
---------------------
10 1
20 2
30 NULL
Assume that subjects source is the master and students source is the detail and we will join these
sources on the subject_id port.
Normal Join:
The joiner transformation outputs only the records that match the join condition and discards all the
rows that do not match the join condition. The output of the normal join is
---------------------------------------------
---------------------------------------------
1 Maths 10 1
2 Chemistry 20 2
In a master outer join, the joiner transformation keeps all the records from the detail source and
only the matching rows from the master source. It discards the unmatched rows from the master
source. The output of master outer join is
---------------------------------------------
---------------------------------------------
1 Maths 10 1
2 Chemistry 20 2
GE MSAT Internal
Detail Outer Join:
In a detail outer join, the joiner transformation keeps all the records from the master source and
only the matching rows from the detail source. It discards the unmatched rows from the detail
source. The output of detail outer join is
---------------------------------------------
---------------------------------------------
1 Maths 10 1
2 Chemistry 20 2
The full outer join first brings the matching rows from both the sources and then it also keeps the non-
matched records from both the master and detail sources. The output of full outer join is
---------------------------------------------
---------------------------------------------
1 Maths 10 1
GE MSAT Internal
2 Chemistry 20 2
Normalizer Transformation
It is active and connected Transformation.
It returns multiple rows for a source row.
GK field generate sequence number starting from the value as defined in sequence field.
GCID holds column number of the occurrence field.
Router Transformation
Router Transformation is an Active and Connected Transformation.
It is used to filter the data based on some condition.
In a filter transformation, you can specify only one condition and drops the rows that do not
satisfy the condition.
Where as in a router transformation, you can specify more than one condition and provides
the ability for route the data that meet the test condition.
Use router transformation if you need to test the same input data on multiple conditions
Use router transformation to test multiple conditions on the same input data. If you use more than
one filter transformation, the integration service needs to process the input for each filter
GE MSAT Internal
transformation. In case of router transformation, the integration service processes the input data
only once and thereby improving the performance.
Mapping Level: Use the transaction control transformation to define the transactions.
Session Level: Specify the Commit Type in Session Properties Tab.
Use the following built-in variables in the expression editor of the transaction control
transformation:
TC_CONTINUE_TRANSACTION: The Integration Service does not perform any transaction change
for this row. This is the default value of the expression.
TC_COMMIT_BEFORE: The Integration Service commits the transaction, begins a new transaction,
and writes the current row to the target. The current row is in the new transaction.
TC_COMMIT_AFTER: The Integration Service writes the current row to the target, commits the
transaction, and begins a new transaction. The current row is in the committed transaction.
TC_ROLLBACK_BEFORE: The Integration Service rolls back the current transaction, begins a new
transaction, and writes the current row to the target. The current row is in the new transaction.
TC_ROLLBACK_AFTER: The Integration Service writes the current row to the target, rolls back the
transaction, and begins a new transaction. The current row is in the rolled back transaction.
If the transaction control transformation evaluates to a value other than the commit, rollback or
continue, then the integration service fails the session.
GE MSAT Internal
SQL Transformation
GE MSAT Internal
Mapplet
It contains set of transformations and lets us reuse that transformations logic in multiple
mappings.
It is a reusable object that we create in Mapplet Designer.
Mapplet Input:
Mapplet Output:
GE MSAT Internal
Mapping Parameters and Mapping Variables
Mapping Parameters
Mapping Variables
Variable functions
Variable functions determine how the Integration Service calculates the current value of a mapping
variable in a pipeline.
SetMaxVariable: Sets the variable to the maximum value of a group of values. It ignores rows
marked for update, delete, or reject. Aggregation type set to Max.
SetMinVariable: Sets the variable to the minimum value of a group of values. It ignores rows marked
for update, delete, or reject. Aggregation type set to Min.
SetCountVariable: Increments the variable value by one. It adds one to the variable value when a
row is marked for insertion, and subtracts one when the row is Marked for deletion. It ignores rows
marked for update or reject. Aggregation type set to Count.
GE MSAT Internal
SetVariable: Sets the variable to the configured value. At the end of a session, it compares the final
current value of the variable to the start value of the variable. Based on the aggregate type of the
variable, it saves a final value to the repository.
SESSION TASK
A session is a set of instructions that tells the Power Centre Server how and when to move
data from sources to targets.
To run a session, we must first create a workflow to contain the Session task.
We can run as many sessions in a workflow as we need.
We can run the Session tasks sequentially or concurrently, depending on our needs
EMAIL TASK
The Workflow Manager provides an Email task that allows us to send email during a
workflow.
COMMAND TASK
The Command task allows us to specify one or more shell commands in UNIX or DOS commands in
Windows to run during the workflow.
For example, we can specify shell commands in the Command task to delete reject files, copy a
file, or archive target files.
1. Standalone Command task: We can use a Command task anywhere in the workflow or
worklet to run shell commands.
2. Pre- and post-session shell command: We can call a Command task as the pre- or post-
session shell command for a Session task. This is done in COMPONENTS TAB of a session. We can
run it in Pre-Session Command or Post Session Success Command or Post Session Failure
Command. Select the Value and Type option as we did in Email task.
GE MSAT Internal
Worklet
Session Parameters
This parameter represents values you might want to change between sessions, such as
DB Connection or source file.
We can use session parameter in a session property sheet, then define the parameters in a
session parameter file.
The user defined session parameter are:
(a) DB Connection
(b) Source File directory
(c) Target file directory
(d) Reject file directory
Description:
Use session parameter to make sessions more flexible. For example, you have the same type
of transactional data written to two different databases, and you use the database
connections TransDB1 and TransDB2 to connect to the databases. You want to use the same
mapping for both tables.
Instead of creating two sessions for the same mapping, you can create a database
connection parameter, like $DBConnectionSource, and use it as the source database connection
for the session.
When you create a parameter file for the session, you set $DBConnectionSource to TransDB1
and run the session. After it completes set the value to TransDB2 and run the session again.
NOTE:
You can use several parameter together to make session management easier.
GE MSAT Internal
Session parameters do not have default value, when the server can not find a value for a
session parameter, it fails to initialize the session.
Workflow Variables
GE MSAT Internal
Predefined Workflow Variables:
Each workflow contains a set of predefined variables that you use to evaluate workflow and task
conditions. Use the following types of predefined variables:
Task-specific variables. The Workflow Manager provides a set of task-specific variables for
each task in the workflow. Use task-specific variables in a link condition to control the path
the Integration Service takes when running the workflow. The Workflow Manager lists task-
specific variables under the task name in the Expression Editor.
Built-in variables. Use built-in variables in a workflow to return run-time or system
information such as folder name, Integration Service Name, system date, or workflow start
time. The Workflow Manager lists built-in variables under the Built-in node in the Expression
Editor.
Sample syntax:
$Dec_TaskStatus.Condition =
<TRUE | FALSE | NULL | any
integer>
End Time Date and time the associated task All tasks Date/Time
ended. Precision is to the second.
Sample syntax:
$s_item_summary.EndTime >
TO_DATE('11/10/2004
08:13:25')
ErrorCode Last error code for the associated All tasks Integer
task. If there is no error, the
Integration Service sets ErrorCode
to 0 when the task completes.
Sample syntax:
$s_item_summary.ErrorCode =
24013.
Note: You might use this variable
when a task consistently fails with
this final error message.
ErrorMsg Last error message for the All tasks Nstring
associated task.If there is no error,
the Integration Service sets
ErrorMsg to an empty string when
the task completes.
Sample syntax:
$s_item_summary.ErrorMsg =
GE MSAT Internal
'PETL_24013 Session run
completed with failure
Variables of type Nstring can have
a maximum length of 600
characters.
Note: You might use this variable
when a task consistently fails
with this final error message.
First Error Code Error code for the first error Session Integer
message in the session.
If there is no error, the Integration
Service sets FirstErrorCode to 0
when the session completes.
Sample syntax:
$s_item_summary.FirstErrorCode
= 7086
FirstErrorMsg First error message in the Session Nstring
session.If there is no error, the
Integration Service sets
FirstErrorMsg to an empty string
when the task completes.
Sample syntax:
$s_item_summary.FirstErrorMsg =
'TE_7086 Tscrubber:
Debug info… Failed to
evalWrapUp'Variables of type
Nstring can have a maximum
length of 600 characters.
PrevTaskStatus Status of the previous task in the All Tasks Integer
workflow that the Integration
Service ran. Statuses include:
1.ABORTED
2.FAILED
3.STOPPED
4.SUCCEEDED
Use these key words when writing
expressions to evaluate the status
of the previous task.
Sample syntax:
$Dec_TaskStatus.PrevTaskStatus
= FAILED
SrcFailedRows Total number of rows the Session Integer
Integration Service failed to read
from the source.
Sample syntax:
$s_dist_loc.SrcFailedRows = 0
SrcSuccessRows Total number of rows successfully Session Integer
read from the sources.
Sample syntax:
$s_dist_loc.SrcSuccessRows >
2500
StartTime Date and time the associated task All Task Date/Time
started. Precision is to the second.
Sample syntax:
$s_item_summary.StartTime >
TO_DATE('11/10/2004
08:13:25')
Status Status of the previous task in the All Task Integer
workflow. Statuses include:
- ABORTED
GE MSAT Internal
- DISABLED
- FAILED
- NOTSTARTED
- STARTED
- STOPPED
- SUCCEEDED
Use these key words when writing
expressions to evaluate the status
of the current task.
Sample syntax:
$s_dist_loc.Status = SUCCEEDED
TgtFailedRows Total number of rows the Session Integer
Integration Service failed to write
to the target.
Sample syntax:
$s_dist_loc.TgtFailedRows = 0
TgtSuccessRows Total number of rows successfully Session Integer
written to the target.
Sample syntax:
$s_dist_loc.TgtSuccessRows > 0
TotalTransErrors Total number of transformation Session Integer
errors.
Sample syntax:
$s_dist_loc.TotalTransErrors = 5
We can define events in the workflow to specify the sequence of task execution.
Types of Events:
Pre-defined event: A pre-defined event is a file-watch event. This event Waits for a
GE MSAT Internal
2. Click Workflow-> Edit -> Events tab.
3. Click to Add button to add events and give the names as per need.
4. Click Apply -> Ok. Validate the workflow and Save it.
EVENT RAISE: Event-Raise task represents a user-defined event. We use this task to
EVENT WAIT: Event-Wait task waits for a file watcher event or user defined event to
Example1: Use an event wait task and make sure that session s_filter_example runs when abc.txt
TIMER TASK
The Timer task allows us to specify the period of time to wait before the Power Center Server runs
the next task in the workflow. The Timer task has two types of settings:
Absolute time: We specify the exact date and time or we can choose a user-defined
workflow variable to specify the exact time. The next task in workflow will run as per the
date and time specified.
Relative time: We instruct the Power Center Server to wait for a specified period of time
after the Timer task, the parent workflow, or the top-level workflow starts.
Example: Run session s_m_filter_example relative to 1 min after the timer task.
DECISION TASK
The Decision task allows us to enter a condition that determines the execution of the
workflow, similar to a link condition.
The Decision task has a pre-defined variable called $Decision_task_name.condition that
represents the result of the decision condition.
The Power Center Server evaluates the condition in the Decision task and sets the pre-
defined condition variable to True (1) or False (0).
We can specify one decision condition per Decision task.
GE MSAT Internal
CONTROL TASK
We can use the Control task to stop, abort, or fail the top-level workflow or the parent
workflow based on an input link condition.
A parent workflow or worklet is the workflow or worklet that contains the Control task.
We give the condition to the link connected to Control Task.
ASSIGNMENT TASK
Parameter File
Parameters file provides us with the flexibility to change parameter and variable values
every time we run a session or workflow.
A parameter file contains a list of parameters and variables with their assigned values.
$$LOAD_SRC=SAP
$$DOJ=01/01/2011 00:00:01
[email protected]
Each heading section identifies the Integration Service, Folder, Workflow, Worklet, or Session to
which the parameters or variables apply.
[Global]
[Folder_Name.WF:Workflow_Name.WT:Worklet_Name.ST:Session_Name]
[Session_Name
To assign a null value, set the parameter or variable value to <null> or simply leave the value
blank.
$PMBadFileDir=<null>
GE MSAT Internal
$PMCacheDir=
A mapping parameter represents a constant value that we can define before running a session. A
mapping parameter retains the same value throughout the entire session. If we want to change
the value of a mapping parameter between session runs we need to Update the parameter file.
A mapping variable represents a value that can change through the session. The Integration
Service saves the value of a mapping variable to the repository at the end of each successful
session run and uses that value the next time when we run the session. Variable functions like
the value of the variable. At the beginning of a session, the Integration Service evaluates
references to a variable to determine the start value. At the end of a successful session, the
Integration Service saves the final value of the variable to the repository. The next time we run
the session, the Integration Service evaluates references to the variable to the saved value. To
override the saved value, define the start value of the variable in the parameter file.
GE MSAT Internal
Constraint based Loading
Constraint based load ordering is used to load the data first in to a parent table and then in
to the child tables.
You can specify the constraint based load ordering option in the Config Object tab of the
session.
For every row generated by the active source, the integration service first loads the row into
the primary key table and then to the foreign key tables.
The constraint based loading is helpful to normalize the data from a denormalized source
data.
The constraint based load ordering option applies for only insert operations.
You cannot update or delete the rows using the constraint base load ordering.
You have to define the primary key and foreign key relationships for the targets in the target
designer.
The target tables must be in the same Target connection group.
There is a work around to do updates and deletes using the constraint based load ordering. The
informatica powercenter provides an option called complete constraint-based loading for inserts,
updates and deletes in the target tables. To enable complete constraint based loading, specify
FullCBLOSupport=Yes in the Custom Properties attribute on the Config Object tab of session
If you don't check the constraint based load ordering option, then the work flow will succeed in two
cases.
1. When there is no primary key constraint on the deparatments table.
GE MSAT Internal
2. When you have only unique values of department id in the source.
If you have primary key and foreign key relation ship between the tables, then always you have to
insert a record into the parent table (departments) first and then the child table (employees).
Constraint based load ordering takes care of this.
Is used to specify the order in which integration service loads the targets.
If you have multiple source qualifier transformations connected to multiple targets, you can
specify the order in which the integration service loads the data into the targets.
A target load order group is the collection of source qualifiers, transformations and targets linked in a
mapping
The following figure shows the two target load order groups in a single mapping:
Target load order will be useful when the data of one target depends on the data of another target. For
example, the employees table data depends on the departments data because of the primary-key and
foreign-key relationship. So, the departments table should be loaded first and then the employees table.
Target load order is useful when you want to maintain referential integrity when inserting, deleting or
updating tables that have the primary key and foreign key constraints.
GE MSAT Internal
Incremental Aggregation in Informatica
Incremental Aggregation is the process of capturing the changes in the source and calculating the
aggregations in a session. This process makes the integration service to update the target incrementally
and avoids the process of calculating the aggregations on the entire source. Consider the below sales table
as an example and see how the incremental aggregation works.
Source:
YEAR PRICE
----------
2010 100
2010 200
2010 300
2011 500
2011 600
2012 700
For simplicity, I have used only the year and price columns of sales table. We need to do aggregation and
find the total price in each year.
When you run the session for the first time using the incremental aggregation, then integration service
process the entire source and stores the data in two file, index and data file. The integration service creates
the files in the cache directory specified in the aggregator transformation properties.
After the aggregation, the target table will have the below data.
Target:
YEAR PRICE
GE MSAT Internal
----------
2010 600
2011 1100
2012 700
Now assume that the next day few more rows are added into the source table.
Source:
YEAR PRICE
----------
2010 100
2010 200
2010 300
2011 500
2011 600
2012 700
2010 400
2011 100
2012 200
2013 800
Now for the second run, you have to pass only the new data changes to the incremental aggregation. So,
the source will contain the last four records. The incremental aggregation uses the data stored in the cache
and calculates the aggregation. Once the aggregation is done, the integration service writes the changes to
the target and the cache. The target table will contains the below data.
GE MSAT Internal
Target:
YEAR PRICE
----------
2010 1000
2011 1200
2012 900
2013 800
Points to remember
1. When you use incremental aggregation, first time you have to run the session with complete
source data and in the subsequent runs you have to pass only the changes in the source data.
2. Use incremental aggregation only if the target is not going to change significantly. If the
incremental aggregation process changes more than hhalf of the data in target, then the session
perfromance many not benfit. In this case go for normal aggregation.
Before enabling the incremental aggregation option, make sure that you capture the changes in the source
data. You can use lookup transformation or stored procedure transformation to remove the data which is
GE MSAT Internal
already processed. You can also create a trigger on the source database and can read only the source
changes in the mapping.
Performance Tuning
Performance tuning process identifies the bottlenecks and eliminates it to get a better ETL
load time.
Tuning starts with the identification of bottlenecks in source, target, mapping and further
to session tuning.
It might need further tuning on the system resources on which the Informatica
When a PowerCenter session is triggered, integration service starts Data Transformation Manager
(DTM), which is responsible to start reader thread, transformation thread and writer thread.
Reader thread is responsible to read data from the sources. Transformation threads process data
according to the transformation logic in the mapping and writer thread connects to the target and
loads the data. Any data processing delay in these threads leads to a performance issue
http://www.disoln.org/2013/08/Informatica-PowerCenter-Performance-Turning-A-to-Z-Guide.html
GE MSAT Internal
Source Bottlenecks
Performance bottlenecks can occur when the Integration Service reads from a source database.
Target Bottlenecks
When target bottleneck occurs, writer thread will not be able to free up space for reader and transformer
threads, until the data is written to the target. So the the reader and transformer threads to wait for free
Mapping Bottlenecks
A complex mapping logic or a not well written mapping logic can lead to mapping bottleneck. With mapping
bottleneck, transformation thread runs slower causing the reader thread to wait for free blocks and writer
Session Bottlenecks
GE MSAT Internal
If you do not have a source, target, or mapping bottleneck, you may have a session bottleneck. Session
bottleneck occurs normally when you have the session memory configuration is not turned correctly. This in
turn leads to a bottleneck on the reader, transformation or writer thread. Small cache size, low buffer
System Bottlenecks
After you tune the source, target, mapping, and session, consider tuning the system to prevent system
bottlenecks. The Integration Service uses system resources to process transformations, run sessions, and
read and write data. The Integration Service also uses system memory to create cache files for
data. Sessions that use a large number of sources and targets might require additional memory blocks.
Not having enough buffer memory for DTM process, can slowdown reading, transforming or writing
process. Adding extra memory blocks can keep the threads busy and improve session performance. You
can do this by adjusting the buffer block size and DTM Buffer size.
Depending on the source, target data, you might need to increase or decrease the buffer block size.
GE MSAT Internal
2. Increasing DTM Buffer Size
When you increase the DTM buffer memory, the Integration Service creates more buffer blocks, which
improves performance.
GE MSAT Internal
II. Caches Memory Optimization
Transformations such as Aggregator, Rank, Lookup uses cache memory to store transformed data, which
includes index and data cache. If the allocated cache memory is not large enough to store the data, the
Integration Service stores the data in a temporary cache file. Session performance slows each time the
You can increase the allocated cache sizes to process the transformation in cache memory itself such that
the integration service do not have to read from the cache file.
GE MSAT Internal
You can update the cache size in the session property of the transformation as shown below.
GE MSAT Internal
III. Optimizing the Target
You can use bulk loading to improve the performance of a session that inserts a large amount of data into a
Oracle, or Microsoft SQL Server database. When bulk loading, the Integration Service bypasses the
database log, which speeds performance. Without writing to the database log, however, the target
database cannot perform rollback. As a result, you may not be able to perform recovery.
When you define key constraints or indexes in target tables, you slow the loading of data to those tables.
To improve performance, drop indexes and key constraints before you run the session. You can rebuild
The Integration Service performance slows each time it waits for the database to perform a checkpoint. To
If a session joins multiple source tables in one Source Qualifier, you might be able to improve performance
GE MSAT Internal
Optimizing Transformations
Each transformation is different and the tuning required for different transformation is different. But
generally, you reduce the number of transformations in the mapping and delete unnecessary links between
Partitioning: the session improves the session performance by creating multiple connections
to sources and targets and loads data in parallel pipe lines.
Lookup Transformations: If the session contained lookup transformation you can improve
the session performance by enabling the look up cache. The cache improves the speed by
saving the previous data and hence no need to load that again.
Filter Transformations: If your session contains filter transformation, create that filter
transformation nearer to the sources or you can use filter condition in source qualifier.
Group transformations: Aggregator, Rank and joiner transformation may often decrease
the session performance .Because they must group data before processing it. To improve
session performance in this case use sorted ports option ie sort the data before using the
transformation.
GE MSAT Internal
SCD
Slowly changing dimensions are dimensions that have data that changes slowly.
How to record such changes is a common concern in Data warehousing.
To deal issue, we have the following SCD types
1. SCD Type 1
2. SCD Type 2
3. SCD Type 3
SCD Problem:
Rajkumar is a customer with ABC Inc. he first lived in Chennai. So, the original entry in the customer
lookup table has the following record:
At a later date, he moved to Vizag on Dec 2008. How should ABC Inc. now modify its customer table
to reflect this change? This is the "Slowly Changing Dimension" problem.
SCD Type 1: SCD type 1 methodology is used when there is no need to store historical data in
the dimension table. This method overwrites the old data in the dimension table with the new data.
It is used to correct data errors in the dimension.
------------------------------------------------
1 1 Marspton Illions
GE MSAT Internal
Here the customer name is misspelt. It should be Marston instead of Marspton. If you use type1
method, it just simply overwrites the data. The data in the updated table will be.
------------------------------------------------
1 1 Marston Illions
The advantage of type1 is ease of maintenance and less space occupied. The disadvantage is that
there is no historical data kept in the data warehouse.
SCD Type 2
In type 2, you can store the data in three different ways. They are
Versioning
Flagging
Effective Date
SCD Type 2 Versioning: In versioning method, a sequence number is used to represent the
change. The latest sequence number always represents the current row and the previous sequence
numbers represents the past data.
As an example, let’s use the same example of customer who changes the location. Initially the
customer is in Illions location and the data in dimension table will look as.
--------------------------------------------------------
GE MSAT Internal
1 1 Marston Illions 1
The customer moves from Illions to Seattle and the version number will be incremented. The
dimension table will look as
--------------------------------------------------------
1 1 Marston Illions 1
2 1 Marston Seattle 2
Now again if the customer is moved to another location, a new record will be inserted into the
dimension table with the next version number.
SCD Type 2 Flagging: In flagging method, a flag column is created in the dimension table. The
current record will have the flag value as 1 and the previous records will have the flag as 0.
Now for the first time, the customer dimension will look as.
--------------------------------------------------------
1 1 Marston Illions 1
Now when the customer moves to a new location, the old records will be updated with flag value as
0 and the latest record will have the flag value as 1.
--------------------------------------------------------
GE MSAT Internal
1 1 Marston Illions 0
2 1 Marston Seattle 1
SCD Type 2 Effective Date: In Effective Date method, the period of the change is tracked using the
start_date and end_date columns in the dimension table.
-------------------------------------------------------------------------
The NULL in the End_Date indicates the current version of the data and the remaining records
indicate the past data.
SCD Type 3: In type 3 methods, only the current status and previous status of the row is
maintained in the table. To track these changes two separate columns are created in the table. The
customer dimension table in the type 3 method will look as
--------------------------------------------------------------------------
Let say, the customer moves from Illions to Seattle and the updated table will look as
GE MSAT Internal
--------------------------------------------------------------------------
Now again if the customer moves from seattle to NewYork, then the updated table will be
--------------------------------------------------------------------------
The type 3 method will have limited history and it depends on the number of columns you create.
Things to know
In SCD O Dimensional Table, we just keep the data as it is and it will never change.
SCD type 4 provides a solution to handle the rapid changes in the dimension tables.
When you issue a stop command on a session, the integration service first stops reading the data from the
sources. It continues processing and writing data to the targets and then commits the data.
Abort command is handled the same way as the stop command, except that the abort command has
timeout period of 60 seconds. If the Integration Service cannot finish processing and committing data within
the timeout period, it kills the DTM process and terminates the session.
GE MSAT Internal
Load Manager And DTM Process - Informatica
While running a Workflow, the PowerCenter Server uses the Load Manager process and the Data Transformation
Manager Process (DTM) to run the workflow and carry out workflow tasks. When the PowerCenter Server runs a
workflow, the Load Manager performs the following tasks:
When the PowerCenter Server runs a session, the DTM performs the following tasks:
DTM
Data Transformation Manager
The PowerCenter Integration Service process starts the DTM process to run a session.
The DTM is the process associated with the session task.
The DTM retrieves the mapping and session metadata from the repository and validates it.
If the session is configured for pushdown optimization, the DTM runs an SQL statement to
push transformation logic to the source or target database
GE MSAT Internal
If the workflow uses a parameter file, the PowerCenter Integration Service process sends the
parameter file to the DTM when it starts the DTM. The DTM creates and expands session-
level, service-level, and mapping-level variables and parameters
The DTM creates logs for the session. The session log contains a complete history of the
session run, including initialization, transformation, status, and error messages
The DTM verifies that the user who started or scheduled the workflow has execute
permissions for connection objects associated with the session.
After verifying connection object permissions, the DTM runs pre-session shell commands.
The DTM then runs pre-session stored procedures and SQL commands
After initializing the session, the DTM uses reader, transformation, and writer threads to
extract, transform, and load data.
After the DTM runs the processing threads, it runs post-session SQL commands and stored
procedures. The DTM then runs post-session shell commands.
When the session finishes, the DTM composes and sends email that reports session completion or
failure
Test Load
GE MSAT Internal
With test load the Integration service reads and transforms data without writing it to target.
The integration service generates all session files and performs pre and post-session
functions.
The Integration Service writes data to the relational targets but rollback the data when
session completes.
For other targets, Integration Service does not write data to the targets.
Enter the number of source rows you want to test in the Number of Rows to Test Field.
Requirement Gathering
Testing
Raise a Change Request with Informatica Design team (L3 Informatica Architecture) ( SLA is 3
days) and Attach below documents in the ticket.
1. CRF(wfs,sessions,mappings)
2. Load Statistics(need to provide in separate document)
3. Volumetrics(list of tables,frequency of data)
4. session logs and workflow logs
Need Informatica Design team Approval
- Review session logs and provide the approval
They will assign the ticket to Informatica Admin team(L2 Informatica admin team) ( SLA is 2
days)
Admin team will work on it and only the listed changes in the CRF document will migrate to
the QA.
Note
GE MSAT Internal
The process for Code Migration from QA to Production is same as Code Migration from Dev
to QA except IM Approval is required here.
Minimum test records we can test is 100.
Unit Testing
Integration Testing
GE MSAT Internal
What are the transformations we can not use in Mapplet?
Mapplet - A Mapplet is a reusable object that can contain as many transformations as you need and it is a
reusable object by "default".
Target definitions - Mapplet is just the set of reusable transformation, it is not used to load data in target.
That’s why target definition is missing.
Pre- and post- session stored procedures - pre-session and post-session properties are present in target
so we can't use them also.
Normalizer transformations - Mapplet is a reusable logic that you can use across different mappings.
Normalizer is a dynamic transformation which converts rows to columns or vice-versa, so that is dependent
on the input to the normalizer, it is not fixed logic that you can reuse in other mappings.
Non Reusable Sequence Generator transformation - The mapplet is a reusable object so if we are using
sequence generator in our mapplet it should be reusable.
Other mapplets - Informatica might never see such a business scenario or functionality requirement, so
they limited this option.
GE MSAT Internal
Why we should not use Bulk Load on the targets having indexes/constraints?
Bulk load
Note
If your mapping has update Strategy then your session will be data driven. In this case even
if you use BULK mode Informatica will treat this as a N
Cached values increase the performance by reducing the amount of time to contact the repository
for getting next sequence value.
It affects sequence. Example: If it caches values 1 to 10 and the session completes at sequence 7, the
remaining sequence values will be discarded. In the next run, the sequence starts from 11.
Code page
The code page in Informatica is used to specify the character encoding.
It is selected based on the source data.
GE MSAT Internal
If the source data contains Japanese characters, then the code page is selected to support
Japanese text.
To avoid data loss.
Most commonly selected encoding systems are - ASCII, UTF-8, UTF- 32
An encoding is the assignment of a number to a character in the character set
You use code pages to identify data that might be in different languages. For example, if you
create a mapping to process Japanese data, you must select a Japanese code page for the
source data.
Q1. If a session fails after loading 10000 records in the target how can we
start loading into the target from 10001 records?
We can run the session with recovery strategy mode.
Resume from the last checkpoint. The Integration Service saves the session state of
operation and maintains target recovery tables.
Restart. The Integration Service runs the session again when it recovers the workflow.
Fail session and continue the workflow. The Integration Service cannot recover the session,
but it continues the workflow. This is the default session recovery strategy
Yes. By using copy session wizard, you can copy a session in a different folder or repository.
GE MSAT Internal
What is a command that used to run a batch?
Master thread
Mapping thread
GE MSAT Internal
Reader thread
Writer thread
An active transformation can change the number of rows that pass through it.
A passive transformation does not change the number of rows that pass through it.
Run only on demand: Informatica server runs the session only when user starts session explicitly
Run once: Informatica server runs the session only once at a specified date and time.
Run every: Informatica server runs the session at regular intervals as you configured.
Customized repeat: Informatica server runs the session at the date and time specified in the repeat
dialog box.
Just run the session in time stamp mode then automatically session log will not overwrite current
session log.
GE MSAT Internal
What is the difference between Mapping and Mapplet?
Mapping is the collection of Source Definition, Target Definition, Transformation(s) and/or Mapplet.
Assignment Task
Command Task
Control Task
Decision Task
E-mail Task
GE MSAT Internal
Session Task
Timer Task
Link Task
Warehouse Designer
No
GE MSAT Internal
If there are x lot number of records with y number of columns from source data and we need to
extract z number of columns only (very less) then the cache stores those columns for respective
records in the $PMCACHEDIR of Informatica Server so that we don’t need to extract each record from
database and load into Informatica. Doing this increases the performance of the system.
If the object undergoes a change, this change has to be updated to each and every user. Instead, if
the object is made as shared, then the update has to be done to the object and all other users get the
update.
Dual table is a table that is created by oracle along with data dictionary. It gives output of exactly
one column name dummy and one record ‘x’.
-----------
DUMMY
GE MSAT Internal
No we can’t install Informatica on Power Center. We can Install Informatica 9.x on Windows 7.
Equi join in Oracle is performed on oracle sources (relational sources) while Informatica Equi joins can
be performed on non-relational sources too (oracle and flat file etc).
Dimension which has no dimension table of its own and is derived from the fact table.
It is carried out by Business Analyst. It is nothing but interacting with end users and getting to know
what his requirements are. Based on his requirements, the rest of the phases Analysis, Design,
Implementation and Testing and finally Maintenance are carried on.
GE MSAT Internal
What is Junk dimension?
The dimension that is formed by lumping of smaller dimensions is called Junk dimension.
Staging Area is indeed a database where data from different source systems are brought together
and this database acts as an input to Data Cleansing.
equi join
self join
outer join
Joins in informatica
GE MSAT Internal
normal join
What is the file extension or format of files for the Informatica Objects like sessions, mappings etc.
in Repository?
The format of files for Informatica Objects in Repository is XML
Where can we find Versioning in Informatica? What happens if Versioning is turned off?
In Informatica, we can find Versioning in Repository Manager. If Versioning is turned off, we will not
be able to track the changes for the respective Sessions/Mappings/Workflows.
In joiner transformation, we take the table with lesser number of rows as master while the more
number of rows as detailed. Why?
In joiner, each and every row of the master is compared with every row of the detailed and so, the
less number of rows in master, the less is the number of iterations and so better is the performance
of the system.
What are all the databases the Informatica server on windows can connect to?
What are the databases the Informatica server on UNIX can connect to?
GE MSAT Internal
DB2
Oracle
Two ways
We can reimport the source definition
We can edit the source definition
What is mapping?
Mapping is nothing but data flow between source and target.
What is a session?
Session is a set of instructions that tells the Informatica server when to and how to move the data
from source to targets.
If a session fails after loading 10000 records into the target how can we start loading into the
target from the 10001th record?
We can run the session with the recovery strategy mode.
GE MSAT Internal
By using ODBC if they are relational, FTP if they are flat files.
What are the constants or flags for each database operation and their numeric equivalent in
Update Strategy?
Insert DD_INSERT 0
Update DD_UPDATE 1
Delete DD_DELETE 2
Reject DD_REJECT 3
(SCD-2 by RamaSurReddy) Yes, Using Informatica Data Analyzer Tool we can generate reports.
Batches provide a way to group sessions for either sequential or parallel execution by Informatica
server.
GE MSAT Internal
Source definition
Target definition
Mappings
Mapplet
Transformations
GE MSAT Internal
What is Dimension table?
A dimension table is one that describes the business entities of an enterprise.
What are the types of files created by Informatica server during the session running?
Types of files created are
Cache file
Session log file
Informatica server log file
Output file
Reject file
GE MSAT Internal
What are the two types of processes that run the session?
The two types of processes that run the session are
Load Manager
DTM processes (Data Transformation Manager)
Lookup transformation can run on source or target tables while Union tables work only on source
tables.
GE MSAT Internal
How many ways you can add ports?
Two ways
From other transformation
Click on add port button
How many number of sessions can you can you group in batches?
Any number of sessions but the lesser the number of sessions in a batch, the easier the migration.
Filter transformation works on single condition only while Router transformation works on multiple
conditions as well.
Filter transformation gives only one output. Router transformation can give more than one output.
What is the difference between Source Qualifier transformation and Joiner transformation?
Source Qualifier transformation is used to join the data from homogeneous sources while Joiner
transformation is used to join data from heterogeneous sources as well as homogenous sources from
different schemas.
We need matching keys to join two relational sources in Source Qualifier transformation and is not
the case with joiner transformation.
GE MSAT Internal
Which transformation should we use to normalize the COBOL and relational sources?
We need to make use of Normalizer Transformation
Joiner works on source data only while Lookup works on source as well as target data.
Joiner transformation supports equi joins only while Lookup supports equi join as well as non equi
joins.
Connected lookup returns more than one column in a row where as unconnected
GE MSAT Internal
lookup returns only one column in each row.
What is a Mapplet?
Mapplet is an object which consists of set of reusable transformations which can be used in different
mappings.
What is the transformation used in loading 4 flat files of similar structure to a single target?
We can make use of Union transformation
In direct loading, we can perform recovery process while in indirect loading, we cannot perform
recovery process.
GE MSAT Internal
Date=1/1/1753
In case of flat files (which comes through FTP) haven’t arrived what happens?
The session is going to fail because of fatal error
GE MSAT Internal
Shared cache
What are the differences between unique key and primary key?
Primary key cannot contain null value whereas unique key can contain one and only one null value
In case of sql server, with default options, primary key is created as a clustered index while unique
key is created as a non clustered index
A unique key is similar to primary key but we can have more than one unique key per table
equi join
self join
outer join
Joins in informatica
normal join
RowId is the physical address of a row. If we know the RowId, we can read entire row.
What are pseudo columns? What are the various types of pseudo columns?
Pseudo columns are columns which are not in the table but they can be used in sequel queries as if
they are part of the table
GE MSAT Internal
RowNum
RowId
Sysdate
User
Currval
Nextval
Select
Update
Delete
Insert
Merge
GE MSAT Internal
LAST
GE MSAT Internal
How did you implement performance tuning in Informatica?
To answer these questions we can say that we specifically did not work on performance tuning but
while implementing the mapping, we took care of this. Some of the steps are below.
Shared Cache
You can share the lookup cache between multiple lookup transformations.
You can configure multiple Lookup transformations in a mapping to share a single lookup
cache.
The Integration Service builds the cache when it processes the first Lookup transformation.
It uses the same cache to perform lookups for subsequent Lookup transformations that
share the cache
You can share unnamed cache between multiple lookup transformations in the same
mappings.
You can share named cache between multiple lookup transformations in the same mapping
or different mappings.
GE MSAT Internal
rest,also find out the reason for session failure.
In development role:
Question: What is 3rd normal form? {L} Give me an example of a situation where the tables
Answer: No column is transitively depended on the PK. For example, column1 is dependant
dependant” on column1. To make it 3rd NF we need to split it into 2 tables: table1 which has
column1 & column2 and table2 which has column2 and column3.
9. Question: Tell me how to design a data warehouse, i.e. what are the steps of doing
Answer: There are many ways, but it should not be too far from this order: 1. Understand
the business process, 2. Declare the grain of the fact table, 3. Create the dimension tables
including attributes, 4. Add the measures to the fact tables (from Kimball’s Toolkit book
GE MSAT Internal
chapter 2). Step 3 and 4 could be reversed (add the fact first, then create the dims), but step
1 & 2 must be done in that order. Understanding the business process must always be the
Declaring the grain means saying exactly what a fact table record represents. Remember that
a fact table record captures a measurement. Example declarations of the grain include:
Data cleansing is important aspect while performing ETL. There are many
cases were we get non printable and special characters in the input file or
table. Informatica Regular expression and function are very handy here.
So here is how we can avoid Non Printable and Special Characters when
we are loading the target tables. Use REGREPLACE function for handling
GE MSAT Internal
Non Printable characters and REPLACESTR for replacing multiple special
characters in the field.
Examples:
Non Printable:
Syntax :
1. REG_REPLACE( subject, pattern, replace, numReplacements )
Sometimes if we use
REG_REPLACE(PRODUCT_DESC,'[^[:print]]',NULL), Informatica
doesn't replace non printable with NULLS, so its better we use '' instead of
NULL
Special Characters:
There are some cases where we are asked for replacing special characters
from the field we can think of REPLACECHR() function here, but it will
replace only one specific character from the field. Whereas we need to
replace a set of special characters from the input field.
So here we go for REPLACESTR() function which can handle multiple
special characters from the input field.
GE MSAT Internal
Syntax:
REPLACESTR ( CaseFlag, InputString, OldString1, [OldString2, ...
OldStringN,] NewString )
GE MSAT Internal
input field. For instance take the expression
REPLACESTR(1,PRODUCT_DESC,'"','.','?','#','+','/','!','^','~','`','$','
')
Here it will replace all occurrence of special characters (".#?#+/$!~`) will
be replaced with '' ie NULL
Like if PRODUCT_DESC is 'ABC~`DEF^%GH$%XYZ#!' the output of the
expression will be 'ABCDEFGHXYZ'. This is how we can handle special
characters from the input field.
By using REG_REPLACE and REPLACESTR we can take care of both non
printable and special characters from the input field like below
REG_REPLACE(REPLACESTR(1,PRODUCT_DESC,'"','.','?','#','+','/','
!','^','~','`','$',''),'[^[:print:]]','')
Important Note: Use relational connection with code page UTF-8 here
Hope this will help us all in data cleansing.
Happy learning!!!!
Correction and suggestions are always welcomed :)
GE MSAT Internal