You are on page 1of 14

INFORMATICA PERFORMANCE

TUNNING GUIDE AND TIPS

ByPravin Ingale
pravin.ingale@in.ibm.com

AGENDA
Below are the goals of this document:
1. To provide useful guidelines to follow while designing or coding
Informatica mappings with the intention to develop better
mappings, from performance perspective, at the beginning of the
project lifecycle itself. If developers follow these guidelines - by
the end of the project they would have already taken care of
general performance related issues that can occur with
Informatica mappings. These guidelines are from Informatica
help files, previous project guidelines and my own project
learnings.
2. To provide some useful tips that I had used in my projects to
performance tune ETL mappings. These are scenario based tips
and may not be applicable to every Informatica ETL project.
Note: The optimization tips were used in Informatica 8.6
version and may/may not be valid for other versions

GUIDELINES FOR MAPPING


DEVELOPMENT
1. Use Flat File sources and Targets:
Flat File sources and targets are better to use than relational
tables because accessing relational tables through PowerCenter
requires database connection.
2. Use Fixed-width files:
Fixed-width files are faster to load than delimited files because
delimited files require extra parsing. The integration service
spends more time and resources to first validate each record of
the source file and then read it into its environment.
3. Use Local Files:
Flat file located on Informatica server loads faster than a file
located on the local machine because local machine files requires
extra time to transfer to server while mapping execution.
4. Reduce the number of transformations:
More the number of transformations more will be the number
of processing threads that the integration service needs to create
and manage.
5. Use of active transformations:
5.1. Use active transformations as early as possible in the
mapping if they are able to decrease the number of
records read into Informatica environment (i.e., place
filters, aggregators as close to source as possible).
5.2. Use active transformations as far as possible in the
mapping if they increase the number of records (i.e., place
Routers as close to Target as possible).
5.3. Use Incremental Aggregation if possible.
6. Calculate once, use many times (Facilitate reuse)
6.1. Avoid calculating or testing the same value over and over.
Calculate it once in an expression and set a True/False flag
6.2. Within an expression, use variable ports to calculate a
value than can be used multiple times within that
transformation.
6.3. Make maximum use of reusable transformations.
6.4. Use
mapplets
to
encapsulate
multiple
reusable
transformations.

6.5.

Use mapplets to leverage the work of critical developers


and minimize mistakes when performing similar functions.

7. Use Normalizer
Use a Normalizer Transformation to pivot rows rather than
multiple instances of the same target.
8. Only connect what is used
8.1. Delete
unnecessary
ports
and
links
between
transformations to minimize the amount of data moved;
particularly in the Source Qualifier.
8.2. In lookup transformations, change unused ports to be
neither input nor output. This makes generated SQL
override as small as possible, which cuts down on the
amount of cache necessary and thereby improves
performance.
9. Watch the data types
Sometimes data conversion is excessive. Data types are
automatically converted when types are different between
connected ports.
Minimize data type changes between
transformations by planning data flow prior to developing the
mapping.
10.

Usage of Filter transformation


10.1. Use Router instead of multiple Filter transformation.
10.2. Use Filter Instead of Update Strategy when you are only
inserting records into targets and we can neglect Rejecting
records.

11.

Utilize single-pass reads


Redesign mappings to utilize one Source Qualifier to populate
multiple targets. This way the server reads this source only once.
If you have different Source Qualifiers for the same source (e.g.,
one for delete and one for update/insert), the server reads the
source for each Source Qualifier.

12.

Reduce field-level stored procedures


If you use field-level stored procedures, the integration service
has to make a call to that stored procedure for every row which
results in slowing the performance.

13.

Numeric operations are faster than string operations

Optimize char-varchar comparisons (i.e.,


comparing).

trim spaces before

14.

Operators are faster than functions


(i.e., || vs. CONCAT)

15.

Date comparisons
Avoid date comparisons in lookup; replace with string.

16.

Use a Router instead of multiple Filters


Use a Router Transformation to separate data flows instead of
multiple Filter Transformations because each filter has to take
data from the transformations which requires more time.

17. Uncheck the Forward Rejected Rows option in the update


strategy if these rows are not critical
(When you create Update Strategy Forward Rejected Rows
option will be checked by default)
18.

Optimize Joiner
18.1. Normal joins are faster than outer joins and the resulting
set of data is also smaller.
18.2. Join sorted data when possible.
18.3. When using a Joiner Transformation, be sure to make the
source with the smallest amount of data the driving/Master
source.
18.4. For an unsorted Joiner transformation, designate as the
master source the source with fewer rows.
18.5. For a sorted Joiner transformation, designate as the
master source the source with fewer duplicate key values.

19.

Optimize Source Qualifier


19.1. SQL override as small as possible, which cuts down on the
amount of reads and there by improves performance.
19.2. Prefer the Source Qualifier for filtering unnecessary data if
the source is not bottleneck.
19.3. Use the Source Qualifier to do the join when sourcing data
from the same database schema

20.

Use the equality operator (=) first while giving conditions


When using a Lookup Table Transformation, improve lookup
performance by placing all conditions that use the equality
operator = first in the list of conditions under the condition tab.

21.

Cache lookup tables


21.1. Cache lookup tables ONLY if the number of lookup calls is
more than 10 to 20 percent of the lookup table rows.
21.2. For fewer number of lookup calls, do not cache if the
number of lookup table rows is large.
21.3. For small lookup tables (i.e., less than 5,000 rows), cache
for more than 5 to 10 lookup calls.

22. Replace lookup with decode or IIF (for small sets of


values).
Use operators instead of functions in expressions.
23.

Use Dynamic Look Up


For overly large lookup tables, use dynamic caching along with a
persistent cache. Cache the entire table to a persistent file on
the first run, enable the update else insert option on the dynamic
cache and the engine will never have to go back to the database
to read data from this table. You can also partition this persistent
cache at run time for further performance gains.

24.

Optimize targets
You can identify target bottlenecks by configuring the session to
write to a flat file target. If the session performance increases
significantly when you write to a flat file you have a target
bottleneck.
Consider
performing
the
following
tasks
performance:
24.1. Drop indexes and key constraints.
24.2. Increase checkpoint intervals.
24.3. Use bulk loading.
24.4. Use external loading.
24.5. Increase database network packet size.
24.6. Optimize target databases.

25.

to

increase

Optimize sources
If the session reads from relational source, you can use a filter
transformation, a read test mapping, or a database query to
identify source bottlenecks:
Filter Transformation - measure the time taken to process a
given amount of data, then add an always false filter
transformation in the mapping after each source qualifier so that
no data is processed past the filter transformation. You have a

source bottleneck if the new session runs in about the same


time.
Read Test Session - compare the time taken to process a given
set of data using the session with that for a session based on a
copy of the mapping with all transformations after the source
qualifier removed with the source qualifiers connected to file
targets. You have a source bottleneck if the new session runs in
about the same time.
Consider
performing
the
following
tasks
to
increase
performance:
25.1. Optimize the query.
25.2. Use conditional filters.
25.3. Increase database network packet size.
25.4. Connect to Oracle databases using IPC protocol.
26.

How to find the bottleneck in a mapping:


The session log always contains thread summary records and
can be a starting point to do mapping performance analysis. At
the bottom of the session log you can find a section similar to
the below sample:
MASTER> PETL_24018 Thread [READER_1_1_1] created for the read stage of
partition point [SQ_test_all_text_data] has completed: Total Run Time =
[11.703201] secs, Total Idle Time = [9.560945] secs, Busy Percentage =
[18.304876].
MASTER> PETL_24019 Thread [TRANSF_1_1_1_1] created for the
transformation
stage of partition point [SQ_test_all_text_data] has completed: Total Run
Time = [11.764368] secs, Total Idle Time = [0.000000] secs, Busy
Percentage
= [100.000000].
MASTER> PETL_24022 Thread [WRITER_1_1_1] created for the write stage of
partition point(s) [test_all_text_data1] has completed: Total Run Time =
[11.778229] secs, Total Idle Time = [8.889816] secs, Busy Percentage =
[24.523321].

This
section
gives
statistics
about
the
READER,
TRANSFORMATION and WRITER threads.
The thread with the highest busy percentage is the bottleneck.
In the example above, the transformation is bottleneck and
needs to be optimized first.

Performance Tuning Tips


I have worked on multiple Informatica projects. There were situations
where because of the huge volume of data our ETLs were not
completing within the available batch run window and we had to do
separate project for performance tuning our ETLs. Enlisted below are
few of the techniques that we devised and implemented to achieve
desired results. These were techniques for those given scenarios and
may not be applicable to any other Informatica project in general.
Scenario 1:
The project was intended to create a central repository of certified data
in the form of a Data Hub. This involved reading the month-end/dayend snapshot of data from discrete source systems and inserting or
updating records into the Hub. Each source system used to send
files/datasets ranging from 1 to 20 million records. The datasets would
contain entire dump from the source system .i.e new records, updated
records and un-changed records. The percentage of un-changed
records was in the range of 60-80%. Our ETLs were reading the entire
datasets (multiple times in multiple passes), comparing onto the
targets and deciding whether to insert, update or skip records.
Unchanged records were obviously skipped but reading those and
comparing those was still happening.
Solution There were 2 options used wherever applicable:
i.
Worked with the source systems to get only the changed data
(CDCed), i.e. new or changed records only.
ii.
For those source systems that were not ready to provide CDCed
data, our team developed database scripts to CDC data for those
sources in landing/staging area. This CDCed data was then the
source for the ETLs rather than the original datasets sent by
source system.
Using this approach the ETL batch time could be brought down by
30%.
Scenario 2:
Each of the ETL mappings were doing a lookup on the target table and
comparing all the looked-up non-key columns with same source
columns to determine whether source records had to be updated
(changed) into the target table or not. The target tables were having
huge volume of records and all this data (rows X non-key columns)
was getting cached. This was creating a bottleneck for parallel
execution of mappings. More the number of mappings would run in

parallel more was the cache size required and after a point mappings
were failing due to insufficient memory on the Informatica nodes.
Solution Used MD5 function.
MD5 is a data encoding function and returns the MD5 hash value of the
input value as char. This function is available in the Expression
transformation.
The Hub data model was modified and an additional column of char
data-type by name MD5_Value was added onto the identified tables.
All the identified ETL mappings were modified. In the first expression
an output port was created which received the MD5 value of all the
non-key columns, concatenated in defined sequence, to be written to
the target table. This MD5 value was inserted into the MD5_Value
column of the target tables. Secondly the lookups were modified to
extract and compare only the MD5_Value column from the lookup with
the calculated MD5_Value of the source columns. Since instead of n
number of lookup columns only the MD5_Value column data was
getting cached during mapping execution and lookups the memory
requirement of the ETL mappings came down to a significant extent.
Now there was no more a limit on the number of mappings to be
executed in parallel in production. The data comparison in the
expression to identify updates was also faster now.
This helped to bring down the ETL batch execution window by having
all possible ETL mappings execute in parallel.
Scenario 3:
In one of my projects, the source system was providing day-end
changes to our application Hub. There were 45 source tables from
which ETLs were loading data into 2 super-type and 4 sub-type tables.
The target table had huge volume of data and ETL had to do a lookup
on the target tables in multiple passes to generate surrogate id,
identify insert/update etc. The lookups on the target tables were
caching this huge volume of data and the lookup threads used to run
for a long time to get this caching done.
Solution Disabled the Lookup caching enabled property.
We saw that the number of changed records was around 1-2% of the
total volume of records in the target tables. So we created 2 copies of
each
workflow.
The
original
workflows
were
named
as
<workflowname>_bulk
and
new
were
named
as
<worklfowname>_incr (incremental).

We created 2 sets of the scheduled job that would run all these 45 odd
workflows as BULK & INCR. Needless to mention, all the _bulk
workflows were added in BULK job and all _incr workflows were added
in INCR job. The only other difference was we disabled the Lookup
caching enabled property in the major lookups in the _incr workflows.
This is depicted in the below screenshot-

Since the daily changes were very less, we scheduled the INCR job to
run daily at night. The BULK job we kept on-demand and was hardly
run it was used just the first time during production deployment (for
history data load) and planned for later execution when any bulk syncup requirement would come in future. The target tables had no history
maintenance requirement.
The INCR job now completed faster since the heavy lookup caching
was omitted. Since the source records were in few tens or occasionally
a hundred there was no need to download all the target data into
lookup cache. For the few tens of source records only those number of
lookup queries would hit the target tables and lookup done.
Scenario 4:
In one project there were a set of account and customer tables that
were getting loaded every month. These were month end snapshot
tables and used to contain month-end data for each account or
customer for every month. E.g. - there were around 35 million odd
accounts and the table data was supposed to grow every month. With
data growth the ETL performance was degrading every month.

Solution Used table partitioning. Table partitioning is a data


organization scheme in which table data is divided across multiple
storage objects called data partitions or ranges according to values in
one or more table columns. Each data partition is stored separately.
These storage objects can be in different table spaces, in the same
table space, or a combination of both.
All the target tables were partitioned on the month_end_date column
using PARTITION BY RANGE scheme.
A database
procedure
was
created by the
DBA as add_partition_by_month. This procedure was called by the ETL in presession sql as add_partition_by_month(<table_name>,<no of
partition to maintain>,<no of partition to create>).
E.g. The ETL to load the account table would have pre-SQL as
add_partition_by_month(account,36,2). This procedure used system
date as reference to create or drop partitions. The procedure would
maintain 36 partitions older than system date and create 2 future
partitions as per system date. If any of the required future partitions
already exists then nothing was done. Also nothing was done if table
contained less than or equal to 36 old partitions.
After execution of the procedure the ETL would start loading data into
appropriate partition based on the month_end_date column value in
each record.
The ETL loading time was tremendously decreased. Also query
execution time on the tables was reduced.
Similar procedure was created for tables that were loaded on a daily
basis.
Scenario 5:
The ETL mapping in one of the projects was required to load ETL
metrics data into tables. E.g. source read count, target insert count,
target update count, target t2 insert count etc. In the first phase of the
project this was achieved through mapplet and multiple target
instances of the metric table (one per each source and target table). In
the mapplet the record count was calculated using aggregator
transformation. So the mapping had one mapplet and one metric table
instance for each source and target table. There were a lot of
mappings created in the project. Just imagine how complex the
mappings were looking and performance overhead to count the row
counts using multiple aggregators.

Solution Used Informatica built-in session parameters to provide the


row counts.
Informatica built-in session parameters can be used in correct format
to get the row count of every source and target instance applied/affect
by the session. For source the built-in variable gives number of rows
extracted by source qualifier.
E.g. $PMSourceQualifierName@numAppliedRows - Returns the number of
rows the Integration Service successfully read from the named Source
Qualifier.
$PMSourceQualifierName@numRejectedRows - Returns the number of
rows the Integration Service dropped when reading from the named
Source Qualifier.
$PMTargetName@numAppliedRows - Returns the number of rows the
Integration Service successfully applied to the named target instance.
$PMTargetName@numRejectedRows - Returns the number of rows the
Integration Service rejected when writing to the named target
instance.
These parameters can be used in email tasks/commands and post
session commands.
We wrote post session commands using these session parameters to
insert row counts into the metric tables.
Benefits1. Mapping looked less complicated.
2. Development time was reduced as instead of a mapplet and
target instance of metric table a post session command was
enough.
3. Mapping performance was improved as aggregators to count
rows were eliminated.
Scenario 6:
Most of the ETL mappings were running slow and there was increasing
pressure from the client that the total batch run-time needs to be
brought down.
Solution Used Informatica session partitioning.
The Informatica mappings were loading customer and account related
data into super-type, sub-type and relationship tables. The source data
was from around 30+ source systems at that time. The count of

records per source system was different. The Informatica sessions


were partitioned on source system code. 4 partitions were created in
identified sessions. Analysis was done on what percentage of total
records are from each source systems. Based on that information decision was made to put which source system into which of the 4
partitions to make sure that all 4 partitions processes almost equal
number of records (if not same). The session run time was brought
down to almost 30% after session partitioning based on Source system
code.
Scenario 7:
There were various optimization techniques used to fine tune
mappings in a project and afterwards it was realized that nothing more
can be done within Informatica to bring down the total batch execution
time.
Solution Used Informatica Push Down Optimization technique in few
of the mappings.
To increase session performance, push transformation logic to the
source or target database. Based on the mapping and session
configuration, the Integration Service executes SQL against the source
or target database instead of processing the transformation logic
within the Integration Service.
You can push transformation logic to the source or target database
using pushdown optimization. When you run a session configured for
pushdown optimization, the Integration Service translates the
transformation logic into SQL queries and sends the SQL queries to the
database. The source or target database executes the SQL queries to
process the transformations.
The amount of transformation logic you can push to the database
depends on the database, transformation logic, and mapping and
session configuration. The Integration Service processes all
transformation logic that it cannot push to a database.
You can configure the following types of pushdown optimization:
Source-side pushdown optimization. The Integration Service
pushes as much transformation logic as possible to the source
database.
Target-side pushdown optimization. The Integration Service
pushes as much transformation logic as possible to the target
database.

Full pushdown optimization. The Integration Service attempts to


push all transformation logic to the target database. If the Integration
Service cannot push all transformation logic to the database, it
performs both source-side and target-side pushdown optimization.
Limitations with Push Down Optimization1. A long transaction uses more database resources.
2. A long transaction locks the database for longer period of time.
This reduces database concurrency and increases the likelihood
of deadlock.
3. A long transaction increases the likelihood of an unexpected
event.

I intend to keep on adding my learnings to this document in future.


Please write to me at pravin.ingale@in.ibm.com if you find this
document useful in your project implementations. Also let me know if
you want any more information on these guidelines and tips.

You might also like