You are on page 1of 16

1

Data Warehouse Testing

White Paper Authored for: 13th International Conference, QAI

Author 1: Prasuna Potteti


Date:
11-Sep-2011
Email: ppotteti@deloitte.com
Deloitte Consulting India Private Limited
Divyasree Technopolis, 124, Yemlur P.O
Off Old Airport Road, Bengaluru 560037

2
Abstract
Testing is an essential part of design life-cycle of any software product. Data
warehouse testing is very important in projects because users need to trust the
quality of information they access. The reasons for this are increase in Enterprise
Mergers & Acquisitions, Compliance Regulations, Increased focus on data by
Management and data driver decision makings. Data warehouse is a collection of
large amount of data which is used by management for making strategic decisions.
This paper takes a look at the different strategies to test a data warehouse
application. It attempts to suggest various approaches that could be beneficial while
testing the ETL process in a DW. A data warehouse is a critical business application
and defects in it results is business loss that cannot be accounted for. Here, we walk
you through some of the basic phases and strategies to minimize defects.
Focus is on different components in Data warehouse architecture, its design and
aligning test strategy accordingly. Data storage has become cheaper and easier.
Data driven decisions have proved to be more accurate. In this context, testing data
warehouse implementations are also of utmost significance. Organization decisions
depend entirely on the Enterprise data and data has to be Utmost quality. Successful
data ware housing will help Investors, Business leaders and Project Managers.

Addressed Audience:
Test Practitioners and Engineers, Software and Test Managers, QA Managers and
Development Managers as well as other professionals interested in building and
delivering better software.

Objectives:
1. Provide an overview Data warehouse testing
2. Data warehouse Life cycle & Testing activities for DW Projects
3. Test Data needs for DW Projects
4. Address challenges for DW Testing like voluminous data and heterogeneous
sources
5. Highlight Business Case Studies of successful DW Testing & Implementations

Contents

Table of Contents
1.
Introduction to Data Warehouse: .............................................................. 4
2.
Data Warehouse Testing Approach ............................................................ 5
3.
Methodological Framework ....................................................................... 6
4.
A Timeline for Testing .............................................................................. 7
5.
Types of Data warehouse Testing .............................................................. 7
6.
ETL Testing check points .......................................................................... 9
7.
KEY hot points in ETL Testing ................................................................. 10
8.
Database Testing VS. DW Testing ........................................................... 12
9.
Challenges in DW Testing ....................................................................... 13
10.
Tools: Data Warehouse Testing............................................................... 13
11.
Case Study: ......................................................................................... 14
The Client .................................................................................................. 14
The Solution ............................................................................................... 14
The Benefit ................................................................................................ 15
12.
Conclusion and Lessons Learnt ............................................................... 15
References ................................................................................................. 16
Author Bio ................................................................................................. 16
Appendix ................................................................................................... 16

1. Introduction to Data Warehouse:


Extract Transform Load is the process to enable business to consolidate their data
while moving it from place to place. i.e, moving data from source systems into the
data warehouse. The data can arrive from any source.
Extract Process of reading data from a source file.
Transform Process of converting the extracted data from previous form to the
form needed so that it can be stored in desired database. Transformation
occurs by using rules or look up tables etc.
Load Process of writing data into target database.
ETL testing mainly deals with how, when, where and what data we carry in our data
warehouse from which final reports need to be generated. Thus, ETL testing Spreads
across all and each stage of the data flow in the warehouse starting from source
databases to final target data warehouse.

The Data
Sources

Staging
Area

Operat
ional
Data
Source

]]]]]]]

Dimensional

Data Warehouse

Reporting

Report 1

Report 2

Report 3

Data

2. Data Warehouse Testing Approach


There are at least 2 approaches that can be considered for data validation:
Approach1: Data movement directly from source to data warehouse. This
approach validates data in the source is moved to data warehouse according
to business rules.
Approach2: This approach validates data movement through each step in the
ETL and then to the data warehouse. Data will be validated at each step in
this approach.
o Source data stores to Staging tables
o If ODS is used, Staging tables to ODS
o Staging tables or ODS to Data warehouse
Approach to follow for testing is depends on the timeline. Approach 1 will take less
time to script and execute. This approach is not suggested because if issues are
uncovered then it will be more difficult and time consuming to determine origin. If
time and resources are available its always suggested to follow approach2. This
approach will help to determine where and when the issues are found. By validating
each step of the ETL, data is being moved and transformed as expected at each
stage.
Testing will focus on following points in this approach:
Extraction of source data
Population of staging tables
Transformation of data

6
3. Methodological Framework
Below are the stages involved in the ETL process:
Requirements Elicitation: Requirements are elicited from users and
represented either in the informal or formal way.
Analysis and reconciliation: Data sources are inspected, normalized, and
integrated to obtain a reconciled schema.
Conceptual design: A conceptual schema for the data mart is designed
considering both user requirements and data available in the reconciled
schema.
Logical design: a logical schema for the data mart is obtained by properly
translating the conceptual schema.
Data staging design: ETL procedures are designed considering the source
schema, the reconciled schema and the data mart logical schema.
Physical design: This includes index selection, schema fragmentation, and all
other issues related to physical allocation.
Implementation: This includes implementation of ETL procedures and creation
of front-end reports.

7
4. A Timeline for Testing
From a methodological point of view, the three main phases of testing are:
Create a test plan. The test plan describes the tests that must be performed
and their expected coverage of the system requirements.
Prepare test cases. Test cases enable the implementation of the test plan by
detailing the testing steps together with their expect results. The reference
databases for testing should be prepared during this phase, and a wide,
comprehensive set of representative workloads should be defined.
Execute tests. A test execution log tracks each test along and its results.

5. Types of Data warehouse Testing


Below are types of testing that will be suitable for Data Warehouse projects.
Unit testing: Traditionally this has been the task of the developer. This is a whitebox testing to ensure the module or component is coded as per agreed upon design
specifications. The developer should focus on the following:
a) That all inbound and outbound directory structures are created properly with
appropriate permissions and sufficient disk space. All tables used during the ETL are
present with necessary privileges.
b) The ETL routines give expected results:
i.
All transformation logics work as designed from source till target
ii.
Boundary conditions are satisfied e.g. check for date fields with leap
year dates
iii.
Surrogate keys have been generated properly
iv.
NULL values have been populated where expected
v.
Rejects have occurred where expected and log for rejects is created with
sufficient details
vi.
Error recovery methods
vii.
Auditing is done properly
c) That the data loaded into the target is complete:
i.
All source data that is expected to get loaded into target actually get
loaded compare counts between source and target and use data profiling
tools
ii.
All fields are loaded with full contents i.e. no data field is truncated while
transforming
iii.
No duplicates are loaded
iv.
Aggregations take place in the target properly
v.
Data integrity constraints are properly taken care of
System testing: Generally the QA team owns this responsibility. For them the
design document is the bible and the entire set of test cases is directly based upon
it. Here we test for the functionality of the application and mostly it is black-box. The
major challenge here is preparation of test data. Wherever possible use productionlike data. You may also use data generation tools or customized tools of your own to
create test data. We must test for all possible combinations of input and specifically
check out the errors and exceptions. An unbiased approach is required to ensure
maximum efficiency. Knowledge of the business process is an added advantage since
we must be able to interpret the results functionally and not just code-wise.
The QA team must test for:

8
i.
ii.
iii.
iv.
v.

Data completeness match source to target counts


Data aggregations match aggregated data against staging tables and/or
ODS4
Granularity of data is as per specifications
Error logs and audit tables are generated and populated properly
Notifications to IT and/or business are generated in proper format

Regression testing: A DW application is not a one-time solution. Possibly it is the


best example of an incremental design where requirements are enhanced and
refined quite often based on business needs and feedbacks. In such a situation it is
very critical to test that the existing functionalities of a DW application are not
messed up whenever an enhancement is made to it. Generally this is done by
running all functional tests for existing code whenever a new piece of code is
introduced. However, a better strategy could be to preserve earlier test input data
and result sets and running the same again. Now the new results could be compared
against the older ones to ensure proper functionality.
Integration testing: This is done to ensure that the application developed works
from an end-to-end perspective. Here we must consider the compatibility of the DW
application with upstream and downstream flows. We need to ensure for data
integrity across the flow. Our test strategy should include testing for:
i.
Sequence of jobs to be executed with job dependencies and scheduling
ii.
Re-startability of jobs in case of failures
iii.
Generation of error logs
iv.
Cleanup scripts for the environment including database
This activity is a combined responsibility and participation of experts from all related
applications is a must in order to avoid misinterpretation of results.
Acceptance testing: This is the most critical part because here the actual users
validate your output datasets. They are the best judges to ensure that the
application works as expected by them. However, business users may not have
proper ETL knowledge. Hence, the development and test team should be ready to
provide answers regarding ETL process that relate to data population. The test team
must have sufficient business knowledge to translate the results in terms of
business. Also the load windows, refresh period for the DW and the views created
should be signed off from users.
Performance testing: In addition to the above tests a DW must necessarily go
through another phase called performance testing. Any DW application is designed to
be scalable and robust. Therefore, when it goes into production environment, it
should not cause performance problems. Here, we must test the system with huge
volume of data. We must ensure that the load window is met even under such
volumes. This phase should involve DBA team, and ETL expert and others who can
review and validate your code for optimization.

6. ETL Testing check points


Following is a checklist or testing standards that should be followed to ensure
complete and exhaustive testing of ETL process in a data-warehousing project. The
overall testing process is divided into the testing for data completeness, data quality,
data transformation and meta-data and the points to be taken care in these are as
follows:
Data Completeness: One of the most basic tests of data completeness is to verify
that all expected data loads In to the data warehouse. This includes validating that
all records, all fields and the full contents of each field are loaded. Strategies to
consider include:
Comparing record counts between Source database data, staging table data
and data loaded to target DW during testing for full load testing.
Comparing unique values of key fields between Source database, staging
database and target DW.
Populating the full contents of each field to validate that no truncation occurs
Testing the boundaries of each field to find any database limitations.
Data Transformation: Validating that data is transformed correctly based on
business rules can be the most complex part of testing an ETL application with
significant transformation logic.
One typical method is to pick some sample records and "stare and compare"
to validate data transformations manually.
Create a spreadsheet of input data and expected results and validate these
with the output of our test scripts.
During Incremental Testing - create perfect test data that includes all
scenarios which can ever occur in source stage?
Set up data scenarios that test referential integrity between tables.
Validate parent-to-child relationships in the data. Set up data scenarios that
test how orphaned child records are handled.
Data Quality: Data quality deals with "how the ETL system handles Staging table
data rejection, substitution, correction and notification without modifying data."
Reject the record if a certain decimal field has nonnumeric data.
Substitute null if a certain decimal field has nonnumeric data.
Compare accurate values to values in a lookup table
Determine and test exact points where to reject the data and where to send it
for error processing.

10
7. KEY hot points in ETL Testing
There are several levels of testing that can be performed during data warehouse
testing and they should be defined as part of the testing strategy in different phases
of testing. Some examples include below:
Constraint Testing: During constraint testing, the objective is to validate unique
constraints, primary keys, foreign keys, indexes, and relationships. The test script
should include these validation points. Some ETL processes can be developed to
validate constraints during the loading of the warehouse. If the decision is made to
add constraint validation to the ETL process, the ETL code must validate all business
rules and relational data requirements.
Source to Target Counts: The objective of the count test scripts is to determine if
the record counts in the source match the record counts in the target. Some ETL
processes are capable of capturing record count information such as records read,
records written, records in error, etc.
Source to Target Data Validation: No ETL process is smart enough to perform
source to target field-to-field validation. This piece of the testing cycle is the most
labor intensive and requires the most thorough analysis of the data. There are a
variety of tests that can be performed during source to target validation. Below is a
list of tests those are best practices:
Threshold testing
Field-to-field testing
Initialization
Transformation and Business Rules: Tests to verify all possible outcomes of the
transformation rules, default values, straight moves and as specified in the Business
specification document. As a special mention, Boundary conditions must be tested on
the business rules.
Batch Sequence & Dependency Testing: ETLs in DW are essentially a sequence
of processes that execute in a particular sequence. Dependencies do exist among
various processes and the same are critical to maintain the integrity of the data.
Executing the sequences in a wrong order might result in inaccurate data in the
warehouse. The testing process must include at least 2 iterations of the endend
execution of the whole batch sequence. Data must be checked for its integrity during
this testing. The most common type of errors caused because of incorrect sequence
is the referential integrity failures, incorrect end-dating (if applicable) etc., reject
records etc.
Job restart Testing: In a real production environment, the ETL jobs/processes fail
because of number of reasons (say for ex: database related failures, connectivity
failures etc). The jobs can fail half/partly executed. A good design always allows for a
restart ability of the jobs from the failure point. Although this is more of a design
suggestion/approach, it is suggested that every ETL job is built and tested for restart
capability.

11
Error Handling: Understanding a script might fail during data validation, may
confirm the ETL process is working through process validation. During process
validation the testing team will work to identify additional data cleansing needs, as
well as identify consistent error patterns that could possibly be diverted by modifying
the ETL code.
Taking the time to modify the ETL process will need to be determined by the project
manager, development lead, and the business integrator. It is the responsibility of
the validation team to identify any and all records that seem suspect.
Once a record has been both data and process validated and the script has passed,
the ETL process is functioning correctly. Conversely, if suspect records have been
identified and documented during data validation those are not supported through
process validation, the ETL process is not functioning correctly. The development
team will need to become involved in finding the appropriate solution. For example,
during the execution of the source to target count scripts suspect counts are
identified (there are less records in the target table than in the source table). The
records that are missing should be captured during the error process and can be
found in the error log. If those records do not appear in the error log, the ETL
process is not functioning correctly and the development team needs to become
involved.
Negative Testing: Negative Testing checks whether the application fails and where
it should fail with invalid inputs and out of boundary scenarios and to check the
behavior of the application.

12
8. Database Testing VS. DW Testing
The difference between a Database and a Data Warehouse is not just a data volume.
ETL is the building block for a Data warehouse. Data warehouse testing thus should
be aligned with the data modeling underlying a data warehouse. Specific test
strategies should be designed for Extraction, Transformation and for the loading
modules.

Database Testing

Smaller in Scale
Usually used to test data
at the source instead of
testing using GUI
Usually Homogeneous
Data
Normalized Data
CRUD Operations
Consistent Data

Data Warehouse Testing


Large Scale.
Voluminous Data
Includes several faces.
Extraction,
Transformation &
Loading mechanisms
being the major ones
Heterogeneous data
involved
De normalized Data
Usually Read-only
operations
Temporal Data
inconsistency

13
9. Challenges in DW Testing
Voluminous Data from heterogeneous sources.
Data Quality not assured at source.
Difficult to estimate. Only volume might be available. No accurate picture of
the quality of the underlying data.
Business Knowledge. Organization wide Enterprise data knowledge may not
be feasible.
100% Data verification is not feasible. In such cases, the extraction,
transformation and loading components will be thoroughly tested to ensure all
types of data behave as expected.
Very High Cost of Quality. This is due to defect slippage will slip into
significantly high cost.
The Heterogeneous of data will be updated asynchronously.
Transaction level traceability will be difficult to attain in a Data warehouse.

10.

Tools: Data Warehouse Testing

There are no standard guidelines on the tools that can be used for Data warehouse
testing. Majority of the teams will go with the tool that has been used for the data
warehouse implementation. A drawback of this approach is redundancy. The same
transformation logic need to be applied for DWH Implementation and also its
testing.
Tool selection also depends on the test strategy via exhaustive verification,
Sampling, Aggregation etc. Reusability & Scalability of the Test Suite being
developed is a very important factor to be considered.

14
11.

Case Study:

The Client
Our client is the investment advisory firm, managing assets for institutional and
private clients worldwide valued at $42.4 billion in assets and $34.2 billion in
Institutional/Private Client Assets.
The Challenge
Our client generated all financial information and reports via manual calculations
delivering their final reports in Excel format. This did not satisfy customer needs and
they threatened to take their business to competitors where they could get reports in
industry standard format. The challenge: Implement a Financial Reporting Solution
within a year. To achieve this, our client needed to build an infrastructure and a data
warehouse that would store data to generate ad hoc and canned reports in a timely
manner. But, the biggest challenge was how to validate the reports and ensure that
all the calculations were accurate. The solution was to find a Testing Partner with
extensive experience validating financial reports. This is exactly what SQA Solution
offered, experienced Financial Reporting QA Engineers with a strong financial
background and the technical skills to deliver a high-quality bullet-proof reporting
solution.

The Solution
The SQA Solution team began by assessing the work and offered a Free Rapid
Assessment to understand the scope of work, schedule, and the resourcing needs.
We assembled a team of eight: one QA Lead and seven Senior QA Engineers. Our QA
lead was responsible for the overall Test strategy, Test Planning, Daily Status
reporting, and day-to-day team management.
Our team assessed reporting requirements, data sources, and data target. We also
reviewed source to target maps and came up with 800+ test cases that focused on
ensuring:
Data Completeness ensuring all expected data is loaded.
Data Transformation ensuring all data is transformed correctly according to
business rules and/or design specifications.
Data Quality Ensuring ETL application correctly rejects, substitutes default
values, corrects or ignores and reports invalid data.
Performance & Scalability Ensuring data loads and queries perform within
expected time frames and that the technical architecture is scalable.
Reporting UI Testing Verify Reports User Interface
Data Calculations
Integration Testing
Compatibility Testing
User-acceptance Testing
Regression Testing

15
The Benefit
High quality data warehouse and reporting solution helped Client to retain
their customers and acquire new customers.
Achieved optimal performance for complex reports.
Great user experience.
Meet project deadlines.

12.

Conclusion and Lessons Learnt

In this paper we proposed a comprehensive approach which adapts and extends the
testing methodologies proposed for general-purpose software to the peculiarities of
data ware- house projects. Our proposal builds on a set of tips and suggestions
coming from our direct experience on real projects, as well as from some interviews
we made to data warehouse practitioners. As a result, a set of relevant testing
activities were identified, classified, and framed within a reference design
methodology.
In order to experiment our approach on a case study, we are currently supporting a
professional design team engaged in a large data warehouse project, which will help
us better focus on relevant issues such as test coverage and test documentation. In
particular, to better validate our approach and understand its impact, we will apply it
to one out of two data marts developed in parallel, so as to assess the
extra-effort due to comprehensive testing on the one hand, the saving in postdeployment error correction activities and the gain in terms of better data and design
quality on the other.

16
References
Google
Wikipedia

Author Bio
Prasuna Potteti
| System Integration | Deloitte Consulting
Prasuna Potteti is a senior consultant with core competency in software testing, data
warehouse testing, test data management and testability. Familiar with many test
tools, she helps teams develop test plans, test cases and execution. She has helped
develop automated tests for functional, as well as performance tests for package
applications as well as custom development.
Prasuna has been actively involved in competency building in testing COE (Center of
Excellence), Deloitte.

Appendix
DW Data Warehouse
ETL- Extract, Transform & Load
ODS Operation Data Source
QA Quality Assurance
IT Information Technologies

You might also like