You are on page 1of 35

ISQS 6339, Business Intelligence

Extraction, Transformation, and


Loading

Outline
Introduction

to SSIS
Learn by doing Exercise 4
More about SSIS features - Package
development tools

Youtube Videos

Introduction to SQL Server Integration Services


Part 1 1019, Part 2 812

Create A Basic SSIS Package with SSIS 755


How to create a simple SSIS package
Part 1 319, Part 2 754, Part 3 641

More videos
An Overview of SSIS

Part 1 612, Part 2 613, Part 3 820

ETL Demon 1040


ETL Tools 456
SSIS 2008 Package Deployment

Part I 804, Part II 509

Introduction to SSIS 2008

Part I 957, Part II 957, Part III 955, Part IV 959, Part V 611

ETL Strategies with SSIS 2156

Introduction to SSIS

Structure and Components of


Business Intelligence
MS
MSSQL
SQLServer
Server2008
2008
SSMS
SSMS

SSIS
SSIS

SSAS
SSAS

BIDS

SSRS
SSRS

SAS
SAS
EG
EG

SAS
SAS
EM
EM
5

ETL Topics

Dimension Processing
Extract changed row from the operational database
Handling slowly changing dimensions
De-duplication and fuzzy transforms

Fact Processing

Extract fact data from the operational database


Extract fact updates and deletes
Cleaning fact data
Checking data quality and halting package execution
Transform fact data
Surrogate key pipeline
Loading fact data
Analysis services processing

Integrating all tasks

Automating your routine


information processing tasks

Your routine information processing tasks

Read online news at 8:00a and collect a few most important


pieces
Retrieve data from database to draft a short daily report at 10a
View and reply emails and take some notes that are saved in a
database
View 10 companies webpage to see the updates. Input the
summaries into a database
Browse three popular magazines twice a week. Input the
summaries into a database
Generate a few one-way frequency and two-way frequency tables
and put them on the web
Merge datasets collected by other people into a main database.
Prepare a weekly report using the database and at 4p every
Monday, and publish it to the internal portal site.
Prepare a monthly report at 11a on the first day of a month, which
must be converted into a pdf file and uploaded to the website.

Seems there are many things on going. How to handle them


properly in the right time?
Organizer yes
How about regular data processing tasks?

SQL Server Integration Services


(SSIS)
The data in data warehouses and data marts is
usually updated frequently, and the data loads are
typically very large.
Integration Services includes a task that bulk loads
data directly from a flat file into SQL Server tables
and views, and a destination component that bulk
loads data into a SQL Server database as the last
step in a data transformation process.
An SSIS package can be configured to be
restartable. This means you can rerun the package
from a predetermined checkpoint, either a task or
container in the package. The ability to restart a
package can save a lot of time, especially if the
package processes data from a large number of
sources.

What can you do with


SSIS?
To load the dimension and fact tables in the

database. If the source data for a dimension table is stored


in multiple data sources, the package can merge the data
into one dataset and load the dimension table in a single
process, instead of using a separate process for each data
source.
To update data in data warehouses and data marts.
The Slowly Changing Dimension Wizard automates support
for slowly changing dimensions by dynamically creating the
SQL statements that insert and update records, update
related records, and add new columns to tables.
To process Analysis Services cubes and dimensions.
When the package updates tables in the database that a
cube is built on, you can use Integration Services tasks and
transformations to automatically process the cube and to
process dimensions as well.
To compute functions before the data is loaded into
its destination. If your data warehouses and data marts
store aggregated information, the SSIS package can
compute functions such as SUM, AVERAGE, and COUNT. An
SSIS transformation can also pivot relational data and
transform it into a less-normalized format that is more
compatible with the table structure in the data warehouse.
9

SQL Server Integration


Services

The hierarchy of SSIS

Project -> Package ->


Control flow -> Data flow

Package structure

Control flow
Data flow
Event handler
Package explorer
Connection tray

Features
Event driven
Layered
Drag-and-drop
programming
Data I/O definitions are
done using Connection
Managers

Control
Flow
Data
Flow
Event
Handler

10

SSIS

Architectur
e

11

Control Flow
Bulk

Insert task: Perform a fast load of data


from flat files into a target table. Good for
loading clean data.
Execute SQL task: Perform database
operations, creating views, tables, or even
databases. Good for query data or metadata
File Transfer Protocol and File System tasks:
transfer files or sets of files.
Execute Package, Execute DTS2000 Package,
and Execute Process tasks: Break a complex
workflow into smaller ones, and define a
parent or master package to execute them.
Send Mail task: sends an email message.
12

Control Flow (contd)


Script

and ActiveX Script tasks: Perform an


endless array of operations that are beyond the
scope of the standard tasks.
Data Mining and Analysis Service Processing
tasks: Launch processing on SSAS dimensions
and databases. Use SSAS DDL task to create
new Analysis Services partitions, or perform any
data definition language operation.
XML and Web Services tasks
Message Queue, WMI Data Reader, and WMI
Event Watcher tasks: Help to build an automatic
ELT system.
ForEach Loop, For Loop, and Sequence
containers: Execute a set of tasks multiple times
Data Flow tasks
13

Data Flow Task


Data

Flow task is a pipeline in which data


is picked up, processed and written to a
destination.
Avoids I/O, which provided excellent
performance
Concepts

Data sources
Data destinations
Data transformations
Error flows
14

Frequently Used Data


Transformation Steps
Sort

and Aggregate transforms


Conditional Split and Multicast transforms
Union All, Merge Join, and Lookup transforms
Slowly Changing Dimension transform
OLE DB Command transform
Row Count and Audit transforms
Pivot and Unpivot transforms
Data mining Model Training and data Mining
Query transforms
Term extraction and Term Lookup transforms
File Extractor and File Injector transforms
15

Dynamic Packaging
Modifying

the actions that a package takes


when its executing.
SSIS implements a rich expression language
that is used in control flow and also in data flow
transform.
Concepts
Expressions. Uses an expression language,
simple.
Variables. Can be defined within a package.
Can be scoped to any object: package-wide,
within a container, a single task, etc.
Configurations. Can overwrite most of the
settings for SSIS objects by supplying a
configuration file at runtime.
16

Decision Issues in ETL System


Design
Source-to-target
Load

mapping

frequency
How much history is needed

1
7

ISQS 6339, Data Management & Business


Intelligence

Strategies for Extracting


Data
Extracting data from packaged source

systems self-contained data sources


May not be good to use their APIs
May not be good to use their add-on
analytic system
Extracting directly from the source databases
Strategies vary depending on the nature of
the source database
Extracting data from incremental loads
How the source database records the
changes of the rows
Extracting historical data

1
8

ISQS 6339, Data Management & Business


Intelligence

De-Duplication
Two

common situations: person, and


organization
SSIS provides two general-purpose
transforms helping address data
quality and de-duplication

Fuzzy Lookup
Fuzzy Grouping

19

LEARN BY DOING
EXERCISE 4

Exercise 4: Populate Maximum Miniatures


Manufacturing Data Mart Dimensions

Preparation: Data sources and destination definition


Source database: AccountingSystemDatabase

Loading dimensions

ProductType
ProductSubType
Product
Country
Plant (using SQL Command)
Material (using SQL Command, Aggregate item)
MachineType (copied from the Material loading task)
Machine (copied from the MachineType loading task)

Note: DimBatch and the fact table will be loaded in the next
exercise.
Debugging
Step by step
Understand the error messages
Watch database loading status

See more detailed Guidelines of this exercise


Submit the screenshots of green results of the ETL flow to
isqs6347@gmail.com by February 20 before 5p.
21

Snowflake Schema of the Data Mart

10

DimBatch

DimMachine

ManufacturingFact

DimProduct
7

DimMachineType

DimPlant

DimMaterial

DimCountry

DimProductSubType

DimProductType

Aggregate

SQL Coding

22

Codes for data flows

The following codes are used to selectively retrieve data


from the source for the destination database

Code for DimPlant loading


SELECT LocationCode, LocationName, CountryCode
From Locations
WHERE LocationType = 'Plant Site'

Code for DimMaterial loading


SELECT AssetCode, AssetName, AssetClass, LocationCode,
Manufacturer, DateOfPurchase, RawMaterial
FROM CapitalAssets
WHERE AssetType = 'Molding Machine'

23

Package Items

Data flow Task main task


Control Flow Items

Data Preparation Tasks

Analysis Services Processing Task, Analysis Services Execute DDL Task, Data
Mining Query Task

Transfer Tasks

ActiveX Script Task, Script Task

Analysis Services Tasks

Bulk Insert Task, Execute SQL task

Scripting Tasks

Execute Package Task, Execute DTS 2000 Package Task, Execute Process Task,
Message Queue Task, Send Mail Task, WMI Data Reader Task, WMI Event Watcher
Task

SQL Server Tasks

File System Task, FTP Task, Web Service Task, XML Task

Work Flow Tasks

For Loop Container, Foreach Loop Container, Sequence Container

Transfer Database Task, Transfer Error Messages Task, Transfer Logins Task
Transfer Objects Task, Transfer Stored Procedures Task

Maintenance Tasks
Custom Tasks
24

Exploring Features of SQL Server


ETL System
Data

sources and data destinations

SQL Server file (OLE DB file)


Flat file
Excel file
Data

flow transformation

Aggregate
Derived Column
Data Conversion
Sort

25

More About SSIS


Features

ETL System Debugging


Most

frequently encountered errors


Data format error: The database tables data
type does not match the input datas format
Reason 1: Flat Text file uses varchar(50), or
string[DT_STR] format; Excel file uses
nvarchar format
Reason 2: You defined the database using
different formats, which could be caused
by the imported data set.
Solution: A Data Conversion data
transformation node can be used for
changing the format
SQL Server system error: Even though you did
things correctly you could not get through.
Solution: the easiest way to solve this
problem is to redo the ETL flow.
27

ETL How-to Problems


How

to use Merge function of Data


Transformation to join datasets from two tables
into one.
How to split a dataset to two tables
How to remove duplicated rows in a table.
How to detect the changes of the rows in the
data sources and extract the updated rows into a
table in the data warehouse.
How to load multiple datasets with similar
structure into a table

Reference: SQL Server 2005 Integration Services,


McGraw Hill Osborne, 2007
28

Connection managers

Excel Connection Manger


File Connection Manger
Flat File Connection Manager
FTP Connection Manager
HTTP Connection Manager
ODBC Connection Manager
OLE DB Connection Manager ADO Connection Manager for
legacy applications using earlier versions of programming
languages, such as VB 6.0
ADO.NET Connection Manager Access to Microsoft SQL
Server and data sources exposed through OLE DB and XML
by using a .NET provider
Microsoft .NET Data Provider for mySQL Business Suite
access to SAP server and enables to execute RFC/NAPI
commands and select queries against SAP tables
Design-time data source objects can be created in SSIS,
SSAS and SSRS projects
29

Container Managers
Foreach

Loop Container
For Loop Container
Sequence Container

30

Different Types of ETL Control Flows


With

data flows, e.g.


Import data
Database updates
Loading SCD
Database cleansing
Aggregating data
Without data flows, e.g.
Downloading zipped files
Archiving downloaded files
Reading application log
Mailing opportunities
Consolidating workflow package
31

Data Flow for Updating


Database

32

Data Flow for Loading Slowly


Changing Dimension

33

Control Flow for Importing Expanded


Files

34

Exploring Features of SQL Server


ETL System (TBD)
Data

Set:

Source: Commrex_2011, D5.txt (in the shared


directory under \OtherDatasets)
Destination: Flat file, Excel file, OLE DB file
Data

flow transformation

Aggregate (Use D5.txt, and aggregate the data


with regard to UserID)
Derived Column (Use Commrex_2011, and create
a new column NewID)
Data Conversion (Use Commrex_2011, and
convert data type of some columns, such as
UserID, Prop_ID)
Sort (use D5.txt, sort ascending with ID, Date,
Time)
35

You might also like