You are on page 1of 11

DATA MODELING

STAGING AREA - SOME CLARITY Staging Area optional to cleanse the source data Accepts data from different sources Data model is required at staging area Multiple data models may be required for parking different sources and for transformed data to be pushed out to warehouse ODS - SOME CLARITY Operational Data Store Optional Granular, detailed level data May feed warehouse (eg when warehouse is aggregated) Usually a relational model May keep data for a smaller time period than warehouse Types of Data Warehouse Enterprise Data Warehouse Data Mart Enterprise data warehouse Contains data drawn from multiple operational systems Supports time- series and trend analysis across different business areas Can be used as a transient storage area to clean all data and ensure consistency Can be used to populate data marts Can be used for everyday and strategic decision making Data Mart Logical subset of enterprise data warehouse Organized around a single business process Based on granular data May or may not contain aggregates Object of analytical processing by the end user. Less expensive and much smaller than a full blown corporate data warehouse. Distributed and Centralized Data warehouses DW sitting on a monolithic machine - unrealistic Separate machines, different OS, different DB systems - reality Solution Share a uniform architecture to allow them to be fused coherently

Levels of modeling Business process Logical Model Physical model Conceptual modeling Describe data requirements from a business point of view without technical details Logical modeling Refine conceptual models Technology oriented but platform independent Physical modeling Detailed specification of what is physically implemented using specific technology
Conceptual Model Identify entities and define important associations between those entities Logical Model Includes the entities, relationships and the full attributes. Physical Model Includes tables, columns, keys, data types, validation rules.

Is not a normalized Model

Is normalized to at least 3rd normal form.

May be denormalized to meet performance requirements

Conceptual Model A conceptual model shows data through business eyes. All entities which have business meaning. Important relationships Few significant attributes in the entities. Few identifiers or candidate keys. Sample conceptual model
Products

Customer Invoices Sales Reps

Customers

Customer Addresses

Geographic Boundaries

Sample Conceptual Model

Logical Model Replaces many-to-many relationships with associative entities. Defines a full population of entity attributes. May use non-physical entities for domains and sub-types. Establishes entity identifiers. Has no specifics for any RDBMS or configuration. Sample logical model
CUSTOMER INVOICE #INVOICE ID #LINE ITEM SEQ .INVOICE DATE

the bill for purchased by

the bill sent to purchased at

the bill purchased by purchased by

PRODUCT #PRODUCT CODE .PRODUCT DESCRIPTION


sold by

CUSTOMER ADDRESS #CUSTOMER ID #ADDRESS ID


for the for the located within customer customer managed by sold to by the salesman for

CUSTOMER #CUSTOMER ID #SNAPSHOT DATE .CUSTOMER NAME

the salesman the sales for manager for

the general location of

SALES REP #SALES REP ID

GEOGRAPHIC BOUNDARY #GEO CODE

Sample Logical Model


Physical Model A Physical data model may include Referential Integrity Indexes Views Alternate keys and other constraints Tablespaces and physical storage objects.

PRODUCTS # PRODUCT_CODE PRODUCT_DESCRIPTION CATEGORY_CODE CATEGORY_DESCRIPTION

SALES_REPS #SALES_REP_ID LAST_NAME FIRST_NAME oMANAGER_FIRST_NAME oMANAGER_LAST_NAME

CUSTOMER_INVOICES #INVOICE_ID #LINE_ITEM_SEQ INVOICE_DATE CUSTOMER_ID BILL_TO_ADDRESS_ID SALES_REP_ID MANAGER_REP_ID ORGANIZATION_ID ORG_ADDRESS_ID PRODUCT_CODE QUANTITY UNIT_PRICE AMOUNT oPRODUCT_COST LOAD_DATE

CUSTOMERS #CUSTOMER_ID #SNAPSHOT_DATE CUSTOMER_NAME oAGE oMARITAL_STATUS CREDIT_RATING

Sample Physical Model


GEOGRAPHIC_BOUNDARIES #GEO_CODE CITY_NAME STATE_NAME COUNTRY_NAME oCITY_ABBRV oSTATE_ABBRV oCOUNTRY_ABBRV

CUSTOMER_ADDRESSES #CUSTOMER_ID #ADDRESS_ID ADDRESS_LINE1 oADDRESS_LINE2 oPOSTAL_CODE SALES_REP_ID GEO_CODE LOAD_DATE

Data Modeling
DATA MODELING - FOR WHICH COMPONENT?

STAGING AREA YES ! (maybe multiple data models are required) ODS YES ! DATAWAREHOUSE YES! Review of Data Modeling Modeling techniques E-R Modeling Dimensional Modeling Implementation and modeling styles Modeling versus implementation Modeling: describe what should be built to non-technical folks Implementation: describe what is actually built to technical folks Relational modeling Use for implementation Difficult to understand by non-technical folks Dimensional modeling Use for modeling during analysis and design phases Can be implemented using other modeling styles e.g. object-oriented, relational

E-R Modeling Produces a data model, using two basic concepts entities and the relationships between those entities. Detailed ER models also contain attributes, which can be properties of either the entities or the relationships.

Conventions used in E-R modeling Entities Attributes Relationships or Associations Entities Principal data objects about which information is to be collected. Usually recognizable concepts such as person, things, or events. Examples : EMPLOYEES, PROJECTS, INVOICES. Attributes & Relationships Attributes describe the entity of which they are associated. A relationship represents an association between two or more entities. An example : Employees are assigned to projects Departments manage one or more projects. Types of Data Relationships One - One 1: 1 One - Many 1: m Many - Many m:n Recursive data relationship Normalization Remove data redundancy 0 NF - contains repeating values 1 NF - No repeating values 2 NF - Every attribute is dependent on the key, the whole key and nothing but the key 3 NF - No non-key attribute is functionally dependent on another non-key attribute Denormalization - carefully introduced redundancy to improve query performance Join Techniques Self Join Outer Join Equi Join Non - Equi Join

Relational modeling Represents business entities, data items associated with each entity, and the relationships of business interest among the entities Entities are usually broken down into smallest possible units and combined using relationships Diagram looks like a spiderweb Limitations of E-R Modeling Poor Performance Tend to be very complex and difficult to navigate

Dimensional Modeling
Dimensional modeling uses three basic concepts : measures, facts, dimensions. Is powerful in representing the requirements of the business user in the context of database tables. Focuses on numeric data, such as values counts, weights, balances and occurences. Must identify Business process to be supported Grain (level of detail) Dimensions Facts Use for data marts Conventions used in Dimensional modeling Facts Measures(Variables) Dimensions Dimension members Dimension hierarchies Facts A fact is a collection of related data items, consisting of measures and context data. Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business process. Facts are measured, continuously valued, rapidly changing information. Can be calculated and/or derived. A table that is used to store business information (measures) that can be used in mathematical equations. Quantities Percentages Prices

Dimensions A dimension is a collection of members or units of the same type of views. Dimensions determine the contextual background for the facts. Dimensions represent the way business people talk about the data resulting from a business process, e.g., who, what, when, where, why, how Table used to store qualitative data about fact records Who What When Where Why Dimension data should be verbose, descriptive complete no misspellings, impossible values indexed equally available documented ( metadata to explain origin, interpretation of each attribute) Measures A measure is a numeric attribute of a fact, representing the performance or behaviour of the business relative to dimensions. The actual numbers are called as variables. Hierarchies Allow for the rollup of data to more summarized levels. Time day month quarter year Granularity of Data Level of data stored in the warehouse Lowest possible grain of each dimension to be stored in warehouse Grain determines the primary dimensionality of the fact table. Add additional dimensions - Single value under each combination of the primary dimensions. Advantages of Dimensional Modeling Allows complex multi-dimensional data structure to be defined with a very simple data model. Reduces number of physical joins the query has to process Simplifies the view of data model. Allows DWH to expand and evolve with relatively low maintenance.

PRODUCT Product description Category code Category description

TIME PERIOD Invoice date Fiscal year Quarter Month Week

SALES REP Last name First name

CUSTOMER REP SALES Customer snapshot date Invoice date Gross sales Quantity Product cost

ADDRESS Address line 1 Address line 2 City name State abbreviation Postal code Country name

CUSTOMERS Customer name

CUSTOMER DEMOGRAPHICS Snapshot date Credit rating Marital status Age

Sample Logical Model for Dimensional Data Mart


PRODUCTS #PRODUCT_CODE . PRODUCT_DESCRIPTION . CATEGORY_CODE . CATEGORY_DESCRIPTION PRODUCT_SNAPSHOTS #PRODUCT_CODE #SNAPSHOT_DATE . MSRP . UOM . PRIMARY_SUPPLIER_NAME . SUPPLIER_CITY_NAME . SUPPLIER_STATE_ABBRV . SUPPLIER_COUNTRY_NAME PURCHASE_INVOICES # INVOICE_ID #LINE_ITEM_SEQ . INVOICE_DATE . SUPPLIER_ID . ADDRESS_ID . BUDGET_ID . REVISION_SEQ . BUDGET_LINE_ITEM_SEQ . PRODUCT_CODE . QUANTITY . UNIT_PRICE . AMOUNT . LOAD_DATE BUDGET_DETAILS #BUDGET_ID #REVISION_SEQ #LINE_ITEM_SEQ . BLI_TYPE_CODE . BLI_TYPE_DESCRIPTION . ORGANIZATION_ID . ADDRESS_ID . BUDGET_PERIOD . LOAD_DATE . BUDGET_AMOUNT . EXPENDITURES o PRODUCT_CODE

SALES_REPS # SALES_REP_ID . LAST_NAME . FIRST_NAME o MANAGER_FIRST_ NAME oMANAGER_LAST _NAME

CUSTOMER_INVOICES #INVOICE_ID #LINE_ITEM_SEQ . INVOICE_DATE . CUSTOMER_DATE . BILL_TO_ADDRESS_ID . SALES_REP_ID . MANAGER_REP_ID . ORGANIZATION_ID . ORG_ADDRESS_ID . PRODUCT_CODE . QUANTITY . UNIT_PRICE . AMOUNT o PRODUCT COST . LOAD_DATE

Sample Physical Model for Data Warehouse

CUSTOMER_ADDRESSES #CUSTOMER_ID #ADDRESS_ID . ADDRESS_LINE1 oADDRESS_LINE2 oPOSTAL_CODE . SALES_REP_ID . GEO_CODE . LOAD_DATE

CUSTOMERS #CUSTOMER_ID #SNAPSHOT_DATE . CUSTOMER_NAME oAGE oMARITAL STATUS . CREDIT_RATING

SUPPLIER_ADDRESSES #SUPPLIER_ID #ADDRESS_ID . SUPPLIER_NAME oPOSTAL_CODE . GEO_CODE . LOAD_DATE

INTERNAL_ORG_ADDRESSES #ORGANIZATION_ID #ADDRESS_ID . ORG_TYPE . ORGANIZATION_NAME . ADDRESS_LINE1 oADDRESS_LINE2 oPOSTAL_CODE . GEO_CODE oPARENT_ORG_ID . LOAD_DATE

GEOGRAPHIC_BOUNDARIES #GEO_CODE . CITY_NAME . STATE_NAME . COUNTRY_NAME oCITY_ABBRV oSTATE_ABBRV oCOUNTRY_ABBRV

Common structures for datamarts:Denormalize! Flat table Star Single fact table surrounded by denormalised dimension tables Many star schemas in a data mart Easily understood by end users, more disk storage required Snowflake Single fact table surrounded by normalised dimension tables Less intuitive, slower performance due to joins May want to use both approaches, especially if supporting multiple end-user tools
PRODUCTS #PRODUCT_CODE . PRODUCT_DESCRIPTION . CATEGORY_CODE . CATEGORY_DESCRIPTION

CENTER BUDGET_DETAILS #BUDGET_ID #REVISION_SEQ #LINE_ITEM_SEQ . BLI_TYPE_CODE . BLI_TYPE_DESCRIPTION . ORGANIZATION_ID . ADDRESS_ID . BUDGET_PERIOD . LOAD_DATE . BUDGET_AMOUNT . EXPENDITURES o PRODUCT_CODE . ORG_TYPE . ORGANIZATION_NAME . ADDRESS_LINE1 oADDRESS_LINE2 oPOSTAL_CODE . GEO_CODE

DIVISION . ORG_TYPE . ORGANIZATION_NAME . ADDRESS_LINE1 oADDRESS_LINE2 oPOSTAL_CODE . GEO_CODE

INTERNAL_ORG_ADDRESSES #ORGANIZATION_ID #ADDRESS_ID . LOAD_DATE . CENTER_ID . DIVISION_ID . DEPARTMENT_ID

DEPARTMENT . ORG_TYPE . ORGANIZATION_NAME . ADDRESS_LINE1 oADDRESS_LINE2 oPOSTAL_CODE . GEO_CODE

Sample Snowflake Schema

Basic Dimensional Modeling Techniques Slowing changing Dimensions Rapidly changing Small Dimensions Large Dimensions Rapidly changing Large Dimensions Degenerate Dimensions Junk Dimensions

Slowly Changing Dimensions A dimension is considered a Slowly Changing Dimension when its attributes remain almost constant over time, requiring relatively minor alterations to represent the evolved state. Slowly changing Dimension- Options Eg. Key does not change but description changes (product description) TYPE 1 Overwrite dimension record with new values used when old value of attribute has no significance TYPE 2 Create a new record using a new value of surrogate key used when history can be clearly partitioned query only on new value or only old value query on some other attributes - return all records) TYPE 3 Create an old field in dimension to store immediate previous value used when change is a soft change no perfect partition in history may want to track for sometime with both old or new value do not use when there are too many such soft changes successively Degenerate Dimensions Occur in line item oriented fact tables occur when dimension table is left only with a single key and no other fields all other attributes have been moved into other dimension tables Moved to fact table - not joined to anything Junk Dimensions Number of miscellaneous flags and text attributes left over after design
WHAT TO DO WITH THEM???? DO NOT

Leave them behind in the fact table Make each flag and attribute into its own dimension Strip off all such flags and attributes Grouping of random flags and attributes take away from fact and group them into junk dimension eg. Open ended comments fields

Conformed Dimensions Dimension that means the same thing with every possible fact table that it is joined. Conformed Dimension is identically the same dimension in each data mart Major responsibility of the central data warehouse design team is to establish, publish, maintain and enforce the conformed dimensions When you dont need Conformed Dimensions Several lines of business where the customers and products are disjoint. Dont manage these separate business lines together Without strict adherence to conformed dimensions the data warehouse cannot function as an integrated whole