Professional Documents
Culture Documents
Data sets are made up of data objects. A data object represents an entity—in a sales database, the objects may be
customers, store items, and sales; in a medical database, the objects may be patients; in a university database, the
objects may be students, professors, and courses. Data objects are typically described by attributes. Data objects can
also be referred to as samples, examples, instances, data points, or objects. If the data objects are stored in a
database, they are data tuples. That is, the rows of a database correspond to the data objects, and the columns
correspond to the attributes.
What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a data object. The term dimension is
commonly used in data warehousing. Attributes describing a customer object can include, for example, customer
ID, name, and address. Observed values for a given attribute are known as observations. A set of attributes used to
describe a given object is called an attribute vector (or feature vector). The distribution of data involving one
attribute (or variable) is called univariate. A bivariate distribution involves two attributes, and so on.
The type of an attribute is determined by the set of possible values—nominal, binary, ordinal, or numeric—the
attribute can have. In the following subsections, we introduce each type.
Nominal Attributes
Nominal means “relating to names.” The values of a nominal attribute are symbols or names of things. Each value
represents some kind of category, code, or state, and so nominal attributes are also referred to as categorical. The
values do not have any meaningful order. In computer science, the values are also known as enumerations.
Example: Suppose that hair color and marital status are two attributes describing person objects. In our application,
possible values for hair color are black, brown, blond, red, auburn, gray, and white. The attribute marital status can
take on the values single, married, divorced, and widowed. Both hair color and marital status are nominal attributes.
Another example of a nominal attribute is occupation, with the values teacher, dentist, programmer, farmer, and so
on.
Binary Attributes
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means that
the attribute is absent, and 1 means that it is present. Binary attributes are referred to as Boolean if the two states
correspond to true and false.
Example - Binary attributes. Given the attribute smoker describing a patient object, 1 indicates that the patient
smokes, while 0 indicates that the patient does not. Similarly, suppose the patient undergoes a medical test that has
two possible outcomes. The attribute medical test is binary, where a value of 1 means the result of the test for the
patient is positive, while 0 means the result is negative.
1|Page
A binary attribute is symmetric if both of its states are equally valuable and carry the same weight; that is, there is
no preference on which outcome should be coded as 0 or 1. One such example could be the attribute gender having
the states male and female.
A binary attribute is asymmetric if the outcomes of the states are not equally impor-tant, such as the positive and
negative outcomes of a medical test for HIV. By convention, we code the most important outcome, which is
usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV negative).
Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but
the magnitude between successive values is not known.
Example - Ordinal attributes. Suppose that drink size corresponds to the size of drinks available at a fast-food
restaurant. This nominal attribute has three possible values: small, medium, and large. The values have a
meaningful sequence (which corresponds to increasing drink size);
Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real values.
Numeric attributes can be interval-scaled or ratio-scaled.
Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of interval-scaled attributes
have order and can be positive, 0, or negative. Thus, in addition to providing a ranking of values, such attributes
allow us to compare and quantify the difference between values.
Example - Interval-scaled attributes. A temperature attribute is interval-scaled. Suppose that we have the
outdoor temperature value for a number of different days, where each day is an object. By ordering the values, we
obtain a ranking of the objects with respect to temperature. In addition, we can quantify the difference between
values. For example, a temperature of 20 C is five degrees higher than a temperature of 15 C. Calendar dates are
another example. For instance, the years 2002 and 2010 are eight years apart.
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if a measurement is ratio-
scaled, we can speak of a value as being a multiple (or ratio) of another value. In addition, the values are ordered,
and we can also compute the difference between values, as well as the mean, median, and mode.
2|Page
Example - Ratio-scaled attributes. Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K) temperature
scale has what is considered a true zero-point (0 K D 273.15 C): It is the point at which the particles that comprise
matter have zero kinetic energy. Other examples of ratio-scaled attributes include count attributes such as years of
experience (e.g., the objects are employees) and number of words (e.g., the objects are documents). Additional
examples include attributes to measure weight, height, latitude and longitude coordinates.
A discrete attribute has a finite or countably infinite set of values, which may or may not be represented as
integers. The attributes hair color, smoker, medical test, and drink size each have a finite number of values, and so
are discrete.
An attribute is countably infinite if the set of possible values is infinite but the values can be put in a one-to-one
correspondence with natural numbers. For example, the attribute customer ID is countably infinite. The number of
customers can grow to infinity, but in reality, the actual set of values is countable. Zip codes are another example.
If an attribute is not discrete, it is continuous. In practice, real values are represented using a finite number of
digits. Continuous attributes are typically represented as floating-point variables.
OLAP OPERATIONS
CUBE
Cube is a structure that allows fast analysis of Data .The limitation of arranging data in relational
db has been overcome with Cube.
Data cubes are used to represent data that is complex to be described by a table which has
columns and rows.
Most of the Decision support systems use data In the form of DATA CUBE and it can be a 2D or
a 3D or a Higher Dimension. Now Each dimension Represents an attribute in the DB and hence
data cube is a 3D or more while representing DATA or Interpreting data.
3|Page
Data cubes dimensions represents a characteristics of a DB and the data inside the cube helps in
analysing on every possible dimension, where by generating a trend. These cubes are used by
Analysis systems, Reporting Systems, Data Mining Systems. Multidimensional databases uses
cubes to represent data.
OLAP
Relational OLAP
4|Page
OLAP
Stands for online analytical processing is a computer processing that enables users to easily
select, extract and view data from different view point. OLAP, Allows the users, to analyze
database information from multiple database systems at a time.
Data is stored in multi dimensional databases and this database is used for dataware hosue and
OLAP CUBES are created from existing DataBase but in relational DB, we use query.
OLAP is used for Data Mining and all the OLAP Products are designed for multiple user
environment
5|Page
Few olap servers are
Hyperion
OLAP is used for Data Mining and all the OLAP Products are designed for multiple user
environment
6|Page
Nigel Pendse has suggested that an alternative and perhaps more descriptive term to describe the
concept of OLAP is Fast Analysis of Shared Multidimensional Information (FASMI).
The first product that performed OLAP queries was Express, which was released in 1970 (and
acquired by Oracle in 1995 from Information Resources). However, the term did not appear until
1993 when it was coined by Ted Codd, who has been described as "the father of the relational
database".
The user‐initiated process of navigating by calling for page displays interactively, through the
specification of slices via rotations and drill down/up is sometimes called "slice and dice".
OLAP OPERATIONS
To perform operations on an OLAP CUBE we need Dimension Tables and Fact Tables
7|Page
If there is hierarchical structure for dimensions then we call that structure to be Dimensional
Schema
8|Page
ROLL UP/CONSOLIDATAION
9|Page
DRILL DOWN
SLICE
10 | P a g e
DICE
PIVOT
11 | P a g e
Or
https://www.tutorialride.com/data-mining/olap.htm
The most common Architecture adopted by DW is a THREE TIER architecture which are as
follows
Middle
Top
12 | P a g e
Bottom TIER
The Bottom Tier consists of DATA WARE HOUSE SERVER, It is the Relational DATABAES
system, The Dataware house server will fetch only relevant information based on data mining
request.
The backend tools are used to store or feed data into the bottom tier. The functions Performed
by Back End Tools are Extract, Clean, Transformation, Load, and Refresh functions.
The extraction is the process of refining the data that is collected from the different sources like
internal database of the organization, external databases from various departments of the
institute, other leading educational libraries in the city, etc.
Two methods can be used for the extraction of the data from sources, viz.,
change-based extraction.
The entire process of extracting data from multiple sources, transforming it into a unique
standard format and finally the loading into the warehouse is referred as extraction,
transformation and loading (ETL) process. Operational Databases
Only relevant information is extracted based on data mining knowledge base. Where the
extracted information is subject oriented, integrated from multiple sources, time variant,
nonvolatile..
DataMarts
Are subsets of DW, where the info or a data mart is confined to a specific subject.Datamarts can
be categeorized into two..
Metadatarepository
Helps us to identify what is available in Datawarehouse . As in the structure of the DW, datasets
names, definitions, algorithms used in performing cleaning, source of extracted data. Sequence
of extracted data.
13 | P a g e
Monitoring and Administration
Data Refreshment
Data source synchronization
Disaster recovery
Managing data growth, database performace
Controlling the number & range of queries
Limiting the size of data warehous
MIDDLE TIER
It presents the users a multidimensional data from data warehouse or data marts.Typically
implemented using two models:
TOP TIER
Is the presentation layer which has reporting, analysis and data mining tools, which acts as a
User Interface between the Dartawarehouse ad the end user either for Querying, Analyzing,
Report Generation.
OLTP Vs OLAP
OLAP OLTP
On-line Analytical Processing On-line Transaction Processing
Has very low volume of Transactions , queries Large volume of short online transactions such
are often complex and involves aggregation as insert, delete, update
Response time is the key factor for an OLAP Emphasis is on very fast query processing,
System. maintaining data integrity, ensuring more
These systems are widely used in Data Mining number of transactions per second.
techniques
OLAP systems data are form various OLTP OLTP systems data are the original source of
systems and called as Consolidation data data and we call this as operational data.
Data reveals Multi-dimensional View of Data reveals a snapshot of ongoing business
Various kinds of business Activities. processes
14 | P a g e
Processing Speed depends on the volume of The Processing Speed is Very Fast
data involved. Batch data refreshes and
complex queries may take hours.
Data Base Design is De-Normalized with Database design is highly Normalized with
fewer Tables, using Star and SnowFlake Many Tables
Schema
The Purpose of OLAP s are to help with The Purpose of OLTP’s is to control and run
Planning, Problem Solving and Decision the Fundamental Business Tasks
Support.
Space requirement can be larger due to Space Requirement can be very small if
existence of aggregation structures and historic historical data is archived.
data.
Data recovery / back up is done simply by Back up is taken on a regular interval as
reloading the OLTP data as data recovery operational data is critical to run the Business.
Process If this data is lost then it results in monetary
loss and legal liability.
1. Time variant.
2. Non Volatile.
3. Integrated.
4. Subject Oriented.
TIME VARIANT
A Data Warehouse is a time variant data base, which supports the business management in
analysing the business and comparing the business with different time periods like Year, Quarter,
Month, Week and Date.
15 | P a g e
ATTRIBUTES OF TIME
DAY_NAME
DAY_NUMBER_IN_WEEK
DAY_NUMBER_IN_MONTH
DAY_NUMBER_IN_YEAR
WEEK_NUMBER_IN_MONTH
WEEK_NUMBER_IN_YEAR
MONTH_NUMBER
MONTH_YEAR
QUARTER_YEAR
QUARTER_NUMBER
YEAR
SESSION
WEEKEND_INDICATOR_FLAG
WEEKDAY_INDICATOR_FLAG
NON VOLATILE
It is non volatile Database, once the data entered into the database, it does not reflects to the
change which takes place at operational database. Hence the data is statics in Data Warehouse.
16 | P a g e
INTEGRATED DATABASE
A DWH is a integrated database, which allows you to collect the data and integrate the data with
multiple database sources.
SUBJECT ORIENTED
Data warehouse is a subject oriented database, which supports the business need of individual
department specific user.
Example : Sales, HR, Accounts, Marketing etc.
17 | P a g e