You are on page 1of 107

Practical :- 1 Aim: Overview of SQL Server 2005 analysis services. Software Required: Analysis services- SQL Server-2005.

Knowledge Required: Data Mining Concepts Theory/Logic: Data Mining Act of excavation in the data from which patterns can be extracted Alternative name: Knowledge discovery in databases (KDD) Multiple disciplines: database, statistics, artificial intelligence Fastly maturing technology Unlimited applicability
Define a model
Data Mining Management System (DMMS)

Train the model

Training Data

Test the model

Test Data Mining Model

Prediction using the model


Prediction Input Data

Figure 1: Data mining process Data Mining Tasks - Summary Classification Regression Segmentation Association Analysis Anomaly detection

Sequence Analysis Time-series Analysis Text categorization Advanced insights discovery Others The data mining tutorial is designed to walk you through the process of creating data mining models in Microsoft SQL Server 2005. The data mining algorithms and tools in SQL Server 2005 make it easy to build a comprehensive solution for a variety of projects, including market basket analysis, forecasting analysis, and targeted mailing analysis. The scenarios for these solutions are explained in greater detail later in the tutorial. The most visible components in SQL Server 2005 are the workspaces that you use to create and work with data mining models. The online analytical processing (OLAP) and data mining tools are consolidated into two working environments: Business Intelligence Development Studio and SQL Server Management Studio. Using Business Intelligence Development Studio, you can develop an Analysis Services project disconnected from the server. When the project is ready, you can deploy it to the server. You can also work directly against the server. The main function of SQL Server Management Studio is to manage the server. Each environment is described in more detail later in this introduction. All of the data mining tools exist in the data mining editor. Using the editor you can manage mining models, create new models, view models, compare models, and create predictions based on existing models. After you build a mining model, you will want to explore it, looking for interesting patterns and rules. Each mining model viewer in the editor is customized to explore models built with a specific algorithm. Often your project will contain several mining models, so before you can use a model to create predictions, you need to be able to determine which model is the most accurate. For this reason, the editor contains a model comparison tool called the Mining

Accuracy Chart tab. Using this tool you can compare the predictive accuracy of your models and determine the best model. To create predictions, you will use the Data Mining Extensions (DMX) language. DMX extends SQL, containing commands to create, modify, and predict against mining models. Because creating a prediction can be complicated, the data mining editor contains a tool called Prediction Query Builder, which allows you to build queries using a graphical interface. You can also view the DMX code that is generated by the query builder. The key to creating a mining model is the data mining algorithm. The algorithm finds patterns in the data that you pass it, and it translates them into a mining model it is the engine behind the process. SQL Server 2005 includes nine algorithms: 1. Microsoft Decision Trees 2. Microsoft Clustering 3. Microsoft Nave Bayes 4. Microsoft Sequence Clustering 5. Microsoft Time Series 6. Microsoft Association 7. Microsoft Neural Network 8. Microsoft Linear Regression 9. Microsoft Logistic Regression Using a combination of these nine algorithms, you can create solutions to common business problems. Some of the most important steps in creating a data mining solution are consolidating, cleaning, and preparing the data to be used to create the mining models. SQL Server 2005 includes the Data Transformation Services (DTS) working environment, which contains tools that you can use to clean, validate, and prepare your data. The audience for this tutorial is business analysts, developers, and database administrators who have used data mining tools before and are familiar with data mining concepts.

Business Intelligence Development Studio


Business Intelligence Development Studio is a set of tools designed for creating business intelligence projects. Because Business Intelligence Development Studio was created as an IDE environment in which you can create a complete solution, you work disconnected from the server. You can change your data mining objects as much as you want, but the changes are not reflected on the server until after you deploy the project.

Working in an IDE is beneficial for the following reasons: You have powerful customization tools available to configure Business Intelligence Development Studio to suit your needs. You can integrate your Analysis Services project with a variety of other business intelligence projects encapsulating your entire solution into a single view. Full source control integration enables your entire team to collaborate in creating a complete business intelligence solution. The Analysis Services project is the entry point for a business intelligence solution. An Analysis Services project encapsulates mining models and OLAP cubes, along with supplemental objects that make up the Analysis Services database. From Business Intelligence Development Studio, you can create and edit Analysis Services objects within a project and deploy the project to the appropriate Analysis Services server or servers. Working with Data Mining Data mining gives you access to the information that you need to make intelligent decisions about difficult business problems. Microsoft SQL Server 2005 Analysis Services (SSAS) provides tools for data mining with which you can identify rules and patterns in your data, so that you can determine why things happen and predict what will happen in the future. When you create a data mining solution in Analysis Services, you first create a model that describes your business problem, and then you run your data through an algorithm that generates a mathematical model of the data, a process that is known as training the model. You can then either visually explore the mining model or

create prediction queries against it. Analysis Services can use datasets from both relational and OLAP databases, and includes a variety of algorithms that you can use to investigate that data. SQL Server 2005 provides different environments and tools that you can use for data mining. The following sections outline a typical process for creating a data mining solution, and identify the resources to use for each step. Creating an Analysis Services Project To create a data mining solution, you must first create a new Analysis Services project, and then add and configure a data source and a data source view for the project. The data source defines the connection string and authentication information with which to connect to the data source on which to base the mining model. The data source view provides an abstraction of the data source, which you can use to modify the structure of the data to make it more relevant to your project. Adding Mining Structures to an Analysis Services Project After you have created an Analysis Services project, you can add mining structures, and one or more mining models that are based on each structure. A mining structure, including tables and columns, is derived from an existing data source view or OLAP cube in the project. Adding a new mining structure starts the Data Mining Wizard, which you use to define the structure and to specify an algorithm and training data for use in creating an initial model based on that structure. You can use the Mining Structure tab of Data Mining Designer to modify existing mining structures, including adding columns and nested tables. Working with Data Mining Models Before you can use the mining models you define, you must process them so that Analysis Services can pass the training data through the algorithms to fill the models. Analysis Services provides several options for processing mining model objects, including the ability to control which objects are processed and how they are processed. After you have processed the models, you can investigate the results and make decisions about which models perform the best. Analysis Services provides viewers for each mining model type, within the Mining Model Viewer tab in Data Mining Designer, which you can use to explore the mining models. Analysis Services also provides tools,

in the Mining Accuracy Chart tab of the designer, that you can use to directly compare mining models and to choose the mining model that works best for your purpose. These tools include a lift chart, a profit chart, and a classification matrix. Creating Predictions The main goal of most data mining projects is to use a mining model to create predictions. After you explore and compare mining models, you can use one of several tools to create predictions. Analysis Services provides a query language called Data Mining Extensions (DMX) that is the basis for creating predictions. To help you build DMX prediction queries, SQL Server provides a query builder, available in SQL Server Management Studio and Business Intelligence Development Studio, and DMX templates for the query editor in Management Studio. Within BI Development Studio, you access the query builder from the Mining Model Prediction tab of Data Mining Designer. SQL Server Management Studio After you have used BI Development Studio to build mining models for your data mining project, you can manage and work with the models and create predictions in Management Studio. SQL Server Reporting Services After you create a mining model, you may want to distribute the results to a wider audience. You can use Report Designer in Microsoft SQL Server 2005 Reporting Services (SSRS) to create reports, which you can use to present the information that a mining model contains. You can use the result of any DMX query as the basis of a report, and can take advantage of the parameterization and formatting features that are available in Reporting Services. Working Programmatically with Data Mining Analysis Services provides several tools that you can use to programmatically work with data mining. The Data Mining Extensions (DMX) language provides statements that you can use to create, train, and use data mining models. You can also perform these tasks by using a combination of XML for Analysis (XMLA) and Analysis Services Scripting Language (ASSL), or by using Analysis Management Objects (AMO).

You can access all the metadata that is associated with data mining by using data mining schema rowsets. For example, you can use schema rowsets to determine the data types that an algorithm supports, or the model names that exist in a database. Data Mining Concepts Data mining is frequently described as "the process of extracting valid, authentic, and actionable information from large databases." In other words, data mining derives patterns and trends that exist in data. These patterns and trends can be collected together and defined as a mining model. Mining models can be applied to specific business scenarios, such as:

Forecasting sales. Targeting mailings toward specific customers. Determining which products are likely to be sold together. Finding sequences in the order that customers add products to a shopping cart. An important concept is that building a mining model is part of a larger process

that includes everything from defining the basic problem that the model will solve, to deploying the model into a working environment. This process can be defined by using the following six basic steps: 1. Defining the Problem 2. Preparing Data 3. Exploring Data 4. Building Models 5. Exploring and Validating Models 6. Deploying and Updating Models The following diagram describes the relationships between each step in the process, and the technologies in Microsoft SQL Server 2005 that you can use to complete each step.

Although the process that is illustrated in the diagram is circular, each step does not necessarily lead directly to the next step. Creating a data mining model is a dynamic and iterative process. After you explore the data, you may find that the data is insufficient to create the appropriate mining models, and that you therefore have to look for more data. You may build several models and realize that they do not answer the problem posed when you defined the problem, and that you therefore must redefine the problem. You may have to update the models after they have been deployed because more data has become available. It is therefore important to understand that creating a data mining model is a process, and that each step in the process may be repeated as many times as needed to create a good model. SQL Server 2005 provides an integrated environment for creating and working with data mining models, called Business Intelligence Development Studio. The environment includes data mining algorithms and tools that make it easy to build a comprehensive solution for a variety of projects. Defining the Problem The first step in the data mining process, as highlighted in the following diagram, is to clearly define the business problem.

This step includes analyzing business requirements, defining the scope of the problem, defining the metrics by which the model will be evaluated, and defining the final objective for the data mining project. These tasks translate into questions such as the following:

What are you looking for? Which attribute of the dataset do you want to try to predict? What types of relationships are you trying to find? Do you want to make predictions from the data mining model or just look for interesting patterns and associations?

How is the data distributed? How are the columns related, or if there are multiple tables, how are the tables related? To answer these questions, you may have to conduct a data availability study, to

investigate the needs of the business users with regard to the available data. If the data does not support the needs of the users, you may have to redefine the project. Preparing Data The second step in the data mining process, as highlighted in the following diagram, is to consolidate and clean the data that was identified in the Defining the Problem step.

Microsoft SQL Server 2005 Integration Services (SSIS) contains all the tools that you need to complete this step, including transforms to automate data cleaning and consolidation. Data can be scattered across a company and stored in different formats, or may contain inconsistencies such as flawed or missing entries. For example, the data might show that a customer bought a product before that customer was actually even born, or that the customer shops regularly at a store located 2,000 miles from her home. Before you start to build models, you must fix these problems. Typically, you are working with a very large dataset and cannot look through every transaction. Therefore, you have to use some form of automation, such as in Integration Services, to explore the data and find the inconsistencies. Exploring Data The third step in the data mining process, as highlighted in the following diagram, is to explore the prepared data.

You must understand the data in order to make appropriate decisions when you create the models. Exploration techniques include calculating the minimum and maximum values, calculating mean and standard deviations, and looking at the distribution of the data. After you explore the data, you can decide if the dataset contains flawed data, and then you can devise a strategy for fixing the problems. Data Source View Designer in BI Development Studio contains several tools that you can use to explore data. Building Models The fourth step in the data mining process, as highlighted in the following diagram, is to build the mining models.

Before you build a model, you must randomly separate the prepared data into separate training and testing datasets. You use the training dataset to build the model, and the testing dataset to test the accuracy of the model by creating prediction queries. You can use the Percentage Sampling Transformation in Integration Services to split the dataset. You will use the knowledge that you gain from the Exploring Data step to help define and create a mining model. A model typically contains input columns, an identifying column, and a predictable column. You can then define these columns in a new model by using the Data Mining Extensions (DMX) language, or the Data Mining Wizard in BI Development Studio. After you define the structure of the mining model, you process it, populating the empty structure with the patterns that describe the model. This is known as training the model. Patterns are found by passing the original data through a mathematical algorithm. SQL Server 2005 contains a different algorithm for each type of model that you can build. You can use parameters to adjust each algorithm. A mining model is defined by a data mining structure object, a data mining model object, and a data mining algorithm. Microsoft SQL Server 2005 Analysis Services (SSAS) includes the following algorithms:

Microsoft Decision Trees Algorithm Microsoft Clustering Algorithm Microsoft Naive Bayes Algorithm Microsoft Association Algorithm Microsoft Sequence Clustering Algorithm Microsoft Time Series Algorithm Microsoft Neural Network Algorithm (SSAS) Microsoft Logistic Regression Algorithm Microsoft Linear Regression Algorithm

Exploring and Validating Models The fifth step in the data mining process, as highlighted in the following diagram, is to explore the models that you have built and test their effectiveness.

You do not want to deploy a model into a production environment without first testing how well the model performs. Also, you may have created several models and will have to decide which model will perform the best. If none of the models that you created in the Building Models step perform well, you may have to return to a previous step in the process, either by redefining the problem or by reinvestigating the data in the original dataset. You can explore the trends and patterns that the algorithms discover by using the viewers in Data Mining Designer in BI Development Studio. You can also test how well the models create predictions by using tools in the designer such as the lift chart and classification matrix. These tools require the testing data that you separated from the original dataset in the model-building step. Deploying and Updating Models The last step in the data mining process, as highlighted in the following diagram, is to deploy to a production environment the models that performed the best.

After the mining models exist in a production environment, you can perform many tasks, depending on your needs. Following are some of the tasks you can perform:

Use the models to create predictions, which you can then use to make business decisions. SQL Server provides the DMX language that you can use to create prediction queries, and Prediction Query Builder to help you build the queries.

Embed data mining functionality directly into an application. You can include Analysis Management Objects (AMO) or an assembly that contains a set of objects that your application can use to create, alter, process, and delete mining structures and mining models. Alternatively, you can send XML for Analysis (XMLA) messages directly to an instance of Analysis Services.

Use Integration Services to create a package in which a mining model is used to intelligently separate incoming data into multiple tables. For example, if a database is continually updated with potential customers, you could use a mining model together with Integration Services to split the incoming data into customers who are likely to purchase a product and customers who are likely to not purchase a product.

Create a report that lets users directly query against an existing mining model. Updating the model is part of the deployment strategy. As more data comes into the organization, you must reprocess the models, thereby improving their effectiveness.

Practical :- 2 Aim: Design and Create cube by identifying measures and dimensions for Star Schema, Snowflake schema. Software Required: Analysis services- SQL Server-2005. Knowledge Required: Data cube Theory/Logic:

Creating a Data Cube


To build a new data cube using BIDS, you need to perform these steps: Create a new Analysis Services project Define a data source Define a data source view Invoke the Cube Wizard Well look at each of these steps in turn.

Creating a New Analysis Services Project


To create a new Analysis Services project, you use the New Project dialog box in BIDS. This is very similar to creating any other type of new project in Visual Studio. To create a new Analysis Services project, follow these steps: 1. Select Microsoft SQL Server 2005 SQL Server Business Intelligence. Development Studio from the Programs menu to launch Business Intelligence Development Studio. 2. Select File New Project. 3. In the New Project dialog box, select the Business Intelligence Projects project type. 4. Select the Analysis Services Project template. 5. Name the new project AdventureWorksCube1 and select a convenient location to save it. 6. Click OK to create the new project. Figure 2-1 shows the Solution Explorer window of the new project, ready to be populated with objects

Figure 2-1: New Analysis Services project

Defining a Data Source


To define a data source, youll use the Data Source Wizard. You can launch this wizard by right-clicking on the Data Sources folder in your new Analysis Services project. The wizard will walk you through the process of defining a data source for your cube, including choosing a connection and specifying security credentials to be used to connect to the data source. To define a data source for the new cube, follow these steps: 1. Right-click on the Data Sources folder in Solution Explorer and select New Data Source. 2. Read the first page of the Data Source Wizard and click Next. 3. You can base a data source on a new or an existing connection. Because you dont have any existing connections, click New. 4. In the Connection Manager dialog box, select the server containing your analysis services sample database from the Server Name combo box. 5. Fill in your authentication information. 6. Select the Native OLE DB\SQL Native Client provider (this is the default provider). 7. Select the AdventureWorksDW database. Figure 3-2 shows the filled-in Connection Manager dialog box.

8. Click OK to dismiss the Connection Manager dialog box. 9. Click Next.

Figure 2-2: Setting up a connection

10. Select Default impersonation information to use the credentials you just supplied for the connection and click Next.
11. Accept the default data source name and click Finish.

Defining a Data Source View


A data source view is a persistent set of tables from a data source that supply the data for a particular cube. BIDS also includes a wizard for creating data source views, which you can invoke by right-clicking on the Data Source Views folder in Solution Explorer.

To create a new data source view, follow these steps: 1. Right-click on the Data Source Views folder in Solution Explorer and select New Data Source View. 2. Read the first page of the Data Source View Wizard and click Next. 3. Select the Adventure Works DW data source and click Next. Note that you could also launch the Data Source Wizard from here by clicking New Data Source. 4. Select the dbo.FactFinance table in the Available Objects list and click the button to move it to the Included Object list. This will be the fact table in the new cube.
5.

Click the Add Related Tables button to automatically add all of the tables that are directly related to the dbo.FactFinance table. These will be the dimension tables for the new cube. Figure 2-3 shows the wizard with all of the tables selected.

6. Click Next.
7.

Name the new view Finance and click Finish. BIDS will automatically display the schema of the new data source view, as shown in Figure 2-4.

Figure 3-3: Selecting tables for the data source view

Analysis Services

Figure 2-4: The Finance data source view

Invoking the Cube Wizard


As you can probably guess at this point, you invoke the Cube Wizard by right clicking on the Cubes folder in Solution Explorer. The Cube Wizard interactively explores the structure of your data source view to identify the dimensions, levels, and measures in your cube. To create the new cube, follow these steps: 1. Right-click on the Cubes folder in Solution Explorer and select New Cube. 2. Read the first page of the Cube Wizard and click Next. 3. Select the option to build the cube using a data source. 4. Check the Auto Build checkbox. 5. Select the option to create attributes and hierarchies. 6. Click Next.

7. Select the Finance data source view and click Next. 8. Wait for the Cube Wizard to analyze the data and then click Next. 9. The Wizard will get most of the analysis right, but you can fine-tune it a bit. Select DimTime in the Time Dimension combo box. Uncheck the Fact checkbox on the line for the dbo.DimTime table. This will allow you to analyze this dimension using standard time periods. 10. Click Next. 11. On the Select Time Periods page, use the combo boxes to match time property names to time columns according to Table 2-1.

Table 2-1: Time columns for Finance cube

12. Click Next. 13. Accept the default measures and click Next. 14. Wait for the Cube Wizard to detect hierarchies and then click Next. 15. Accept the default dimension structure and click Next. 16. Name the new cube FinanceCube and click Finish.

Deploying and Processing a Cube


At this point, youve defined the structure of the new cube - but theres still more work to be done. You still need to deploy this structure to an Analysis Services server and then process the cube to create the aggregates that make querying fast and easy. To deploy the cube you just created, select Build Deploy AdventureWorksCube1. This will deploy the cube to your local Analysis Server, and also process the cube, building the aggregates for you. BIDS will open the Deployment Progress window, as shown in Figure 2-5, to keep you informed during deployment and processing.

Figure 2-5: Deploying a cube

Exploring a Data Cube


At last youre ready to see what all the work was for. BIDS includes a built-in Cube Browser that lets you interactively explore the data in any cube that has been deployed and processed. To open the Cube Browser, right-click on the cube in Solution Explorer and select Browse. Figure 2-6 shows the default state of the Cube Browser after its just been opened.

The Cube Browser is a drag-and-drop environment. If youve worked with pivot tables in Microsoft Excel, you should have no trouble using the Cube browser. The pane to the left includes all of the measures and dimensions in your cube, and the pane to the right gives you drop targets for these measures and dimensions. Among other operations, you can:

Figure 2-6: The cube browser in BIDS

Drop a measure in the Totals/Detail area to see the aggregated data for that measure. Drop a dimension or level in the Row Fields area to summarize by that level or dimension on rows. Drop a dimension or level in the Column Fields area to summarize by that level or dimension on columns

Drop a dimension or level in the Filter Fields area to enable filtering by members of that dimension or level. Use the controls at the top of the report area to select additional filtering expressions.

To see the data in the cube you just created, follow these steps: 1. Right-click on the cube in Solution Explorer and select Browse. 2. Expand the Measures node in the metadata panel (the area at the left of the user interface). 3. Expand the Fact Finance node. 4. Drag the Amount measure and drop it on the Totals/Detail area. 5. Expand the Dim Account node in the metadata panel. 6. Drag the Account Description property and drop it on the Row Fields area. 7. Expand the Dim Time node in the metadata panel. 8. Drag the Calendar Year-Calendar Quarter-Month Number of Year hierarchy and drop it on the Column Fields area. 9. Click the + sign next to year 2001 and then the + sign next to quarter 3. 10. Expand the Dim Scenario node in the metadata panel. 11. Drag the Scenario Name property and drop it on the Filter Fields area. 12. Click the dropdown arrow next to scenario name. Uncheck all of the checkboxes except for the one next to the Budget name. Figure 2-7 shows the result. The Cube Browser displays month-by-month budgets by account for the third quarter of 2001. Although you could have written queries to extract this information from the original source data, its much easier to let Analysis Services do the heavy lifting for you.

Figure 2-7: Exploring cube data in the cube browser

Questions
1. Create a data cube, based on the data in the AdventureWorksDW sample database, to answer the following question: What were the internet sales by country and product name for married customers only?

Practical :- 3 Aim: Design and Create cube by identifying measures and dimensions for Design storage for cube using storage mode MOLAP, ROLAP and HOALP. Software Required: Analysis services- SQL Server-2005. Knowledge Required: MOLAP, ROLAP, HOLAP Theory/Logic:

Partition Storage (SSAS)


Physical storage options affect the performance, storage requirements, and storage locations of partitions and their parent measure groups and cubes. A partition can have one of three basic storage modes: Multidimensional OLAP (MOLAP) Relational OLAP (ROLAP) Hybrid OLAP (HOLAP)

Microsoft SQL Server 2005 Analysis Services (SSAS) supports all three basic storage modes. It also supports proactive caching, which enables you to combine the characteristics of ROLAP and MOLAP storage for both immediacy of data and query performance. You can configure the storage mode and proactive caching options in one of three ways. Storage Configuration Description Method You can configure storage settings for a partition or configure Storage Settings dialog default storage settings for a measure group. You can configure storage settings for a partition at the same time that you design aggregations. Storage Design Wizard You can also define a filter to restrict the source data that is read into the partition using any of the three storage modes. Usage-Based Optimization Wizard You can select a storage mode and optimize aggregation design based on queries that have been sent to the cube.

MOLAP The MOLAP storage mode causes the aggregations of the partition and a copy of its source data to be stored in a multidimensional structure in Analysis Services, which structure is highly optimized to maximize query performance. This can be storage on the computer where the partition is defined or on another Analysis Services computer. Storing data on the computer where the partition is defined creates a local partition. Storing data on another Analysis Services computer creates a remote partition. The multidimensional structure that stores the partition's data is located in a subfolder of the Data folder of the Analysis Services program files or another location specified during setup of Analysis Services. Because a copy of the source data resides in the Analysis Services data folder, queries can be resolved without accessing the partition's source data even when the results cannot be obtained from the partition's aggregations. The MOLAP storage mode provides the most rapid query response times, even without aggregations, but which can be improved substantially through the use of aggregations. As the source data changes, objects in MOLAP storage must be processed periodically to incorporate those changes. The time between one processing and the next creates a latency period during which data in OLAP objects may not match the current data. You can incrementally update objects in MOLAP storage without downtime. However, there may be some downtime required to process certain changes to OLAP objects, such as structural changes. You can minimize the downtime required to update MOLAP storage by updating and processing cubes on a staging server and using database synchronization to copy the processed objects to the production server. You can also use proactive caching to minimize latency and maximize availability while retaining much of the performance advantage of MOLAP storage. ROLAP The ROLAP storage mode causes the aggregations of the partition to be stored in tables in the relational database specified in the partition's data source. Unlike the MOLAP storage mode, ROLAP does not cause a copy of the source data to be stored in the Analysis Services data folders. When results cannot be derived from the aggregations or query cache, the fact table in the data source is accessed to answer queries. With the

ROLAP storage mode, query response is generally slower than that available with the other MOLAP or HOLAP storage modes. Processing time is also typically slower. Realtime ROLAP is typically used when clients need to see changes immediately. No aggregations are stored with real-time ROLAP. ROLAP is also used to save storage space for large datasets that are infrequently queried, such as purely historical data. Note: When using ROLAP, Analysis Services may return incorrect information related to the unknown member if a join is combined with a group by, which eliminates relational integrity errors rather than returning the unknown member value. If a partition uses the ROLAP storage mode and its source data is stored in SQL Server 2005 Analysis Services (SSAS), Analysis Services attempts to create indexed views to contain aggregations of the partition. If Analysis Services cannot create indexed views, it does not create aggregation tables. While Analysis Services handles the session requirements for creating indexed views on SQL Server 2005 Analysis Services (SSAS), the creation and use of indexed views for aggregations requires the following conditions to be met by the ROLAP partition and the tables in its schema: The partition cannot contain measures that use the Min or Max aggregate functions. Each table in the schema of the ROLAP partition must be used only once. For example, the schema cannot contain "dbo"."address" AS "Customer Address" and "dbo"."address" AS "SalesRep Address". Each table must be a table, not a view. All table names in the partition's schema must be qualified with the owner name, for example, "dbo"."customer". All tables in the partition's schema must have the same owner; for example, you cannot have a FromClause like : "tk"."customer", "john"."store", or

"dave"."sales_fact_2004". The source columns of the partition's measures must not be nullable. All tables used in the view must have been created with the following options set to ON: o ANSI_NULLS o QUOTED_IDENTIFIER

The total size of the index key, in SQL Server 2005, cannot exceed 900 bytes. SQL Server 2005 will assert this condition based on the fixed length key columns when the CREATE INDEX statement is processed. However, if there are variable length columns in the index key, SQL Server 2005 will also assert this condition for every update to the base tables. Because different aggregations have different view definitions, ROLAP processing using indexed views can succeed or fail depending on the aggregation design.

The session creating the indexed view must have the following options on: ARITHABORT, CONCAT_NULL_YEILDS_NULL, QUOTED_IDENTIFIER, ANSI_NULLS, ANSI_PADDING, and ANSI_WARNING. This setting can be made in SQL Server Management Studio.

The session creating the indexed view must have the following option off: NUMERIC_ROUNDABORT. This setting can be made in SQL Server Management Studio.

HOLAP The HOLAP storage mode combines attributes of both MOLAP and ROLAP. Like MOLAP, HOLAP causes the aggregations of the partition to be stored in a multidimensional structure on an Analysis Services server computer. HOLAP does not cause a copy of the source data to be stored. For queries that access only summary data contained in the aggregations of a partition, HOLAP is the equivalent of MOLAP. Queries that access source data, such as a drilldown to an atomic cube cell for which there is no aggregation data, must retrieve data from the relational database and will not be as fast as if the source data were stored in the MOLAP structure. Partitions stored as HOLAP are smaller than equivalent MOLAP partitions and respond faster than ROLAP partitions for queries involving summary data. HOLAP storage mode is generally suitable for partitions in cubes that require rapid query response for summaries based on a large amount of source data. However, where users generate queries that must touch leaf level data, such as for calculating median values, MOLAP is generally a better choice.

Steps:
1. In the Analysis service object explorer tree pane, expand the Cubes folder, rightclick the created cube, and then click Property.

2. In the property wizard, select proactive caching and then select option button.

3. Select MOLAP/HOLAP/ROLAP as your data storage type, and then click Next.

4. After setting required parameters, click ok button. 5. After that right click on created cube and then select Process.

Application: -- To analyze data for decision making. Questions: 1. What is ROLAP? 2. What is MOLAP? 3. What is HOLAP?

Practical :- 4 Aim: Process cube and Browse Cube data a. By replacing a dimension in the grid, filtering and drilldown using cube browser. b. Browse dimension data and view dimension members, member properties, member property values. c. Create calculated member using arithmetic operators and member property of dimension member. Software Required: Analysis services- SQL Server-2005. Knowledge Required: Data Mining Concepts Theory/Logic:

Browsing Cube Data


Use the Browser tab of Cube Designer to browse cube data. You can use this view to examine the structure of a cube and to check data, calculation, formatting, and security of database objects. You can quickly examine a cube as end users see it in reporting tools or other client applications. When you browse cube data, you can view different dimensions, drill down into members, and slice through dimensions. Before you browse a cube, you must process it. After you process it, open the Browser tab of Cube Designer. The Browser tab has three panes the Metadata pane, the Filter pane, and the Data pane. Use the Metadata pane to examine the structure of the cube in tree format. Use the Filter pane at the top of the Browser tab to define any subcube you want to browse. Use the Data pane to examine the data and drill down through dimension hierarchies. Setting up the Browser To prepare to browse a cube, you can specify a perspective or translation that you want to use. You add measures and dimensions to the Data pane and specify any filters in the Filter pane.

Specifying a Perspective Use the Perspective list to choose a perspective that is defined for the cube. Perspectives are defined in the Perspectives tab of Cube Designer. To switch to a different perspective, select any perspective in the list. Specifying a Translation Use the Language list to choose a translation that is defined for the cube. Translations are defined in the Translations tab of Cube Designer. The Browser tab initially shows captions for the default language, which is specified by the Language property for the cube. To switch to a different language, select any language in the list. Formatting the Data Pane You can format captions and cells in the Data pane. Right-click the selected cells or captions that you want to format, and then click Commands and Options. Depending on the selection, the Commands and Options dialog box displays settings for Format, Filter and Group, Report, and Behavior. Setting up the Data Adding or Removing Measures Drag the measures you want to browse from the Metadata pane to the details area of the Data pane, which is labeled Drop Totals or Detail Fields Here. As you drag additional measures, they are added as columns in the details area. A vertical line indicates where each additional measure will drop. Dragging the Measures folder drops all the measures into the details area. To remove a measure from the details area, either drag it out of the Data pane, or rightclick it and then click Remove Total on the shortcut menu. Adding or Removing Dimensions Drag the hierarchies or dimensions to the row or column field areas. To remove a dimension, either drag it out of the Data pane, or right-click it and then click Remove Field on the shortcut menu. Adding or Removing Filters You can use either of two methods to add a filter:

Expand a dimension in the Metadata pane, and then drag a hierarchy to the Filter pane. - or -

In the Dimension column of the Filter pane, click <Select dimension> and select a dimension from the list, then click <Select hierarchy> in the Hierarchy column and select a hierarchy from the list.

After you specify the hierarchy, specify the operator and filter expression. The following table describes the operators and filter expressions. Operator Equal Filter Expression Description One or more Values must be equal to a specified member. (Provides multiple member selection for attribute hierarchies, other than parent-child hierarchies, and single member selection for other

members

hierarchies.) Not Equal One or more Values must not equal a specified member. (Provides multiple member selection for attribute hierarchies, other than parent-child hierarchies, and single member selection for other

members

hierarchies.) In One or more Values must be in a specified named set. (Supported for attribute hierarchies only.) more Values must not be in a specified named set. (Supported for attribute hierarchies only.) two Values must be between or equal to the delimiting members. If the delimiting members are equal or of a only one member is specified, no range is imposed and all values are permitted. (Supported only for attribute hierarchies. The

named sets Not In One or

named sets Range (Inclusive) One or

delimiting members range

range must be on one level of a hierarchy. Unbounded ranges are not currently supported.) Range (Exclusive) One or two Values must be between the delimiting members. If the delimiting members are the equal or only of a one member is specified, values must be either greater than or less than the delimiting member. (Supported only for attribute hierarchies. The range must be on one level of a hierarchy. Unbounded ranges are not currently supported.) MDX An expression returning member set MDX Values must be in the calculated member set. Selecting this option displays the Calculated a Member Builder dialog box for creating MDX expressions.

delimiting members range

For user-defined hierarchies, in which multiple members may be specified in the filter expression, all the specified members must be at the same level and share the same parent. This restriction does not apply for parent-child hierarchies. Working with Data Drilling Down into a Member To drill down into a particular member, click the plus sign (+) next to the member or double-click the member. Slicing Through Cube Dimensions To filter the cube data, click the drop-down list box on the top-level column heading. You can expand the hierarchy and select or clear a member on any level to show or hide it and all its descendants. Clear the check box next to (All) to clear all the members in the hierarchy. After you have set this filter on dimensions, you can toggle it on or off by right-clicking anywhere in the Data pane and clicking Auto Filter. Filtering Data

You can use the filter area to define a subcube on which to browse. You can add a filter by either clicking a dimension in the Filter pane or by expanding a dimension in the Metadata pane and then dragging a hierarchy to the Filter pane. Then specify an Operator and Filter Expression. Performing Actions A triangle marks any heading or cell in the Data pane for which there is an action. This might apply for a dimension, level, member, or cube cell. Move the pointer over the heading object to see a list of available actions. Click the triangle in the cell to display a menu and start the associated process. For security, the Browser tab only supports the following actions:

URL Rowset Drillthrough

Command Line, Statement, and Proprietary actions are not supported. URL actions are only as safe as the Web page to which they link. Viewing Member Properties and Cube Cell Information To view information about a dimension object or cell value, move the pointer over the cell. Showing or Hiding Empty Cells You can hide empty cells in the data grid by right-clicking anywhere in the Data pane and clicking Show Empty Cells. c. Create calculated member using arithmetic operators and member property of dimension member Calculated members are members of a dimension or a measure group that is defined based on a combination of cube data, arithmetic operators, numbers, and functions. For example, you can create a calculated member that calculates the sum of two physical measures in the cube. Calculated member definitions are stored in cubes, but their values are calculated at query time. To create a calculated member, use the New Calculated Member command on the Calculations tab of Cube Designer. You can create a calculated member within any

dimension, including the measures dimension. You can also place a calculated member within a display folder in the Calculation Properties dialog box. In the tasks in this topic, you define calculated measures to let users view the gross profit margin percentage and sales ratios for Internet sales, reseller sales, and for all sales. To define calculations to aggregate physical measures 1. Open Cube Designer for the Analysis Services Tutorial cube, and then click the Calculations tab. Notice the default CALCULATE command in the Calculation Expressions pane and in the Script Organizer pane. This command specifies that the measures in the cube should be aggregated according to the value that is specified by their AggregateFunction properties. Measure values are generally summed, but may also be counted or aggregated in some other manner. The following image shows the Calculations tab of Cube Designer.

2. On the toolbar of the Calculations tab, click New Calculated Member.

A new form appears in the Calculation Expressions pane within which you define the properties of this new calculated member. The new member also appears in the Script Organizer pane. The following image shows the form that appears in the Calculation Expressions pane when you click New Calculated Member.

3. In the Name box, change the name of the calculated measure to [Total Sales Amount]. If the name of a calculated member contains a space, the calculated member name must be enclosed in square brackets. Notice in the Parent hierarchy list that, by default, a new calculated member is created in the Measures dimension. A calculated member in the Measures dimension is also frequently called a calculated measure. 4. On the Metadata tab in the Calculation Tools pane of the Calculations tab, expand Measures and then expand Internet Sales to view the metadata for the Internet Sales measure group.

You can drag metadata elements from the Calculation Tools pane into the Expression box and then add operators and other elements to create Multidimensional Expressions (MDX) expressions. Alternatively, you can type the MDX expression directly into the Expression box. Note: If you cannot view any metadata in the Calculation Tools pane, click Reconnect on the toolbar. If this does not work, you may have to process the cube or start the instance of Analysis Services. 1. Drag Internet Sales-Sales Amount from the Metadata tab in the Calculation Tools pane into the Expression box in the Calculation Expressions pane. 2. In the Expression box, type a plus sign (+) after [Measures].[Internet SalesSales Amount]. 3. On the Metadata tab in the Calculation Tools pane, expand Reseller Sales, and then drag Reseller Sales-Sales Amount into the Expression box in the Calculation Expressions pane after the plus sign (+). 4. In the Format string list, select "Currency". 5. In the Non-empty behavior list, select the check boxes for Internet Sales-Sales Amount and Reseller Sales-Sales Amount, and then click OK. The measures you specify in the Non-empty behavior list are used to resolve NON EMPTY queries in MDX. When you specify one or more measures in the Non-empty behavior list, Analysis Services treats the calculated member as empty if all the specified measures are empty. If the Non-empty behavior property is blank, Analysis Services must evaluate the calculated member itself to determine whether the member is empty. The following image shows the Calculation Expressions pane populated with the settings that you specified in the previous steps.

6. On the toolbar of the Calculations tab, click Script View, and then review the calculation script in the Calculation Expressions pane. Notice that the new calculation is added to the initial CALCULATE expression; each individual calculation is separated by a semicolon. Notice also that a comment appears at the beginning of the calculation script. Adding comments within the calculation script for groups of calculations is a good practice, to help you and other developers understand complex calculation scripts. 7. Add a new line in the calculation script after the Calculate; command and before the newly added calculation script, and then add the following text to the script on its own line: /* Calculations to aggregate Internet Sales and Reseller Sales measures */ The following image shows the calculation scripts as they should appear in the Calculation Expressions pane at this point in the tutorial.

8. On the toolbar of the Calculations tab, click Form View, verify that [Total Sales Amount] is selected in the Script Organizer pane, and then click New Calculated Member. 9. Change the name of this new calculated member to [Total Product Cost], and then create the following expression in the Expression box: [Measures].[Internet Sales-Total Product Cost] + [Measures].[Reseller Sales-Total Product Cost] 10. In the Format string list, select "Currency". 11. In the Non-empty behavior list, select the check boxes for Internet Sales-Total Product Cost and Reseller Sales-Total Product Cost, and then click OK. You have now defined two calculated members, both of which are visible in the Script Organizer pane. These calculated members can be used by other calculations that you define later in the calculation script. You can view the definition of any calculated member by selecting the calculated member in the Script Organizer pane; the definition of the calculated member will appear in the Calculation Expressions pane in the Form view. Newly defined calculated members will not appear in the Calculation Tools pane until these objects have been deployed. Calculations do not require processing. Defining Gross Profit Margin Calculations To define gross profit margin calculations 1. Verify that [Total Product Cost] is selected in the Script Organizer pane, and then click New Calculated Member on the toolbar of the Calculations tab.

2. In the Name box, change the name of this new calculated measure to [Internet GPM]. 3. In the Expression box, create the following MDX expression: ([Measures].[Internet Sales-Sales Amount] [Measures].[Internet Sales-Total Product Cost]) / [Measures].[Internet Sales-Sales Amount] 4. In the Format string list, select "Percent". 5. In the Non-empty behavior list, select the check box for Internet Sales-Sales Amount, and then click OK. 6. On the toolbar of the Calculations tab, click New Calculated Member. 7. In the Name box, change the name of this new calculated measure to [Reseller GPM]. 8. In the Expression box, create the following MDX expression: ([Measures].[Reseller Sales-Sales Amount] [Measures].[Reseller Sales-Total Product Cost]) / [Measures].[Reseller Sales-Sales Amount] 9. In the Format string list, select "Percent". 10. In the Non-empty behavior list, select the check box for Reseller Sales-Sales Amount, and then click OK. 11. On the toolbar of the Calculations tab, click New Calculated Member. 12. In the Name box, change the name of this calculated measure to [Total GPM]. 13. In the Expression box, create the following MDX expression: ([Measures].[Total Sales Amount] [Measures].[Total Product Cost]) / [Measures].[Total Sales Amount] Notice that this calculated member is referencing other calculated members. Because this calculated member will be calculated after the calculated members that it references, this is a valid calculated member. 14. In the Format string list, select "Percent".

15. In the Non-empty behavior list, select the check boxes for Internet Sales-Sales Amount and Reseller Sales-Sales Amount, and then click OK. 16. On the toolbar of the Calculations tab, click Script View and review the three calculations you just added to the calculation script. 17. Add a new line in the calculation script immediately before the [Internet GPM] calculation, and then add the following text to the script on its own line: /* Calculations to calculate gross profit margin */ The following image shows the Expressions pane with the three new calculations.

Defining the Percent of Total Calculations To define the percent of total calculations 1. On the toolbar of the Calculations tab, click Form View. 2. In the Script Organizer pane, select [Total GPM], and then click New Calculated Member on the toolbar of the Calculations tab. Clicking the final calculated member in the Script Organizer pane before you click New Calculated Member guarantees that the new calculated member will be entered at the end of the script. Scripts execute in the order that they appear in the Script Organizer pane. 3. Change the name of this new calculated member to [Internet Sales Ratio to All Products].

4. Type the following expression in the Expression box: Case When IsEmpty( [Measures].[Internet Sales-Sales Amount] ) Then 0 Else ( [Product].[Product Categories].CurrentMember, [Measures].[Internet Sales-Sales Amount]) / ( [Product].[Product Categories].[(All)].[All], [Measures].[Internet Sales-Sales Amount] ) End This MDX expression calculates the contribution to total Internet sales of each product. The Case statement together with the IS EMPTY function ensures that a divide by zero error does not occur when a product has no sales. 5. In the Format string list, select "Percent". 6. In the Non-empty behavior list, select the check box for Internet Sales-Sales Amount, and then click OK. 7. On the toolbar of the Calculations tab, click New Calculated Member. 8. Change the name of this calculated member to [Reseller Sales Ratio to All Products]. 9. Type the following expression in the Expression box: Case When IsEmpty( [Measures].[Reseller Sales-Sales Amount] ) Then 0 Else ( [Product].[Product Categories].CurrentMember, [Measures].[Reseller Sales-Sales Amount]) / ( [Product].[Product Categories].[(All)].[All], [Measures].[Reseller Sales-Sales Amount] ) End 10. In the Format string list, select "Percent". 11. In the Non-empty behavior list, select the check box for Reseller Sales-Sales Amount, and then click OK. 12. On the toolbar of the Calculations tab, click New Calculated Member.

13. Change the name of this calculated member to [Total Sales Ratio to All Products]. 14. Type the following expression in the Expression box: Case When IsEmpty( [Measures].[Total Sales Amount] ) Then 0 Else ( [Product].[Product Categories].CurrentMember, [Measures].[Total Sales Amount]) / ( [Product].[Product Categories].[(All)].[All], [Measures].[Total Sales Amount] ) End 15. In the Format string list, select "Percent". 16. In the Non-empty behavior list, select the check boxes for Internet Sales-Sales Amount and Reseller Sales-Sales Amount, and then click OK. 17. On the toolbar of the Calculations tab, click Script View, and then review the three calculations that you just added to the calculation script. 18. Add a new line in the calculation script immediately before the [Internet Sales Ratio to All Products] calculation, and then add the following text to the script on its own line: /* Calculations to calculate percentage of product to total product sales */ You have now defined a total of eight calculated members, which are visible in the Script Organizer pane when you are in Form view.

Browsing the New Calculated Members To browse the new calculated members 1. On the Build menu of Business Intelligence Development Studio, click Deploy Analysis Services Tutorial. 2. When deployment has successfully completed, switch to the Browser tab, click Reconnect, and then remove all hierarchies and measures from the Data pane.

3. In the Metadata pane, expand Measures to view the new calculated members in the Measures dimension. 4. Add the Total Sales Amount, Internet Sales-Sales Amount, and Reseller SalesSales Amount measures to the data area, and then review the results. Notice that the Total Sales Amount measure is the sum of the Internet SalesSales Amount measure and the Reseller Sales-Sales Amount measure. 5. Add the Product Categories user-defined hierarchy to the filter area of the Data pane, and then filter the data by Mountain Bikes. Notice that the Total Sales Amount measure is calculated for the Mountain Bikes category of product sales based on the Internet Sales-Sales Amount and the Reseller Sales-Sales Amount measures for Mountain Bikes. 6. Add the Date.Calendar Time user-defined hierarchy to the row area, and then review the results. Notice that the Total Sales Amount measure for each calendar year is calculated for the Mountain Bikes category of product sales based on the Internet SalesSales Amount and the Reseller Sales-Sales Amount measures for Mountain Bikes. 7. Add the Total GPM, Internet GPM, and Reseller GPM measures to the data area, and then review the results. Notice that the gross profit margin for reseller sales is significantly lower than for sales over the Internet. Notice also that the gross profit margin on the sales of mountain bikes is increasing over time, as shown in the following image.

8. Add the Total Sales Ratio to All Products, Internet Sales Ratio to All Products, and Reseller Sales Ratio to All Products measures to the data area. Notice that the ratio of the sales of mountain bikes to all products has increased over time for Internet sales, but is decreasing over time for reseller sales. Notice also that the ratio of the sale of mountain bikes to all products is lower from sales through resellers than it is for sales over the Internet. 9. Change the filter from Mountain Bikes to Bikes, and review the results. Notice that the gross profit margin for all bikes sold through resellers is negative, because touring bikes and road bikes are being sold at a loss. 10. Change the filter to Accessories, and then review the results. Notice that the sale of accessories is increasing over time, but that these sales make up only a small fraction of total sales. Notice also that the gross profit margin for sales of accessories is higher than for bikes. 11. Expand CY 2004, expand H2 CY 2004, and then expand Q3 CY 2004. Notice that there are no Internet sales in this cube for after July, 2004, and no reseller sales for after June, 2004. These sales values have not yet been added from the source systems to the Adventure Works DW database.

Practical :- 5 Aim: Create and use Excel Pivot Table report based on data cube. Software Required: Analysis services- SQL Server-2005. Knowledge Required: Data Mining Concepts Theory/Logic: 1. Start Microsoft Excel. 2. When the blank spreadsheet appears, on the Data menu, click PivotTable and PivotChart Report. 3. The first step of the PivotTable and PivotChart Wizard opens. Click External data source, and then click Next.

4. In the second step of the wizard, click Get Data.

5. The Choose Data Source dialog box opens. Click the OLAP Cubes tab. Ensure that <New Data Source> is selected, and then click OK.

6. The Create a New Data Source dialog box opens. In the What name do you want to give to your data source? box, enter Remote internet connection to Sales. 7. In the Select an OLAP provider for the database you want to access box, click Microsoft OLE DB Provider for OLAP Services 8.0. Click Connect.

8. The Multidimensional Connection dialog box opens. Enter HTTP://Localhost (or HTTP:// your server TCP/IP address or name). This establishes an Internet connection to your Analysis server. Click Next.

9. The list of databases available in your Analysis server appears. Select the Tutorial database, and then click Finish. 10. In Create New Data Source dialog box, in Select the Cube that contains the data you want box, click Sales, and then click OK.

11. In the Choose Data Source dialog box, click OK. 12. In the second step of the wizard, click Next, and then click Finish.

13. You are returned to the Excel spreadsheet, where you can drag dimensions in columns and rows and analyze data through an Internet connection. Application: To access remote data. Advantage: Providing facility to read and analyze remote data. Questions: 1. What is pivot table? 2. Why we need to access remote data?

Practical :- 6 Aim: Design and create data mining models using Analysis Service of SQL Server 2005. Software Required: Analysis services- SQL Server-2005. Knowledge Required: Data Mining Models Theory/Logic: The tutorial is broken up into three sections. 1. Preparing the SQL Server Database, 2. Preparing the Analysis Services Database, and 3. Building and Working with the Mining Models.

1. Preparing the SQL Server Database


The AdventureWorksDW database, which is the basis for this tutorial, is installed with SQL Server (not by default, but as an option at installation time) and already contains views that will be used to create the mining models. If it was not installed at the installation time, you can add it by selecting Change button from Control Panel Add/Remove Programs Microsoft SQL Server 2005.

2. Preparing the Analysis Services Database


Before you begin to create and work with mining models, you must perform the following tasks: 1. Create a new Analysis Services project 2. Create a data source. 3. Create a data source view. a) Creating an Analysis Services Project Each Analysis Services project defines the schema for the objects in a single Analysis Services database. The Analysis Services database is defined by the mining models, OLAP cubes, and supplemental objects that it contains. To create an Analysis Services project 4. Open Business Intelligence Development Studio. 5. Select New and Project from the File menu.

6. Select Analysis Services Project as the type for the new project and name it AdventureWorks. 7. Click Ok. The new project opens in Business Intelligence Development Studio. b) Creating a Data Source A data source is a data connection that is saved and managed within your project and deployed to your Analysis Services database. It contains the server name and database where your source data resides, as well as any other required connection properties. To create a data source 1. Right-click the Data Source project item in Solution Explorer and select New Data Source. 2. On the Welcome page, click Next. 3. Click New to add a connection to the AdventureWorksDW database. 4. The Connection Manager dialog box opens. In the Server name drop-down box, select the server where AdventureWorksDW is hosted (for example, localhost), enter your credentials, and then in the Select the database on the server drop-down box select the AdventureWorksDW database. 5. Click OK to close the Connection Manager dialog box. 6. Click Next. 7. By default the data source is named Adventure Works DW. Click Finish The new data source, Adventure Works DW, appears in the Data Sources folder in Solution Explorer. c) Creating a Data Source View A data source view provides an abstraction of the data source, enabling you to modify the structure of the data to make it more relevant to your project. Using data source views, you can select only the tables that relate to your particular project, establish relationships between tables, and add calculated columns and named views without modifying the original data source. To create a data source view 1. In Solution Explorer, right-click Data Source View, and then click New Data Source View.

2. On the Welcome page, click Next. 3. The Adventure Works DW data source you created in the last step is selected by default in the Relational data sources window. Click Next. 4. If you want to create a new data source, click New Data Source to launch the Data Source Wizard. 5. Select the tables in the following list and click the right arrow button to include them in the new data source view: 6. vAssocSeqLineItems 7. vAssocSeqOrders 8. vTargetMail 9. vTimeSeries 10. Click Next. 11. By default the data source view is named Adventure Works DW. Click Finish. Data Source View Editor opens to display the Adventure Works DW data source view, as shown in Figure 2. Solution Explorer is also updated to include the new data source view.

Figure 2: Adventure Works DW data source view

3. Building and Working with the Mining Models


The data mining editor (shown in Figure 3) contains all of the tools and viewers that you will use to build and work with the mining models.

Figure 3: Data mining editor

Practical :- 7 Aim: Design and build targeted mailing data mining model using analysis service of SQL Server 2005 and compare their predictive capabilities using the Mining Accuracy Chart view, and create predictions using Prediction Query Builder. Software Required: Analysis Services - SQL Server 2005 Knowledge Required: Data Mining Models Theory/Logic: Targeted Mailing The marketing department of Adventure Works is interested in increasing sales by targeting specific customers for a mailing campaign. By investigating the attributes of known customers, they want to discover some kind of pattern that can be applied to potential customers, which can then be used to predict who is more likely to purchase a product from Adventure Works. Additionally, the marketing department wants to find any logical groupings of customers already in their database. For example, a grouping may contain customers with similar buying patterns and demographics. Adventure Works contains a list of past customers and a list of potential customers. Upon completion of this task, the marketing department will have the following: A set of mining models that will be able to suggest the most likely customers from a list of potential customers A clustering of their current customers

In order to complete the scenario, you will use the Microsoft Nave Bayes, Microsoft Decision Trees, and Microsoft Clustering algorithms. The scenario consists of five tasks: Create the mining model structure. Create the mining models. Explore the mining models. Test the accuracy of the mining models. Create predictions from the mining models.

Create a Targeted Mailing Mining Model Structure Using the Wizard:


The first step is to use the Mining Model Wizard to create a new mining structure. The Mining Model Wizard also creates an initial mining model based on the Microsoft Decision Trees algorithm. To create the targeted mailing mining structure 1. In Solution Explorer, right-click Mining Structures, and then click New Mining Structure. The Mining Model Wizard opens. 2. On the Welcome page, click Next. 3. Click From existing relational database or data warehouse, and then click Next. 4. Under Which data mining technique do you want to use?, click Microsoft Decision Trees. You will create several models based on this initial structure, but the initial model is based on the Microsoft Decision Trees algorithm. 5. Click Next. By default the Adventure Works DW is selected in the Select Data Source View window. You may click Browse to view the tables in the data source view inside of the wizard. 6. Click Next. 7. Select the Case check box next to the vTargetMail table, and then click Next. 8. Select the Key check box next to the CustomerKey column. If the source table from the data source view indicates a key, the Mining Model Wizard automatically chooses that column as a key for the model. 9. Select the Input and Predictable check boxes next to the BikeBuyer column. This action enables the column for prediction in new datasets. When you indicate that a column is predictable, the Suggest button is enabled. Clicking Suggest opens the Suggest Related Column dialog box, which lists the columns that are most closely related to the predictable column.

The Suggest Related Columns dialog box orders the attributes by their correlation with the predictable attribute. Columns with a value higher than 0.05 are automatically selected to be included in the model. If you agree with the suggestion, click OK, which marks the selected columns as inputs in the wizard. If you don't agree, you can either modify the suggestion or click Cancel. 10. Select the Input check boxes next to the columns listed in the following table. Age CommuteDistance EnglishEducation EnglishOccupation FirstName Gender YearlyIncome HouseOwnerFlag LastName MaritalStatus NumberCarsOwned NumberChildrenAtHome Region TotalChildren

You can select multiple columns by using the SHIFT key. Selecting a check box within the selected area specifies the same selection for each column. 11. 12. Click Next. In Specify Columns' Content and Data Type, click Detect. An algorithm runs that samples numeric data and determines whether the numeric columns contain continuous or discrete values. For example, a column can contain salary information as the actual salary values, which is continuous, or it can contain integers representing encoded salary ranges (1 = < $25,000; 2 = from $25,000 to $50,000, and so on) which is discrete. 13. 14. Click Next. In both Mining Structure Name and Mining Model Name, type Targeted Mailing. 15. Click Finish.

The data mining editor opens, displaying the mining structure named Targeted Mailing that you just created, as shown in Figure 4.

Figure 4 Targeted Mailing mining structure tab Edit the Mining Models The initial mining structure only contains a model based on the Microsoft Decision Trees model. In this section, you will define two additional models using the Mining Models tab of the data mining editor: a Microsoft Nave Bayes model and a Microsoft Clustering model. To create a Microsoft Clustering model 16. Click the Mining Models tab. 17. Right-click Targeted Mailing and then click New Mining Model 18. In Model Name, type TM_Clustering 19. In Algorithm Name, select Microsoft Clustering. 20. Click OK. A new model appears in the Mining Models tab. A Microsoft Clustering model can cluster and predict continuous and discrete attributes. You can modify the column usage and properties for the new model.

Setting a column to Predict has no effect on the model training; it allows you to select that column in a PREDICTION JOIN query. However, the algorithm ignores columns set to PredictOnly when it creates clusters. The statistics for PredictOnly columns in a clustering model are determined as a final pass after the clustering operation is complete. This is beneficial if you want to see how an attribute is distributed across clusters created from other attributes, and it can expose deeper correlations. To create a Microsoft Nave Bayes model 21. Right-click Targeted Mailing, and then click New Mining Model. 22. In Model Name, type TM_NaiveBayes. 23. In Algorithm Name, click Microsoft Nave Bayes. 24. Click OK. A dialog box appears with the text explaining that the Microsoft Nave Bayes algorithm does not support working with the Age, Yearly Income columns, which are continuous, and will be ignored 25. Click Yes. A new model appears in the Mining Models tab. Although you can modify the column usage and properties for all of the models in this tab, in this case you can leave them as they are. Process the Mining Models Now that the structure and parameters for the mining models are complete, you can deploy and process the models. To deploy the project and process the mining models Click F5.

Depending on what account Analysis Services is running under, you may need to change impersonation information of the data source. To change it, open Adventure Works DW.ds from your solution explorer and go to the Impersonation Information tab. The Analysis Services database is deployed to the server and the mining models are processed. If the database has already been deployed to the server, you can just process the mining models using the following process.

To process the mining models 26. On the Mining Model menu, select Process. The Process Mining Structure dialog box opens showing Targeted Mailing in Object list. 27. Click Run. The Process Progress dialog box opens, displaying information about model processing. This may take some time, depending on your computer. 28. After processing is complete, click Close in both dialog boxes. Note that processing the mining models may take several minutes depending on your computer configuration. Exploring the Mining Models After the models are processed, you can view them using the Mining Model Viewer tab in the data mining editor. Using the Mining Model combo box at the top of the tab, you can examine the models in the mining structure. Microsoft Decision Trees Model The Mining Model Viewer tab defaults to opening the Targeted Mailing mining model, the first model in the structure. The Tree viewer contains two tabs, Decision Tree and Dependency Network. Decision Tree On the Decision Tree tab, you can examine all of the tree models that make up the Targeted_Mailing model. There is one tree model for each predictable attribute in the model, unless feature selection is invoked. Because your model only contains a single predictable attribute, Bike Buyer, there is only one tree to view. If there were more trees, you could use the Tree box to choose another tree. The Tree viewer defaults to only showing the first three levels of the tree. If the tree contains less than three levels, the Tree viewer only shows the existing levels. You can view more levels using the Show Level slider or the Default Expansion box. To create the tree shown in Figure 6 29. Slide Show Level to 5.

30. In the Background list, click 1. By changing this setting, you can quickly tell the number of cases for bike buyer equal to 1 that exist in each node. The darker the node, the more cases that exist.

Figure 5 Decision Tree tab of the Targeted Mailing model Each node in the decision tree displays three pieces of information: The condition required to reach that node from the node preceding it. You can see the full node path in either the legend or a ToolTip. A histogram that describes the distribution of states for the predictable column in order of popularity. You can control how many states appear in the histogram using the Histogram control. The concentration of cases, if the state of the predictable attribute specified in the Background control. If drill through is enabled, you can see the training cases each node supports by rightclicking the node and then clicking Drill through. Dependency Network

The Dependency Network tab displays the relationships between the attributes that contribute to the predictive ability of the mining model. The dependency network for the Targeted Mailing model is displayed in Figure 6.

Figure 6: Dependency Network tab of the Targeted Mailing model The center node in Figure 6, Bike Buyer, represents the predictable attribute in the mining model. Each surrounding node represents an attribute that affects the outcome of the predictable attribute. You can use the slider on the left side of the tab to control the strength of the links that are shown. Moving the slider down means that only the strongest links are shown. Using the color legend at the bottom of the chart, you can see the nodes that a selected node predicts, or the nodes that the selected node is predicted by. Testing the Accuracy of the Mining Models You have now processed and explored the mining models. But how well do they perform predictions? Does one of the targeted mailing models perform better than the others?

Using the Mining Accuracy Chart tab, you can calculate how well each of the models predicts and compare their results directly against each other. This method of comparison is also sometimes called a lift chart. The Mining Accuracy Chart tab uses test data, which is data separated from the original training dataset, to compare predictions against a known result. The results are then sorted and plotted on a graph, along with an ideal model to show how well the model performs at predictions. An ideal model represents a plot for a theoretical model that predicts the result correctly 100 percent of the time. The lift chart is important because it helps to distinguish between nearly identical models in a structure, determining which provides the best predictions. Similarly, it shows which algorithm types perform the best predictions for a given situation. Open the tab, which is shown in Figure 16, by clicking Mining Accuracy Chart.

Figure 16 Mining Accuracy Chart tab To create a new mining accuracy chart, you must perform the following tasks: 31. Map the columns of the model to the columns in the input dataset. 32. Filter the input data.

33. Select the models to compare and the predictable columns and values. Note Before you can use the mining accuracy chart, you must deploy and process the mining models. Mapping the Input Columns The first step is to map the columns in the model to the columns in the test data. If the column names map directly, the tool automatically creates relationships. To map the input columns to the mining structure 34. In the Select Input Table(s) box, click Select case table. The Select Table dialog box opens, where you choose an input table that contains the test data that you want to use in the prediction queries to determine the accuracy of the models. For example, if you held out some TargetMail rows independent from vTargetMail that was used to process the models, you might select that table. However, in this tutorial we use the same data used to process the models as the input table. 35. In the Select Table dialog box, select Adventure Works DW from data source list. 36. Select vTargetMail.from Table/View list and click OK. The columns the mining structure are automatically mapped to the columns with the same name in the input table, as shown in Figure 17.

Figure 17 Mapped columns in the Mining Accuracy Chart tab

A prediction query is generated for each model in the structure based on the column mappings. You can delete a mapping by selecting the line that links the columns in Mining Structure and Select Input Table(s) and then pressing DELETE. You can also manually create mappings by clicking a column in Select Input Table(s) and dragging it onto the corresponding column in Mining Structure. Filtering Input Rows You can use the grid under Filter the input data used to generate the lift chart to filter the input data. You can drag columns from Select Input Table(s) to the grid, or you can select the values from combo boxes. For example, if you want to limit the input rows to those where the YearlyIncome column is greater than x, in the Field column, select YearlyIncome, and then in the Criteria/Argument column, type >x. Selecting the Models, Predictable Columns, and Values

The next step is to select the models that you want to include in the lift chart and the predictable column that they will be compared against. By default, all of the models in the mining structure are selected. You can choose not to include a model, but for this tutorial, leave them as they are. You can create two types of accuracy charts. If you select a predictable value, you will see a chart like the one in Figure 18, which shows how much lift the model provides. If you do not include a Predict Value, as shown in Figure 19, the chart will show how accurate the model is. To show the lift of the models 37. For each remaining model, in Predictable Column Name, click Bike Buyer. 38. For each remaining model, in the Predict Value column, click 1. To show the accuracy of the models In Predictable Column Name, click Bike Buyer. Leave the Predict Value column empty. If the Synchronize Prediction Columns and Values check box is selected, the predictable column is synchronized for each mining model in the mining structure. Note The mining model columns listed in the Predictable Column Name box are restricted to columns that have the usage type set to Predict or Predict Only, and where the mining structure column on which this mining column is based has a content type of Discrete or Discretized. In some advanced scenarios, you may want to generate a lift chart that includes a predictable column in two mining models that are not based on the same structure column but contain the same data. If you clear the Synchronize Prediction Columns and Values check box, you can select any valid predictable column and value. The results are plotted together, regardless of whether they make sense. Viewing the Lift Chart Click the Lift Chart tab to view the lift chart. When you click the tab, a prediction query runs against the server and database for the mining structure and input table. The predicted results are compared to the actual values that is known and sorted by

probability, and the results are plotted on the graph. For more information about using the chart, see "Lift Chart" in SQL Server Books Online. If you specified a predictable value, the lift chart is plotted as shown in Figure 18.

Figure - 18: Lift provided by each model plotted against the ideal model

If you did not specify a predictable value, the lift chart shows the accuracy of the mining model predictions as shown in Figure 19.

Figure 19 Accuracy of each model plotted against the ideal model

Creating Predictions Now that you are satisfied with the mining models, you can begin to create DMX prediction queries using Prediction Query Builder. Prediction Query Builder is similar to Access Query Builder, in which you use drag-and-drop operations to build the queries. The tool contains three different views: Design Query Result

Figure 20 Default view of Prediction Query Builder

Using the Design and Query views, you can build and look at your query. You can then execute and view the results of the query in the Result view. Creating the Query The first step in creating the query is to select a mining model and input table. To select a model and input table 39. In Mining Model, click Select model. The Select Mining Model dialog box opens. By default, the first mining model in the mining structure is selected. 40. Navigate through the tree to Targeted Mailing, and then click Targeted Mailing. 41. In the Select Input Table(s) box, click Select case table. The Select Table dialog box opens. 42. Navigate through the tree and select the vTargetMail table in the AdventureWorksDW data source view.

Note that typically you would have a separate table that contains your prospect customers and you would want to predict whether each customer would buy a bike or not (i.e., Bike Buyer column) based on other known information (i.e., other columns). However, for the sake of simplicity of the tutorial, we are using the same training data, vTargetMail as the prospect customers. After you select the input table, Prediction Query Builder creates a default mapping between the mining model and input table based on the names of the columns, as shown in Figure 21.

Figure 21 Mapped columns in the Mining Model Prediction tab

To build the prediction query 43. In the Source column, click the cell in the first empty row, and then click vTargetMail table.

44. In the Field column, next to the entry you created in step 1, click CustomerKey. This adds the unique identifier to the prediction query so that you can identify who is and who is not likely to buy a bicycle. 45. Click the next cell in the Source column, and then click Targeted Mailing mining model. 46. In the Field cell, click Bike Buyer. This specifies that the Microsoft Clustering model in the Targeted Mailing structure will be used to create the predictions. 47. Click the next cell under the Source column, and then click Prediction Function. 48. Next to Prediction Function, in the Field column, click PredictProbability. Prediction functions provide information about how the model predicts. The PredictProbability function provides information about the probability of the prediction being correct. You can specify parameters for the prediction function in the Criteria/Argument column. 49. In the Criteria/Argument column, type [Targeted Mailing].[Bike Buyer]. This specifies the target column for the PredictProbability function. For more information on functions, see "DMX Function Reference" in SQL Server Books Online. Your screen should now look like Figure 22.

Figure 22 Prediction Query Builder in the Mining Model Prediction tab

By clicking the icon in the upper-left corner of the view, you can switch to the Query view and look at the DMX code that Prediction Query Builder created. You can also run the query, modify the query, and run the modified query, but the modified query is not persisted if you switch back to the Design view. Viewing the Results You can run the query by clicking the arrow next to the icon in the top left corner of the tab, and then clicking Result. Figure 23 displays the query results.

Figure 23 Mining Model Prediction Result tab

The CustomerKey, BikeBuyer, and Expression columns identify potential customers, whether they are bike buyers, and the probability of the prediction being correct. You can use these results to determine who should be sent an advertisement.

Practical :- 8 Aim: Perform various steps of preprocessing on the given relational database/ warehouse. Software Required: OD Miner Knowledge Required: Data Preprocessing Theory/Logic: The data used in the data mining process usually has to be collected from various locations, and also some transformation of the data is usually required to prepare the data for data mining operations. The Mining Activity Guides will assist you in joining data from disparate sources into one view or table, and will also carry out transformations that are required by a particular algorithm; those transforms will be discussed in the context of the Guides. However, there are transforms that typically will be completed on a standalone basis using one of the Data Transformation wizards. These include Recode Filter Derive field and others. Moreover, utilities are available for importing a text file into a table in the database, for displaying summary statistics and histograms, for creating a view, for creating a table from a view, for copying a table, and for dropping a table or view. The examples below assume that the installation and configuration explained in Appendix A have been completed and that the sample views are available to the current user. These sample views include: MINING_DATA_BUILD_V MINING_DATA_TEST_V MINING_DATA_APPLY_V MARKET_BASKET_V and others, including the tables of the SH schema.

These tables describe the purchasing habits of customers in a pilot marketing campaign. They will be used to illustrate the business problems of identifying the most valuable customers as well as defining the product affinity that will help determine product placement in the stores. Note on data format: Previous versions of Oracle Data Mining allowed two distinct data formats, Single Row per Record, in which all the information about an individual resides in a single row of the table/view, and Multiple row per Record (sometimes called Transactional format), in which information for a given individual may be found in several rows (for example if each row represents an item purchased). In ODM 10g Release 2, only Single Row per Record format is acceptable; however, some language relating to the former distinction remains in some wizards. The feature called Nested Column is used to accommodate the use case previously handled by Transactional format.

The Import Wizard


The text file demo_import_mag.txt is included in the Zip file containing this tutorial. It consists of comma-separated customer data from a magazine subscription service, with attribute names in the first row. The Import wizard accepts information about the text file from the user and configures the SQLLDR command to create a table. You must identify the location of the SQLLDR executable in the Preferences worksheet. To import the text file into a table, select Import in the Data pulldown menu.

Click Next on the Welcome page to proceed. Step 1: Click Browse to locate the text file to be imported

Step 2: Select the field (column) delimiter from the pulldown menu

Any string field values containing the delimiter must be enclosed in either single or double quotes; if this is the case, specify the enclosures from the pull-down menu. In

addition, certain other characters are unacceptable in a string for some purposes; an alternative to quoting the string is replacing the illegal characters. SQLLDR parameters such as termination criteria can be selected by clicking Advanced Settings. If the first row of the file contains the field names, click the appropriate checkbox To verify the format specifications, click Preview.

Step 3: Verify the attribute names and data types. If the first row of the text file does not contain field names, then dummy names are supplied and they may be modified in this step. The Data Type may also be modified.

Step 4: Specify the name of the new table or the existing table in which the imported data will be inserted.

Click Finish to initiate the import operation.

When completed, the Browser displays a sample from the table.

Data Viewer and Statistics


Left click on the name of a table or view to display the structure.

Click the Data tab to see a sample of the table/view contents.

The default number of records shown is 100; enter a different number in the Fetch Size window, then click Refresh to change the size of the display, or click Fetch Next to add to add more rows to the display. Right-click the table/view name to expose a menu with more options. Click Transform to expose another menu giving access to transformation wizards (most of which will be discussed in detail later).

The two menu choices Generate SQL and Show Lineage appear only for views; they are not on the menu for tables. Show Lineage displays the SQL code and identifies the underlying table(s) used to create the view, while Generate SQL allows you to save the SQL code into an executable script. To see a statistical summary, click one of the two selections depending on the data format type. The following example uses Show Summary Single-record.

For each numerical attribute, Maximum and Minimum values, as well as average and variance, are shown. These statistics are calculated on a sample (1500 in this screen shot); the size of the sample can be changed by adjusting ODM Preferences. For any highlighted attribute, click Histogram to see a distribution of values. The values are divided into ranges, or bins.

Numerical

Categorical

The default number of bins is 10; this number can be changed by clicking Preference. Numerical attributes are divided into the designated number of bins of equal width between the minimum and maximum. The bins are displayed in ascending order of attribute values. Categorical attributes are binned using the Top N method (N is the number of bins). The N values occurring most frequently have bins of their own; the remaining values are thrown into a bin labeled Other. The bins are displayed in descending order of bin size.

Transformations
You can right-click on the table/view name or pull down the Data menu to access the data transformation wizards. Many of the transforms are incorporated into the Mining Activity Guides; some have value as standalone operations. In each case the result is a view, unless the wizard allows a choice of table or view. Some examples follow: Filter Single-Record Suppose we want to concentrate on our customers between the ages of 21 and 35. We can filter the data to include only those people. Oracle Data Miner provides a filtering transformation to define a subset of the data based upon attribute values. Begin by highlighting Transformations on the Data pulldown menu and selecting Filter Single-Record (or right-click on the table/view name) to launch the wizard. Click Next on the Welcome page.

Identify the input data and click Next (if you accessed the wizard by right-clicking the table/view name, then the data is already known and this step is skipped).

Enter a name for the resultant view and click Next.

Enter the filter condition or (much easier!) click the icon to the right of the window to construct the condition in a dialog box.

The Expression Editor allows easy construction of the where clause that will be inserted into the query to create the new view. In this example, we want only those records representing individuals whose age is between 21 and 35 years. Double-click the attribute name AGE, click the >= button, and type 21 to construct the first part of the condition shown. Click AND to continue defining the full condition. Note that complex conditions can be constructed using the And, Or, and parentheses buttons. Click the Validate button to check that the condition is satisfied by a subset of the source data. When you dismiss the Expression Editor by clicking OK, the condition is displayed in the Filter window.

You may preview the results and then choose to generate a stored procedure from the Finish page. Click Finish to complete the transformation.

When the transformation is complete, a sample of the new data is displayed.

Recode
The Recode transformation allows specified attribute values to be replaced by new values. For example, suppose the Summarization Viewer reveals that the attribute LENGTH_OF_RESIDENCE has a numerical range from 1 to 34 in the table DEMO_IMPORT_MAG, just created in the Import example. In order to make the model build operation more efficient, you decide to consider only two classes of residence: LOW for residences of less than or equal to 10 years, and HIGH for residences of more than 10 years. Begin by highlighting Transform on the Data pulldown menu and selecting Recode (or right-click on the table/view name) to launch the wizard.

Select the table or view to be transformed and specify the format by clicking the appropriate radio button (if you accessed the wizard by right-clicking the table/view name, then the data is already known and this step is skipped).

Enter a name for the resultant view.

Highlight the attribute to be recoded and click Define.

In the Recode dialog box, choose the condition on the attribute value and enter the new value in the With Value window; click Add to confirm. Repeat for each condition.

Warning: The wizard does not check the conditions for inconsistencies. In the same dialog box, a missing values treatment can be defined. In this example, all null values for this attribute are recoded to UNKNOWN.

Also, a treatment for any value not included in the conditions may be defined; in this example, all such values are recoded to OTHER.

Click OK; the recode definitions are now displayed with the attributes. You may recode more than one attribute by highlighting another attribute and repeating the steps.

When done, click Next. You may preview the results by clicking Preview Transform on the Finish page.

Note

that

the

recoded

attribute

has

assumed

the

defined

data

type;

LENGTH_OF_RESIDENCE, previously numerical, is now of type VARCHAR2.

On this same page, you can click the SQL tab to see the query used to display the preview. To save executable code for future use, you can click Advanced SQL to see and save the complete code that creates the transformed dataset.

Click Finish to complete the transformation; a sample of the transformed data is displayed.

Compute Field
It is often necessary when preparing data for data mining to derive a new column from existing columns. For example, specific dates are usually not interesting, but the elapsed time in days between dates may be very important (calculated easily in the wizard as Date1 Date2; the difference between two date types gives the number of days between the two dates in numerical format). This example shows another viewpoint on Disposable Income as Fixed Expenses, calculated as (Income Disposable Income). Begin by highlighting Transform on the Data pulldown menu and selecting Compute Field (or right-click on the table/view name) to launch the wizard.

Select the table or view to be transformed (if you accessed the wizard by right-clicking the table/view name, then the data is already known and this step is skipped).

Enter the name of the view to be created.

Click New to construct a definition of the new column.

In the Expression Editor, double-click on an attribute name to include it in the expression. Click on the appropriate buttons to include operators. Note that many SQL functions are available to be selected and included in the expression by clicking the Functions tab.

In this example, the new column FAMILY_EXPENSES is the difference of FAMILY_INCOME_INDICATOR and PURCHASING_POWER_INDICATOR. You can check that the calculation is valid by clicking the Validate button.

You may want to drop the columns FAMILY_INCOME_INDICATOR and PURCHASING_POWER_INDICATOR after the result is created. This can be done by using the result as source in the Create View wizard and deselecting those columns (illustrated below). The column definition is displayed in the Define New Columns window; you may repeat the process to define other new columns in the same window.

You may preview the results and then choose to generate a stored procedure from the Finish page. Click Finish to complete the transformation.

The view with the new column is displayed when the transformation is complete.

Create View Wizard


The Mining Activity Guides provide utilities for the combining of data from various sources, but there are times when the Create View wizard can be used independently of the Guides to adjust the data to be used as input to the data mining process. One example is the elimination of attributes (columns). Begin by selecting Create View from the Data pulldown menu. Click the plus sign + next to the database connection to expand the tree listing the available schemas. Expand the schemas to identify tables and views to be used in creating the new view. Doubleclick the name DEMO_IMPORT_MAG2 (created in the previous section) to bring it into the work area.

Click the checkbox to toggle inclusion of an attribute in the new view; click the top checkbox to toggle all checkboxes.

Then

click

the

checkboxes

next

to

FAMILY_INCOME_INDICATOR

and

PURCHASING_POWER_INDICATOR to deselect those attributes.

Select Create View from the File pulldown menu and enter the name of the resultant view in the dialog box; then click OK.

When the view has been created, a sample of the data is displayed. Dismiss the Create View wizard by selecting Exit from the wizards File pulldown menu.

Practical :- 9 Aim: To study Data Mining Extensions (DMX) language and MDX query language. Software Required: Analysis Services - SQL Server 2005 Knowledge Required: Query language Theory/Logic: Data Mining Extensions (DMX) is a language that you can use to create and work with data mining models in Microsoft SQL Server 2005 Analysis Services (SSAS). You can use DMX to create the structure of new data mining models, to train these models, and to browse, manage, and predict against them. DMX is composed of data definition language (DDL) statements, data manipulation language (DML) statements, and functions and operators. Microsoft OLE DB for Data Mining Specification The data mining features in Analysis Services are built to comply with the Microsoft OLE DB for Data Mining specification, which was first released to coincide with the release of Microsoft SQL Server 2000. The Microsoft OLE DB for Data Mining specification defines the following:
A structure to hold the information that defines a data mining model. A language for creating and working with data mining models.

The specification defines the basis of data mining as the data mining model virtual object. The data mining model object encapsulates all that is known about a particular mining model. The data mining model object is structured like an SQL table, with columns, data types, and meta information that describe the model. This structure lets you use the DMX language, which is an extension of SQL, to create and work with models. DMX Statements You can use DMX statements to create, process, delete, copy, browse, and predict against data mining models. There are two types of statements in DMX: data definition statements and data manipulation statements. You can use each type of statement to perform different kinds of tasks. The following sections provide more information about working with DMX statements: Data Definition Statements Data Manipulation Statements Query Fundamentals

Data Definition Statements Use data definition statements in DMX to create and define new mining structures and models, to import and export mining models and mining structures, and to drop existing models from a database. Data definition statements in DMX are part of the data definition language (DDL). You can perform the following tasks with the data definition statements in DMX: Create a mining structure by using the CREATE MINING STRUCTURE (DMX) statement, and add a mining model to the mining structure by using the ALTER MINING STRUCTURE (DMX) statement. Create a mining model and associated mining structure simultaneously by using the CREATE MINING MODEL (DMX) statement to build an empty data mining model object. Export a mining model and associated mining structure to a file by using the EXPORT (DMX) statement. Import a mining model and associated mining structure from a file that is created by the EXPORT statement by using the IMPORT (DMX) statement. Copy the structure of an existing mining model into a new model, and train it with the same data, by using the SELECT INTO (DMX) statement. Completely remove a mining model from a database by using the DROP MINING MODEL (DMX) statement. Completely remove a mining structure and all its associated mining models from the database by using the DROP MINING STRUCTURE (DMX) statement.

Data Manipulation Statements Use data manipulation statements in DMX to work with existing mining models, to browse the models and to create predictions against them. Data manipulation statements in DMX are part of the data manipulation language (DML). You can perform the following tasks with the data manipulation statements in DMX: Train a mining model by using the INSERT INTO (DMX) statement. This does not insert the actual source data into a data mining model object, but instead creates an abstraction that describes the mining model that the algorithm creates. The source query for an INSERT INTO statement is described in <source data query>. Extend the SELECT statement to browse the information that is calculated during model training and stored in the data mining model, such as statistics of the source data. Following are the clauses that you can include to extend the power of the SELECT statement:

o SELECT DISTINCT FROM <model > (DMX) o SELECT FROM <model>.CONTENT (DMX) o SELECT FROM <model>.CASES (DMX) o SELECT FROM <model>.SAMPLE_CASES (DMX) o SELECT FROM <model>.DIMENSION_CONTENT (DMX) Create predictions that are based on an existing mining model by using the PREDICTION JOIN clause of the SELECT statement. The source query for a PREDICTION JOIN statement is described in <source data query>. Remove all the trained data from a model or a structure by using the DELETE (DMX) statement.

DMX Query Fundamentals The SELECT statement is the basis for most DMX queries. Depending on the clauses that you use with such statements, you can browse, copy, or predict against mining models. The prediction query uses a form of SELECT to create predictions based on existing mining models. Functions extend your ability to browse and query the mining models beyond the intrinsic capabilities of the data mining model. You can use DMX functions to obtain information that is discovered during the training of your models, and to calculate new information. You can use these functions for many purposes, including to return statistics that describe the underlying data or the accuracy of a prediction, or to return an expanded explanation of a prediction. Working with data mining models in Microsoft SQL Server 2005 Analysis Services (SSAS) involves the following primary tasks: Creating mining structures and mining models Processing mining structures and mining models Deleting or dropping mining structures or mining models Copying mining models Browsing mining models Predicting against mining models

You can use Data Mining Extensions (DMX) statements to perform each of these tasks programmatically.

Creating mining structures and mining models Use the CREATE MINING STRUCTURE (DMX) statement to add a new mining structure to a database. You can then use the ALTER MINING STRUCTURE (DMX) statement to add mining models to the mining structure. Use the CREATE MINING MODEL (DMX) statement to build a new mining model and associated mining structure. Processing mining structures and mining models Use the INSERT INTO (DMX) statement to process a mining structure and mining model. Deleting or dropping mining structures or mining models Use the DELETE (DMX) statement to remove all the trained data from a mining model or mining structure. Use the DROP MINING STRUCTURE (DMX) or DROP MINING MODEL (DMX) statements to completely remove a mining structure or mining model from a database. Copying mining models Use the SELECT INTO (DMX) statement to copy the structure of an existing mining model into a new mining model and to train the new model with the same data. Browsing mining models Use the SELECT (DMX) statement to browse the information that the data mining algorithm calculates and stores in the data mining model during model training. Much like with Transact-SQL, you can use several clauses with the SELECT statement, to extend its power. These clauses include DISTINCT FROM <model>, FROM <model>.CASES, FROM <model>.SAMPLE_CASES, FROM <model>.CONTENT and FROM <model>.DIMENSION_CONTENT. Predicting against mining models Use the PREDICTION JOIN clause of the SELECT statement to create predictions that are based on an existing mining model. You can also import and export models by using the IMPORT (DMX) and EXPORT (DMX) statements. These tasks fall into two categories, data definition statements and data manipulation statements, which are described in the following table. Topic Data Mining (DMX) Data Statements Description Extensions Part of the data definition language (DDL). Used to define Definition a new mining model (including training) or to drop an existing mining model from a database.

Data Mining Extensions Part of the data manipulation language (DML). Used to (DMX) Data Manipulation work with existing mining models, including browsing a

Statements

model or creating predictions.

Creating DMX Prediction Queries The main goal of most data mining projects is to use mining models to create predictions for new data. For example, you may want to predict how many bicycles your company will sell next year during December, or whether a potential customer will purchase a bicycle in response to an advertising campaign. You can also use predictions to explore the information that the algorithms discover when they train the mining models. Prediction queries are based on the Data Mining Extensions (DMX) language. DMX extends the SQL language, to provide support for working with mining models. Prediction Query Tools SQL Server provides two tools that you can use to build prediction queries: Prediction Query Builder and the Query Editor. Prediction Query Builder is included in the Mining Model Prediction tab of Data Mining Designer. When you use the query builder, you can use graphical tools to design a query, use a text editor to manually modify the query, and use a simple results pane to view the results of the query. The Query Editor provides tools that you can use to build and run DMX queries. You can also include prediction queries as part of a SQL Server 2005 Integration Services (SSIS) package. The main goal of most data mining projects is to use mining models to create predictions for new data. For example, you may want to predict how many bicycles your company will sell next year during December, or whether a potential customer will purchase a bicycle in response to an advertising campaign. You can also use predictions to explore the information that the algorithms discover when they train the mining models. Prediction queries are based on the Data Mining Extensions (DMX) language. DMX extends the SQL language, to provide support for working with mining models. Prediction Query Tools SQL Server provides two tools that you can use to build prediction queries: Prediction Query Builder and the Query Editor. Prediction Query Builder is included in the Mining Model Prediction tab of Data Mining Designer. When you use the query builder, you can use graphical tools to design a query, use a text editor to manually modify the query, and use a simple results pane to view the results of the query. The Query Editor provides tools that you can use to build and run DMX queries. You can also include prediction queries as part of a SQL Server 2005 Integration Services (SSIS) package. Data Mining Extensions (DMX) Function Reference

Analysis Services supports several functions in the Data Mining Extensions (DMX) language. Functions expand the results of a prediction query to include information that further describes the prediction. Functions also provide more control over how the results of the prediction are returned. The following table lists the functions that DMX supports. Function BottomCount Description Returns a table that contains a specified number of bottommost rows, in increasing order of rank based on a rank expression. Returns a table that contains the smallest number of bottommost rows that meet a specified percent expression, in increasing order of rank based on a rank expression. Returns a table that contains the smallest number of bottommost rows that meet a specified sum expression, in increasing order of rank based on a rank expression. Returns the cluster that is most likely to contain the input case. Returns the probability that the input case belongs to the cluster. Indicates whether the current node descends from the specified node. Indicates whether the specified node contains the case. Returns the time slice between the date of the current case and the last date in the data. Performs a prediction on a specified column. Returns the adjusted probability of the specified predictable column. Predicts associative membership in a column. Returns the likelihood that an input case will fit within the existing model. This function can only be used with clustering models. Returns a table that represents the histogram for a specified column. Returns the NodeID for a selected case. Returns the probability of the specified column.

BottomPercent

BottomSum

Cluster ClusterProbability IsDescendant IsInNode Lag Predict PredictAdjustedProbability PredictAssociation PredictCaseLikelihood

PredictHistogram PredictNodeId PredictProbability

PredictSequence PredictStdev PredictSupport PredictTimeSeries PredictVariance RangeMax RangeMid RangeMin

Predicts the next values in a sequence. Retrieves the standard deviation value for a specified column. Returns the support value of the column. Predicts the future values for a time series. Returns the variance value of the specified column. Returns the upper value of the predicted bucket that is discovered for a specified discretized column. Returns the midpoint value of the predicted bucket that is discovered for a specified discretized column. Returns the lower value of the predicted bucket that is discovered for a specified discretized column. Returns a table that contains a specified number of topmost rows, in a decreasing order of rank based on a rank expression. Returns a table that contains the smallest number of topmost rows that meet a specified percent expression, in a decreasing order of rank based on a rank expression. Returns a table that contains the smallest number of topmost rows that meet a specified sum expression, in a decreasing order of rank based on a rank expression.

TopCount

TopPercent

TopSum

Practical 10 Aim: Case Study: To study the research papers on the given topic and prepare the report on it OR To implement any data mining functionality using any programming language.

You might also like