You are on page 1of 10

A CASE STUDY: Excel 2010 A Powerful BI Tool for Analyzing Big Datasets

Kashan Jafri Richard Mintz Evan Ross Marc Wright

Excel 2010 Powerful BI Tool for Analyzing Big Datasets

Copyright
This document is provided as-is. Information and views expressed throughout, including URLs and other Internet Web site references, may change without notice. You bear the risk of using the content. Some examples depicted here are provided for illustration purposes only and bare no real association or connection to any known entities or organizations and no such inference is intended nor should be supposed. This document does not provide you with any legal rights to any intellectual property. You may copy and use this document for your internal, reference purposes. 2012 Dimensional Strategies Inc. All rights reserved.

Excel 2010 Powerful BI Tool for Analyzing Big Datasets

INTRODUCTION
Excel 2010 Powerful BI Tool for Analyzing Big Datasets
This paper discusses our approach, findings and recommendations to architecting a highthroughput, general purpose Business Intelligence solution using an I/O balanced approach to symmetric multiprocessing (SMP) with Microsoft SQL Server 2012 Enterprise at the core and Excel 2010 as the front-end business analytical tool. We describe our experiences and the design path we took with a relatively stable yet large dataset of about 76 billion rows. Our considerations focus on approaches to relatively highvolume data needs and does not specifically address high-velocity, high-variety or highly complex data challenges. This content is relevant to audiences including: CIOs, CTOs, IT planners, architects, DBAs, and business intelligence (BI) users with an interest in deploying SMP-based DW/BI capabilities that address big-dataset management with reporting and analysis leveraging the power and familiarity of features found in Microsoft Excel 2010.

WHAT IS BIG DATA, REALLY?


Defining Big Data
In our internet enabled world, humans and machines are generating more than 2.5 exabytes (2,500,000,000,000,000,000 bytes!) of data every 24 hours. More data has been created in the last two years than has existed throughout the history of the human race! We have numerous sources for this data: social media posts, digital pictures and videos, business and financial transaction records etc. This is what people mean when they talk about Big Data. There is nothing new about the data itself; however the average speed at which we are now creating new and significant data holdings is astonishingly fast. Big data can be described and classified using four key aspects: 1. 2. 3. 4. Volume How Plentiful? Velocity How Fast? Variety How Different/Varied? Complexity How Difficult to manage and/or analyze?

Excel 2010 Powerful BI Tool for Analyzing Big Datasets

Volume How Plentiful: Many organizations are simply floundering in the current ocean of ever-growing data with all its many forms and conceivable sizes. This data is also driving the creation of even more data data derived from data! How can an organization turn the data from over 256 million daily financial trades into useable, legible and actionable information? (The Toronto Stock Exchange, August 2012) Velocity How Fast: Sometimes a minute or possibly two is the decision window. For time-sensitive processes such as fraud detection or identifying known terrorist at a border crossing, big data must often be analyzed and queried as it comes pouring into a business via live transactions and interactions. How do you keep a country safe and open to travel and trade but closed to crime with the following kinds of stats and data-points: Number of travelers processed 24,513,463 , Number of Land vehicles processed (cars, trucks, busses) 9,304,652, number of Aircraft processed 90,685 (The Canada Border Services Agency CBSA, April to June 2012) Variety How Different/Varied: Big data is any type of data - structured data is about traditional relational databases like Microsoft Access or SQL Server. Unstructured data is anything else plain text, audio, video, Microsoft Office Documents, etc. Search and index all global web content (all text, all audio, PDFs, all documents, all videos, and every picture) then serve this mosaic of content up to over 212 million Americans in the course of one month! (Google, May 2012) Complexity How Difficult to manage and/or analyze: People relationships and interactions are amongst the most complex and actively morphing information domains to model, monitor and analyze. Who is Who?, Who Knows Who? and Who Does What? In order to answer these questions, organizations must resolve and relate all relevant sources

Excel 2010 Powerful BI Tool for Analyzing Big Datasets of data (complex, large volumes), and then present that information to the decision maker at the point of decision (rapid delivery). Payment services company leverage corporate data to conduct fraud detectionprocess allows them to deter more than US$37.7 million in fraudulent transactions. (MoneyGram International, 2011)

Are 76 Billion Rows Of Data Big?


In this case study, we are describing our experience with a relatively stable, yet large dataset of about 76 billion rows. Our considerations focus on approaches for relatively highvolume data and do not specifically address high-velocity, high-variety or highly complexity data challenges (as described above).

SOLUTION GOALS AND DRIVERS


Simply put, our client needed a cost-effective, data warehouse back-end that could serve close to 40 users and handle over 20 terabytes of raw, uncompressed data (representing over 76 billion rows); providing a throughput of 2 GB/sec on reads and roughly 1GB/sec on writes.

Business Requirements and Use Case


Acquisition cost and total cost of ownership was a key driver, with our client requesting a non-proprietary approach (not a specialized solution), leveraging the lowest cost to purchase and to operate, with standard, mainstream, industry approaches where possible. Performance The solution required us to efficiently process complex queries on large historical datasets. Expected response times for most data queries are within minutes. The following table provides three success measures that the client required us to meet: Use Case Run a low volume query Run a medium volume query Run a high volume query Average Number of rows returned 400,000 rows 1.5 Million rows 5-10 Million rows Expected Response Time (minutes) In a few minutes 10 minutes 30 minutes

General Requirements The Data Warehouse must be optimized to address report and analysis needs In-house query tools and other off-the-shelf analytical tools must be able to integrate with the new data warehouse back-end The data must only be accessible by named and managed departments within the organization.

Excel 2010 Powerful BI Tool for Analyzing Big Datasets Ease of Use Self-service analytics end-users must be able to quickly conduct their own queries to unlock insights with interactive data exploration and graphing/charting The solution must provide query tools with an intuitive user interface for creating ad hoc queries and continuing analysis in Microsoft Excel Ability to export to Excel .XLS or .CSV formats Ability to visually categorize and select data (e.g. by industry or by client etc.)

TECHNICAL DESIGN STEPS AND PRINCIPLES


In the clients use cases, data was consumed in two ways. First, ad-hoc analysis is performed on aggregate-level data to identify trends and patterns. This type of analysis would ideally be performed in an easy-to-use interface, with little or no knowledge of SQL or the underlying data structures. Second, if further investigation is required, end-users need the ability to extract millions of rows of data for analysis in other downstream processes. Several approaches were considered when architecting the SMP solution used at our client. Because of the large data volumes (170 million rows per day), not all approaches would be feasible on currently available hardware. In creating the best possible performance at a reasonable cost, we decided to take a multi-tier approach with regards to the hardware and software. The table below provides an overview of the reference architecture.

Excel 2010 Powerful BI Tool for Analyzing Big Datasets

APPROACHES
The following section describes the various approaches considered for the solution. Each approach is described along with its respective pros and cons. This development methodology of quickly standing up solutions and evaluating against the clients business requirements allowed us to quickly develop the solution was best suited to the clients needs.

Approach 1 SQL 2012 Tabular in in-Memory Mode


The first approach that was considered was to use the new Analysis Services Tabular InMemory model released in SQL 2012. From an end-user standpoint, all frontend tools (Pivot Tables, Power View, PerformancePoint, and Reporting Services) would consume data from the same source the tabular model, giving a consistent view across all tools. This approach would give the best end-user experience, but due to the data volume (76 billion rows over 2 years), the resulting tabular model was estimated to be roughly 2 terabytes in size. Even though fairly robust SMP hardware was being considered with 1 terabyte of RAM, a 2 terabyte model would still be forced to swap half of the model to disk, making the performance of the system unacceptable.

Approach 1: SQL 2012 Tabular in In-Memory Mode

y uer XQ MD
SSIS ETL

Reporting Services

L Re ist po rts
Ad-hoc Analysis

SQL Query

MDX Query

MD X
SQL 2012 Data Warehouse 2 years of data, ~76 billion rows, Rolling daily partitions SQL 2012 Analysis Services Tabular In-Memory ~2 TB compressed

&D

Excel PivotTables

AX Que ry

s h Ric ation liz a u Vis

End-users
2 3

Source Data (Flat Files) ~170 million rows/day

SharePoint 2010, PerformancePoint & PowerView

Approach 2 Column Store Index & SQL 2012 tabular in DirectQuery Mode
The second approach would be to still use an Analysis Services Tabular model but in DirectQuery mode. This allows the resulting queries to be pushed down to the SQL Database engine. When combined with a column store index on the fact tables, performance would be acceptable for the data volumes we were considering. The main drawback to this approach is that DirectQuery models only support the DAX query language, not MDX. This means that the only supported frontend tool at this time is Power View, which is an excellent tool for visualizing data, but is lacking in analytic functionality when compared to Excel Pivot Tables or traditional OLAP frontend tools. Without a powerful analytic frontend tool, this approach will not work for the vast majority of business users.

Excel 2010 Powerful BI Tool for Analyzing Big Datasets

Approach 3: Column Store Index & SQL 2012 SSAS Multi-dimensional with ROLAP Fact Table

uery SQL Q

SQL Query SSIS ETL


SQL Qu ery

DX M

y er Qu

Reporting Services

L Re ist po rts

X MD

ry Que Excel PivotTables s & PowerPivot Rich ation z


Vis li ua

Ad-hoc Analysis

MDX Query
Source Data (Flat Files) ~170 million rows/day
1 2 3

End-users

SharePoint 2010, PerformancePoint

Approach 3 Column Store Index & SQL 2012 SSAS Multi-dimensional with ROLAP fact table
Once Analysis Services Tabular was ruled-out as a solution for our client, we began considering traditional Analysis Services multidimensional models. An OLAP cube was created based on the star schema in the SQL Data Warehouse, at first using a ROLAP fact table in order to take advantage of the column-store index already being created to service list reporting through SSRS. The end-user experience using an OLAP cube was acceptable users would perform ad-hoc analysis through Excel Pivot tables, and drill-through to SSRS reports to access detail level data. Since Power View is not supported on multidimensional cubes, it is not available with this approach. For our client, Power View would be nice to have as a visualization tool but it was not considered a requirement for this solution.
Approach 3: Column Store Index & SQL 2012 SSAS Multi-dimensional with ROLAP Fact Table

uery SQL Q

SQL Query SSIS ETL


SQL Qu ery

DX M

y er Qu

Reporting Services

L Re ist po rts

X MD

ry Que Excel PivotTables s & PowerPivot Rich ation liz


Vis ua

Ad-hoc Analysis

MDX Query
Source Data (Flat Files) ~170 million rows/day
1 2 3

End-users

SharePoint 2010, PerformancePoint

Excel 2010 Powerful BI Tool for Analyzing Big Datasets

Approach 4 Column Store Index & SQL 2012 SSAS Multi-dimensional with MOLAP Fact table
In order to further improve performance for end-users, a traditional OLAP cube with MOLAP fact tables was also created, but this time partitioned by day. This allows the nightly ETL process to only load the most recent day of data, greatly reducing the overall run-time of the nightly process. Both the end-user experience and performance of the system was deemed acceptable and this was selected as the most desirable approach.

Approach 4: Column Store Index & SQL 2012 SSAS Multi-dimensional with MOLAP Fact Table

uery SQL Q
SQL Query SSIS ETL

DX M
X MD

y er Qu

Reporting Services

L Re ist po rts
Ad-hoc Analysis

Source Data (Flat Files) ~170 million rows/day

Qu ery SQL 2012 Data Warehouse Column Store Index on Fact 2 years of data, ~76 billion rows, Rolling daily partitions
SQL 2012 Analysis Services OLAP, MOLAP Fact Table, Daily Partitions

SQL

ry Que Excel PivotTables s & PowerPivot Rich ation liz


Vis ua

MDX Query
2 3

End-users

SharePoint 2010, PerformancePoint

CONCLUSIONS
Designing a system for analysis versus reporting requires different techniques to ensure optimum performance. The choice of tools also makes a difference, with new features such as Power View requiring a tabular model. Multiple solutions may be required to address the business requirements, but by keeping calculations and hierarchies within the relational model we can ensure that all methods of analysis are using the same data and will return the same result (single version of the truth)
Approaches #1 SQL 2012 Tabular in inMemory Mode BI Tool Supported All Microsoft tools (PowerPivot, Power View, PerformancePoint, SSRS etc.) can consume this model. DirectQuery only supports DAX capable query tools. Power View supported. Conclusions/Observations Tabular Model was about 2 TB in size. In the end the data volumes prohibited the use of this option. Excel does not support DAX queries. We had to abandon this option.

#2

Column Store Index & SQL 2012 tabular in DirectQuery Mode

Excel 2010 Powerful BI Tool for Analyzing Big Datasets

Approaches #3 Column Store Index & SQL 2012 SSAS Multidimensional with ROLAP fact table Column Store Index & SQL 2012 SSAS Multidimensional with MOLAP Fact table

BI Tool Supported SSRS, Excel, PerformancePoint fully supported. Power View not supported SSRS, Excel, PerformancePoint fully supported. Power View not supported

Conclusions/Observations Uses extra disk space (compared to our MOLAP option). Power View does not consume (MultiDimensional) cubes. MOLAP cube allowed for a very granular partitioning strategy (by day) while still delivering very good query responses. Power View does not consume (MultiDimensional) cubes.

#4

REFERENCES
Fast Track Data Warehouse on SQL Server Web site http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/fasttrack.aspx Fast Track Data Warehouse Reference Guide for SQL Server 2012 http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26FEF9550EFD44/Fast%20Track%20DW%20Reference%20Guide%20for%20SQL%202012.do cx Choosing a Tabular or Multidimensional Modeling Experience in SQL Server 2012 Analysis Services http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26FEF9550EFD44/Fast%20Track%20DW%20Reference%20Guide%20for%20SQL%202012.do cx All about PowerPivot for Microsoft Excel http://www.microsoft.com/en-us/bi/powerpivot.aspx SQL Server Web site http://www.microsoft.com/sqlserver/ How to Choose the Right Reporting and Analysis Tools to Suit Your Style http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26FEF9550EFD44/MicrosoftReportingToolChoices%2020120327%201643E3.docx SQL Server TechCenter http://technet.microsoft.com/en-us/sqlserver/ SQL Server DevCenter http://msdn.microsoft.com/en-us/sqlserver/

You might also like