CASE - Excel 2010 Is A Powerful BI Tool For Analyzing Big Data Sets PDF

A CASE STUDY:
Excel 2010 – A Powerful BI Tool

for Analyzing Big Datasets
Kashan Jafri
Richard Mintz
Evan Ross
Marc Wright
Excel 2010 – Powerful BI Tool for Analyzing Big Datasets
Copyright
This document is provided “as-is”. Information and views expressed throughout, including
URLs and other Internet Web site references, may change without notice. You bear the risk
of using the content.
Some examples depicted here are provided for illustration purposes only and bare no real
association or connection to any known entities or organizations and no such inference is
intended nor should be supposed.
This document does not provide you with any legal rights to any intellectual property. You
may copy and use this document for your internal, reference purposes.
© 2012 Dimensional Strategies Inc. All rights reserved.

INTRODUCTION
This paper discusses our approach, findings and recommendations to architecting a high-
throughput, general purpose Business Intelligence solution using an I/O balanced approach
to symmetric multiprocessing (SMP) with Microsoft SQL Server 2012 Enterprise at the core
and Excel 2010 as the front-end business analytical tool.
We describe our experiences and the design path we took with a relatively stable yet large
dataset of about 76 billion rows. Our considerations focus on approaches to relatively high-
volume data needs — and does not specifically address high-velocity, high-variety or highly
complex data challenges.
This content is relevant to audiences including: CIOs, CTOs, IT planners, architects, DBAs,
and business intelligence (BI) users with an interest in deploying SMP-based DW/BI
capabilities that address big-dataset management with reporting and analysis leveraging
the power and familiarity of features found in Microsoft Excel 2010.
WHAT IS BIG DATA, REALLY?

Defining Big Data
In our internet enabled world, humans and machines are generating more than 2.5
exabytes (2,500,000,000,000,000,000 bytes!) of data every 24 hours. More data has been
created in the last two years than has existed throughout the history of the human race!
We have numerous sources for this data: social media posts, digital pictures and videos,
business and financial transaction records etc. This is what people mean when they talk
about “Big Data”. There is nothing new about the data itself; however the average speed at
which we are now creating new and significant data holdings is astonishingly fast.
Big data can be described and classified using four key aspects:
1. Volume — How Plentiful?

2. Velocity — How Fast?
3. Variety — How Different/Varied?
4. Complexity — How Difficult to manage and/or analyze?
Volume — How Plentiful: Many organizations are simply floundering in the current ocean
of ever-growing data with all its many forms and conceivable sizes. This data is also driving
the creation of even more data — data derived from data! How can an organization turn the
data from over 256 million daily financial trades into useable, legible and actionable
information? (The Toronto Stock Exchange, August 2012)
Velocity — How Fast: Sometimes a minute or possibly two is the decision window. For
time-sensitive processes such as fraud detection or identifying known terrorist at a border
crossing, big data must often be analyzed and queried as it comes pouring into a business
via live transactions and interactions. How do you keep a country safe and open to travel and
trade but closed to crime with the following kinds of stats and data-points: Number of travelers
processed — 24,513,463 , Number of Land vehicles processed (cars, trucks, busses) —
9,304,652, number of Aircraft processed — 90,685… (The Canada Border Services Agency –
CBSA, April to June 2012)
Variety — How Different/Varied: Big data is any type of data - structured data is about
traditional relational databases like Microsoft Access or SQL Server. Unstructured data is
anything else – plain text, audio, video, Microsoft Office Documents, etc. Search and index
all global web content (all text, all audio, PDFs, all documents, all videos, and every picture)
then serve this mosaic of content up to over 212 million Americans in the course of one
month! (Google, May 2012)
Complexity — How Difficult to manage and/or analyze: People relationships and

interactions are amongst the most complex and actively morphing information domains to
model, monitor and analyze. “Who is Who?”, “Who Knows Who?” and “Who Does What?” In
order to answer these questions, organizations must resolve and relate all relevant sources
of data (complex, large volumes), and then present that information to the decision maker
at the point of decision (rapid delivery). Payment services company leverage corporate data
to conduct fraud detection—process allows them to deter more than US$37.7 million in
fraudulent transactions. (MoneyGram International, 2011)
Are 76 Billion Rows Of Data Big?

In this case study, we are describing our experience with a relatively stable, yet large
dataset of about 76 billion rows. Our considerations focus on approaches for relatively high-
volume data — and do not specifically address high-velocity, high-variety or highly
complexity data challenges (as described above).
SOLUTION GOALS AND DRIVERS

Simply put, our client needed a cost-effective, data warehouse back-end that could serve
close to 40 users and handle over 20 terabytes of raw, uncompressed data (representing
over 76 billion rows); providing a throughput of 2 GB/sec on reads and roughly 1GB/sec on
writes.
Business Requirements and Use Case

Acquisition cost and total cost of ownership was a key driver, with our client requesting a
non-proprietary approach (not a specialized solution), leveraging the lowest cost to
purchase and to operate, with standard, mainstream, industry approaches where possible.
Performance
The solution required us to efficiently process complex queries on large historical datasets.
Expected response times for most data queries are within minutes. The following table
provides three success measures that the client required us to meet:
Use Case Average Number Expected Response

of rows returned Time (minutes)
Run a low volume query 400,000 rows In a few minutes

Run a medium volume query 1.5 Million rows 10 minutes
Run a high volume query 5-10 Million rows 30 minutes
General Requirements
 The Data Warehouse must be optimized to address report and analysis needs
 In-house query tools and other off-the-shelf analytical tools must be able to
integrate with the new data warehouse back-end
 The data must only be accessible by named and managed departments within the
organization.
Ease of Use
 Self-service analytics – end-users must be able to quickly conduct their own queries
to unlock insights with interactive data exploration and graphing/charting
 The solution must provide query tools with an intuitive user interface for creating ad
hoc queries and continuing analysis in Microsoft Excel
 Ability to export to Excel .XLS or .CSV formats
 Ability to visually categorize and select data (e.g. by industry or by client etc.)
TECHNICAL DESIGN STEPS AND PRINCIPLES

In the client’s use cases, data was consumed in two ways. First, ad-hoc analysis is
performed on aggregate-level data to identify trends and patterns. This type of analysis
would ideally be performed in an easy-to-use interface, with little or no knowledge of SQL or
the underlying data structures. Second, if further investigation is required, end-users need
the ability to extract millions of rows of data for analysis in other downstream processes.
Several approaches were considered when architecting the SMP solution used at our client.
Because of the large data volumes (170 million rows per day), not all approaches would be
feasible on currently available hardware. In creating the best possible performance at a
reasonable cost, we decided to take a multi-tier approach with regards to the hardware and
software. The table below provides an overview of the reference architecture.
APPROACHES
The following section describes the various approaches considered for the solution. Each
approach is described along with its respective pros and cons. This development
methodology of quickly standing up solutions and evaluating against the clients business
requirements allowed us to quickly develop the solution was best suited to the client’s
needs.
Approach 1 – SQL 2012 Tabular in in-Memory Mode

The first approach that was considered was to use the new Analysis Services Tabular In-
Memory model released in SQL 2012. From an end-user standpoint, all frontend tools (Pivot
Tables, Power View, PerformancePoint, and Reporting Services) would consume data from
the same source – the tabular model, giving a consistent view across all tools. This
approach would give the best end-user experience, but due to the data volume (76 billion
rows over 2 years), the resulting tabular model was estimated to be roughly 2 terabytes in
size. Even though fairly robust SMP hardware was being considered with 1 terabyte of RAM,
a 2 terabyte model would still be forced to swap half of the model to disk, making the
performance of the system unacceptable.
Approach 1: SQL 2012 Tabular in In-Memory Mode
y Reporting L
uer Re ist
po
XQ Services rts
MD
Ad-hoc
MDX Query
SSIS ETL SQL Query Analysis
MD Excel PivotTables s
X &D h
Ric ation
AX liz
SQL 2012 Data Warehouse SQL 2012 Analysis Que u a
ry Vis
2 years of data, ~76 billion rows, Services Tabular End-users
Source Data (Flat Files) Rolling daily partitions In-Memory
~170 million rows/day ~2 TB compressed 3
2
1
SharePoint 2010,
PerformancePoint
& PowerView
Approach 2 – Column Store Index & SQL 2012 tabular in

DirectQuery Mode
The second approach would be to still use an Analysis Services Tabular model but in
DirectQuery mode. This allows the resulting queries to be pushed down to the SQL Database
engine. When combined with a column store index on the fact tables, performance would be
acceptable for the data volumes we were considering. The main drawback to this approach
is that DirectQuery models only support the DAX query language, not MDX. This means that
the only supported frontend tool at this time is Power View, which is an excellent tool for
visualizing data, but is lacking in analytic functionality when compared to Excel Pivot Tables
or traditional OLAP frontend tools. Without a powerful analytic frontend tool, this approach
will not work for the vast majority of business users.
Approach 3: Column Store Index & SQL 2012 SSAS Multi-dimensional with ROLAP Fact Table
uery Reporting L
Re ist
SQL Q Services po
rts
y
er
Qu
SQL Query DX Ad-hoc
SSIS ETL M Analysis
ry
SQL
X Que Excel PivotTables
Qu MD s
ery & PowerPivot Rich ation
z li
ua
MDX Query Vis
End-users
Source Data (Flat Files)
~170 million rows/day
3
2
1
SharePoint 2010,
PerformancePoint
Approach 3 – Column Store Index & SQL 2012 SSAS

Multi-dimensional with ROLAP fact table
Once Analysis Services Tabular was ruled-out as a solution for our client, we began
considering traditional Analysis Services multidimensional models. An OLAP cube was
created based on the star schema in the SQL Data Warehouse, at first using a ROLAP fact
table in order to take advantage of the column-store index already being created to service
list reporting through SSRS. The end-user experience using an OLAP cube was acceptable –
users would perform ad-hoc analysis through Excel Pivot tables, and drill-through to SSRS
reports to access detail level data. Since Power View is not supported on multidimensional
cubes, it is not available with this approach. For our client, Power View would be nice to
have as a visualization tool but it was not considered a requirement for this solution.
Approach 3: Column Store Index & SQL 2012 SSAS Multi-dimensional with ROLAP Fact Table
uery Reporting L
Re ist
SQL Q Services po
rts
y
er
Qu
SQL Query DX Ad-hoc
SSIS ETL M Analysis
ry
SQL
Qu MD s
liz
ua
MDX Query Vis
End-users
~170 million rows/day
3
2
1
SharePoint 2010,
PerformancePoint
Approach 4 – Column Store Index & SQL 2012 SSAS

Multi-dimensional with MOLAP Fact table
In order to further improve performance for end-users, a traditional OLAP cube with MOLAP
fact tables was also created, but this time partitioned by day. This allows the nightly ETL
process to only load the most recent day of data, greatly reducing the overall run-time of
the nightly process. Both the end-user experience and performance of the system was
deemed acceptable and this was selected as the most desirable approach.
Approach 4: Column Store Index & SQL 2012 SSAS Multi-dimensional with MOLAP Fact Table
uery Reporting L
Re ist
SQL Q Services po
rts
y
er
Qu
SQL Query DX Ad-hoc
SSIS ETL M Analysis
ry
SQL
Qu MD s
SQL 2012 Data Warehouse liz
ua
Column Store Index on Fact MDX Query Vis
2 years of data, ~76 billion rows, End-users
~170 million rows/day Rolling daily partitions
3
SQL 2012 Analysis 1
2
Services OLAP,
MOLAP Fact Table, SharePoint 2010,
Daily Partitions PerformancePoint
CONCLUSIONS
Designing a system for analysis versus reporting requires different techniques to ensure
optimum performance. The choice of tools also makes a difference, with new features such
as Power View requiring a tabular model. Multiple solutions may be required to address the
business requirements, but by keeping calculations and hierarchies within the relational
model we can ensure that all methods of analysis are using the same data and will return
the same result (single version of the truth)
Approaches BI Tool Supported Conclusions/Observations

#1 SQL 2012 Tabular in in- All Microsoft tools Tabular Model was about 2 TB in
Memory Mode (PowerPivot, Power View, size. In the end the data
PerformancePoint, SSRS volumes prohibited the use of
etc.) can consume this this option.
model.
#2 Column Store Index & SQL DirectQuery only supports Excel does not support DAX
2012 tabular in DirectQuery DAX capable query tools. queries. We had to abandon this
Mode Power View supported. option.
Approaches BI Tool Supported Conclusions/Observations

#3 Column Store Index & SQL SSRS, Excel, Uses extra disk space (compared
2012 SSAS Multi- PerformancePoint fully to our MOLAP option). Power
dimensional with ROLAP supported. Power View not View does not consume (Multi-
fact table supported Dimensional) cubes.
#4 Column Store Index & SQL SSRS, Excel, MOLAP cube allowed for a very
2012 SSAS Multi- PerformancePoint fully granular partitioning strategy (by
dimensional with MOLAP supported. Power View not day) while still delivering very
Fact table supported good query responses. Power
View does not consume (Multi-
Dimensional) cubes.
REFERENCES
Fast Track Data Warehouse on SQL Server Web site
http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-warehousing/fast-
track.aspx
Fast Track Data Warehouse Reference Guide for SQL Server 2012
http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-
FEF9550EFD44/Fast%20Track%20DW%20Reference%20Guide%20for%20SQL%202012.do
cx
Choosing a Tabular or Multidimensional Modeling Experience in SQL Server 2012
Analysis Services
FEF9550EFD44/Fast%20Track%20DW%20Reference%20Guide%20for%20SQL%202012.do
cx
All about PowerPivot for Microsoft Excel
http://www.microsoft.com/en-us/bi/powerpivot.aspx
SQL Server Web site
http://www.microsoft.com/sqlserver/
How to Choose the Right Reporting and Analysis Tools to Suit Your Style
FEF9550EFD44/MicrosoftReportingToolChoices%2020120327%201643E3.docx
SQL Server TechCenter
http://technet.microsoft.com/en-us/sqlserver/
SQL Server DevCenter
http://msdn.microsoft.com/en-us/sqlserver/

CASE - Excel 2010 Is A Powerful BI Tool For Analyzing Big Data Sets PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CASE - Excel 2010 Is A Powerful BI Tool For Analyzing Big Data Sets PDF

Uploaded by

Copyright:

Available Formats

A CASE STUDY:

Excel 2010 – A Powerful BI Tool

© 2012 Dimensional Strategies Inc. All rights reserved.

WHAT IS BIG DATA, REALLY?

1. Volume — How Plentiful?

Complexity — How Difficult to manage and/or analyze: People relationships and

Are 76 Billion Rows Of Data Big?

SOLUTION GOALS AND DRIVERS

Business Requirements and Use Case

Use Case Average Number Expected Response

Run a low volume query 400,000 rows In a few minutes

TECHNICAL DESIGN STEPS AND PRINCIPLES

Approach 1 – SQL 2012 Tabular in in-Memory Mode

Approach 1: SQL 2012 Tabular in In-Memory Mode

Approach 2 – Column Store Index & SQL 2012 tabular in

Approach 3 – Column Store Index & SQL 2012 SSAS

Approach 4 – Column Store Index & SQL 2012 SSAS

Approaches BI Tool Supported Conclusions/Observations

Approaches BI Tool Supported Conclusions/Observations

You might also like