Market Basket Analysis With Networks

Market Basket Analysis with Networks
Troy Raeder, Nitesh V. Chawla

Interdisciplinary Center for Network Science and Applications
Department of Computer Science and Engineering
University of Notre Dame
Notre Dame, IN 46556 USA
{traeder, nchawla}@cse.nd.edu
Abstract
The field of market basket analysis, the search for meaningful associations in customer purchase data, is one of the oldest areas of data
mining. The typical solution involves the mining and analysis of association rules, which take the form of statements such as people who buy
diapers are likely to buy beer. It is well-known, however, that typical
transaction datasets can support hundreds or thousands of obvious association rules for each interesting rule, and filtering through the rules is a
non-trivial task [25]. One may use an interestingness measure to quantify
the usefulness of various rules, but there is no single agreed-upon measure
and different measures can result in very different rankings of association
rules. In this work, we take a different approach to mining transaction
data. By modeling the data as a product network, we discover expressive
communities (clusters) in the data, which can then be targeted for further
analysis. We demonstrate that our network based approach can concisely
isolate influence among products, mitigating the need to search through
massive lists of association rules. We develop an interestingness measure
for communities of products and show that it isolates useful, actionable
communities. Finally, we build upon our experience with product networks to propose a comprehensive analysis strategy by combining both
traditional and network-based techniques. This framework is capable of
generating insights that are difficult to achieve with traditional analysis
methods.
Keywords: market basket analysis, community detection, product network,

transaction data, association rules
Introduction
The collection and study of retail transaction data, known as market basket
analysis, has become increasingly prevalent in the past several years. Many
supermarkets, for example, issue loyalty cards [27]. While providing discounts to
the customer, these cards allow the retailer to develop a better understanding of
individuals purchasing habits by associating customers with transactions. The
uses of this information vary, but may include informing product placement
decisions, designing personalized marketing campaigns, and determining the
timing and extent of product promotions [1, 2, 14] among others.
Formally, the task of market basket analysis is to discover actionable knowledge in transaction databases. The problem can be understood as follows: A
standard retail store sells a large set of products P. Define a transaction p P
as the set of products an individual customer buys in a single trip to the store.
The stores transaction database T = {p} is the set of all transactions the
store has processed within a given time period. Ultimately, an effective analysis method should enable the retailer to draw clear, comprehensive conclusions
from the data.
One popular tool for market basket analysis in practice is the mining of
association rules [2]. A set of association rules R(T, s, c) is defined by a transaction database T, a minimum support parameter s and a minimum confidence
parameter c. Define A and B as arbitrary sets of products. Further, define A
(analogously B) as the set of transactions containing every product in A (B).
Formally, R is the set of all rules A B such that:
1. |AB|
|T| s
2. |AB|
|A| c.
Association rules have found successful application in many diverse contexts
and a number of algorithms have been developed to discover them efficiently
[2, 10, 23, 43], but they are not without limitations. The most prominent of
these is sheer volume. Large transaction datasets tend to contain hundreds or
thousands of rules at reasonable levels of support and confidence, and many
of these may be redundant or obvious [25]. As a result, it is often difficult to
isolate interesting relationships.
Two distinct classes of methods have evolved to address this problem. One
class [20, 25, 40, 41] attempts to eliminate any rules that may be redundant,
while the other [18, 28, 35] aims to elevate rules that are especially interesting
(by sorting on an objective measure). Unfortunately, the concepts of both interestingness and redundancy are somewhat subjective. As a result, (which we
show in Section 2) these methods are of limited use in practice.
Ultimately, existing literature on market basket analysis has failed to provide
conclusive answers to some of the fields most pressing questions. For example,
there is no widely-accepted means of isolating representative or useful relationships in market basket datasets and no existing work of which we are aware has
attempted to offer any manner of procedural guidance for analyzing such data.
In other words, no work has addressed the question Given a new market basket dataset, what method or methods should I apply in order to obtain effective
insights?
This work attempts to address these concerns and improve the power and
clarity of market basket analysis by modeling transactional data as a network.
We show that by detecting communities of products in this network, we can
discover strong and expressive relationships among products including relationships that are difficult to discover with traditional association rules. We then
build on our experience with product networks and with a number of different
market-basket and graph-theoretic algorithms to propose a novel procedure for
mining unseen market basket datasets. The network representation of transaction data allows for the use of a diverse array of algorithms previously unavailable to the association rule community. As a result, this procedure is the first
comprehensive market basket analysis framework ever proposed in the literature. All of our developments and conclusions are verified on real transaction
data, consisting of over 660,000 transactions across more than 2,200 items, from
an on-campus convenience store at the University of Notre Dame.
The remainder of the paper is organized as follows: Section 2 explores the
strengths and weaknesses of traditional association rules analysis on our transaction data. The results presented here motivate the rest of the paper and serve
as an introduction to the data itself. Section 3 introduces the concept of product networks and presents some properties of our network. Section 4 describes
our community detection approach to market basket analysis and presents the
first known interestingness measure for communities of products. Section 5
develops a comprehensive and novel framework for market basket analysis, incorporating both techniques introduced in this paper and previously-developed
network analysis methods. Finally, Section 6 acknowledges some related work
not mentioned elsewhere in the paper and Section 7 concludes.
Association Rules
A popular approach for analyzing market basket data is the discovery and interpretation of association rules. The association rules problem [2] is defined as
follows:
Given a threshold s, called the minimum support and a threshold c, the
minimum confidence, find all rules of the form A B, where A and B are sets
of products, such that:
1. A and B appear together in at least s% of transactions.
2. B occurs in at least c% of the transactions in which A occurs.
Sets of products are typically called itemsets, itemsets of size k are called
k-itemsets, and sets that meet the minimum support criterion are typically
called large or frequent itemsets. An association rule is said to be supported
in a transaction database if it meets both the minimum support and minimum
confidence criteria.
Algorithms for efficiently enumerating association rules are well-known
[2, 21, 42] and are a popular tool for unsupervised data exploration. As they
came into widespread use, researchers noticed that understanding the rules
themselves was not a trivial matter. First, there is no obvious method for
choosing appropriate support and confidence thresholds. If the thresholds are
chosen too high, interesting associations may be missed. However, if they are
chosen too low, the user may be inundated with thousands of weak rules that
do not represent meaningful associations.

To illustrate the magnitude of this problem, and in particular the difficulty
of isolating appropriate thresholds, we discovered association rules in our own
data at varying levels of support and confidence. Figure 1(a) shows the number of association rules discovered at 10% confidence as support ranges from
0.005% to 1%. The number of rules is negligible above 0.1% support but increases very rapidly below 0.05%. Figure 1(b) shows a similar result, this time
holding support steady at 0.01% and varying confidence from 5% to 100%. The
increase appears substantially less drastic but this is largely due to a number
of redundant multi-item associations with exceptionally high confidence. Note
that from 10% to 5%, the number of rules more than doubles. Taken together,
Figures 1(a) and 1(b) show that association rules can be incredibly sensitive to
the choice of support and confidence parameters.
A second practical issue is that transaction databases often contain hundreds
or thousands of association rules at reasonable levels of support and confidence,
and many of those rules are either redundant or simply obvious [25].
A number of different techniques have been developed to address this issue.
The first is the mining of maximal [20] or closed [40, 41] itemsets. An itemset
I is closed if no superset of I has the same support as I and I is maximal at
s% support if no superset of I has at least s% support. The effectiveness of
these methods in practice depends on the composition of the data. If a dataset
supports several rules A B, AC B, AD B, ... maximal itemset mining
will prune the first of these rules but leave the others. If the first rule arises as a
consequence of the others, then the pruning is useful. However, if the additional
products C, D, etc. co-occur incidentally with the popular products A and B,
then the remaining rules are the ones that are redundant. Furthermore, the
number of pruned rules may be very small compared to the number of rules
remaining.
As an example, our data supports 168 rules at 0.01% support and 10% confidence. Of these rules, 155 are maximal. Decreasing support to 0.005%, the
numbers increase to 385 and 340 respectively. In both cases, all the itemsets
are closed. Also, of the original 168 rules, 38 take the form {CREAM CHEESE, X}
BAGEL or {BAGEL, X} CREAM CHEESE. Within these rules, all are closed
and only two, (BAGEL CREAM CHEESE and CREAM CHEESE BAGEL) are not
maximal. This result suggests that, in addition to pruning very few rules, maximal itemset mining, in our case, prunes incorrectly. Those 36 rules involving
bagel and cream cheese can be very effectively explained by the very strong
relationship between cream cheese and bagel.1
These findings may seem to be in conflict with prior research on closed and
maximal itemsets. For example, in [40], the author claims that the mining of
closed itemsets can reduce the number of association rules found in a dataset
1 A matter of notation: Throughout the paper, as we discuss insights from our data, it
will be necessary to mention a number of specific products sold in the store. Whenever we do
so, we will denote them in ALL CAPS to distinguish specific products from concepts or classes of
items. Classes of items are typed in normal text. Thus, throughout the paper, WATER DASANI
20 OZ refers to a specific type of water, whereas water refers to a general class of products.
(a) 10% confidence, varying support.
(b) 0.01% support, varying confidence.
Figure 1: Number of associations discovered at varying levels of confidence and

support.
by as much as a factor of 3,000. Those experiments, however, were conducted

on generic machine learning datasets rather than market basket datasets. Furthermore, the results were obtained by mining association rules with multi-item
consequents, which is rarely done in practice because it is known to produce redundant rules. We believe based on our results that maximal and closed itemset
mining are of limited use for practical market basket analysis.
A second approach to combat the explosion of uninteresting rules is to calculate additional interestingness measures [35] on the rules. These measures can
then be used to either rank the rules by importance (and present a sorted list
to the user) or as an additional pruning criterion. How exactly interestingness
is determined varies by measure, but many existing measures take the approach
that interestingness is deviation from independence. For example, one of the
simpler such measures, the lift of a rule A B is defined as:
L(A B) =
P (AB)
P (A)P (B)
(1)
where P(X) is the proportion of transactions in which X occurs. Note that if

purchases of A and B are perfectly independent, the lift L(A B) = 1. If A
and B appear together more often than we would expect under independence,
the lift is greater than 1, and otherwise it is less than one.
This notion of interestingness is intuitively reasonable, but there are dozens
of such measures defined in the literature [7, 18, 35], and it has been shown that
they tend to rank rules very differently [35]. Therefore, it is not obvious a priori
which measure, if any, will elevate the desired rules to the top, or at what level
of interestingness the useful rules will end.
To study this phenomenon in our own data, we found association rules at
0.01% support and 10% confidence and ranked them according to each measure
given by [35]. Table 1 shows information about the top ten rules by average
rank. Four of these rules are ranked best by at least one measure, and one ranks
as badly as 129. Even the relationship between BAGEL and CREAM CHEESE, which
is the strongest in the data (support almost 1%, confidence 93%) is ranked 128th
by one measure. This variability implies that interestingness measures are useful
mainly when experience or background knowledge is available to assist in the
selection of an appropriate measure.
An alternative approach to searching through large sets of rules is to impose
a pruning criterion that preserves only the strongest relationships in the data.
Hyperclique Patterns [39] discover tightly-knit groups of items, potentially at a
much lower level of support than is feasible with association rules. A hyperclique
pattern P at support s and h-confidence c is a set of items P = {P1 , P2 , . . . Pn }
such that for each association rule Pi P1 . . . Pi1 , Pi+1 . . . Pn , the support of
the rule is at least s and the confidence of the rule is at least c. The advantage
of hyperclique patterns is that they are able to discover relevant patterns without an explosion of the rule-space, as might be with using vanilla association
rules. However, the criteria that define a hyperclique pattern are very strong
in practice, and it is difficult to find hyperclique patterns of any substantial
Table 1: High, Low, and Mean rank and Standard Deviation of Ranks for the
top 10 rules by average rank among the 21 interestingness measures in [35]
Rule
CREAM CHEESE BAGEL
Cake Mixa Frosting
VAULT SODA VAULT ZERO
YORK MINT PATTIES, DIET COKE 20 OZ NEWSPAPER CHICAGO TR
NEWSPAPER CHICAGO TR, DIET COKE 20 OZ YORK MINT PATTIES
BAGEL CREAM CHEESE
CREAM CHEESE, COFFEE 12 OZ BAGEL
NYQUIL DAYQUIL
VAULT ZERO VAULT SODA
Frosting Cake Mix
a Product
High
1
3
6
2
8
1
1
1
16
3
Low
128
65.5
71
96
96
129
133
118.5
70
69
Mean
18.07
21.85
24.95
28.05
28.85
31.37
32.02
33.35
33.40
34.37
StDev
33.79
19.61
15.77
25.67
22.90
36.32
42.90
33.92
13.41
43.42
names are: DH YELLOW CAKE MX 18 and DH FROSTING DXCHOC
size in market basket data. For our data, there are no hyperclique patterns of
size greater than two, even at support as low as 0.005%. Therefore hyperclique
patterns, while effective at discovering certain strong relationships, are hardly
a sufficient analysis technique on their own.
Association Rules Networks [12, 13, 33] reduce the ruleset by focusing solely
on rules related to a single product. More specifically, given a set of association
rules R and a target product z, the association rules network ARN(R, z) is the
unique directed hypergraph G satisfying the following properties:
1. Any hyperedge in G corresponds to a rule in R with a one-item consequent.
2. There is a hyperedge corresponding to a rule whose consequent is the
target product z.
3. The target product z is reachable from every vertex v in G.
4. No vertex v 6= z is reachable from z.
Generally speaking, an ARN shows the extent to which rules flow into the
target product. The resulting network can show both direct and indirect associations of the target product z. However, Association Rules Networks can be quite
sensitive to the choice of target product, and there is no obvious proper choice.
As a result, one must have some idea of the products he or she is interested in
before association rules networks are applicable. We explore the integration of
association rules networks into a broader strategy for market basket analysis in
Section 5.3.
The above discussion suggests that no technique currently available in the literature sufficiently addresses the problem of finding meaningful relationships in
large transaction databases. This deficiency motivates our discussion of network
methods for market basket analysis, which is the subject of the next section.
We do not claim to definitively solve the market basket problem. However, we
will show that as a first exploratory step, our techniques can discover expressive
relationships from which we can draw direct conclusions about the nature of
customer behavior in a store.
(a)
(b)
Figure 2: Degree distribution for (a) the entire network and (b) the neighbors
of a single product.
Constructing a Network of Products
We begin our discussion by examining the properties of product networks and

their similarities and differences with other types of social networks. To construct a network of products from a list of transactions, we follow an intuitive
approach similar to that of several other authors [22, 25, 32]: each node in the
network represents a product, and an edge appears between any two products
that have been bought together in a transaction.
The networks discussed here and in the rest of the paper are based on transaction data collected from an on-campus convenience store at the University of
Notre Dame during the calendar year 2006. The data contain complete transaction information, including date and time, products purchased, and total cost,
for over 660,000 transactions involving 2,200 unique products. Due to privacy
concerns, there is no way to associate transactions with individual people.
It has been well-established that real-world social networks often have heavytailed degree distributions, meaning that there are very few hubs, connected to
many others while the vast majority of nodes have very few neighbors [4]. In our
data, we find heavy-tailed behavior both locally and globally. Figure 2 shows
the degree distribution of the entire network and the distribution of edge weights
around a single product. Each plot also contains best-fit power-law distributions
calculated by the method of [16]. The KS-test p-values, given in the figures,
show that the are both power laws at 0.05% confidence, although the degree
distribution of the entire network is not nearly as strong of a fit. In any case,
both distributions exhibit heavy-tailed behavior, in that the distributions are
very heavily skewed toward small numbers but span many orders of magnitude.
This result suggests that the average product is bought infrequently with the
majority of its neighbors, and frequently with only a few.
Figure 2 hints at the most difficult aspect of product networks in practice.
They differ from other types of interaction networks for one simple reason: the
presence of an edge does not necessarily imply a confirmed relationship between
8
products. Networks based on citations or phone calls, for example, do not suffer
this problem to nearly the same degree.
In citation networks, two nodes linked together by an edge are necessarily
related: if one paper cites another, there is a reason. A cell phone network
will have a small number of incidental links, (wrong numbers, telemarketing,
or random personal business), but most of the time, when one person calls
another, it implies a connection between them. Product networks are different.
Simply because a person buys paper towels and spaghetti sauce in the same
transaction does not entail a common motivation for the two purchases. Worse,
a person who buys several unrelated items in a single transaction will form a
clique among them, despite the absence of any true relationship.
As a result, product networks are very dense, with a large number of connections per node, but many of these edges are meaningless: representing spurious
associations generated by chance. Our network contains 2,248 products and almost 250,000 edges between them. However, over 150,000 of these edges have a
weight of one, meaning the two products were bought together only once in the
entire year 2006, and over 235,000 have weight less than 10. These extremely
low-weight edges are common and are unlikely to represent strong relationships.
One natural consequence of this density, many popular network statistics are
unusually skewed. For example our product network has a 90% effective diameter of 4 and a full diameter of 5, much smaller than we would expect in a social
network of the same size, and the average clustering coefficient is relatively high
at 0.518.
In order to remove some of the noisy edges created by coincidental purchases
and improve the quality of our subsequent analysis, we establish a minimum
threshold , such that an edge exists between two products only if they have
been bought together at least times. This is analogous to choosing a minimum
support threshold for association rules. Note that, in the pruned network, the
weight of the any remaining edge is unchanged.
Having described the construction of a product network and studied some of
its properties, we now turn our attention to the analysis of the product space.
Since the primary focus of market basket analysis is the discovery of relationships between products, we need to find groups of products whose structure or
position within the network reveals useful information about the store itself.
Many real-world interaction networks naturally contain communities:
groups of nodes that are more strongly connected to each other than they are to
the rest of the network. Often, these communities have an easily-interpretable
significance. In a cell phone network [34], for example, communities may represent families or circles of friends. Conversely, in a network of web pages [24]
they may represent sites devoted to a common interest or theme. Community
detection has been applied successfully in a numerous fields of science, ranging
from social network analysis [34] to biology [3] and molecular physics [26]. It
seems logical to expect that communities of products, since they are mutually
strongly-connected, would be of particular interest. Therefore, the remainder
of the paper will focus on the problem of community detection in product networks, and show how communities of products can be used to gain insight in to
9
the behavior of customers in a store.
Discovering Communities of Products
Community detection is the process of finding strong communities in a network.

The problem is usually addressed as follows: given a graph G, partition it into
a series of disjoint subgraphs G = {G1 ..., Gn } maximizing an objective function
f (G). The number of communities n is generally not known beforehand, but
determined by the algorithm. Many community detection algorithms [6, 15, 30]
attempt to optimize a quantity known as modularity [31]. The modularity Q of
a set of communities is defined as:
X
(eii a2i )
Q=
(2)
i
where eii is the fraction of edges that join vertices in community i to other
vertices in community i and ai is the fraction of edge endpoints that lie in
community i. Modularity measures the difference between the number of incommunity edges in a given set of communities and the expected number of
in-community edges in a random network with the same degree distribution.
This notion is very intuitive. If a set of communities has a large fraction of
its edges falling within communities, (and therefore a relatively small fraction
falling between communities), then that particular community decomposition
probably represents a strong community structure.
The application to market basket analysis is clear: isolating tightlyconnected communities within the network of products will allow us to identify
strong relationships among the products and, therefore meaningful correlations
in customer purchase behavior. Furthermore, because communities can be arbitrarily large, they should be able to represent these relationships much more
expressively and with less redundancy than ordinary association rules.
4.1
Measuring the Utility of Communities
Before we present our results, we quantify the utility of a community. Specifically, we wish to answer the question: given a set of communities in a product
network, which are most useful to a human analyst?
Intuitively, the utility of a community can be determined by two opposing
forces: information, and information density. A useful community will be large
enough to provide a substantial insight into customer behavior, but small enough
to be human-interpretable. To this end, we propose the following quantitative
definitions. Define the information present in a community to be the sum, over
all the edges in the community, of the confidence of the relationship indicated by
the edge. The confidence of the relationship A B is the observed conditional
probability that B is purchased given that A is purchased.
X
I(Gi ) =
P (p1 |p2 )
(3)
(p1 ,p2 )Ei
10
We could have chosen, in lieu of confidence, a number of measures for the

strength of an edge. The choice of confidence is convenient for two reasons.
First, it is bounded. An unbounded measure, which can take values up to
infinity, may assign an unreasonably high value to a community containing a
single interesting relationship. Second, it is null invariant [35], meaning that
its measure of the relationship between A and B is unaffected by transactions
containing neither A nor B. To see why null invariance is important, consider
two seasonal products that are sold only one month of the year. Even if these
products are bought together 100% of the time, a measure that is not nullinvariant (such as support) will likely see the relationship as weak because, for
most of the year they are not bought at all.
Next, we define the information density D(Gi ) of community i as the information per node in Gi :
I(Gi )
(4)
D(Gi ) =
|Vi |
Finally, we define the overall utility of community i as the harmonic mean of
the above-defined quantities:
U (Gi ) =
2I(Gi )D(Gi )
.
I(Gi ) + D(Gi )
(5)
Substituting the definitions of I(Gi ) and D(Gi ) into Equation 5 yields: U (Gi ) =
i|
D(Gi ) |V|V
. Thus, our measure prefers dense communities but given two comi |+1
munities of roughly equal density, it favors the larger one. This matches the
intuition given earlier.
Because the computation in Equation 3 depends on the actual number of
edges present in the community, our utility measure depends somewhat on the
method of graph construction. In other words, if we allow an edge between any
two products that are bought together, the computation will be different than if
we restrict edges to products bought together at least 100 times. The end result
of this is that our utility measure is not comparable across different network
constructions. We do not consider this to be a significant issue because it is
designed to help a human analyst assess one set of communities.
While our utility measure is designed for product networks, we believe that
the tradeoff between size and density is very general and that, in principle,
Equation 5 could be applied to other domains. In an email network, for example,
if one defines information as the frequency of email correspondence between
members of the community over some time period, an analog of Equation 5
follows naturally.
4.2
Results on Real-World Data
In order to demonstrate the effectiveness of our proposed methods, we present

results from our 2006 data. We built a product network in the manner described
above, setting the support parameter = 65 (0.01% of all transactions). We
present communities discovered with the algorithm of Blondel et al. [6], which is
11
(a) Chips and salsa.
(b) Eggs and baking products.
Figure 3: The first two communities in our data, ranked by the measure given
in Equation 5.
Figure 4: The distribution of utility scores across all communities

one of the more scalable algorithms available, and rank them using the measure
defined in Equation 5. Though we use only one algorithm here, our studies have
shown that differences across algorithms are largely insignificant.
Overall, there were 17 communities discovered in the pruned network, ranging in size from two products to over 70. We evaluated each of these communities
using the utility measure defined in Equation 5 and the results appear in Figure 4. The calculated utilities range from very near zero to slightly over 1. We
see that a large number of communities have very low utility, with five communities falling in the first bin (below 0.14). At the other end of the spectrum,
two communities rate substantially higher than the others (1.01 and 0.92 respectively). Highly-rated communities are generally well-connected with a clear
purpose.
Figure 3(a) shows the most highest-rated community, consisting of different
types of chips and salsa. The community is very densely connected, and it
carries a very clear message: that people often buy chips and salsa together, and
yet is small enough for a human to easily interpret. The community is nearly
bipartite, with chips connecting only to salsa and salsa connecting only to chips.
The one exception is a single edge between salsa con queso (FL SALSA CON QUE)
and medium salsa (FL SALSA MED 16OZ). From this community, it becomes clear
that chips and salsa are complementary products, while the different types of
12
chips (and respectively salsa) are substitutes for one another. The salsa con
queso is an exception, because it is distinct from the other types available.
Figure 3(b) shows the second-ranked community, a collection of eggs and
baking products. The structure of the community, with eggs (EGGS CSPRING
8CT) as a hub in the center and the baking items the periphery, seems to imply
that when people buy eggs in our store, they buy them for baking. Further
investigation supports this initial hypothesis.
There were 541 distinct products bought with EGGS CSPRING 8CT at our
store in the calendar year 2006, and in 18.5% of the cases, they were bought
alone. However, at least one item among the six neighbors appears in over 39%
of all transactions containing EGGS CSPRING 8CT, which is especially significant
because most of the transactions in our store are small. As a case study, we
further quantify the impact of this particular community. Similar analysis can
be applied to other communities, but space limitations preclude such analysis
in this paper. Intuitively, cake mix is the most likely causal item in the group
(it is unlikely, for example, that people buy frosting because they have a craving
for eggs). Therefore, we calculate expected additional sales from each sale of
cake mix as:
E(Sales) =P (Eggs|CakeM ix) P rice(Eggs)
+P (F rosting|CakeM ix) P rice(F rosting)
and find that the store can expect to generate $2.30 in additional sales from
each cake mix sold. Therefore, the store stands to profit from any promotion
that increases the sales of cake mix at a cost of less than $2.30 per transaction.
Since cake mix itself costs $2.69, the expected additional revenue is 85.5% of the
items purchase price. This analysis is admittedly simple, but it demonstrates
that communities can help identify profitable promotions in a store.
The third-and-fourth-ranked communities, shown in Figures 5(a) and 5(b)
are communities of cereal and milk. The first of these shows a small container
of milk as a hub surrounded by a series of cereals. In this case, the milk is small,
at one pint, and many of the cereals are smaller individual-serving cereals. The
second is composed of two nearly-disconnected subgraphs: a hub-and-spoke
arrangement of larger milks and cereals and a clique of sodas. The disparate
structures are each connected, by one edge, to a single product: plastic cups.
These communities support several conclusions in addition to the notion
that people buy cereal and milk together. First, there are separate relationships
between cereal and milk at two levels: smaller sizes of milk correlate with smaller
sizes of cereal, while larger milks relate to larger cereals. Second, the strong
mutual correlation among sodas suggests that they are often purchased several
at a time, while the disconnection among cereals indicates that people buy them
largely for personal use.
The final community of interest is shown in Figure 5(c): a community containing fruit, salad, yogurt. It is much less dense than the others and therefore,
at number eight, is ranked much less favorably. However, it still contains useful
insights. Figure 5(c) shows the single fruit product (diamond) connected to nine
13
(a) A community of milk and cereal
(b) A community of milk, cereal, and soda. The soda

connects to the rest of the community with only one
link.
(c) A community of fruit (diamond), salad (square), and yogurt (triangle).
Figure 5: Three more communities.

different yogurt products (triangles). The associations between fruit and any
of the individual yogurt products are not strong (none is ranked better than
78th, in a list of 168 rules, by any of the interestingness measures in [35], but
in combination the association is quite powerful.
If all the different varieties of yogurt are combined, they become the most
popular product purchased with fruit, and we find that 10% of all fruit sales
(by dollar value) come in transactions that contain yogurt, and that 9.5% of all
yogurt transactions contains some form of fruit. By contrast, if all varieties of
coffee are combined, coffee (the runner-up) occurs in only 8% of fruit transactions, despite the fact that it is bought five times more frequently than yogurt
overall. The fruit and yogurt association, then, is a significant relationship
whose significance is hidden by the number of yogurt products available.
The largest community, not shown, contains over 70 products. Composed
of many of the stores most popular items, it is too large and dense to be
easily interpreted. This fact, in conjunction with the communities mentioned
above, suggests that community detection can play a useful supplementary role
in market basket analysis. The highly-ranked communities discussed above
provide a good deal of insight into the purchases of items as diverse as fruit,
cereal, and frosting, but communities reveal very little with regard to the dense
core of the network: popular products such as coffee, bagels, and water.
Therefore, we propose that community detection be used as a first exploratory step in the analysis process, where it will illuminate the relationships
14
among important but more peripheral products. Then, the subsequent association rules analysis can focus more intently on products whose role is not clear
within the community decomposition. The next section describes in greater
detail our proposed framework for such an analysis.
Toward a Comprehensive Analysis Strategy
A great deal of literature has been published on the subject of market basket
analysis and survey papers about algorithms [23, 43], interestingness measures
[28, 35], and visualization techniques ([5], section 2) abound. In spite of all
this effort, however, the community has made no substantive attempt to answer
the following basic question: Given a fresh, unseen market basket dataset what
method or set of methods should be employed to obtain quick, actionable results?
There are several possible reasons for this. The first is a dearth of widelyavailable transaction data, which we alluded to in the introduction. The second
is a general lack of diversity in analysis techniques: maximal itemset mining,
for example, is not different enough from traditional association rules such that
the techniques can be complementary, with one strong where the other is weak.
Finally, most studies that do consider real data are only conducted within a
single domain (i.e. supermarkets or online retailers), and so the ability to draw
overarching conclusions is limited.
Since we too are confined to a single dataset, we cannot address the third
concern, but this section addresses the first and the second. In doing so, we call
upon not only the techniques developed here in Section 3, but also a series of
methods developed by other authors. To our knowledge, these methods (Association Rules Networks [12, 13, 33] and Center-Piece Subgraphs [36]) have not
been generally applied to market basket data, but in the course of our work we
have found that they complement community detection nicely.
The rest of the section is organized as follows: Section 5.1 explores practical
concerns regarding the use of Association Rules Networks (introduced in Section 3), Section 5.2 introduces the Center-Piece Subgraph problem and studies
its application in the domain of product networks, Section 5.3 ties together the
discussion of this section and the prior one in order to propose a unified strategy
for mining market basket data, and Section 5.4 briefly discusses strategies for
parameter selection.
5.1
Association Rules Networks
Recall from Section 3 that an Association Rules Network ARN (R, z) is a directed hypergraph representation of the ruleset R that mops out the direct and
indirect associations of the target product z. The concerns we must address
when applying Association Rules Networks are 1) How do we choose an appropriate ruleset R? and 2) How do we choose an appropriate item z? The
first question essentially boils down to the appropriate choice of support and
confidence parameters, and we do not address it here. With regard to the sec-
15
(a) Association Rules Network with z = eggs (b) Association Rules Network with z = cake
(EGGS CSPRING 8CT)
mix (DH YELLOW CAKE MX 18)
Figure 6: Two Association Rules Networks from the community of eggs.

ond question, it is natural first to ask: is the choice of z important? Figure 6
shows two different Association Rules Networks. In Figure 6(a), eggs are used
as the target product, and in Figure 6(b), we use cake mix. Even though the
two products chosen are related, we see that the resulting networks are quite
different. While Figure 6(a) shows a relationship between oil, eggs, cake mix
and frosting, similar to what was found with community detection, Figure 6(b)
contains only cake mix and frosting.
While Figure 6 makes it clear that the target product z cannot be chosen
arbitrarily, it does not shed any light on the process for making an appropriate
choice. Figure 7 shows a separate Association Rules Network flowing into BAGEL:
one of the most popular items in the store. This network is large and expressive,
and includes two of the relationships, fruit-yogurt and cereal-milk that we found
with communities earlier (although not to the same detail). It provides an
effective visualization of the relationships between some of the more central
products in the store.
Many of the items that appear in the network, such as newspapers and
donuts, are items that we would intuitively expect to sell well in the mornings.
A cursory glance at the network suggests that coffee may drive food sales during
the morning hours and bagels may drive drink sales. coffee does not connect
to any other drinks, whereas BAGEL connects to drinks almost exclusively. Additionally, the network provides insight into the key relationships other core
products: milk (with cereal), salad (with soup), and fruit (with salad and yogurt).
To understand why this bagel network is so much more informative than the
cake mix network described above, we need to understand the ruleset on which
the network is built. Recall that there were three rules containing BAGEL in the
top-ten rules given in Table 1. As one would expect, the full ruleset contains
substantially more. In fact, 47 of the 168 rules discovered contain BAGEL as
either the antecedent or the consequent. This great diversity among BAGELs
neighbors in the network allows its ARN to span different segments of the
product space.
Thus, it appears that an effective choice for z, when constructing an Associ-
16
Figure 7: Association Rules Network with z = BAGEL.
17
ation Rules Network from transaction data, is to choose the item that appears
in the most rules in the underlying ruleset R. One might consider instead the
most popular product in the store, or the item which has been bought with the
greatest number of other products. In our data, however, these strategies are
less effective. BULK CANDY, which is both the most frequently-sold and bought
with the most items, has only two products in its Association Rules Network,
and one popular type of water (WATER DASANI 20 OZ), has none.
The reason for this is that association rules involving BULK CANDY and WATER
DASANI 20 OZ, which are bought with a stunningly wide variety of items, do
not meet the minimum confidence criterion that we have used throughout the
paper. We contend, however, that relationships which do not meet the minimum
confidence criterion may still be interesting. There are several potential causes
of low confidence, but the most relevant in the case of water is substitution.
There are many different types of water available in the store, and this variety
erodes the confidence of certain relationships.
To illustrate the effect of substitution on rule confidence, assume n different
products F1 , . . . , Fn are all substitutes for each other, meaning that they serve
roughly the same function F . Furthermore, assume a product P correlates with
items of the function F , such that the confidence of the association rule F P
is c or
|F P|
= c.
|P|
(6)
If the products F1 , . . . , Fn are all bought equally with P, then for any Fi ,
the confidence of the rule Fi P is given by
|F P|
n
|P|
c
.
n
(7)
Thus, the substitution erodes the confidence of the association Fi P even

though the overarching association F P may be sufficiently interesting. It is
also trivially true that substitution erodes the support of any relationship.
This parameter sensitivity is a problem inherent to every technique we have
covered thus far. Association rules, ARNs, and the community detection framework we have defined will all systematically fail to find relationships that fall
outside the specified support and confidence thresholds for any reason (substitution or otherwise). To address this issue, we turn to Center-Piece Subgraphs.
5.2
Center-Piece Subgraphs
Center-Piece Subgraphs (CePS) [36], like Association Rules Networks, describe

the neighborhood of a node or set of nodes, but they differ considerably in
how they define this neighborhood. The Center-Piece Subgraph Cp (G, b, Q, k)
is a subgraph H of the graph G that contains all query nodes in the set Q,
contains at most b other nodes, and maximizes an objective function g(H). The
parameter k is called a soft AND coefficient. In simple terms, k is the number of
18
query nodes to which a node must be strongly related in order to be considered

a candidate for the subgraph.
In other words, association rules networks define the neighborhood of the
target product z as the set of set of products that are either direct or indirect
causes of z within the ruleset R. A center-piece subgraph, by contrast, defines
the neighborhood of the query nodes Q as the set of b products that are most
closely related to the members of Q according to the objective function g().
The benefit of center-piece subgraphs in the context of market basket analysis is that they allow us to trade scope for granularity. While community detection can find relationships in the product network with virtually no guidance, it
requires a reasonable support threshold in order to isolate useful relationships.
Similarly, association rules networks require the specification of a ruleset which
is, by definition, constrained by a minimum support and confidence. Thus, in
both cases, the number of products about which one can learn useful information
is significantly constrained.
Center-Piece subgraphs provide the opportunity to consider all products
in an analysis, because the budget parameter b constrains the size of the sets
that can be discovered. The cost of this added power is a tremendous decrease
in scope. Whereas communities can discover relationships anywhere in the network, and an ARN may extend several levels out from the target product (recall
the BAGEL ARN of Figure 7), a center-piece subgraph is constrained to the set
Q of query nodes and at most b other related products. As a result, the set of
query nodes Q must be carefully defined in order for the resulting subgraph to
be meaningful.
The remainder of the section discusses the objective function g(H) maximized by CePS and outlines practical concerns regarding its application to
market basket data. We conclude that, for the reasons stated above, centerpiece subgraphs are primarily useful for either verification of hypotheses suggested by other techniques or for explaining unexpected results arrived at by
other methods. For both of these applications, the set of query nodes Q will be
very well-defined.
5.2.1
Objective Function Definition
Define a Random Walk with Restart (RWR) [37] on the graph G starting from
a node n V (G) as follows: At time t, a randomly-walking particle existing
at node nt V (G) (n0 = n) transmits itself to one of the neighbors of nt with
a probability proportional to the weight of its edge with nt . At any time, the
particle has a fixed probability c of returning to node n.
From the normalized matrix of edge weights W, one can calculate the prob(t)
ability pi,j that a randomly-walking particle starting at node i stands at j after
(t)
exactly t steps. The limit as t of the pi,j is known as the steady-state

probability that a particle starting at i will exist at node j. The vector of
steady-state probabilities originating from node i, pi can be calculated as [37]:
pi = cWpi + (1 c)ei .
19
(8)
Figure 8: Center-Piece Subgraphs with tortilla chips (TOSTITOS SUPER SIZE)

as the query node. Edges are weighed by a) support and b) confidence.
where ei is an indicator vector that is 1 in the ith position and zero everywhere
else. The matrix W is normalized in the sense that it is a transition matrix:
i.e. Wi,j represents the probability that the randomly-walking particle will
transition from i to j independent of the possibility of restart.
The RWR problem is very general and has been applied in a number of
contexts. For example PageRank [8] now incorporates the notion of restart in its
random-walk determination of page relevance to prevent assigning outrageous
scores to dense communities of web pages. The CePS problem incorporates
RWR into its goodness function as follows:
Define r(i, j) = pi,j to be the steady-state probability that a RWR starting
at i exists at j. Further, define r(Q, j, k) to be the steady-state probability that
at least k RWRs originating from nodes in the query set Q simultaneously meet
at node j. For hard AND queries, which are the type of query we will be
most interested in, we can define the probability r(Q, j) that random walkers
from all query nodes meet at j as:
Y
r(Q, j) =
r(i, j).
(9)
iQ
The objective function g(H) for a subgraph H follows as:

X
g(H) =
r(Q, v)
(10)
vV (H)
Ref. [36] provides a fast algorithm for extracting subgraphs with high g(H),
and our experience shows that it scales to networks with thousands of nodes.
In the next section we explore practical concerns regarding the application of
CePS to market basket analysis and present results from our data.
5.2.2
Center-Piece Subgraphs on Market Basket Data
Each technique we have discussed to this point has been limited by the need to
specify a minimum support (and possibly minimum confidence) with which to
discover relationships. As a result, strong relationships with low levels of support
and substitution relationships with artificially low confidence are undiscovered.
20
Because center-piece subgraphs are constrained in size by the budget parameter b, it is unnecessary to further constrain them with minimum support
and confidence parameters. As a result, they are the only technique we have
discussed which is capable of discovering relationships between any and all products that make up the product space. The remainder of the section will show
that this property makes center-piece subgraphs invaluable for the exploration of
results obtained through other means. Specifically, they are effective for either
verifying hypotheses suggested by other techniques or explaining relationships
that do not, on the surface make sense.
Figure 8(a) shows a center-piece subgraph constructed from the full 2006
product network using a type of tortilla chips (TOSTITOS SUPER SIZE) as the
query node and a budget b = 10. The network contains other chips and salsa,
as our prior experience would lead us to expect, but also contains some items
(BULK CANDY and BAGEL) that are marginally related at best. We explored this
phenomenon by constructing subgraphs of gradually increasing size in order to
determine which items the algorithm considered more important with respect
to the tortilla chips. In doing so, we found that the BULK CANDY was added as
the 6th member of the subgraph, before other products to which the chips have
a stronger connection.
The reason for this is that BULK CANDY, as a popular product, is bought with
a tremendously large array of other products (recall the degree distribution of
Section 3). To see why this causes problems for CePS, imagine a seldom-sold
product pj , appearing in 5 transactions, with which BULK CANDY is bought once.
By standard normalization, the transition probability from pj to BULK CANDY is
at least 1/5, meaning that any random particle that reaches j is highly likely to
reach BULK CANDY. Combining this effect over hundreds of less popular products
results in a very substantial steady-state probability for popular products.
To reduce the influence of such products, we weighted the edges by confidence
instead of by absolute support. That is, the edge A B is weighted with
min(P (A|B), P (B|A)). There are two distinct advantages to using confidence
in this instance. First, it forces all edge weights onto a uniform scale between
zero and one. Second, it lessens the impact of coincidental purchases with
popular products. In the example of the previous paragraph, the weight of the
1
edge between pj and BULK CANDY is now 60,000
and after normalization it is
likely that the transition probability from pj to BULK CANDY is much lower.
Figure 8(b) shows the impact of weighting edges by confidence. Now, instead
of extraneous products like BAGEL and BULK CANDY, we see sodas and other types
of chips, which much more closely matches our intuition and corroborates the
results found with other techniques.
Figure 9 shows a center-piece subgraph with eggs (EGGS CSPRING 8CT) as
the lone query node and a budget of 10. When we examined the community of
eggs and cake mix in Section 3 we concluded that when customers bought eggs
in our store, they bought them for baking. The subgraph in Figure 9 further
corroborates this notion: it includes four additional products (brownie mix,
butter, margarine, and chocolate chips) and all of them are baking products.
To this point, we have used CePS simply to explore the neighborhood of
21
Figure 9: A center-piece subgraph with eggs (EGGS CSPRING 8CT) as the query
node.
individual items, similar to the way in which we might apply Association Rules
Networks. As we mentioned before, however, the CePS algorithm is actually
much more general, and can handle any number of query nodes. The following
discussion explores the ability of CePS to explain a single association rule.
Figure 10 shows a ten-node center-piece subgraph for one of the less intuitive (and more interesting) rules in the dataset: DIET COKE 20 OZ, YORK MINT
PATTIES NEWSPAPER CHICAGO TR. Specifically, it is a center-piece subgraph
with those three items as query nodes and a budget of 10. The three items
in question seem to be entirely unrelated, and yet the rule is ranked highly by
a number of interestingness measures (Table 1). Ideally, the center-piece subgraph would illuminate the relationship between the products and explain the
association.
Looking at the network, we see something interesting. In addition to patties
and Kit Kat, which appear in the Association Rules Network of Figure 7, we
also see three more types of candy: Hersheys, Mounds and Chuckles. This
observation implies that there is some sort of relationship between Chicago
Tribune, Diet Coke, and candy. As it turns out, the newspapers in our store are
located at the front of the store, next to the rack where those candies are sold.
Figure 10 shows that, because center-piece subgraphs can consider the entire
product network without requiring excessive computation time or providing
overwhelming output, they are very effective for exploration or validation of
relationships provided by other methods. As such, they complement nicely the
other techniques outlined in this paper.
Center-Piece Subgraphs require substantially more parameters than any of
the other techniques we have discussed. All of our experiments were conducted
on small networks (b 10), with hard AND, meaning that k is equal to the
number of query nodes. Though we did not conduct any detailed studies of
22
Figure 10: A center-piece subgraph with Diet Coke (DIET COKE 20 OZ), Newspaper (NEWSPAPER CHICAGO TR), and Peppermint Patties (YORK MINT PATTIES)
as query nodes, to explain the association rule.
the parameter selection process, informally we found that the choice of k and b
makes little difference in the quality of the subgraph discovered. By choosing b
to be large, we observed that popular products such as BULK CANDY and BAGEL
came to be included in the subgraph. Altering k had no discernible effect for
the types of queries we tried.
5.3
A Strategy for Market Basket Analysis
The research we present here has allowed us to make and corroborate a number
of significant observations about market basket analysis of real-world data. We
re-state the chief observations here, citing the work of others where appropriate.
1. Deriving interesting, actionable knowledge from association rules is difficult because rulesets are often muddied by a preponderance of obvious or
redundant rules [25].
2. One can choose to mine maximal or closed itemsets instead, but these
techniques fail to prune away many redundant rules.
3. Similarly, one may choose to rank rules by an interestingness measure,
but there are many such measures to choose from and they may rank rules
inconsistently [35]. As such, it may be difficult to choose an appropriate
measure in the absence of prior knowledge.
4. Detecting communities of products within the network formed by customer purchases can alleviate redundancy by discovering larger, more expressive relationships among groups of products. However, community
detection is less effective within the dense core of the network and requires
a minimum support threshold, which imparts parameter sensitivity.
23
5. Association Rules Networks are more effective at exploring the core of

the network, provided that the chosen target product appears in a large
number of association rules. Under other circumstances, they are highly
sensitive to the choice of target product and certain networks, even for
very popular products, are small and uninformative.
6. Center-Piece Subgraphs are useful for explaining or validating relationships discovered by other methods because they do not require a support
or confidence threshold to be effective. They are less useful for general
analysis because they are necessarily limited in scope.
This list of observations naturally suggests a unified strategy for the analysis
of unseen market basket data. First, select a minimum support threshold. On
the basis of this threshold, construct a product network and discover communities. The structure of the interesting communities in the network (as defined
by Equation 5) provides a quick overview of any especially strong relationships
within the data. The discovered relationships are generally more complex and
expressive than those discovered with association rules.
Next, the analyst should decide on a minimum confidence threshold and discover association rules. Choosing a popular product, such as the product that
appears in the most rules, as the target product, construct an Association Rules
Network. This network will provide a roadmap of some of the important relationships within the core of the network and may illuminate some associations
that were not clear in the list of communities.
The set of communities and the Association Rules Network, along with the
actual list of association rules if desired, will provide a degree of insight into
customer behavior in the store. As a final step, one can apply Center-Piece
Subgraphs to analyze carefully selected subsections of the entire (unpruned)
network. These subgraphs can serve to corroborate or debunk hypotheses about
customer behavior or explain unexpected results in the data. Our experiments
have suggested that Center-Piece Subgraphs are most effective if the edges of
the network are weighted by confidence rather than support.
5.4
Choosing the Minimum Support Parameter
Since the first step in our proposed procedure requires the user to choose a
minimum support parameter, we attempt to provide some guidance into this
choice. We are aware of no prior work from which to draw, but one can imagine
several reasonable options. For example, one might select an arbitrarily high
threshold and iteratively reduce it until the number of rules becomes unmanageable. Alternatively, one may attempt to find a certain number (some hundreds
or thousands) of rules, or a certain number of rules that score highly based on
his or her favorite interestingness measure.
All of these are valid choices and to evaluate them critically is beyond the
scope of this work. However, if community detection is the target then existing
community detection research affords us another option. In Section 3, we briefly
alluded to the fact that community detection algorithms find poor communities
at low levels of minimum support. This fact can be used, in principle, to choose
24
Figure 11: Modularity of discovered communities as a function of minimum

support.
a minimum support threshold.
Modularity (Equation 2) provides us with a measure of the quality of a
community structure. It follows, then, that discovering communities at a given
support threshold with a modularity-maximization algorithm (e.g. [6, 29, 30])
will provide an estimate of the quality of the communities available at that
threshold. This suggests the following procedure:
1. Beginning with a very low support threshold (possibly one transaction),
discover communities using a modularity-maximization algorithm.
2. Iteratively increase the threshold until the modularity of the discovered
community structure begins to plateau or decrease.
3. If there are several thresholds with very similar modularities pick the lowest one, as it preserves information about the greatest number of products.
Figure 11 shows the modularity of the communities discovered by our implementation of Newmans eigenvector modularity algorithm [30] as a function
of the minimum support threshold. We chose this algorithm in particular because it is one of the more effective at finding high-modularity decompositions.
The graph shows a local maximum at a minimum support of 50 transactions
(0.008%) and a global maximum at 110 (0.017%). This suggests that a minimum support threshold of 50 transactions may have been superior to our fairly
arbitrary choice of 0.01%. Further evaluation of this method of support tuning
will make interesting future work.
Related Work
Before concluding, we wish to briefly acknowledge a small amount of related

work that did not fit cleanly into other parts of the paper. Several authors (e.g.
[25, 22]) use graphs to visualize co-purchases between products. We employ
similar techniques to present our results, but claim no originality in doing so.
25
Clauset et al. [15] apply community detection to Amazon.com transaction

data, but their treatment of the data is very basic. They do not explain any
of the communities found, or address any practical issues, but merely state
that the communities make sense. Hao et al. [22] develop an application
that uses networks to visualize association rules from e-commerce transaction
data. Specifically, the application does a force-directed layout of the products
in a network, and is capable of performing k-means clustering on the resulting
visualization. Our approach is more general, in that community detection algorithms do not require users to specify the number of communities to find. Also,
k-means can be sensitive to the initial locations of the cluster centers, which
imposes an additional parameter on the process.
Cavique [11] transforms a transaction database into a graph for the purpose
of discovering frequent itemsets. Specifically, the paper employs a heuristic to
find maximum-weighted cliques of size k, which are then returned as approximate k-itemsets. A similar maximum-weighted-clique approach could be applied
to discover communities in our product network (see [17]), but its asymptotic
complexity of (n3 ) is greater than that of the algorithms we have applied. Fonseca et al. [19] use a graph-based representation of association rules (similar,
but not identical, to association rules networks) in order to disambiguate and
expand user queries to search engines. For a query term Q, the authors build
a directed network of terms where the edge Qi Qj exists if the association
rule Qj Qi holds in the search engine session logs. The strongly-connected
components in this graph are used to define concepts that may be helpful in
disambiguating the users query.
Conclusion
This work deals primarily with the application of network techniques to the
problem of market basket analysis: the location of meaningful associations in
customer purchase data. There is an overwhelming abundance of prior research
in the mining of mining market basket data in general, and the use of association rules in particular. The bulk of this research has focused on developing
algorithms for mining association rules [2, 10, 9, 42, 43], techniques for visualizing association rules [5, 22, 25, 38], techniques for eliminating redundant rules
[25, 20, 40, 41], objective measures of association interestingness [18, 28, 35],
or comparing the performance of association rule algorithms on either real or
synthetic datasets [23, 44]. However, there has not been much work from a
practitioners view point towards answering: Given an unseen market basket
dataset, what set of steps should I follow to conduct a thorough, complete analysis? Our work provides a comprehensive framework aimed at answering this
question.
First, we study the properties of networks of products and show that detecting communities within these networks can uncover expressive relationships
between products that may be difficult to find with association rules. We show
that, in addition to being more expressive than association rules (in that rela-
26
tionships can be expressed more compactly) the structural information available

in communities can assist with financial decisions such as the location of profitable promotions. Finally, we develop a novel measure of interestingness for
communities of products and show that it favors communities which intuitively
seem interesting.
Further, we study the application of two existing techniques, Association
Rules Networks [12, 13, 33] and Center-Piece Subgraphs [36] to the market basket problem. We find that these algorithms complement community detection
in the sense that they can be used effectively to find relationships that communities are unlikely to discover. On the basis of this observation, we propose a very
general framework for the mining of unseen market basket data in the absence of
background knowledge. The framework employs community detection as an initial exploratory step, using Association Rules Networks to uncover relationships
within the dense core of the network and Center-Piece Subgraphs to validate
hypotheses or explore individual relationships that require more explanation.
Acknowledgments
This work partially supported by the National Science Foundation under grant
NSF 0826958, the NET Institute, and the Arthur J. Schmitt Foundation.
References
[1] G. Adomavicius and A. Tuzhilin. User profiling in personalization applications through rule discovery and validation. In Proceedings of KDD, pages
377381. ACM New York, NY, USA, 1999.
[2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in
very large databases. In Proceedings of the 20th International Conference
on VLDB, pages 487499, Santiago, Chile, 1994.
[3] S. Asur, D. Ucar, and S. Parthasarathy. An ensemble framework for clustering protein-protein interaction networks. In ISMB/ECCB, pages 2940,
2007.
[4] A. Barabasi and E. Bonabeau. Scale-free networks. Scientific American,
288(5):509, 2003.
[5] J. Blanchard, F. Guillet, and H. Briand. Exploratory visualization for
association rule rummaging. In KDD-03 Workshop on Multimedia Data
Mining (MDM-03), 2003.
[6] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks, 2008.
27
[7] T. Brijs, K. Vanhoof, and G. Wets. Defining interestingness for association rules. International journal of information theories and applications,
10(4):370376, 2003.
[8] S. Brin, R. Motwani, L. Page, and T. Winograd. What can you do with a
Web in your Pocket? Data Engineering Bulletin, 21(2):3747, 1998.
[9] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: generalizing association rules to correlations. Proceedings of the ACM SIGMOD,
pages 265276, 1997.
[10] S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. ACM SIGMOD Record,
26(2):255264, 1997.
[11] L. Cavique. A scalable algorithm for the market basket analysis. Journal
of Retailing and Consumer Services, 14(6):400407, 2007.
[12] S. Chawla, B. Arunasalam, and J. Davis. Mining open source software (oss)
data using association rules network. PAKDD, pages 461466, 2003.
[13] S. Chawla, J. Davis, and G. Pandey. On Local Pruning of Association
Rules Using Directed Hypergraphs. 20th International Conference on Data
Engeneering, 2004.
[14] Y. Cho, J. Kim, and S. Kim. A personalized recommender system based
on web usage mining and decision tree induction. Expert Systems with
Applications, 23(3):329342, 2002.
[15] A. Clauset, M. Newman, and C. Moore. Finding community structure in
very large networks. Phys. Rev. E, 70(066111), 2004.
[16] A. Clauset, C. Shalizi, and M. Newman. Power-law distributions in empirical data. axriv, 706, 2007.
[17] N. Du, B. Wu, X. Pei, B. Wang, and L. Xu. Community detection in largescale social networks. In Proceedings of WebKDD, pages 1625. ACM,
2007.
[18] W. DuMouchel and D. Pregibon. Empirical bayes screening for multi-item
associations. Proceedings of KDD, pages 6776, 2001.
[19] B. Fonseca, P. Golgher, B. Possas, B. Ribeiro-Neto, and N. Ziviani.
Concept-based interactive query expansion. In Proceedings of CIKM, page
703. ACM, 2005.
[20] K. Gouda and M. Zaki. Efficiently mining maximal frequent itemsets. In
Proceedings of ICDM, pages 163170. IEEE Computer Society, 2001.
28
[21] J. Han and J. Pei. Mining frequent patterns by pattern-growth: methodology and implications. ACM SIGKDD Explorations Newsletter, 2(2):1420,
2000.
[22] M. Hao, U. Dayal, M. Hsu, T. Sprenger, and M. Gross. Visualization
of directed associations in e-commerce transaction data. Proceedings of
VisSym, 1:185192, 2001.
[23] J. Hipp, U. G
untzer, and G. Nakhaeizadeh. Algorithms for association
rule mininga general survey and comparison. ACM SIGKDD Explorations
Newsletter, 2(1):5864, 2000.
[24] J. Kleinberg and S. Lawrence. The structure of the web. Science, 294:1849
1850, 11 2001.
[25] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. Verkamo.
Finding interesting rules from large sets of discovered association rules.
Proceedings of CIKM, pages 401407, 1994.
[26] C. Massen and J. Doye. Identifying communities within energy landscapes.
Physical Review E, 71(4):46101, 2005.
[27] C. Mauri. Card loyalty. A new emerging issue in grocery retailing. Journal
of Retailing and Consumer Services, 10(1):1325, 2003.
[28] K. McGarry. A survey of interestingness measures for knowledge discovery.
The knowledge engineering review, 20(01):3961, 2005.
[29] M. Newman. Detecting community structure in networks. The European
Physical Journal B-Condensed Matter and Complex Systems, 38(2):321
330, 2004.
[30] M. Newman. Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 74(3):36104, 2006.
[31] M. Newman and M. Girvan. Finding and evaluating community structure
in networks. Physical Review E, 69(2):26113, 2004.
[32] C. Palmer and C. Faloutsos. Electricity based external similarity of categorical attributes. Lecture notes in computer science, pages 486500, 2003.
[33] G. Pandey, S. Chawla, S. Poon, B. Arunasalam, and J. Davis. Association
Rules Network: Definition and Applications. Statistical Analysis and Data
Mining, 1(4), 2009.
[34] K. Steinhaeuser and N. Chawla. Community detection in a large-scale real
world social network. In LNCS. Springer Verlag, 2008.
[35] P. Tan, V. Kumar, and J. Srivastava. Selecting the right objective measure
for association analysis. Information Systems, 29(4):293313, 2004.
29
[36] H. Tong and C. Faloutsos. Center-piece subgraphs: problem definition and

fast solutions. In Proceedings of KDD, pages 404413. ACM New York,
NY, USA, 2006.
[37] H. Tong, C. Faloutsos, and J. Pan. Fast random walk with restart and its
applications. In Proceedings of ICDM, pages 613622, 2006.
[38] P. Wong, P. Whitney, and J. Thomas. Visualizing association rules for text
mining. In 1999 IEEE Symposium on Information Visualization, 1999.(Info
Vis 99) Proceedings, pages 120123, 1999.
[39] H. Xiong, P. Tan, and V. Kumar. Hyperclique pattern discovery. Data
Mining and Knowledge Discovery, 13(2):219242, 2006.
[40] M. Zaki. Generating non-redundant association rules. In Proceedings of
KDD, pages 3443. ACM New York, NY, USA, 2000.
[41] M. Zaki and C. Hsiao. CHARM: An efficient algorithm for closed itemset
mining. In 2nd SIAM International Conference on Data Mining, pages
457473, 2002.
[42] M. Zaki, S. Parthasarathy, M. Ogihara, W. Li, et al. New algorithms for
fast discovery of association rules. In Proceedings of KDD, volume 20, 1997.
[43] M. J. Zaki. Parallel and distributed association mining: A survey. IEEE
Concurrency, 7(4):1425, 1999.
[44] Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association
rule algorithms. In Proceedings of KDD, pages 401406. ACM New York,
NY, USA, 2001.
30

Market Basket Analysis With Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Market Basket Analysis With Networks

Uploaded by

Copyright:

Available Formats

Market Basket Analysis with Networks

Troy Raeder, Nitesh V. Chawla

Keywords: market basket analysis, community detection, product network,

do not represent meaningful associations.

(a) 10% confidence, varying support.

(b) 0.01% support, varying confidence.

Figure 1: Number of associations discovered at varying levels of confidence and

by as much as a factor of 3,000. Those experiments, however, were conducted

where P(X) is the proportion of transactions in which X occurs. Note that if

names are: DH YELLOW CAKE MX 18 and DH FROSTING DXCHOC

Constructing a Network of Products

We begin our discussion by examining the properties of product networks and

the behavior of customers in a store.

Discovering Communities of Products

Community detection is the process of finding strong communities in a network.

Measuring the Utility of Communities

We could have chosen, in lieu of confidence, a number of measures for the

Results on Real-World Data

In order to demonstrate the effectiveness of our proposed methods, we present

(a) Chips and salsa.

(b) Eggs and baking products.

Figure 4: The distribution of utility scores across all communities

(a) A community of milk and cereal

(b) A community of milk, cereal, and soda. The soda

(c) A community of fruit (diamond), salad (square), and yogurt (triangle).

Figure 5: Three more communities.

Toward a Comprehensive Analysis Strategy

Association Rules Networks

Figure 6: Two Association Rules Networks from the community of eggs.

Figure 7: Association Rules Network with z = BAGEL.

Thus, the substitution erodes the confidence of the association Fi P even

Center-Piece Subgraphs (CePS) [36], like Association Rules Networks, describe

query nodes to which a node must be strongly related in order to be considered

Objective Function Definition

exactly t steps. The limit as t of the pi,j is known as the steady-state

Figure 8: Center-Piece Subgraphs with tortilla chips (TOSTITOS SUPER SIZE)

The objective function g(H) for a subgraph H follows as:

Center-Piece Subgraphs on Market Basket Data

A Strategy for Market Basket Analysis

5. Association Rules Networks are more effective at exploring the core of

Choosing the Minimum Support Parameter

Figure 11: Modularity of discovered communities as a function of minimum

Before concluding, we wish to briefly acknowledge a small amount of related

Clauset et al. [15] apply community detection to Amazon.com transaction

tionships can be expressed more compactly) the structural information available

[36] H. Tong and C. Faloutsos. Center-piece subgraphs: problem definition and

You might also like