1

Efficient Pattern Mining of Big Data
using Graphs
Dissertation
by:
Vandana Bhatia
( Roll No. : 901403001 )
Under the Guidance of

Dr. Rinkle Rani
(Associate Professor, CSED)
Outlines
 Introduction:
 Big Data
 From Big Data to mining
 Graph Applications
 Properties of Real-World Graphs
 Approaches for Graph Mining:
 Graph Clustering
 Frequent Subgraph Mining
 Research Gaps
 Objectives
 Research Methodology
 Graph Processing Tool
 Proposed Clustering Algorithms
 Proposed Subgraph Mining Algorithms
 Conclusion and Future Scope
 References
1
Introduction
 Big Data:
 Massive amount of structured and unstructured data.
 Dataset, whose size is beyond the ability of a typical
database, software tools to capture, store, manage and
analyze.
Resources of Big Data:
 Physical Measurements: from Science (Physics,
Astronomy)
 Medical data: Genetic Sequences, Detailed time series
 Activity data: GPS location, Social network activity
2
 Business data: Customer behavior tracking at fine detail
7 V’s of Big Data
Value
Visualization
3
From Big Data to Data Mining
 Data Mining also known as Knowledge Discovery in
Databases (KDD) is the process of extracting useful hidden
information from very large databases in supervised and
unsupervised manner.
 For certain kind of data, the connection between the entities
is more important than data itself.
 The relationship between entities can be structured as graph
 Mining graph data is one of the important research areas in
the field of data mining
 Involves analyzing graph structure.
 Mining patterns from graphs.
4
Graph: A Simple Model
 Entities
 Set of vertices
 Pairwise relations among vertices
 Set of edges
 Can add directions, weights,. . .
 Graphs can be used to model many real datasets

 People who are friends
 Computers that are interconnected
 Web pages that point to each other
 Proteins that interact
5
Why Graphs?
Graph Encode Rich Relationship.
6
Diversity of Graphs
8
Properties of
Real-World Graphs
There are different patterns in diverse collections of graphs arising from
different phenomena:
 Static networks
 heavy tails
 clustering coefficients
 communities
 small diameters
 Time-evolving networks
 densification
 shrinking diameters
 Web graph
 bow-tie structure
 Bipartite cliques 9
Graph Applications
 Social networks:
 Friendship and collaboration networks
 Phone-call networks
 Technological networks:
 The internet
 Power grids
 Transportation networks
 Knowledge and Information networks:
 The World Wide Web
 Blog networks
 Software Graphs
 Bluetooth Networks
 Biological networks:
 Gene co-expression networks
 Gene regulation Networks
 Protein-Protein Interaction Networks
7
Graph Based data mining
Graph Mining is essentially the problem of discovering
repetitive subgraphs occurring in the input graphs.
Motivation
Identifying conceptually interesting patterns.
Identifying clusters by grouping the vertices of a
given input graph.
Classifying the unstructured graph data.
Finding subgraphs capable of compressing the data
by abstracting instances of the substructures.
Achieve parallelism by partitioning data to get an
efficient “Divide and Conquer” solution
10
Approaches for Mining Graphs
Focus of the Research: - Graph Clustering

- Frequent Subgraph Mining
(FSM)
10
Graph Clustering
Popular unsupervised learning algorithm
Groups the similar vertices of graph to form clusters.
Vertices within the cluster are more connected with the vertices
inside the cluster than the vertices of other clusters.

Applications:
o In a social networking graph, the clusters could represent people with
same/similar hobbies.
o A set of graphs representing chemical compounds can be grouped into clusters
based on their structural similarity.
13 o Building blocks for graph classification, compression, and correlation analysis

Comparative study of existing Clustering algorithms
14
Frequent Subgraph Mining (FSM)
Finding frequent subgraphs within a single graphs or a
set of graphs
Applications:
o Mining biochemical structures
o Program control flow analysis
o Research Network analysis
o Building blocks for graph classification, clustering, compression, comparison,
and correlation analysis
15
Comparative study of existing FSM algorithms
16
Research Gaps
Most of the existing algorithms do not provide scalability, in any case
directly, up to several millions and billions of nodes and edges.
Existing work relies on a shared memory model which limits their
ability to handle very large and disk-resident distributed graphs.
There exist only few parallel structural clustering algorithm that can
handle large and distributed databases in the form of graphs.
Most of the existing clustering algorithms for graphs are sensitive to
the selection of initial centers and require the number of clusters to be

17 mentioned at beginning.
Research Gaps (Cont..)
 Most of the existing work focuses on graph transactions for
performing FSM.
 Only few work exist for finding frequent subgraphs in single
graph.
 Existing FSM algorithms incur high communication and I/O
overhead while handling large graphs

 There is no scalable solution to find approximate patterns in large
graphs
18
OBJECTIVES
1. To better understand and explore various concepts, techniques
and tools available for mining of graphs.
2. To propose improved algorithm(s) of pattern mining for big
data using graphs.
3. To implement the proposed algorithm(s).
4. To verify and validate the proposed algorithm(s).
19
Research Methodology
20
Graph Processing Framework: Pregel
Scalable and Fault-tolerant platform
API with flexibility to express arbitrary
algorithm
Vertex centric computation (Think like a
vertex).
25
Pregel: Master-Slave Architecture
22
Map-Reduce v/s Pregel
MapReduce
 No online query support
 The map reduce data model is not a native graph model
 Graph algorithms cannot be expressed intuitively.
 Graph processing is inefficient on map reduce
 Intermediate results of each iteration need to be materialized
 Entire graph structure need to be sent over network iteration after iteration,
this incurs huge unnecessary data movement
Pregel
 Exploits fine-grained parallelism at node level
 Pregel doesn’t move graph partitions over network, only messages among
nodes are passed at the end of each iteration
 Not many graph algorithms can be implemented using vertex-based
computation model elegantly.
26
Proposed Clustering Algorithms
PGFC : Parallel Graph Fuzzy Clustering
PFCA : Influence based Fuzzy Clustering algorithm
DFuzzy : Deep Learning based Fuzzy Clustering Model
24
Evaluation Metrics
 F-Measure
o Harmonic Mean of Recall and Precision values
2(Pr ecision �Re call )
F - Measure =
Pr ecsision + Re call
|SI T | |SI T |
Pr ecision = Re call =
|S| |T |
S = Set of pair of vertices (i,j) belonging to same cluster detected by proposed algorithm
T =Set of pair of vertices where vertex (i,j) belong to same cluster in ground truth data.
 Modularity
o Measure of number of dense connection inside a cluster and number of
sparse connections with
1 other
� clusters
d(i ) *. d( j ) �
Q= ��M
2m ( i , j )�V 2
�
ij -
2m
* d (ci , c j )
�
�
d (ci , c j ) = 1 ci = c j

25 if, , otherwise =0.
Evaluation Metrics
Partition Coefficient:
o It measures the amount of overlapping between clusters and is
given as: 1 k N
PC =
N
��m
i =1 j =1
ij
2
Conductance:
o It can be defined as the ratio of number of inter cluster edges to the
minimum number of edges incident on either cluster Ck or :
Cond (Ck ) =
� i�Ck , j�Ck
Aij
min( A(Ck ), A(Ck ))
where, Aij is the adjacency matrix of graph G, Ck∈V and is the

number of edges incident on Ck
26
Proposed PGFC
o Finds fuzzy clusters by
amending the structure of the
classical Fuzzy C-Means
algorithm for large graph
data in distributed
environment.
o Degree measure is taken for
initialization of cluster
centers.
27
Pseudo-code: PGFC
28
Performance Evaluation: PGFC
29
Scalability: PGFC
30
Limitations of PGFC
 Requires the number of clusters to be pre-defined.
 Do not analyze the structure of the Graph.
 Performs full inverse-distance weighting calculations
31
Flow Chart of proposed PFCA
o It selects the candidate
cluster heads based on
their influence in the
network
o Determines the number
of clusters by analyzing
the graph structure using
Personalized PageRank
algorithm and
modularity.
32
Proposed Algorithm: PFCA
Cluster Formation & Optimization
33
Proposed Algorithm: PFCA
Filtering: Identifying fuzzy and crisp vertices
34
Performance Evaluation: PFCA
Dataset Characteristics
Performance
35
Run Time vs No. of Supersteps
36
Accuracy in terms of F1 and F2 measures
37
Scalability: PFCA
38
Limitations of PFCA
 Information Loss:
The vertices that are not covered by personalized

PageRank are not included in final result.
39
Deep Learning based Autoencoders
Autoencoder is a special neural network

whose input and output are same.
It can first encode the input through hidden
40
layers and then decode it as the output.
Proposed DFuzzy
Layered autoencoders are the building blocks of the

proposed DFuzzy.
Autoencoder is used to minimize the reconstruction loss
between the original data and the reconstructed data
from the new representations.
A novel autoencoder based model for learning graph
representations, which generates a latent
representation for each vertex by capturing the graph
41
structural information.
Proposed Algorithm: DFuzzy
42
Performance Evaluation: DFuzzy
Dataset and Effect of multiple layers
43
Run Time vs No. of Supersteps
44
45
Scalability: DFuzzy
46
Performance Comparison: Coverage
 The graph coverage indicates how many vertices are assigned to clusters
Graph Semi- Proposed Proposed Proposed

Dataset FCM LPA BigClam clustering PGFC PFCA DFuzzy
arXiv hep- Coverage (%) 100 71.2 63.1 97.1 100 88.1 100
th No. of Clusters 100 100 100 100 100 109 109
Coverage (%) 100 90.1 99.2 91.3 100 99.9 100
Amazon
No. of Clusters 75,149 75,149 75,149 75,149 75,149 73254 73254
Coverage (%) 100 90.3 94.6 84.7 100 99.9 100
DBLP
No. of Clusters 13,477 13,477 13,477 13,477 13,477 12186 12186
Coverage (%) 100 91.1 79.5 90.2 100 91.4 99.8
YouTube
No. of Clusters 8,385 8,385 8,385 8,385 8,385 8683 8683
Coverage (%) 100 82.2 43.9 91.2 100 88.9 99.7
LiveJournal
No. of Clusters 287,512 287,512 287,512 287,512 287,512 287512 287512
Coverage (%) 100 80.5 52.4 90.1 100 90.2 99.5
Orkut
No. of Clusters 5,000,000 5,000,000 5,000,000 5,000,000 5,000,000 6,128,183 6,128,183
47
Proposed Frequent Subgraph Mining
Algorithms
 Exact Subgraph Mining: PaGro
o Exact subgraph mining algorithm is proposed by leveraging the
operative communication primitives for better scalability.
o A two-step hybrid approach is developed for optimization of
subgraph pruning task at both local and global levels to avoid the
excess communication overhead.
 Approximate Subgraph Mining: Ap-FSM
o An approximate frequent subgraph mining algorithm which
exploits sampling for faster processing.
o A novel sampling approach named G-Samp is proposed for the
selection of an approximate subgraph while capturing the original
graph properties for convenient and relatively easy analysis.
48
Proposed PaGro
49
Proposed Algorithm: PaGro
50
Proposed Algorithm: PaGro
51
Dataset Used in PaGro
 LiveJournal: Nodes in this dataset are users of LiveJournal, and

directed edges characterize friendship among members.
 Twitter: A node represents a user and a directed edge indicates
that a user is followed by another user. This graph does not
contain labels for vertices so; each vertex is assigned a label
randomly based on Gaussian distribution from the pool of 25
labels.
 US Patents: each node is a patent labeled with the patent class and
an edge represents a citation.
 DBLP: The vertices represent authors and an undirected edge
among two vertices is there if there has been a publication of at
52
least one paper among corresponding authors together.
Performance Evaluation: PaGro
53
Performance Evaluation: PaGro Optimizations
~Baseline Algorithm ~Isomorphism Discovery ~ Communication Cost
54
Scalability: PaGro
55
Working of Ap-FSM
o Ap-FSM performs sampling on original single

massive graph to select the representative
graph for making the task of frequent subgraph
mining computationally less expensive.
56 A novel sampling Algorithm named ‘G-Samp’ is
o
proposed.
Proposed Ap-FSM:
Phase 1: G-Samp
o A pre-defined sampling factor δ is used to select the vertices from the graph for sampling.
o The selected vertices will have same degree structure as of the original graph G =(V,E) such that the
selected representative graph G’=(V’,E’) will also follow the same power law degree distribution
o The selected vertices set V’ with the edges together form the sampled graph G’.
57
Performance Metrics for G-Samp
 Degree Distribution
o In-degree is a count of the number of links directed to the vertex and
out-degree is the number of links that the vertex directs to others
 Clustering Coefficient
o It is a measure of the degree to which the vertices of a graph tends to
form a cluster
o There are two versions of this measure: Local (LCC) and Global
(GCC). The LCC of a vertex compute the closeness of vertex v with its
neighbors to form a clique while, the GCC detects the overall
indication of the clustering in graph
 Edge Betweenness
o Edge betweenness can be defined as the number of shortest paths σij in
the graph G that pass through a given edge (u, v)
58
Performance Evaluation: G-Samp
59
Performance Evaluation: Ap-FSM
60
Accuracy Evaluation of Ap-FSM
1
0.95
F-Measure
0.9
0.85
0.8
0.75
0.7
0.65
0.6 US Patents
LiveJournal
0.55
Twitter
0.5
2k 4k 6k 8k 10k
Support Threshold τ
61
Conclusion
A parallel fuzzy clustering algorithm PFCA is proposed which finds fuzzy clusters by amending the
structure of the classical Fuzzy C-Means algorithm for large graph data. Degree measure is taken for
initialization of cluster centers. It is proved that PGFC performs better than state-of-art clustering algorithms
in terms of partition coefficient and conductance.
A parallel fuzzy clustering algorithm named “PFCA” is proposed for large graphs where the number of
clusters is not pre-defined. PFCA outperforms the state-of-art clustering algorithms in terms of run time,
modularity and conductance.
A deep learning based fuzzy clustering algorithm named DFuzzy is proposed which performs clustering by
leveraging the idea from deep learning pipelines. Results shows that use of autoencoders for fuzzy clustering
help in getting good quality of clusters.
An exact subgraph mining algorithm named PaGro is proposed by leveraging the operative communication
primitives for better scalability. A two-step hybrid approach is developed for optimization of subgraph
pruning task at both local and global levels to avoid the excess communication overhead. PaGro performs
better than state-of-art FSM algorithms in terms of processing time and memory overhead.
An approximate frequent subgraph mining algorithm named Ap-FSM is proposed which exploits sampling
for faster processing. A novel sampling approach named G-Samp is proposed for the selection of an
approximate subgraph while capturing the original graph properties for convenient and relatively easy
analysis. The results show that it outperforms the competent algorithms and is time efficient.
62
Future Scope
 A suitable partitioning technique could be applied to the proposed
approaches for better load balancing.
 The proposed approaches can be tested with graphs having billions
of vertices using more number of machines of higher configurations.
 The proposed approaches can be used for real-life applications like
credit card fraud detection, recommendation system, etc. for

obtaining interesting patterns.
63
List of The Publications
Vandana Bhatia and Rinkle Rani, “A Parallel Fuzzy Clustering algorithm for Large Graphs using Pregel”,
Expert Systems with Applications, Vol-78, pp. 135-144, 2017. [SCIE Indexed, Impact Factor-3.928] doi-
10.1016/j.eswa.2017.02.005.
Vandana Bhatia and Rinkle Rani, “DFuzzy: A Deep learning based Fuzzy Clustering Model for Large
Graphs”, Knowledge and Information Systems [ SCIE Indexed, Impact Factor-2.004] doi- 10.1007/s10115-
018-1156-3.
Vandana Bhatia and Rinkle Rani, “Ap-FSM: A Parallel algorithm for Approximate Frequent Subgraph
Mining using Pregel”, Expert Systems with Applications,Vol-106, pp. 217-232, 2018. [SCIE Indexed,
Impact Factor-3.928] doi- 10.1016/j.eswa.2018.04.010.
Vandana Bhatia and Rinkle Rani, “PFCA: An Influence based Parallel Fuzzy Clustering algorithm for Large
Complex Networks”, Expert Systems-SCIE Indexed, Impact factor: 1.18. [In Press]
Vandana Bhatia and Rinkle Rani, “PaGro: A Distributed Pattern Growth based Frequent Subgraph Mining
algorithm for Large Graphs”, IEEE Transactions on Parallel and Distributed Computing-SCIE Indexed,
Impact factor: 4.181. [Under Review]
Vandana Bhatia and Rinkle Rani, “PSGC: Parallel Structural Graph Clustering algorithm based on
Subgraph Similarity”, The Journal of Supercomputing -SCIE Indexed, Impact Factor-1.326
[Communicated].
Vandana Bhatia and Rinkle Rani, “An Efficient Influence based Label Propagation algorithm for Clustering
large graphs”, in the proceedings of IEEE International Conference on Infocom Technologies and
Unmanned Systems (ICTUS'2017), pp. 480-486, 18-20 December 2017.
Vandana Bhatia and Rinkle Rani, “An Efficient algorithm for Sampling of Single Large Graph”, in the
proceedings of IEEE 10th International Conference on Contemporary Computing (IC3’2017), 10-12
August 2017.
Vandana Bhatia, Bharti Saneja and Rinkle Rani, “INGC: Graph Clustering & Outlier Detection algorithm
using Label Propagation”, in the proceedings of IEEE International Conference on Machine Learning and
64Data Science, 13-15 December 2017.
Selected References
o U. Kang and C. Faloutsos, “Big Graph Mining : Algorithms and Discoveries,” ACM SIGKDD Explore Newsetter., vol. 14, no. 2, pp. 29–36, 2013.
o D. J. Cook and L. B. Holder, Mining Graph Data. Wiley, 2007.
o H. Meyerhenke, P. Sanders, and C. Schulz, “Parallel Graph Partitioning for Complex Networks -Balanced Graph Partitioning,” IEEE Transanction on Parallel Distributed Systems, vol. 28, no. 9, pp.
2625–2638, 2017.
o W. X. Lu, C. Zhou, and J. Wu, “Big social network influence maximization via recursively estimating influence spread,” Knowledge-Based Systems, vol. 113, pp. 143–154, 2016.
o M. Wang, C. Wang, J. X. Yu, and J. Zhang, “Community Detection in Social Networks : An In-depth Benchmarking Study with a Procedure-Oriented Framework,” in the p roceedings of VLDB
Endowment, vol. 8, no. 10, pp. 998–1009, 2015.
o N. S. Ketkar, L. B. Holder, and D. J. Cook, “Subdue: compression-based frequent pattern discovery in graph data,” in the proceedings of 1st International Workshop on Open source data Mining:
Frequent Pattern Mining Implementations, pp. 71–76, 2015.
o C. Borgelt and M. R. Berthold, “Mining molecular fragments: finding relevant substructures of molecules,” in the proceedings of IEEE International Conference on Data Mining, pp. 51–58, 2002.
o J. Baumes, M. Goldberg, M. Krishnamoorthy, M. Magdon-Ismail, and N. Preston, “Finding communities by clustering a graph into overlapping subgraphs,” in the proceedings of Interntional
Conference of Appllied Computing (IADIS 2005), pp. 97–104, 2005.
o J. Huan, W. Wang, J. Prins, and J. Yang, “SPIN: Mining Maximal Frequent Subgraphs from Graph Databases,” in the proceedings of the 10th ACM SIGKDD International Conference on Knowledge
discovery and data mining, no. 1, pp. 581–586, 2004.
o F. Schreiber and H. Schw, “Frequency Concepts and Pattern Detection for the Analysis of Motifs in Networks,” in the proceedings of Transactions on Computational Systems Biology III, pp. 89–104,
2005.
o S. E. Schaeffer, “Graph Clustering,” Computer Science Review., vol. 1, pp. 27–64, 2007.
o L. Wang and J. Hopcroft, “Community Structure in Large Complex Networks,” in the proceedings of International Conference on Theory and Applications of Models of Computation, pp. 455–466,
2010.
o S. Malek, M. Golsefid, M. Hossien, and F. Zarandi, “Fuzzy Community Detection Model in Social Networks,” International Journal of Intelligent System, vol. 30, pp. 1227–1244, 2015.
o H. Zhou, J. Li, J. Li, F. Zhang, and Y. Cui, “A graph clustering method for community detection in complex networks,” Physica A: Statistical Mechanics and Its Applications, vol. 469, pp. 551–562,
2017.
o A. Ghosh, N. S. Mishra, and S. Ghosh, “Fuzzy clustering algorithms for unsupervised change detection in remote sensing images,” Information Sciences., vol. 181, no. 4, pp. 699–715, 2011.
o Y. Zhou, H. Cheng, and J. X. Yu, “Graph Clustering Based on Structural / Attribute Similarities,” in the proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 718–729, 2009.
o D. Arthur, “k-means ++ : The Advantages of Careful Seeding,” in the proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035, 2007.
o J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: The fuzzy c-means clustering algorithm,” Computers & Geosciences, vol. 10, no. 2, pp. 191–203, 1984.
o H. Wang, Z. Xu, and W. Pedrycz, “An overview on the roles of fuzzy set techniques in big data processing: Trends, challenges and opportunities,” Knowledge-Based System, vol. 118, pp. 15–30,
2016.
o G. Palla and I. Dere, “Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society,” Nature, vol. 435, no. 7043, pp. 814–818, 2005.
o S. Fortunato, “Community detection in Graphs,” Physics reports, vol. 486, pp. 75–174, 2010.
o C. Laclau and M. Nadif, “Hard and fuzzy diagonal co-clustering for document-term partitioning,” Neurocomputing, vol. 193, pp. 133–147, 2016.
o G. Bello-Orgaz, J. J. Jung, and D. Camacho, “Social Big Data: Recent Achievements and New Challenges,” Information Fusion, vol. 28, pp. 45–59, 2016.
o Y. Özbay, R. Ceylan, and B. Karlik, “A fuzzy clustering neural network architecture for classification of ECG arrhythmias,” Computers in Biology and Medicine, vol. 36, no. 4, pp. 376–388, 2006.
o O. Kesemen, Ö. Tezel, and E. Özkul, “Fuzzy c-means clustering algorithm for directional data ( FCM4DD ),” Expert System and Applications, vol. 58, pp. 76–82, 2016.
o A. Király, Á. Vathy-fogarassy, and J. Abonyi, “Geodesic distance based fuzzy c-medoid clustering – searching for central points in graphs and high dimensional data,” Fuzzy Sets Systems, vol. 1, pp.
1–16, 2015.
o P. G. Sun, L. Gao, and S. S. Han, “Identification of overlapping and non-overlapping community structure by fuzzy clustering in complex networks,” Information Sciences, vol. 181, pp. 1060–1071,
2011.
35
Selected References
o T. Nepusz, A. Petrczi, L. Ngyessy, and F. Bazs, “Fuzzy Communities and the concept of Bridgeness in Complex Networks,” Physical Review E , vol. 77, no. 1, pp. 1–13, 2008.
o A. Fahad et al., “A Survey of Clustering Algorithms for Big Data : Taxonomy and Empirical Analysis,” IEEE Transactions on Emerging Topics in Computing, vol. 2, no. 3, 2014.
o A. Inokuchi, T. Washio, and H. Motoda, “An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data,” Principles of Data Mining and Knowledge Discovery , vol. 1910, pp. 13–23,
2000.
o H. P. Hsieh and C. Te Li, “Mining temporal subgraph patterns in heterogeneous information networks,” in the proceedings of 2nd IEEE International Conference on Privacy, Security, Risk and Trust, pp.
282–287, 2010.
o X. Yan and J. Han, “gSpan: Graph-based substructure pattern mining,” Journal of Chemical Information and Modeling, vol. 53, no. 9, pp. 1689–1699, 2002.
o A. Dhiman and S. K. Jain, “Optimizing Frequent Subgraph Mining for Single Large Graph,” Procedia Computer Science, vol. 89, pp. 378–385, 2016.
o J. Li, Z. Zhong, J. Z. Huang, and S. Feng, “Balanced Parallel FP-Growth with MapReduce,” in the proceedings of IEEE International Conference on Granular Computing (GrC), pp. 875–878, 2011.
o K. Wang, X. X. B, H. Jin, P. Yuan, F. Lu, and X. Ke, “Frequent Subgraph Mining in Graph Databases Based on MapReduce,” in the proceedings of 10th Asia-Pacific Services Computing Conference on
Advances in Services Computing, APSCC, pp. 464–476, 2016.
o W. Lin, X. Xiao, and G. Ghinita, “Large-Scale Frequent Subgraph Mining in MapReduce,” in the proceedings of IEEE International Conference on Data Engineering, pp. 844–855, 2014.
o Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in the proceedings of the Symposium of Operating Systems Design and Implementation, pp. 137–149, 2004.
o G. Malewicz et al., “Pregel : A System for Large-Scale Graph Processing,” in the proceedings of the ACM SIGMOD International Conference on Management of data, pp. 135–145, 2010.
o F. Tian, B. Gao, Q. Cui, E. Chen, and T. Liu, “Learning Deep Representations for Graph Clustering,” in the proceedings of the 28th AAAI Conference on Artificial Intelligence, pp. 1293–1299, 2015.
o J. Yang and J. Leskovec, “Overlapping community detection at scale: A Nonnegative Matrix Factorization Approach,” in the proceedings of the 6th ACM international conference on Web search and data
mining, pp. 587-596 , 2013.
o S. Gregory, “Finding overlapping communities in networks by label propagation,” New Journal of Physics, vol. 12, no. 10, pp. 1–21, 2010.
o T. Schäfer and P. Mutzel, “StruClus: Structural Clustering of Large-Scale Graph Databases,” arXiv Prepr. arXiv1609.09000 (2016)., 2016.
o I. Timón, J. Soto, H. Pérez-sánchez, and J. M. Cecilia, “Parallel implementation of fuzzy minimals clustering algorithm in R,” Expert System with Applications., vol. 48, pp. 35–41, 2016.
o S. A. Ludwig, “MapReduce-based fuzzy c-means clustering algorithm : implementation and scalability,” International Journal of Machine Learning and Cybernetics, vol. 6, no. 6, pp. 923–934, 2015.
o Z. Wu, G. Gao, Z. Bu, and J. Cao, “SIMPLE: a simplifying-ensembling framework for parallel community detection from large networks,” Cluster Computing, vol. 19, no. 1, pp. 211–221, 2016.
o X. Pan, D. Papailiopoulos, S. Oymak, B. Recht, K. Ramchandran, and M. I. Jordan, “Parallel Correlation Clustering on Big Graphs,” In Advances in Neural Information Processing Systems , pp. 1–22,
2015.
o H. Chun et al., “A graph clustering method for community detection in complex networks,” Knowledge and Information System., vol. 469, no. 1, pp. 718–729, 2017.
o S. Nijssen and J. Kok, “Faster Association Rules for Multiple Relations,” in the proceedings of International Joint Conference on Artificial Intelligence, pp. 891–896 , 2001.
o M. Kuramochi and G. Karypis, “An Efficient Algorithm for Discovering Frequent Subgraphs,” IEEE Transaction on Knowledge and Data Engineering, vol. 16, no. 9, pp. 1038–1051, 2004.
o T. Ramraj and R. Prabhakar, “Frequent subgraph mining algorithms - A survey,” Procedia Computer Science, vol. 47, no. C, pp. 197–204, 2014.
36
Thank You

1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1

Uploaded by

Copyright:

Available Formats

Efficient Pattern Mining of Big Data

Under the Guidance of

 Graphs can be used to model many real datasets

Focus of the Research: - Graph Clustering

Groups the similar vertices of graph to form clusters.

inside the cluster than the vertices of other clusters.

o In a social networking graph, the clusters could represent people with

based on their structural similarity.

13 o Building blocks for graph classification, compression, and correlation analysis

o Program control flow analysis

o Research Network analysis

o Building blocks for graph classification, clustering, compression, comparison,

and correlation analysis

directly, up to several millions and billions of nodes and edges.

Existing work relies on a shared memory model which limits their

ability to handle very large and disk-resident distributed graphs.

handle large and distributed databases in the form of graphs.

Most of the existing clustering algorithms for graphs are sensitive to

the selection of initial centers and require the number of clusters to be

 Only few work exist for finding frequent subgraphs in single

 Existing FSM algorithms incur high communication and I/O

overhead while handling large graphs

and tools available for mining of graphs.

2. To propose improved algorithm(s) of pattern mining for big

data using graphs.

3. To implement the proposed algorithm(s).

4. To verify and validate the proposed algorithm(s).

PFCA : Influence based Fuzzy Clustering algorithm

DFuzzy : Deep Learning based Fuzzy Clustering Model

where, Aij is the adjacency matrix of graph G, Ck∈V and is the

 Do not analyze the structure of the Graph.

 Performs full inverse-distance weighting calculations

The vertices that are not covered by personalized

Autoencoder is a special neural network

Layered autoencoders are the building blocks of the

Graph Semi- Proposed Proposed Proposed

 LiveJournal: Nodes in this dataset are users of LiveJournal, and

o Ap-FSM performs sampling on original single

approaches for better load balancing.

 The proposed approaches can be tested with graphs having billions

of vertices using more number of machines of higher configurations.

 The proposed approaches can be used for real-life applications like

credit card fraud detection, recommendation system, etc. for

You might also like