Professional Documents
Culture Documents
each question, we rst identify several critical research chal- by a government that grants a set of rights of exclusivity and
lenges, and then discuss dierent research eorts and vari- protection to the owner of an invention. The right of exclu-
ous techniques used for addressing these challenges. Table 1 sivity allows the patent owner to exclude others from making,
summarizes dierent patent mining tasks, including patent using, selling, oering for sale, or importing the patented in-
retrieval, patent classication, patent visualization, patent vention during the patent term, typically period from the ear-
exploration, and cross-language patent mining. Up-to-date liest ling date, and in the country or countries where patent
references/lists related to patent mining can be found at protection exists. Based upon the understanding of the de-
http://users.cis.u.edu/lzhan015/patmining.html. In the nition, patent documents are one of the key components that
following sections, we will briey introduce the existing so- serve to protect the intellectual properties of patent owners.
lutions to each task based on the techniques being utilized. Note that patents and inventions are two dierent yet in-
The rest of the paper is organized as follows. In 2, we terleaved concepts: patents are legal documents, whereas
provide an introduction to patent documents by describing inventions are the content of patents. Dierent countries
patent document structures, patent classication systems, or regions may have their own patent laws and regulations,
and various patent mining tasks. Section 3 presents a sum- but in general there are two common types of patent doc-
mary of research eorts for addressing patent retrieval, espe- uments: utility patents and design patents. Utility patents
cially, patent search. In Section 4, we investigate how patent describe technical solutions related to a product, a process,
documents can be automatically classied into dierent pre- or a useful improvement, etc., whereas design patents of-
dened categories. In Section 5, we explore how patent doc- ten represent original designs related to the specications
uments can be represented to analysts in a way that the of a product. In practice, due to the distinct properties of
core ideas of patents can be clearly illustrated and the cor- these two types of patents, the structure of patent document
relations of dierent documents can be easily identied. In may vary slightly; however, a typical patent document often
Section 6, we show that the quality of a patent document contains several requisite sections, including a front page,
can be automatically evaluated based on some predened detailed specications, claims, declaration, and/or a list of
measurements that help companies decide which patent is drawings to illustrate the idea of the solution.
more important and should be further maintained for eec- Figure 1 shows an example of the front page of a patent
tive property protection. In Section 7, we present dierent document. In general, a frontpage contains four parts, de-
techniques for cross-language patent mining, including ap- scribed as follows:
proaches to solving machine translation and semantic corre-
spondence. Section 8 discusses existing free and commercial 1. Announcement, which includes Authority Name (e.g.
patent mining systems that provide various functionalities United States Patent), Patent No., and Date of Patent
to allow patent analysts to perform dierent patent mining (i.e., patent publication date).;
tasks. Finally, Section 9 concludes our survey and discusses 2. Bibliography, which often includes Title, Inventors,
emerging research- and application-wise challenges in the Assignee, Application No., and Date of ling.;
domain of patent mining.
3. Classification and Reference, which include Inter-
national Patent Classication Code, Region-based Clas-
2. BACKGROUND sication Code (e.g., United State Classication Code),
In this section, we rst provide a brief overview of patent and/or other patent classication categories, along with
documents and their structure, and then describe the cur- references assigned by the examiner;
rent patent classication systems, followed by introducing
the tasks in the entire process of patent application. 4. Abstract, which may contain a short description of the
invention and sometimes a drawing that is the most
2.1 The Structure of Patent Documents representative one in terms of illustrating the general
According to World Intellectual Property Organization3 , the idea of the invention.
denition of a patent is: patents are legal documents issued
Beside the front page, a patent document contains detailed
3
http://www.wipo.int. description of the solution, claims, and/or a list of draw-
H ELECTRICTY
generating query keywords; and in Section 3.4 we describe Refine the retrieval
No! query
Done?
the techniques to expand the query keyword set. Identify the initial
query
Yes!
STEP 4. ANALYZE THE
Document Structural Complexity Reduction RETURNED RESULTS
Patent Retrieval
END
Patent Query Extraction
Query
Generation
Patent Query Partition Figure 5: A typical procedure of patent search.
External Methods
Appending-based
Methods
Step 1 Construct the retrieval query:
Internal Methods
Query An initial action is to determine the type of patent
Expansion Pseudo Relevance Feedback
Feedback-based
Methods Citation Analysis
search task (as aforementioned) based on the purpose
of patent retrieval. Then, the search scope can be iden-
tied accordingly. For example, patentability search is
Figure 4: A summary of patent retrieval techniques.
to retrieve relevant documents that are published prior
to the ling/application date, and therefore the scope
of patentability search contains all the available doc-
3.1 Patent Search and a Typical Scenario uments worldwide. Finally, we need to construct the
In practice, there are ve representative patent search tasks initial retrieval query based on the users information
listed as follows: need, as well as the type of the task. For example,
in the task of invalidity search, both the core inven-
Prior-Art Search, which aims at understanding the tion and the classication code of the patent document
state-of-the-art of a general topic or a targeted tech- need to be identied.
nology. It is often referred to as patent landscaping
or technology survey. The scope of this task mainly Step 2 Perform the query and review the results:
focuses on all the available publications8 worldwide. Queries are executed in the scope of the task identied
in Step 1, and relevant documents are returned to the
Patentability Search, which tries to retrieve relevant user. Then the user will review the returned results to
documents worldwide that have been published prior determine whether the documents are desired. If so,
to the application date, and may disclose the core con- go to Step 4; otherwise, go to Step 3.
cept in the invention. This task is often performed
before/after patent application. Step 3 Rene the retrieval query:
If the returned results in Step 2 are not satisfactory
Invalidity Search, which searches the available publi- (e.g., too many documents, too few results, or many
cations that invalidate a published patent document. irrelevant results), we need to rene search queries in
This task is usually performed after a patent is granted. order to improve the search results. For example, we
can put more constrains (hyponyms) in the query if
Infringement Search, which retrieves valid patent pub- we want to reduce the number of returned documents,
lications that are infringed by a given product or patent or remove several constrains (hyponyms) if we get too
document. In general, the search operates on the claim few results, or replace the query with new keywords if
section of the available patent documents. the results are irrelevant.
Legal Status Search, which determines whether an in- Step 4 Analyze the returned results:
vention has freedom to make, use, and sell; that is, After a user reviews each returned document, he/she
whether the granted patent has lapsed or not. will write a search report based on the search task in
accordance with the patent law and regulation. The
In Figure 5, we provide an overview of the procedure to per- search report, in general, consists of: (1) a summary of
form patent search tasks. As depicted, it contains 4 major the invention; (2) classication codes; (3) databases or
steps: retrieval tools used for search; (4) relevant documents;
(5) query logs; and (6) retrieval conclusions.
8 We take patentability search as an illustrative example to
Here the publications are public literatures, including
patent documents and scientic papers. further explain the search procedure. Suppose a patent ex-
,%0 ,QWHU
),8 $SSOH
*RRJOH *(
(SVRQ )XMLWVX
create a collection of related patents, called patent portfo- ing approaches to analysis the novelty of patent documents.
lio [113], to form a super-patent in order to increase the They use NLP techniques to extract the key phrases from
coverage of protection. In this case, a critical question is how the claims section of patent documents, and then calculate
to explore and evaluate the potential benet of patent docu- the originality score based on the extracted key phrases.
ments so as to select the most important ones. To tackle this This valuation method has been adopted by IBM, and is
issue, researchers often resort to two types of approaches: applied to various patent valuation scenarios; however, the
unsupervised exploration and supervised evaluation. In the term-based approaches suer the problem of term ambigu-
following, we discuss existing research publications related ity, which may deteriorate the rationality of the scores in
to patent valuation from these two perspectives. some cases. To alleviate this issue, Hu et al. [41] exploit the
topic model to represent the concept of the patents instead
of using words or phrases. In additional, they state that tra-
6.1 Unsupervised Exploration ditional patent valuation approaches cannot handle the case
Unsupervised exploration on the importance of patent docu- that the novelty of patents evolves over time, i.e., the novelty
ments is often oriented towards two aspects: inuence power may decrease along time. Therefore, they exploit the time
and technical strength. The former relies on the linkage be- decay factor to capture the evolution of patent novelty. The
tween patent documents, e.g., citations, whereas the latter experiment indicates that their proposed approach achieves
mainly focus on the content analysis. the improvement compared with the baselines.
Inuence power: The rst work of using citations to evalu-
ate the inuence power of patent documents involves [20].
In this work, a citation graph is constructed, where each 6.2 Supervised Evaluation
node indicates a patent document, and nodes link to others The aforementioned approaches dene the importance of
based on their citation relations. The case study of semi- patent documents from either content or citation links. In
synthetic penicillin demonstrates the eectiveness of using essence, they are unsupervised methods as the goal is to
citation counts in assessing the inuence power of patents. extract meaningful patterns to assess the value of patents
In [3], Albert et al. further extend the idea of using citation purely based on the patent itself. In practice, besides these
counts, and prove the correctness of citation analysis to eval- two types of resources, some other information may also
uate patent documents. In addition, two related techniques be available to exploit. Some researchers introduce other
are proposed, including the bibliographic coupling that in- types of patent related records, such as patent examination
dicates two patent documents share one or more citation, results [37], patent maintenance decisions [46], and court
and co-citation analysis that indicates two patent documents judgments [65], to generate predicated models to evaluate
have been cited by one or more patent documents. Based patent documents. For example, Hido et al. [37] create a
on these two techniques, Huang et al. [42] integrate the learning model to estimate the patentability of patent appli-
bibliographic coupling analysis and multidimensional scal- cations from the historical Japan patent examination data,
ing to assess the importance of patent documents. Further, and then use the model to predict the examination decision
ranking-based approaches can also be applied to the process for new patent applications. They dene the patentability
of patent valuation. For example, Fujii [25] proposes the use prediction problem as a binary classication problem (re-
of PageRank [12] to calculate citation-based score for patent ject or approval). In order to obtain an accuracy classier,
documents. they exploit four types of features, including patent docu-
Technical strength: Unlike approaches that rely on the anal- ment structure, term frequency, syntactic complexity, and
ysis of the inuence power of patent documents, some re- word age [36]. From their experiments, they demonstrate
search publications focus on the analysis of the technical the superiority of the proposed method in estimating the
strength of inventions, which is relevant to the content of examination decision. Jin et al. [46] construct a heteroge-
patents. For instance, Hasan et al. [36] dene the tech- neous information network from patent documents corpus,
nical strength as claim originality, and exploit text min- in which nodes could be inventors, classication codes, or
integrate other types of resources. For example, Thom- same topic. Further, there are some mining systems focus-
son Reuters includes science and business articles, Ques- ing on patent image search. For instance, PATExpert [115]
tel combines news and blogs, and Lexisnesxis consid- presents a semantic multimedia content representation for
ers law cases. These resources are complementary to patent documents based on semantic web technologies. Pat-
patent documents and are able to enhance the analysis Media [112] provides patent image retrieval functionalities
power of the systems. in content-based manner. The visual similarity is realized by
comparing visual descriptors extracted from patent images.
Cutting-edge analysis. Commercial systems often pro-
vide patent analysis functionalities, by which more 9. CONCLUDING REMARKS
meaningful and understandable results can be obtained.
In this survey, we comprehensively investigated several tech-
For example, Thosmson Innovation provides a func-
nical issues in the eld of patent mining, including patent
tion called Themescape that identies common themes
within the search results by analyzing the concept clus- search, patent categorization, patent visualization, and patent
evaluation. For each issue, we summarize the correspond-
ters and then vividly presents them to users.
ing technical challenges exposed in real-world applications,
Export functionality. Compared with free patent re- and explore dierent solutions to them from existing publi-
trieval systems that do not allow people to export the cations. We also introduce various patent mining systems,
search results, most commercial systems provide cus- and discuss how the techniques are applied to these sys-
tomized export functions that enable users to select tems for ecient and eective patent mining. In summary,
and save dierent types of information. this survey provides an overview on existing patent mining
techniques, and also sheds light on specic application tasks
Recently, several patent mining systems have been proposed related to patent mining.
in academia, most of which are constructed by utilizing the With the increasing volume of patent documents, a lot of
available online resources. For example, PatentSearcher [40] application-oriented issues are emerging in the domain of
leverages the domain semantics to improve the quality of dis- patent mining. In the following, we identify a list of chal-
covery and ranking. The system uses more patent elds, lenges in this domain with respect to several mining tasks.
such as abstract, claims, descriptions and images, to re-
Figure-Based Patent Search introduces patent draw-
trieve and rank patents. PatentLight [14] is an extension
ings as additional information to facilitate traditional
of PatentSearcher, which categorizes the search results by
patent search tasks, as technical gures are able to
virtue of the tags of the XML-structure, and ranks the re-
vividly demonstrate the core idea of invention in some
sults by considering exible constraints on both structure
domains, especially in electronics and mechanisms. The
and content. Another representative system is called Patent-
similarity between technical gures may help improve
Miner [97], which studies the problem of dynamic topic
the accuracy of patent search.
modeling of patent documents and provides the topic-level
competition analysis. Such analysis can help patent ana- Product-Based Patent Search: In general, a product
lysts identify the existing or potential competitors in the may be associated with multiple patents. For example,
[76] P. Mahdabi, M. Keikha, S. Gerani, M. Landoni, and [89] G. Salton. The SMART retrieval system experiments
F. Crestani. Building queries for prior-art search. Mul- in automatic document processing. Prentice-Hall, Inc.,
tidisciplinary Information Retrieval, pages 315, 2011. 1971.
[93] C. Sternitzke, A. Bartkowski, and R. Schramm. Vi- [105] Y. Tseng et al. Text mining for patent map analysis. In
sualizing patent statistics by means of social network Proceedings of IACIS Pacic 2005 Conference, pages
analysis tools. World Patent Information, 30(2):115 11091116, 2005.
131, 2008.
[106] Y. Tseng, C. Lin, and Y. Lin. Text mining techniques
[94] J. H. Suh and S. C. Park. Service-oriented technology for patent analysis. Information Processing & Man-
roadmap (sotrm) using patent map for r&d strategy agement, 43(5):12161247, 2007.
of service industry. Expert Systems with Applications,
[107] Y. Tseng, C. Tsai, and D. Juang. Invalidity search for
36(3):67546772, 2009.
uspto patent documents using dierent patent surro-
[95] T. Takaki, A. Fujii, and T. Ishikawa. Associative doc- gates. In Proceedings of NTCIR-6 Workshop, 2007.
ument retrieval by query subtopic analysis and its ap-
[108] Y. Tseng and Y. Wu. A study of search tactics for
plication to invalidity patent search. In Proceedings of
patentability search: a case study on patent engineers.
the thirteenth ACM international conference on Infor-
In Proceedings of the 1st ACM workshop on Patent
mation and knowledge management, pages 399405.
information retrieval, pages 3336. ACM, 2008.
ACM, 2004.
[109] N. Van Zeebroeck. The puzzle of patent value indica-
[96] H. Takeuchi, N. Uramoto, and K. Takeda. Exper-
tors. Economics of Innovation and New Technology,
iments on patent retrieval at ntcir-5 workshop. In
20(1):3362, 2011.
NTCIR-5, 2005.
[110] A. Vinokourov, J. Shawe-Taylor, and N. Cristianini.
[97] J. Tang, B. Wang, Y. Yang, P. Hu, Y. Zhao, X. Yan,
Inferring a semantic representation of text via cross-
B. Gao, M. Huang, P. Xu, W. Li, et al. Patentminer:
language correlation analysis. Advances in neural in-
topic-driven patent analysis and mining. In Proceed-
formation processing systems, 15:14731480, 2003.
ings of the 18th ACM SIGKDD international confer-
ence on Knowledge discovery and data mining, pages [111] I. Von Wartburg, T. Teichert, and K. Rost. Inventive
13661374. ACM, 2012. progress measured by multi-stage patent citation anal-
ysis. Research Policy, 34(10):15911607, 2005.
[98] W. Tannebaum and A. Rauber. Acquiring lexical
knowledge from query logs for query expansion in [112] S. Vrochidis, A. Moumtzidou, G. Ypma, and I. Kom-
patent searching. In Semantic Computing (ICSC), patsiaris. Patmedia: augmenting patent search with
2012 IEEE Sixth International Conference on, pages content-based image retrieval. Multidisciplinary Infor-
336338. IEEE, 2012. mation Retrieval, pages 109112, 2012.
[99] W. Tannebaum and A. Rauber. Analyzing query logs [113] R. P. Wagner and G. Parchomovsky. Patent portfolios.
of uspto examiners to identify useful query terms U of Penn. Law School, Public Law Working Paper,
in patent documents for query expansion in patent 56:0416, 2005.
searching: a preliminary study. Multidisciplinary In-
formation Retrieval, pages 127136, 2012. [114] J. Wang and D. W. Oard. Combining bidirectional
translation and synonymy for cross-language informa-
[100] W. Tannebaum and A. Rauber. Mining query logs tion retrieval. In Proceedings of the 29th annual in-
of uspto patent examiners. Information Access Eval- ternational ACM SIGIR conference on Research and
uation. Multilinguality, Multimodality, and Visualiza- development in information retrieval, pages 202209.
tion, pages 136142, 2013. ACM, 2006.
[101] D. Teodoro, J. Gobeill, E. Pasche, D. Vishnyakova, [115] L. Wanner, S. Br ugmann, B. Diallo, M. Giereth,
P. Ruch, and C. Lovis. Automatic prior art searching Y. Kompatsiaris, E. Pianta, G. Rao, P. Schoester, and
and patent encoding at clef-ip10. In CLEF (Notebook V. Zervaki. Patexpert: Semantic processing of patent
Papers/LABs/Workshops), 2010. documentation. In SAMT (Posters and Demos), 2006.
Reference [61] designs a CF system based on support The similarity between users a and u can be defined through
vector machine (SVM) by iteratively estimating all many similarity measures, for example cosine, Pearson cor-
missing ratings using an heuristic; relation coefficient (PCC) [1] or asymmetric cosine [3] simi-
larities (equations (3), (4) & (5) below respectively):
Reference [29] develops a neural network-based collab- h
orative method: a user-based and an item-based meth-
i
Sim(a, u) = cos l (a), l (u)
ods;
C
X
Reference [28] describes various methods used for the rai rui
Netflix prize1 and in particular the matrix factoriza- i=1
(3)
= v v
tion methods which provided the best results. u C
uX
u C
uX
t (rai )2 t (rui )2
One of the most efficient and best used model-based meth- i=1 i=1
ods is matrix factorization [42; 59] in which users and items
are represented in a low-dimensional latent factors space. h
The new representations of users (U ) and items (I)
are com-
i
Sim(a, u) = PCC l (a), l (u)
monly computed by minimizing the regularized squared er- X
ror [59]: rai l(a) rui l(u)
iI(a)I(u)
Xh 2 i = s
ru,i vuT vi + 1 k vu k2 +2 k vi k2
s
min (1) X 2 X 2
,I
U rai l(a) rui l(u)
u,i
iI(a)I(u) iI(a)I(u)
It should be noted that both users and items similarity mea- Another mechanism has been developed [3] to produce local-
sures above take into account the actions on all items (resp. ity instead of explicitly defining neighborhoods. Functions
of all users): this is why these measures are called collabo- f and g are defined so as to put more emphasis on high
rative. similarities (with high q, q 0 ):
X q
Score(a, i) = rui Sim(a, u)
uK(a)
3.2.2.2 Collaborative Filtering scores. X q 0 (12)
CF techniques produce, for an active user a, a list of recom- Score(a, i) = raj Sim(i, j)
mended items ranked through a scoring function (or aggre- jV (i)
gation function), which takes into account either users most
similar to a (user-based CF) or items most similar to those For q = 0, this is equivalent to average rating, and for q = 1,
consumed by a (item-based CF). this is similar to weighted average rating.
We then rank items i by decreasing scores and retain the
Let us thus denote K(a) the neighborhood of a and V (i) the top k items (ia1 , ia2 , ..., iak ) which are recommended to a, such
neighborhood of item i. These neighborhoods can be defined that:
in many ways (for example, N -nearest neighbors, for some Score(a, ia1 ) Score(a, ia2 ) ... Score(a, iak ) (13)
given N , or neighbors with similarity larger than a given
threshold, using user/item similarity).
3.2.2.3 Conclusion on CF techniques.
The score functions are then defined for users and items as: Notice that while memory-based techniques produce ranked
X lists of items, model-based techniques predict ratings, through
Score(a, i) = rui f Sim(a, u)
a score which can be used also to rank recommendations. In
uK(a)
X (9) practice, all CF systems suffer from several drawbacks:
Score(a, i) = raj g Sim(i, j)
jV (i)
New user/item: collaborative systems cannot make ac-
curate recommendation to new users since they have
where various functions f and g can be used [1]: not rated a sufficient number of items to determine
For user-based CF: average rating (or popularity) of their preferences. The same problem arises for new
item i by neighbors of a in K(a), weighted average items, which have not obtained enough ratings from
rating and normalized average rating of nearest users users. This problem is known as the cold start recom-
weighted by similarity to a (from top to bottom in mendation problem;
equation (10) below): Scalability: memory-based systems generally have a
1 X scalability issue, because they need to calculate the
Score(a, i) = rui
similarity between all pairs of users (resp. items) to
card K(a)] uK(a)
X make recommendations;
rui Sim(a, u)
uK(a) Sparsity: the number of available ratings is usually ex-
Score(a, i) = X tremely small compared to the total number of pairs
|Sim(a, u)|
user - item; as a result the computed similarities be-
uK(a)U (i)
X tween users and items are not stable (adding a few
rui l(u) Sim(a, u) new ratings can dramatically change similarities) and
Score(a, i) = l(a) +
uK(a)U (i)
X
so predicted ratings are not stable either;
|Sim(a, u)|
Information: of course memory-based and model-based
uK(a)U (i)
techniques use very limited information, namely rat-
(10)
ings/purchases only. They could not use content on
For item-based CF: average rating by a of items neigh- users or items, nor social relationships if these were
bors of i in V (i), weighted average rating, normalized available.
3.3 Social Recommender Systems In [24], authors observe on a dataset from Yelp3 that
Traditional RSs, and in particular model-based systems, rely friends tend to give restaurant ratings (slightly) more
on the (often implicit) assumption that users are indepen- similar than non-friends. However, immediate friends
dent, identically distributed (i.i.d). The same holds for it- tend to differ in ratings by 0.88 (out of 5), which is
ems. However, this is not the case on social networks where rather similar to results in [65]. Their experimental
users enjoy rich relationships with other members on the net- setup compares their model-based algorithm (proba-
work. It has long been observed in sociology [40] that users bilistic), a Friends Average approach (which only av-
friends on such networks have similar taste (homophily). erages the ratings of the immediate friend), a Weighted
It is thus natural that new techniques [65] extended previous Friends (more weight is given to friends which are more
RSs by making use of social network structures. However, it similar according to cosine-similarity), a Naive Bayes
was realized that the type of interaction taken into account approach and a traditional CF method. All methods
could have a dramatic impact on the quality of the obtained which use the influences from friends achieve better
social recommender [65]. In this section, we review three results than CF in terms of prediction accuracy.
families of social recommender: one based on explicit social
Reference [12] presents SaND (Social Network and Dis-
links, one based on trust and an emerging family based on
covery), a social recommendation system; SaND is an
implicit links.
aggregation tool for information discovery and analy-
sis over the social data gathered from IBM Lotus Con-
3.3.1 Social Recommender Systems based on explicit nections applications. For a given query, the proposed
social links system combines the active users score, scores from his
connections and scores between terms and the query;
In this section, we assume that users are connected through
explicit relationships such as friend, follower etc. Unsur- Reference [51] proposes two social recommendation
prisingly, with the recent thrive of online social networks, it models: the first one is based on social contagion while
has been found that users prefer recommendations made by the second is based on social influence. The authors
their friends than those provided by online RSs, which use define the social contagion model as a model to simu-
anonymous people similar to them [53]. Most Social RSs are late how an opinion on certain items spreads through
based on CF methods: social collaborative recommenders, the social network;
like traditional CF systems, can be divided into two fami-
lies: memory-based and model-based systems. Reference [19] proposes a group recommendation sys-
tem in which recommendations are made based on the
strength of the social relationships in the active group.
3.3.1.1 Memory-based Social Recommender. This strength is computed using the strengths of the
social relationship between pairwise social links (scaled
Memory-based methods in social recommendation are sim-
from 1 to 5 and based on daily contact frequency).
ilar to those in CF (presented in section 3.2.2), the only
difference being the use of explicit social relationship for
computing similarities.
3.3.1.2 Model-based Social Recommenders.
In [64; 65] the authors present their social-network- Model-based methods in social recommenders represent users
based CF system (SNCF), a modified version of the and items into a latent space vector (as described in sec-
traditional user-based CF and test it on Essembly.com2 tion 3.2.1) making sure that users latent vectors are close
which provides two sorts of links: friends and allies. to those of their friends.
In [64], they use a graph theoretic approach to Reference [4] combined matrix factorization and friend-
compute users similarity as the minimal distance ship links to make recommendations: the recommen-
between two nodes (using Dijkstras algorithm for dation score for the active user is the sum of the scores
instance), instead of using the ratings patterns of his friends.
as in traditional CF; it is assumed that the in-
fluence will exponentially decay as distance in- Reference [38] proposes algorithms which yield bet-
creases. They show that this method produces ter results than non-negative matrix factorization [31],
results worse than traditional CF; probabilistic matrix factorization [42] and a trust-aware
recommendation method [37]. It presents two social
In [65], the users neighborhood is just simply its RSs:
set of friends in the network (first circle). This
approach provides results slightly worse than the A matrix factorization making sure that the la-
best CF. But the computation load is much re- tent vector of a given user is close to the weighted
duced: from computing the similarity of all pairs average latent vectors of his friends;
2 3
http://www.essembly.com http://www.yelp.com
3.3.4 Conclusion
Social RSs are still relatively new. There is a lot of active
research in this area and it should be expected that new re-
sults will extend the field of traditional systems to incorpo-
rate social information of all sorts. In particular, the field of
social recommenders built on implicit social networks seems
particularly promising and we will now dig deeper in this
direction to produce our Social Filtering formalism.
4. SOCIAL FILTERING
Our Social Filtering formalism (SF) is based upon a
bipartite graph and its projections (see [22; 66] for a discus-
sion of bipartite graphs). A bipartite graph is defined over
a set of nodes separated into two non-overlapping subsets:
for example, users and items, items and their features, etc.
A link can only be established between nodes in different
sets: a link connects a user to the items she has consumed.
The bipartite network is then projected into two (unipar-
tite) networks, one for each set of nodes: a Users and an Figure 1: Bipartite graph and projections
Items networks. In the projection (see Figure 1), two nodes
are connected if they had common neighbors in the bipartite
graph. The link weight can be used to indicate the number
of shared neighbors. For example, two users are linked if support of rule a u (resp. i j) is defined as:
they have consumed at least one item in common (we usu-
ally impose a more stringent condition: at least K items). C
The projected networks can thus be viewed as the network # Items cons. by a and u 1 X
Supp(a u) = = rai rui
of users consuming at least K same items (users having the # Items C i=1
same preferences) and the network of items consumed by at L
least K 0 same users (items liked by the same people). # Users who cons. i and j 1 X
Supp(i j) = = rui ruj
# Users L u=1
Projected networks can then be used to define neighbor-
hoods [55] or recommendation algorithms which perform
better than conventional CF on the MovieLens dataset [66]. In the case of a non-binary matrix R (ratings), support is
This generic formalism extends these early contributions: similarly defined. Hence, Support is defined in general as:
we are able to reproduce results from various classical ap-
proaches, and we also provide new approaches, allowing 1 X
C
more flexibility and potential for improved performances, Supp(a u) = rai rui
C i=1
depending on the dataset.
(14)
L
1 X
In the SF formalism, as in traditional CF, we build rec- Supp(i j) = rui ruj
L u=1
ommendations by defining neighborhoods and scoring func-
tions. Support is similar to cosine similarity (equations (3) and (6)),
so that we can use support as a similarity measure. We will
define support-based similarity of users (resp. items) as:
4.1 Similarity
Sim(a, u) = Supp(a u)
4.1.1 Support-based similarity (15)
Sim(i, j) = Supp(i j)
In the case of implicit feedback (binary interaction matrix
R), in essence, the link between two users a and u (resp. i
and j) represents an association rule a u (resp. i j) 4.1.2 Confidence-based similarity
with the link weight proportional to the rule support, where In the case of implicit feedback, the confidence of link a u
4.1.4 Jaccard Index-based similarity These last two cases are completely novel ways to define
neighborhoods: they exploit the homophily expected in so-
Jaccard Index [36] measures the similarity of lists by count-
cial networks (even implicit as here), where users with the
ing how many elements they have in common. The Jaccard
same behavior tend to connect and be part of the same com-
Index of users a and u (resp. items i and j) is defined (in
munity. In CF, users in K(a) (with cosine similarity) are
the binary case) as:
such that they have consumed at least one item in common
h
i (otherwise their cosine would be 0); in the SF setting such
Card l (a) l (u)
users would be linked in the Users graph (K 1). In the
Jaccard(a, u) = h
i
Card l (a) l (u) above community-based definitions of K(a), users in K(a)
might not be directly connected to active user a. These
Card [
c (i)
c (j)] definitions thus embody some notion of paths linking users,
Jaccard(i, j) =
Card [ c (i)
c (j)] through common usage patterns.
=
#Items Cons. By a and u For user-based SF: average rating (or popularity) of
(#Items Cons. By a) + (#Items Cons. By u) (#Items Cons. By a and u)
item i by neighbors of a in K(a), weighted average rat-
As above, Jaccard index will be similarly defined if matrix R ing, normalized average rating of nearest users weighted
is not binary. Hence, Jaccard index for users / items is by similarity to a, as in equation (10) above.
Association rules [9] of length 2 can be used for rec- AUC: Area Under the Curve [52].
ommendation [47]. They are obtained from the item-
based SF formalism with K = K 0 = 1, asymmetric RMSE (Root Mean Square Error) and MAE (Mean
confidence-based similarity with = 0, N = 1 (V (i) is Absolute Error) defined as:
reduced to the first nearest neighbor) and local scoring 1X
function (equation (12)) with q 0 = 1. M AE = |rij rij |
I i,j
CF is obtained with K = K 0 = 1, asymmetric confi- s (21)
1X
dence with = 0.5 (cosine similarity) and the usual RM SE = (rij rij )2
I i,j
CF score functions (identical to those used by SF).
CF with locality [3] is obtained with K = K 0 = 1, where I is the total number of tested ratings.
asymmetric confidence and score function as in equa- Ranking case: in that case, we want to evaluate whether
tion (12). recommended items were adequate for the user; for exam-
ple, recommended items were later consumed. We thus have
Social RS: if an explicit social network is available,
a Target set for each user which represents the set of it-
such as a friendship network for example, then one can
ems he consumed after being recommended. This can be
use that network as a Users graph and proceed as in
implemented by splitting the available dataset into Train-
the SF framework. In [65], the authors use the first
ing / Testing subsets (taking into account time stamps if
circle as neighborhood and show that their results are
available). In this case, metrics are those classically used in
slightly worse than conventional CF, but at a much
information retrieval:
reduced computational cost.
Content-based RS: instead of building a bipartite Recall@k and Precision@k are defined as:
Users x Items graph, one could use the exact same 1 X Card(Ra Ta )
methodology and build a bipartite Users x Users at- Recall@k =
L a Card(Ta )
tributes or Items x Items attributes graph and recom- (22)
mend items liked by similar users or items similar to 1 X Card(Ra Ta )
Precision@k =
those consumed by the user, with a similarity measure L a k
based on the projected graphs.
where Ra = (ia1 , ia2 , ..., iak ) is the set of k items recom-
As can be seen above, the SF formalism generalizes various mended to a, Ta is the target set for a. We can also
well established recommendation techniques. However, it plot Recall@k as a function of the number of recom-
offers new possibilities as well: the Social Filtering formal- mended items and compute the AUC for that curve.
ism thus extends content-based, association rules and CF,
with new similarity measures and new ways to define neigh- F -mesure: F is designed to take into account both
borhoods. recall and precision; F1 is the most commonly used.
F is defined as :
By only building once one bipartite graph and the pro-
jected unipartite graphs, we have at our disposal, in a unique (1 + 2 ) prec@k recall@k
F @k = (23)
framework, a full set of similarity measures, neighborhood ( 2 prec@k) + recall@k
Our experiments will be presented in two sets: NMF (non-negative matrix factorization): we
used the code9 associated to [28], with maximal rank
First, we show that our SF formalism produces results 100 and maximum number of iterations 10,000 and did
identical to those found by conventional techniques, not try to further optimize the settings.
demonstrating that this formalism can be particular-
ized into some of the existing techniques. Since re-
sults reported in the literature use various evaluation Performances shown in Table 3 provide a baseline: figures
metrics and settings (training/test splits for example), in bold indicate the best performance of the corresponding
we have defined homogeneous settings and reproduced category. We run our simulations on an Intel Xeon E7-4850
classical methods to compare them to our formalism; 2,00 GHz (10 cores, 512 GB RAM), shared with members
of the team (so concurrent usage might have happened in
Then, we show cases where our SF formalism generates some of the experiments, with impact on reported time).
new RSs. Computing time in hours is thus indicative only (0:01:00 is
1 min, 4:20:00 is 4 hours 20 min). Our formalism was im-
In the experiments reported below, we have used various plemented using state-of- the-art libraries, such as Eigen10
settings which are not always explicit in the literature: for computing similarities.
Binarization of inputs
As can be seen, bigrams are very efficient in terms of perfor-
Counts: we split the interval of counts into 10 mances on all four datasets (see also [47]) and they require
bins with the same number of elements. We then no parameters tuning (except the support and confidence
replace each count in matrix R by the bin index; threshold). But bigrams have low Users and Items cover-
age, with all recommended items in the Head.
Ratings: we transform ratings into value from 1
In contrast, NMF are very sensitive to parameters (maximal
to 10 (for example rating 1 to 5 stars is multiplied
rank and maximum number of iterations) and since we did
by 2);
no try to optimize these parameters, we obtain low perfor-
Then we transform values r [0, 10] into a binary mances here. In addition, NMF do not scale well with in-
value. We have used the same setting as [3]: 10 is creasing size of datasets. On the other hand, NMF have best
coded as 1, all others as 0. Then all users, resp. Users and Items coverage (note that after 4 days of comput-
items, with their entire line, resp. column, at 0 ing time, we stopped NMF on MSD). These two techniques
are eliminated, which is a very drastic reduction illustrate the trade-off one has to make in practice: fine tune
of the number of users and items. parameters vs. default parameters to obtain optimal perfor-
mances, and performances vs. coverage. Finally scalability
Data split: to evaluate performances, we split data
is indeed a critical feature.
into two data sets, one used for training, the other for
testing. We implemented the same technique as in [3]
for the MSD challenge: take all users, randomize and 7.1 SFs reproduce classical RSs
split: 90% users for training and 10% kept for testing.
We have implemented CF with the code provided by the
For training, we use all transactions of the 90% author11 in [3], in 2 versions:
users.
Classical CF [1] with cosine similarity and average
We test on the remaining transactions of the test score. Results are shown in Table 4, lines CF IB (item-
users: we input 50% of the transactions of each based) and CF UB (user-based);
test user, and compare the obtained recommen-
dations list to the remaining 50%. CF with locality [3]: we exactly reproduced the re-
sults shown in [3] for k = 500 recommendations as
Choices of similarity measure, neighborhood and
in the MSD challenge (we do not show these results
scoring function. These will be varied, producing
here). We show results in Table 4 for k = 10 recom-
the various RSs we want to implement.
mendations (columns CF IB Aiolli) for values q = 5
and = 0. These results are coherent with those pre-
For reference, to compare performances obtained in our ex- sented in [3].
periments, we have implemented three classical techniques: 9
https://github.com/kimjingu/nonnegfac-python
8 10
https://bitbucket.org/danielbernardes/ http://Eigen.tuxfamily.org
11
socialfiltering Code on http://www.math.uni.pd.it/~aiolli/CODE/MSD
Note that in Table 4, the neighborhood chosen simply con- Comparing Table 3 and Table 4 shows that bigrams seem to
sists of all users / items with non null similarity (the case perform best (except for MovieLens), but their Users cov-
with q = 5 restores some locality). We found that limiting erage is poor (because the filter on support and confidence
to a Top N neighbors (we just tested N = 100) usually re- limits the number of rules: the threshold was not optimized).
sulted in decreased performances and coverage. Note that, These results show that our formalism covers these various
with the datasets sizes we use, a relative difference of 1% is classical techniques: association rules, CF (item or user-
significant. based) and CF with locality as in [3].
Table 4: Performances of Collaborative Filtering. CF and ACF are respectively the abbreviations of Collaborative Filtering
and Asymmetric Cosine Collaborative Filtering.
cases results in a lower MAP, and generally de- winner comes from the highest priority list).
grades all performance metrics.
In order to explore the potential of ensemble techniques,
User-based Social Filtering: we tried combinations of pairs of RSs: we tested combina-
tions of some of the best SF-based recommenders mentioned
Results for SF UB in Table 5 vary depending on above (Tables 4, 5 and 6). Results shown in Table 8 are en-
the dataset: sometimes the best SF UB (usually couraging:
with asymmetric confidence) is better than CF
UB or bigrams; sometimes it is worse. For MSD, For Lastfm, the best two-systems ensemble is CF-IB
SF UB has poorer MAP but better precision and with Cosine similarity and weighted average score, com-
recall than best CF IB or bigram. bined with SF-IB with asymmetric confidence simi-
larity and local score (q = 1, = 0), which gets a
M AP @10 = 0.16. This performance is not better than
Social RS: we implemented our SF formalism using
the one obtained with a unique recommender.
the explicit networks (Flixster and Lastfm) for Users
graph; the neighborhood is the 1st Circle of friends, For Flixster, we obtain an improvement: combining
Communities (computed with Louvains algorithm [7]) SF-UB with asymmetric confidence similarity and lo-
or Local Communities (computed with our algorithm cal score (q = 1, = 0) and SF-UB with asymmetric
in [43]) and the score function is the average. Re- confidence similarity and local score (q = 5, = 0)
sults in Table 7 show, as usual, various situations: on leads to a M AP @10 of 0.175, while the best perfor-
Lastfm 1st Circle has best performances, but Commu- mance was SF-UB with asymmetric confidence similar-
nity better Users coverage and Local Community bet- ity and local score (q = 5, = 0) at M AP @10 = 0.157.
ter Items coverage. For Flixster, Community is best,
but 1st Circle has better Items coverage and Local These preliminary results indeed confirm the potential of en-
Community better Tail coverage. semble methods. They could certainly be enhanced by test-
ing combinations of more systems, optimizing the parame-
7.3 Ensemble methods ters and implementing more adequate aggregation methods
As was demonstrated in many cases, and very notably in (Bordas aggregation method [13] seems rather popular in
the Netflix challenge [6; 30], ensembles of RSs often get im- the literature, but can certainly be improved upon to merge
proved performances. We have thus implemented a few en- recommendation results).
sembles using a basic combination method, originated from [13],
and used in [3]. To combine N recommenders, we assemble,
for each user a, the N lists of k recommended items pro- 8. CONCLUSION
duced by the N RSs. We have presented a simple and generic formalism based
upon social network analysis techniques: by building once
Let us denote (in n
1 , . . . , ik ) the list of items recommended to the projected Users and Items networks, the formalism al-
a (omitted in the notation) by RS n (with n = 1, ..., N ). lows reproducing a wide range of RSs found in the literature
Some of the lists might have less than k elements. For each while also producing new RSs. This unique formalism thus
list, we assign points to the items in each list as follows: 1st provides very efficient ways to test the performances of many
item gets k points, 2nd item k 1, etc. The lists are then different RSs on the dataset at hand to select the most ad-
fused and each item gets the sum of points in the various equate in that case.
lists, with the ties resolved through a priority on RSs (the
Table 5: Performances of Social Filtering (implicit) with 1st circle; Asym. Cos is the abbreviation for Asymmetric Cosine.
As can be seen from our experiments, there is no unique We have introduced various choices of similarity mea-
silver bullet (so far!): for each dataset, one has to test and sures, neighborhood and scoring functions. Obviously,
try a full repertoire of candidate RSs, fine tuning hyper- more choices can be designed and evaluated;
parameters (a topic we did not address in this paper) and
selecting the best RS for the performance indicator he/she Since it is easy to produce many RSs within the same
cares for. The richer the repertoire is, the more chances for framework, we could produce collections or ensem-
the final RS to get best performances. This theoretical for- bles of RSs working together to complement each
malism thus provides a very powerful way to formalize and other weaknesses. More ways for combining recom-
compare many RSs. mended lists will be needed to improve upon the Bor-
das mechanism discussed here. Inspiration from the
In addition, this integrated formalism enables the produc- literature on ensemble of rating-based RSs could cer-
tion of modular code, uncoupling the similarities and scor- tainly be useful.
ing functions computation steps. It also allows for elegant
implementation of the recommendation engines, as demon- Since implicit and explicit social networks can be set
strated in our published open source code12 . into the same framework, further investigation is re-
quired on how to integrate implicit and explicit net-
Computing the bipartite network and its projections re- works, thus producing hybrid Social recommenders;
quires significant computing resources. It can be considered
Recent work13 allowing to merge user-based and item-
as a set-up step for our recommender framework. This over-
based CF seem promising, and could certainly be fra-
head is only worth it if one wants to produce an ensemble of
med into our formalism;
RSs originating from various choices of parameters (similar-
ity measures, neighborhoods and scoring functions) to make The first phase in our formalism consists in projecting
comparisons and select the best choice for the dataset at the Users and Items network. This step is computa-
hand. In addition, computation of similarities is the bottle- tionally heavy. Hence, at the present time we cannot
neck (since obviously it involves all pairs of users or items). process extremely large datasets, such as for exam-
However, some similarity measures (asymmetric confidence ple in the MSD challenge. We thus intend to opti-
or Jaccard) are more costly than others (support, confidence mize our implementation of the Social Network pro-
and cosine) as the figures about computing time show in the jection. More generally, our code could be ported to
various tables. distributed Hadoop-based environments to allow pro-
cessing of larger datasets and parallel testing of the
This work opens ways for future research in several direc- various hyper-parameters choices.
tions:
9. ACKNOWLEDGEMENTS
12
https://bitbucket.org/danielbernardes/
13
socialfiltering Personal communication of one of the authors [58].
[4] J. Aranda, I. Givoni, J. Handcock, and D. Tarlow. An [18] M. D. Ekstrand, J. T. Riedl, and J. A. Konstan. Col-
online social network-based recommendation system. laborative filtering recommender systems. Foundations
Toronto, Ontario, Canada, 2007. and Trends in Human-Computer Interaction, 4(2):81
173, 2011.
[5] P. Avesani, P. Massa, and R. Tiella. A trust-enhanced
recommender system application: Moleskiing. In Pro- [19] M. Gartrell, X. Xing, Q. Lv, A. Beach, R. Han,
ceedings of the 2005 ACM symposium on Applied com- S. Mishra, and K. Seada. Enhancing group recommen-
puting, pages 15891593. ACM, 2005. dation by incorporating social relationship interactions.
In Proceedings of the 16th ACM international confer-
[6] J. Bennett and S. Lanning. The netflix prize. In Pro- ence on Supporting group work, pages 97106. ACM,
ceedings of KDD cup and workshop, volume 2007, 2010.
page 35, 2007.
[20] J. A. Golbeck. Computing and applying trust in web-
[7] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and based social networks. 2005.
E. Lefebvre. Fast unfolding of communities in large net-
works. Journal of Statistical Mechanics: Theory and [21] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins.
Experiment, 2008(10):P10008, 2008. Propagation of trust and distrust. In Proceedings of
the 13th international conference on World Wide Web,
[8] J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez. pages 403412. ACM, 2004.
Recommender systems survey. Knowledge-Based Sys-
tems, 46:109132, 2013. [22] R. Guimer`a, M. Sales-Pardo, and L. A. N. Amaral.
Module identification in bipartite and directed net-
[9] C. Borgelt. Frequent item set mining. Wiley Interdisci- works. Physical Review E, 76(3):036102, 2007.
plinary Reviews: Data Mining and Knowledge Discov-
ery, 2(6):437456, 2012. [23] M. Gupte and T. Eliassi-Rad. Measuring tie strength
in implicit social networks. In Proceedings of the 3rd
[10] D. M. Boyd and N. B. Ellison. Social network sites: Def- Annual ACM Web Science Conference, pages 109118.
inition, history, and scholarship. Journal of Computer- ACM, 2012.
Mediated Communication, 13(1):210230, 2008.
[24] J. He and W. W. Chu. A social network-based recom-
[11] J. S. Breese, D. Heckerman, and C. Kadie. Empirical mender system (SNRS). Springer, 2010.
analysis of predictive algorithms for collaborative fil-
tering. In Proceedings of the Fourteenth conference on [25] Y. Hu, Y. Koren, and C. Volinsky. Collaborative fil-
Uncertainty in artificial intelligence, pages 4352. Mor- tering for implicit feedback datasets. In Data Mining,
gan Kaufmann Publishers Inc., 1998. 2008. ICDM08. Eighth IEEE International Conference
on, pages 263272. IEEE, 2008.
[12] D. Carmel, N. Zwerdling, I. Guy, S. Ofek-Koifman,
N. HarEl, I. Ronen, E. Uziel, S. Yogev, and S. Chernov. [26] M. Jamali and M. Ester. A matrix factorization tech-
Personalized social search based on the users social net- nique with trust propagation for recommendation in so-
work. In Proceedings of the 18th ACM conference on cial networks. In Proceedings of the fourth ACM confer-
Information and knowledge management, pages 1227 ence on Recommender systems, pages 135142. ACM,
1236. ACM, 2009. 2010.
[13] J.-C. de Borda. On elections by ballot. Classics of social [27] D. Jannach, M. Zanker, A. Felfernig, and G. Friedrich.
choice, eds. I. McLean, AB Urken, and F. Hewitt, pages Recommender systems: an introduction. Cambridge
8389, 1995. University Press, 2010.
[61] Z. Xia, Y. Dong, and G. Xing. Support vector machines [67] C.-N. Ziegler and G. Lausen. Propagation models for
for collaborative filtering. In Proceedings of the 44th trust and distrust in social networks. Information Sys-
tems Frontiers, 7(4-5):337358, 2005.
Albrecht Zimmermann
LIRIS, INSA Lyon, France
albrecht.zimmermann@insa-lyon.fr
[50] G. Ramesh, M. J. Zaki, and W. Maniatty. Distribution- [64] M. J. Zaki and C.-J. Hsiao. Charm: An efficient algo-
based synthetic database generation techniques for rithm for closed itemset mining. In R. L. Grossman,
itemset mining. In IDEAS, pages 307316. IEEE Com- J. Han, V. Kumar, H. Mannila, and R. Motwani, edi-
puter Society, 2005. tors, SDM. SIAM, 2002.
[66] A. Zimmermann. Objectively evaluating condensed rep- [68] A. Zimmermann and B. Bringmann. Ctc - correlating
resentations and interestingness measures for frequent tree patterns for classification. In Han et al. [23], pages
itemset mining. Journal of Intelligent Information Sys- 833836.
tems, pages 119, 2013.
ABSTRACT The first highlight of the PIKM 2011 was a keynote talk on
Extreme Web Data Integration by Prof. Dr. Felix Naumann
The realm of knowledge discovery extends across several allied from the Hasso Plattner Institute in Potsdam, Germany [8]. This
spheres today. It encompasses database management areas such as talk addressed the integration and querying of data from the
data warehousing and schema versioning; information retrieval Semantic Web at large scale, i.e., from vast sources such as
areas such as Web semantics and topic detection; and core data DBpedia, Freebase, public domain government data, scientific
mining areas, e.g., knowledge based systems, uncertainty data, and media data such as books and albums. It discussed the
management, and time-series mining. This becomes particularly challenges related to the heterogeneity of Web data (even inside
evident in the topics that Ph.D. students choose for their the Semantic Web), common ontology development, and multiple
dissertation. As the grass roots of research, Ph.D. dissertations record linkage. It also highlighted the problems of Web data
point out new avenues of research, and provide fresh viewpoints integration in general, such as identification of good quality
on combinations of known fields. In this article we overview sources, structured data creation, standardization-related cleaning,
some recently proposed developments in the domain of entity matching, and data fusion.
knowledge discovery and its related spheres. Our article is based
We now furnish a review comprising a summary and critique of
on the topics presented at the doctoral workshop of the ACM
the dissertation proposals presented at this workshop, discussing
Conference on Information and Knowledge Management, CIKM
new directions of research in data mining and related areas. We
2011.
follow the thematic structure of the workshop with 3 topic areas:
knowledge discovery, database research, and information
Keywords retrieval. The knowledge discovery issues surveyed in this article
include areas as diverse as pattern recognition in time-series,
Ranking, Text Mining, Extreme Web, ETL, Pattern Recognition,
resource monitoring with knowledge-based models, version
Resource Monitoring, Version Control, KNN, Semantic Web,
control under uncertainty, and random walk k-nearest-neighbors
Main Memory Database, Data Warehousing, Database Analytics
(k-NN) for classification. The database research problems
presented here entail aggregation for in-memory databases,
evolving extract-transform-load (E-ETL) frameworks, schema and
1. INTRODUCTION data versioning, and automatic regulatory compliance support.
Knowledge discovery is an interdisciplinary field of research, The information retrieval themes involve paradigms such as user
which encompasses diverse areas such as data mining, database interaction with polyrepresentation, ranking with entity
management, information retrieval, and information extraction. relationship (ER) graphs, online conversation mining, sub-topical
This inspires doctoral candidates to pursue research in and across document structure, and cost optimization in test collections.
these related disciplines, with the core contributions of their
dissertation being in one or more of these areas. In this article, we The workshop also issues a best paper award to the most exciting
review some of the directions that the researchers of tomorrow dissertation proposal, as determined by the PC of the workshop.
pursue. We provide a report of the research challenges addressed This years award went to the proposal Ranking Objects by
by students in the Ph.D. workshop PIKM 2011. This workshop Following Paths in Entity-Relationship Graphs [6], in the
was held at the ACM Conference on Information and Knowledge Information Retrieval track.
Management, CIKM 2011. The CIKM conference and the The rest of this article is organized as follows. Sections 2, 3, and 4
attached workshop encompass the tracks of data mining, discuss the different tracks of PIKM, i.e., knowledge discovery,
databases and information retrieval, thus providing an excellent database research, and information retrieval, respectively. In
venue for dissertation proposals and early doctoral work in and Section 5, we summarize the hot topics of current research, and
across different spheres of knowledge discovery. This workshop compare them with the topics of the previous PIKM workshops.
was the fourth of its kind after three successful PIKM workshops
in 2007 [15, 16], 2008 [11, 14] and 2010 [9, 10]. The PIKM 2011 2. KNOWLEDGE DISCOVERY
[8] attracted submissions from several countries around the globe. The topics surveyed here are those with main contributions in data
After a review by a PC comprising 19 experts from academia and mining and knowledge discovery, although some of them overlap
industry worldwide, 9 full papers were selected for oral with the other two thematic tracks, namely, databases and
presentation and 4 short papers for poster presentation. The information retrieval.
program was divided into 4 sessions: data mining and knowledge
management; databases; information retrieval; and a poster 2.1 Pattern Recognition in Evolving Data
session with short papers in all tracks. Many devices today, such as mobile phones, modern vehicular
equipment and smart home monitors, contain integrated sensors.