You are on page 1of 2

CSE 5522 Project Proposal Supervised Learning Approaches for Author Name Disambiguation

Aishwarya Varadarajan Cynthea Godwin Mitesh Kanjariya

Motivation We observe two types of name ambiguities in bibliographic database, an author may have multiple publication names or a same publication name can be shared by a number of authors. Such ambiguity affects the correctness and performance of information retrieval, academic search engines and may lead to improper distribution of credits. This simple task of name disambiguation is a fundamental problem in bibliometric analysis. Problem Statement In this work we will explore supervised learning methods to disambiguate authors in bibliographic database. We will use the naive Bayes probability model, Support Vector Machines and Decision tree / Bayes net classifier to address the author name disambiguation problem. Nave Bayes Model and SVM will use four types of citation attributes: Co-Author names, Title of the paper, Title of the journal/conference in which the paper was published and the keywords. We will evaluate this approach on dataset obtained from scopus/Research in View. Related work Name disambiguation methods proposed in the literature adopt a wide range of solution that includes manual methods, unsupervised techniques and supervised techniques. Han et al [1] proposed the use of two supervised learning algorithm Nave Bayes and SVM to address name disambiguation in citations. We will use this as our base paper and follow a similar approach. Many unsupervised algorithms have been proposed in the literature [2] [3]. A density based clustering method was adopted in [4]. We will restrict our work to supervised algorithms and leave the implementation of unsupervised approaches as a future study. Overview of Algorithms Used Nave Bayes: Each authors citation data will be generated by the nave Bayes model. The past publications of the authors will be used to estimate the parameters of the model. Based on the estimated parameters we will use the Bayes rule to calculate the probability that a particular author is the actual author of that publication. Support Vector Machines: Each author will be considered as a class. We will train the classifier for each author. Given a publication, the goal will be to find the closest class. Each publication will be represented by a feature vector (coauthors, keywords, title, and journal title). The weight of feature will be determined by standard approaches such as tfidf or just as the frequency of the feature in that publication. Third approach will be the one that was discussed in class (decision tree / bayes net). Data Set Data for the proposed project will be obtained from scopus(bibliographic database) and/or Research in View(OSU:PRO). Following have been identified as chief attributes for classification:

Co-Authors: pre-processing of the data will be required to retrieve the coauthor ship information. We plan to use already existing work to extract this information. Paper title: readily available in both the data sets Journal Title: readily available in both the data sets Keywords: preprocessing will be required to extract the keywords from abstract. In addition to this scopus provides author keywords and index keywords. Assumptions The probability that a researcher will significantly change his focus area over a small period in time can be neglected for practical purposes A researcher is more likely to collaborate with the authors he has collaborated in the past Each author will be represented in a canonical form proposed in [1] Authors will be represented as Last name First Initial. This is a valid assumption as this is standard used in many publications. Implementation/Evaluation Using the canonical form will help us to uniquely identify the authors. Nave Bayes Model will be coded by one of the team member. We will use available tool kits like weka for SVM. We will be breaking the dataset for experimental purposes into variable sized train & test sets using the WEKA framework by providing appropriate sampling percentage. We will evaluate the performance of all the algorithms which are implemented. We will be using the WEKA Framework to compare the results of the three algorithms that we plan to use. Each name dataset will be split randomly, half of the publications will be used for training and the other half will be used for testing. With each approach we will conduct n experiments. We will explore multiple schemes based on different combination of attributes. We will try to capture the merits and de-merits of each approached mentioned above. Role + Responsibility Matrix Nave Bayes SVM Bayes Net/Decesion Tree Presentation/Reports Evaluations Pre-processing Aishwarya tertiary primary secondary Co-primary TBD Co-primary Cynthea secondary tertiary primary Co-primary TBD Co-primary References [1] Hui Han, Lee Giles, Hongyuan Zha, Cheng Li, Kostas Tsioutsiouliklis. Two Supervised Learning Approaches for Name Disambiguation in Author Citations. [2] Hui Han, Hongyuan Zha , Giles C.L. .Name disambiguation in author citations using a K-way spectral clustering method. [3] H. Han, H. Zha, and C. L. Giles. A model-based k-means algorithm for name disambiguation. [4] Jian Huang, Seyda Ertekin, and C. Lee Giles. Fast Author Name Disambiguation in CiteSeer. Mitesh primary secondary tertiary Co-primary TBD Co-primary

You might also like