Similarity indexes for scientometric research: A comparative analysis
Main Article Content
Abstract
A significant number of papers in the field of scientometrics addressed the comparisons of various similarity indexes. However, there is still a debate on the appropriateness of an index compared to others, beacause of the assessment differences reported in the literature. The objective of this paper is to make a comparative analysis of the five most used similarity indexes for the three scientometric analysis types: co-word, co-citation and co-authorship. A total of 388 papers addressing similarity indexes in scientometric analysis over three decades were retrieved from the Web of Science and examined; of which 49 were retained as the most relevant according to selective criteria. The approach consisted of building cross matrices for the five indexes (Jaccard, Dice-Sorensson, Salton, Pearson, and Association Strength) for the three types of scientometric analysis. For each of these analyses, a distinction is made between papers according to their theoretical or empirical results. Furthermore, papers are classified according to the mathematical formula of the similarity index being used (vector vs non vector). In the 49 relevant papers being selected, the comparative analysis showed that there is still no consensus on the appropriateness of an index for co-word and co-authorship analyses, while for co-citation, Salton is the widely preferred one. The Association Strength is the less covered and compared to other indexes for the three analysis types. An open source computer program was developed as a tool to facilitate empirical comparative studies of indexes. It allows generating normalized matrix of any chosen index for the two mathematical variants.
Downloads
Article Details
It is a condition of publication that manuscripts submitted to the journal have not been published, accepted for publication, nor simultaneously submitted for publication elsewhere. By submitting a manuscript, the author(s) agree that copyright for the article is transferred to the publisher, if and when the manuscript is accepted for publication.
References
Ahlgren, P., Javerning, B. and Rousseau, R. 2004. Rejoinder: In defense of formal methods, Journal of the American Society for Information Science and Technology, Vol. 55, no. 10: 935-936.
Al‐Kharashi, I. A. and Evens, M. W. 1994. Comparing words, stems, and roots as index terms in an Arabic information retrieval system. Journal of the American Society for Information Science, Vol. 45, no. 8: 548-560.
Bensman, S. J. 2004. Pearson’s r and author co-citation analysis: A commentary on the controversy, Journal of the American Society for Information Science and Technology, Vol. 55, no. 10: 935.
Boyack, K. W., Klavans R. and Borner K. 2005. Mapping the backbone of science. Scientometrics, Vol. 64, no. 3: 351-374
Cha, S. H., Choi, S. and Tappert, C. C. 2009. Anomaly between Jaccard and Tanimoto coefficients. Proceedings of Student-Faculty Research Day, CSIS, Pace university.
Chaudhari, M. P. J. and Dharmadhikari, D. D. 2012. Clustering with multi-viewpoint based similarity measure: An overview. International Journal of Engineering Inventions, Vol. 3, no. 1: 2278-7461.
De Meo, P., Ferrara, E., Fiumara, G. and Ricciardello, A. 2012. A novel measure of edge centrality in social networks. Knowledge-Based Systems, Vol. 30: 136-150.
Dice R. 1945, Measures of the amount of ecologic association between species in Ecology, Vol. 26, no. 3: 2976-302.
Egghe, L. 2009. New relations between similarity measures for vectors based on vector norms. Journal of the American Society for Information Science and Technology, Vol. 60, no. 2: 232-239.
Egghe, L. 2010a. On the relation between the Association Strength and other similarity measures. Journal of the American Society for Information Science and Technology, Vol. 61, no. 7: 1502-1504
Egghe, L. 2010b. Good properties of similarity measures and their complementarity. Journal of the American Society for Information Science and Technology, Vol. 61, no. 10: 2151-2160.
Egghe, L. and Leydesdorff, L. 2009. The relation between Pearson's correlation coefficient r and Salton's cosine measure. Journal of the American Society for Information Science and Technology, Vol. 60, no. 5: 1027-1036.
Egghe, L. and Rousseau, R. 2006. Classical retrieval and overlap measures such as Jaccard's coefficient, Salton's cosine measure and the Dice coefficient satisfy the requirements for rankings based upon a Lorenz curve, Information Processing & Management, Vol. 42, no. 1: 106-120.
Elmacioglu, E., Kan, M. Y., Lee, D. and Zhang, Y. 2007. Web based linkage. In Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management (pp. 121-128).
Finardi, U. 2015. Scientific collaboration between BRICS countries. Scientometrics, Vol. 102, no. 2: 1139-1166.
Froud, H., Lachkar, A. and Alaoui, O.S. 2012. A comparative study of root-based and stem-based approaches for measuring the similarity between Arabic words for Arabic text mining applications. Advanced Computing: An International Journal, Vol. 3, no. 6: 1212-3634.
Gamallo, P. and Bordag, S. 2011. Is singular value decomposition useful for word similarity extraction?. Language Resources and Evaluation, Vol. 45, no. 2: 95-119.
Gmür M. 2003, Co-citation analysis and the search for invisible colleges: A methodological evaluation, Scientometrics, Vol. 57, no. 1: 27-57.
Hadj Taieb, M. A., Ben Aouicha, M. and Ben Hamadou, A. 2013. Computing semantic relatedness using Wikipedia features. Knowledge-Based Systems, Vol. 50: 260-278.
Hamers, L., Hemeryck, Y., Herweyers, G., Janssen, M., Keters, H., Rousseau, R., & Vanhoutte, A. 1989. Similarity measures in scientometric research: the Jaccard index versus Salton's cosine formula. Information Processing & Management, Vol. 25, no. 3 : 315-318.
Jaccard P. 1901. Distribution de la flore alpine dans le bassin de Dranses et dans quelques régions voisines, Bulletin de la Société Vaudoise des Sciences Naturelles, Vol. 37: 241-272.
Jones W. P. and Furnas G. W. 1987. Pictures of relevance: A geometric analysis of similarity measures. Journal of the American Society for Information Science, Vol. 38, no. 6: 420-442.
Jung, J. J. 2015. Big bibliographic data analytics by random walk model. Mobile Networks and Applications, Vol. 20, no. 4: 533-537.
Kessler, M. M. 1963. Bibliographic coupling between scientific papers. American Documentation, Vol. 14, no. 1: 10-25.
Khan S. M., 2012. Exploring citations for conflict of interest detection in peer review system. International Journal of Computer Information Systems and Industrial Management Applications, Vol. 3: 283-299.
Klavans, R. and Boyack, K. W. 2006. Identifying a better measure of relatedness for mapping science. Journal of the American Society for Information Science and Technology, Vol. 57, no. 2: 251-263.
Lakshmi, M. S. V. 2013. Correlation preserving indexing based text clustering world. Journal of Engineering Science, Vol. 1, no. 1: 30-37.
Leydesdorff, L. 2008. On the normalization and visualization of author co‐citation data: Salton's Cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology, Vol. 59, no. 1: 77-85.
Leydesdorff, L. and Zaal, R. 1988. Co-words and citations relations between document sets and environments. In L. Egghe & R. Rousseau (Eds.), Informetrics, 87/88, 105-119, Amsterdam: Elsevier.
Linyuan, L., Matus, M., Chi Ho, Y., Yi-Cheng, Z., Zi-Ke, Z. and Tao, Z. 2012. Recommender systems. Physics Reports, Vol. 519, no. 1: 1-49.
Lu, K. and Wolfram, D. 2012. Measuring author research relatedness: A comparison of word‐based, topic‐based, and author cocitation approaches. Journal of the American Society for Information Science and Technology, Vol. 63, no. 10: 1973-1986.
Lü, L. and Zhou, T. 2011. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, Vol. 390, no. 6: 1150-1170.
Luukkonen, T., Tijssen, R. J., Persson, O. and Sivertsen G. 1993. The measurement of international scientific collaboration. Scientometrics, Vol. 28, no. 1: 15-36.
Naidu, C. S., Ramu, V. and Srinivas, D. 2013. Applications of MVS on hierarchical clustering algorithms. IJRCCT, Vol. 2, no. 12: 1409-1415.
Narayanan, N., Judith, J. E. and JayaKumari, J. 2013. Enhanced distributed document clustering algorithm using different similarity measures. Information & Communication Technologies (ICT), 2013 IEEE Conference, 545-550.
Pearson, K. 1895. Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, Vol. 58: 240-242.
Porter, A. L., Cohen, A. S., Roessner, J. D. and Perreault, M. 2007. Measuring researcher interdisciplinarity. Scientometrics, Vol. 72, no. 1: 117-147.
Rip A. and Courtial J.-P. 1984. Co-word maps of biotechnology: An example of cognitive scientometrics, Scientometrics, Vol. 6, no. 6: 381-400.
Rorving, M. 1999. Images of similarity: A visual exploration of optimal similarity metrics and scaling properties of TREC topic-document sets. Journal of the American Society for Information Science and Technology, Vol. 50, no. 8: 639-651.
Saad, S. M. and Kamarudin, S. S. 2013. Comparative analysis of similarity measures for sentence level semantic measurement of text. Proceedings 2013 IEEE International Conference on Control System, Computing and Engineering, ICCSCE 2013, 90-94.
Salton G. and McGill, M.J. 1983. Introduction to modern information retrieval. Auckland, New Zealand, McGraw-Hill.
Schneider, J. W. and Borlund, P. 2007a. Matrix comparison, Part 1 Motivation and important issues for measuring the resemblance between proximity measures_Schneider. Journal of the American Society for Information Science and Technology, Vol. 58, no. 11: 1586-1595.
Schneider, J. W. and Borlund, P. 2007b. Matrix comparison, Part 2 Measuring the resemblance between proximity measures or ordination results by use of the mantel and procrustes statistics, Journal of the American Society for Information Science and Technology, Vol. 58, no. 11: 1596-1609.
Shoaib, M., Daud, A. and Khiyal, M. S. H. 2015. Improving similarity measures for publications with special focus on author name disambiguation. Arabian Journal for Science and Engineering, Vol. 40, no. 6: 1591-1605.
Singh, J., Singh, P. and Chaba, Y. 2014. A study of similarity functions used in textual information retrieval in Wide Area Networks. International Journal of Computer Science and Information Technologies, Vol. 5, no. 6: 7880-7884.
Small H. 1973. Co-citation in the scientific literature: A new measure of the relationship betwen documents, Journal of the American Society for Information Science and Technology, Vol. 24, no. 4: 265-269.
Sorenson T. 1948. a method of establishing groups of equal amplitude in plant sociology based on similarity of spicies content and its application to analyse the vegetation on Danish commons, Biologiske krifter, Vol. 5, no. 4: 1-34.
Sorkhi, M. and Hashemi, S. 2015. Effective team formation in collaboration networks using vertex and proficiency similarity measures. AI Communications, Vol. 28, no. 4: 637-654.
Sternitzke, C. and Bergmann, I. 2009. Similarity measures for document mapping: A comparative study on the level of an individual scientist, Scientometrics, Vol. 78, no. 1: 113-130.
Stvilia, B., Al-Faraj, A. and Yi, Y. J. 2009. Issues of cross-contextual information quality evaluation-The case of Arabic, English, and Korean Wikipedias. Library & Information Science Research, Vol. 3, no. 14: 232-239.
Subhashini, R. and Kumar, V. J. S. 2010. Evaluating the performance of similarity measures used in document clustering and information retrieval. Proceedings 1st International Conference on Integrated Intelligent Computing, ICIIC, 27-31.
Tanimoto, T.T. 1957. IBM Internal Report 17th Nov. 1957.
Thada, V. and Jaglan, D. V. 2013. Comparison of Jaccard, Dice, Cosine Similarity Coefficient to find best fitness value for web retrieved documents using genetic algorithm. International Journal of Innovations in Engineering and Technology, 202-205.
Thijs, B., Zhang, L. and Glänzel, W. 2013. Bibliographic coupling and hierarchical clustering for the validation and improvement of subject-classification schemes. In 14th International Conference on Scientometrics and Informetrics, 15-19.
Thijs, B. and Glänzel, W. 2010. A structural analysis of collaboration between European research institutes. Research Evaluation, Vol. 19, no. 1: 55-65.
Van Eck N. J., Waltman L.,Van den Berg J., Kaymak, U. (2006). Visualizing the WCCI 2006 knowledge domain. IEEE International Conference on Fuzzy Systems, 1671- 1678.
Van Eck, N. J. and Waltman, L. 2009. How to normalize cooccurrence data? An analysis of some well-known similarity measures. Journal of The American Society for Information Science and Technology, Vol. 60, no. 8: 1635-1651.
Van Eck, N.J. and Waltman, L. 2008. Appropriate similarity measures for author co-citation analysis, Journal of the American Society for Information Science and Technology, Vol. 59, no. 10: 1653-1660.
Wagner, C. S. and Leydesdorff, L. 2003. Mapping global science using international co-authorships: a comparison of 1990 and 2000. Proceedings of Ninth International Conference on Scientometrics and Informetrics , ISSI 2003, 330‐340.
Wang, X. and Sukthankar, G. 2013. Multi-label relational neighbor classification using social context features. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 464-472.
White, H.D. 2003. Author cocitation analysis and Pearson’s r, Journal of the American Society for Information Science and Technology, Vol. 54, no. 13: 1250-1259.
White, H.D. 2004. Replies and a correction, Journal of the American Society for Information Science and Technology, Vol. 55, no. 9: 843-844.
Zupic I and Cater T. 2015. Bibliometric methods in management and organization, Organizational Research Methods, Vol. 18, no. 3: 429-472.