TY - JOUR
T1 - Towards a general theory of similarity and association measures
T2 - Similarity, dissimilarity and correlation functions
AU - Batyrshin, Ildar
N1 - Publisher Copyright:
© 2019 - IOS Press and the authors. All rights reserved.
PY - 2019
Y1 - 2019
N2 - Similarity, correlation and association measures play an important role in statistics, information retrieval, data mining and data science, classification and machine learning, recommender systems and decision-making. They have numerous applications in ecology, social and behavioral sciences, biology and bioinformatics, social network and time series analysis, image and natural language processing. Often the measures with the same name introduced on different domains have different properties, and the measures with the same properties have different names. To unify analysis of measures defined on different domains, this paper considers these measures as functions defined on universal domain and satisfying some sets of properties. The general properties of similarity functions (SF) and dissimilarity functions (DF) under the joint name of resemblance functions (RF) studied on universal domain and illustrated by examples on specific domains. The known and the new methods of construction of similarity measures are considered. This paper discusses the following aspects of RF: relationship with fuzzy (valued) relations, T-transitivity and triangle inequality, Minkowski distance and data transformation, cosine SF, RF on domains with involution (negation), aggregation and transformations of RF, visualization of RF. The paper considers also the lattice of RF, composition and min-transitive transformations of SF (fuzzy proximity relations), applications to hierarchical clustering and non-probabilistic entropy of RF. In addition, the paper proposes the method of construction of correlation functions (association measures) using SF. Pearson correlation and Yule's Q association coefficients obtained as particular cases of the general method. One can use the paper as a survey of works on similarity and dissimilarity measures on specific domains, as a guide for constructing new similarity and correlation measures, as a base for the study of mathematical properties of resemblance functions on universal and specific domains, and also as a part of the course on Data Science.
AB - Similarity, correlation and association measures play an important role in statistics, information retrieval, data mining and data science, classification and machine learning, recommender systems and decision-making. They have numerous applications in ecology, social and behavioral sciences, biology and bioinformatics, social network and time series analysis, image and natural language processing. Often the measures with the same name introduced on different domains have different properties, and the measures with the same properties have different names. To unify analysis of measures defined on different domains, this paper considers these measures as functions defined on universal domain and satisfying some sets of properties. The general properties of similarity functions (SF) and dissimilarity functions (DF) under the joint name of resemblance functions (RF) studied on universal domain and illustrated by examples on specific domains. The known and the new methods of construction of similarity measures are considered. This paper discusses the following aspects of RF: relationship with fuzzy (valued) relations, T-transitivity and triangle inequality, Minkowski distance and data transformation, cosine SF, RF on domains with involution (negation), aggregation and transformations of RF, visualization of RF. The paper considers also the lattice of RF, composition and min-transitive transformations of SF (fuzzy proximity relations), applications to hierarchical clustering and non-probabilistic entropy of RF. In addition, the paper proposes the method of construction of correlation functions (association measures) using SF. Pearson correlation and Yule's Q association coefficients obtained as particular cases of the general method. One can use the paper as a survey of works on similarity and dissimilarity measures on specific domains, as a guide for constructing new similarity and correlation measures, as a base for the study of mathematical properties of resemblance functions on universal and specific domains, and also as a part of the course on Data Science.
KW - Association
KW - Correlation
KW - Data mining
KW - Data science
KW - Dissimilarity
KW - Distance
KW - Negation
KW - Similarity
KW - Transitivity
UR - http://www.scopus.com/inward/record.url?scp=85064614042&partnerID=8YFLogxK
U2 - 10.3233/JIFS-181503
DO - 10.3233/JIFS-181503
M3 - Artículo
AN - SCOPUS:85064614042
SN - 1064-1246
VL - 36
SP - 2977
EP - 3004
JO - Journal of Intelligent and Fuzzy Systems
JF - Journal of Intelligent and Fuzzy Systems
IS - 4
ER -