Towards a general theory of similarity and association measures: Similarity, dissimilarity and correlation functions

Ildar Batyrshin

doi:10.3233/JIFS-181503

Towards a general theory of similarity and association measures: Similarity, dissimilarity and correlation functions

Ildar Batyrshin

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

22 Scopus citations

Abstract

Similarity, correlation and association measures play an important role in statistics, information retrieval, data mining and data science, classification and machine learning, recommender systems and decision-making. They have numerous applications in ecology, social and behavioral sciences, biology and bioinformatics, social network and time series analysis, image and natural language processing. Often the measures with the same name introduced on different domains have different properties, and the measures with the same properties have different names. To unify analysis of measures defined on different domains, this paper considers these measures as functions defined on universal domain and satisfying some sets of properties. The general properties of similarity functions (SF) and dissimilarity functions (DF) under the joint name of resemblance functions (RF) studied on universal domain and illustrated by examples on specific domains. The known and the new methods of construction of similarity measures are considered. This paper discusses the following aspects of RF: relationship with fuzzy (valued) relations, T-transitivity and triangle inequality, Minkowski distance and data transformation, cosine SF, RF on domains with involution (negation), aggregation and transformations of RF, visualization of RF. The paper considers also the lattice of RF, composition and min-transitive transformations of SF (fuzzy proximity relations), applications to hierarchical clustering and non-probabilistic entropy of RF. In addition, the paper proposes the method of construction of correlation functions (association measures) using SF. Pearson correlation and Yule's Q association coefficients obtained as particular cases of the general method. One can use the paper as a survey of works on similarity and dissimilarity measures on specific domains, as a guide for constructing new similarity and correlation measures, as a base for the study of mathematical properties of resemblance functions on universal and specific domains, and also as a part of the course on Data Science.

Original language	English
Pages (from-to)	2977-3004
Number of pages	28
Journal	Journal of Intelligent and Fuzzy Systems
Volume	36
Issue number	4
DOIs	https://doi.org/10.3233/JIFS-181503
State	Published - 2019

Keywords

Association
Correlation
Data mining
Data science
Dissimilarity
Distance
Negation
Similarity
Transitivity

Access to Document

10.3233/JIFS-181503

Cite this

@article{9db124e288eb4340b15b9fab82d8ae7a,

title = "Towards a general theory of similarity and association measures: Similarity, dissimilarity and correlation functions",

abstract = "Similarity, correlation and association measures play an important role in statistics, information retrieval, data mining and data science, classification and machine learning, recommender systems and decision-making. They have numerous applications in ecology, social and behavioral sciences, biology and bioinformatics, social network and time series analysis, image and natural language processing. Often the measures with the same name introduced on different domains have different properties, and the measures with the same properties have different names. To unify analysis of measures defined on different domains, this paper considers these measures as functions defined on universal domain and satisfying some sets of properties. The general properties of similarity functions (SF) and dissimilarity functions (DF) under the joint name of resemblance functions (RF) studied on universal domain and illustrated by examples on specific domains. The known and the new methods of construction of similarity measures are considered. This paper discusses the following aspects of RF: relationship with fuzzy (valued) relations, T-transitivity and triangle inequality, Minkowski distance and data transformation, cosine SF, RF on domains with involution (negation), aggregation and transformations of RF, visualization of RF. The paper considers also the lattice of RF, composition and min-transitive transformations of SF (fuzzy proximity relations), applications to hierarchical clustering and non-probabilistic entropy of RF. In addition, the paper proposes the method of construction of correlation functions (association measures) using SF. Pearson correlation and Yule's Q association coefficients obtained as particular cases of the general method. One can use the paper as a survey of works on similarity and dissimilarity measures on specific domains, as a guide for constructing new similarity and correlation measures, as a base for the study of mathematical properties of resemblance functions on universal and specific domains, and also as a part of the course on Data Science.",

keywords = "Association, Correlation, Data mining, Data science, Dissimilarity, Distance, Negation, Similarity, Transitivity",

author = "Ildar Batyrshin",

year = "2019",

doi = "10.3233/JIFS-181503",

language = "Ingl{\'e}s",

volume = "36",

pages = "2977--3004",

journal = "Journal of Intelligent and Fuzzy Systems",

issn = "1064-1246",

number = "4",

}

TY - JOUR

T1 - Towards a general theory of similarity and association measures

T2 - Similarity, dissimilarity and correlation functions

AU - Batyrshin, Ildar

PY - 2019

Y1 - 2019

N2 - Similarity, correlation and association measures play an important role in statistics, information retrieval, data mining and data science, classification and machine learning, recommender systems and decision-making. They have numerous applications in ecology, social and behavioral sciences, biology and bioinformatics, social network and time series analysis, image and natural language processing. Often the measures with the same name introduced on different domains have different properties, and the measures with the same properties have different names. To unify analysis of measures defined on different domains, this paper considers these measures as functions defined on universal domain and satisfying some sets of properties. The general properties of similarity functions (SF) and dissimilarity functions (DF) under the joint name of resemblance functions (RF) studied on universal domain and illustrated by examples on specific domains. The known and the new methods of construction of similarity measures are considered. This paper discusses the following aspects of RF: relationship with fuzzy (valued) relations, T-transitivity and triangle inequality, Minkowski distance and data transformation, cosine SF, RF on domains with involution (negation), aggregation and transformations of RF, visualization of RF. The paper considers also the lattice of RF, composition and min-transitive transformations of SF (fuzzy proximity relations), applications to hierarchical clustering and non-probabilistic entropy of RF. In addition, the paper proposes the method of construction of correlation functions (association measures) using SF. Pearson correlation and Yule's Q association coefficients obtained as particular cases of the general method. One can use the paper as a survey of works on similarity and dissimilarity measures on specific domains, as a guide for constructing new similarity and correlation measures, as a base for the study of mathematical properties of resemblance functions on universal and specific domains, and also as a part of the course on Data Science.

AB - Similarity, correlation and association measures play an important role in statistics, information retrieval, data mining and data science, classification and machine learning, recommender systems and decision-making. They have numerous applications in ecology, social and behavioral sciences, biology and bioinformatics, social network and time series analysis, image and natural language processing. Often the measures with the same name introduced on different domains have different properties, and the measures with the same properties have different names. To unify analysis of measures defined on different domains, this paper considers these measures as functions defined on universal domain and satisfying some sets of properties. The general properties of similarity functions (SF) and dissimilarity functions (DF) under the joint name of resemblance functions (RF) studied on universal domain and illustrated by examples on specific domains. The known and the new methods of construction of similarity measures are considered. This paper discusses the following aspects of RF: relationship with fuzzy (valued) relations, T-transitivity and triangle inequality, Minkowski distance and data transformation, cosine SF, RF on domains with involution (negation), aggregation and transformations of RF, visualization of RF. The paper considers also the lattice of RF, composition and min-transitive transformations of SF (fuzzy proximity relations), applications to hierarchical clustering and non-probabilistic entropy of RF. In addition, the paper proposes the method of construction of correlation functions (association measures) using SF. Pearson correlation and Yule's Q association coefficients obtained as particular cases of the general method. One can use the paper as a survey of works on similarity and dissimilarity measures on specific domains, as a guide for constructing new similarity and correlation measures, as a base for the study of mathematical properties of resemblance functions on universal and specific domains, and also as a part of the course on Data Science.

KW - Association

KW - Correlation

KW - Data mining

KW - Data science

KW - Dissimilarity

KW - Distance

KW - Negation

KW - Similarity

KW - Transitivity

UR - http://www.scopus.com/inward/record.url?scp=85064614042&partnerID=8YFLogxK

U2 - 10.3233/JIFS-181503

DO - 10.3233/JIFS-181503

M3 - Artículo

AN - SCOPUS:85064614042

SN - 1064-1246

VL - 36

SP - 2977

EP - 3004

JO - Journal of Intelligent and Fuzzy Systems

JF - Journal of Intelligent and Fuzzy Systems

IS - 4

ER -

Towards a general theory of similarity and association measures: Similarity, dissimilarity and correlation functions

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this