Analogical Relevance Index. (arXiv:2301.04134v1 [cs.LG])
Focusing on the most significant features of a dataset is useful both in
machine learning (ML) and data mining. In ML, it can lead to a higher accuracy,
a faster learning process, and ultimately a simpler and more understandable
model. In data mining, identifying significant features is essential not only
for gaining a better understanding of the data but also for visualization. In
this paper, we demonstrate a new way of identifying significant features
inspired by analogical proportions. Such a proportion is of the form of “a is
to b as c is to d”, comparing two pairs of items (a, b) and (c, d) in terms of
similarities and dissimilarities. In a classification context, if the
similarities/dissimilarities between a and b correlate with the fact that a and
b have different labels, this knowledge can be transferred to c and d,
inferring that c and d also have different labels. From a feature selection
perspective, observing a huge number of such pairs (a, b) where a and b have
different labels provides a hint about the importance of the features where a
and b differ. Following this idea, we introduce the Analogical Relevance Index
(ARI), a new statistical test of the significance of a given feature with
respect to the label. ARI is a filter-based method. Filter-based methods are
ML-agnostic but generally unable to handle feature redundancy. However, ARI can
detect feature redundancy. Our experiments show that ARI is effective and
outperforms well-known methods on a variety of artificial and some real
datasets.
Source: https://arxiv.org/abs/2301.04134