Scalable Data Point Valuation in Decentralized Learning. (arXiv:2305.01657v1 [cs.LG])
Existing research on data valuation in federated and swarm learning focuses
on valuing client contributions and works best when data across clients is
independent and identically distributed (IID). In practice, data is rarely
distributed IID. We develop an approach called DDVal for decentralized data
valuation, capable of valuing individual data points in federated and swarm
learning. DDVal is based on sharing deep features and approximating Shapley
values through a k-nearest neighbor approximation method. This allows for novel
applications, for example, to simultaneously reward institutions and
individuals for providing data to a decentralized machine learning task. The
valuation of data points through DDVal allows to also draw hierarchical
conclusions on the contribution of institutions, and we empirically show that
the accuracy of DDVal in estimating institutional contributions is higher than
existing Shapley value approximation methods for federated learning.
Specifically, it reaches a cosine similarity in approximating Shapley values of
99.969 % in both, IID and non-IID data distributions across institutions,
compared with 99.301 % and 97.250 % for the best state of the art methods.
DDVal scales with the number of data points instead of the number of clients,
and has a loglinear complexity. This scales more favorably than existing
approaches with an exponential complexity. We show that DDVal is especially
efficient in data distribution scenarios with many clients that have few data
points – for example, more than 16 clients with 8,000 data points each. By
integrating DDVal into a decentralized system, we show that it is not only
suitable for centralized federated learning, but also decentralized swarm
learning, which aligns well with the research on emerging internet technologies
such as web3 to reward users for providing data to algorithms.
Source: https://arxiv.org/abs/2305.01657