Censorship of Online Encyclopedias: Implications for NLP Models. (arXiv:2101.09294v1 [cs.CL])

While artificial intelligence provides the backbone for many tools people use
around the world, recent work has brought to attention that the algorithms
powering AI are not free of politics, stereotypes, and bias. While most work in
this area has focused on the ways in which AI can exacerbate existing
inequalities and discrimination, very little work has studied how governments
actively shape training data. We describe how censorship has affected the
development of Wikipedia corpuses, text data which are regularly used for
pre-trained inputs into NLP algorithms. We show that word embeddings trained on
Baidu Baike, an online Chinese encyclopedia, have very different associations
between adjectives and a range of concepts about democracy, freedom, collective
action, equality, and people and historical events in China than its regularly
blocked but uncensored counterpart – Chinese language Wikipedia. We examine the
implications of these discrepancies by studying their use in downstream AI
applications. Our paper shows how government repression, censorship, and
self-censorship may impact training data and the applications that draw from
them.

Source: https://arxiv.org/abs/2101.09294

webmaster

Related post