Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words. (arXiv:2205.05092v1 [cs.CL])

Cosine similarity of contextual embeddings is used in many NLP tasks (e.g.,
QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in
which word similarities estimated by cosine over BERT embeddings are
understated and trace this effect to training data frequency. We find that
relative to human judgements, cosine similarity underestimates the similarity
of frequent words with other instances of the same word or other words across
contexts, even after controlling for polysemy and other factors. We conjecture
that this underestimation of similarity for high frequency words is due to
differences in the representational geometry of high and low frequency words
and provide a formal argument for the two-dimensional case.



Related post