Learning Efficient Representations for Keyword Spotting with Triplet Loss. (arXiv:2101.04792v1 [eess.AS])

In the past few years, triplet loss-based metric embeddings have become a
de-facto standard for several important computer vision problems, most notably,
person reidentification. On the other hand, in the area of speech recognition
the metric embeddings generated by the triplet loss are rarely used even for
classification problems. We fill this gap showing that a combination of two
representation learning techniques: a triplet loss-based embedding and a
variant of kNN for classification instead of cross-entropy loss significantly
(by 26% to 38%) improves the classification accuracy for convolutional networks
on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel
phonetic similarity based triplet mining approach. We also match the current
best published SOTA for Google Speech Commands dataset V2 10+2-class
classification with an architecture that is about 6 times more compact and
improve the current best published SOTA for 35-class classification on Google
Speech Commands dataset V2 by over 40%.

Source: https://arxiv.org/abs/2101.04792


Related post