The recent emergence of deepfakes, computerized realistic multimedia fakes,
brought the detection of manipulated and generated content to the forefront.
While many machine learning models for deepfakes detection have been proposed,
the human detection capabilities have remained far less explored. This is of
special importance as human perception differs from machine perception and
deepfakes are generally designed to fool the human. So far, this issue has only
been addressed in the area of images and video.
To compare the ability of humans and machines in detecting audio deepfakes,
we conducted an online gamified experiment in which we asked users to discern
bonda-fide audio samples from spoofed audio, generated with a variety of
algorithms. 200 users competed for 8976 game rounds with an artificial
intelligence (AI) algorithm trained for audio deepfake detection. With the
collected data we found that the machine generally outperforms the humans in
detecting audio deepfakes, but that the converse holds for a certain attack
type, for which humans are still more accurate. Furthermore, we found that
younger participants are on average better at detecting audio deepfakes than
older participants, while IT-professionals hold no advantage over laymen. We
conclude that it is important to combine human and machine knowledge in order
to improve audio deepfake detection.