On using distributed representations of source code for the detection of C security vulnerabilities. (arXiv:2106.01367v1 [cs.CR])

This paper presents an evaluation of the code representation model Code2vec
when trained on the task of detecting security vulnerabilities in C source
code. We leverage the open-source library astminer to extract path-contexts
from the abstract syntax trees of a corpus of labeled C functions. Code2vec is
trained on the resulting path-contexts with the task of classifying a function
as vulnerable or non-vulnerable. Using the CodeXGLUE benchmark, we show that
the accuracy of Code2vec for this task is comparable to simple
transformer-based methods such as pre-trained RoBERTa, and outperforms more
naive NLP-based methods. We achieved an accuracy of 61.43% while maintaining
low computational requirements relative to larger models.

Source: https://arxiv.org/abs/2106.01367


Related post