A Unified Active Learning Framework for Annotating Graph Data with Application to Software Source Code Performance Prediction. (arXiv:2304.13032v1 [cs.SE])
Most machine learning and data analytics applications, including performance
engineering in software systems, require a large number of annotations and
labelled data, which might not be available in advance. Acquiring annotations
often requires significant time, effort, and computational resources, making it
challenging. We develop a unified active learning framework, specializing in
software performance prediction, to address this task. We begin by parsing the
source code to an Abstract Syntax Tree (AST) and augmenting it with data and
control flow edges. Then, we convert the tree representation of the source code
to a Flow Augmented-AST graph (FA-AST) representation. Based on the graph
representation, we construct various graph embeddings (unsupervised and
supervised) into a latent space. Given such an embedding, the framework becomes
task agnostic since active learning can be performed using any regression
method and query strategy suited for regression. Within this framework, we
investigate the impact of using different levels of information for active and
passive learning, e.g., partially available labels and unlabeled test data. Our
approach aims to improve the investment in AI models for different software
performance predictions (execution time) based on the structure of the source
code. Our real-world experiments reveal that respectable performance can be
achieved by querying labels for only a small subset of all the data.
Source: https://arxiv.org/abs/2304.13032