SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions. (arXiv:2202.10451v1 [cs.LG])

Automatic machine learning, or AutoML, holds the promise of truly
democratizing the use of machine learning (ML), by substantially automating the
work of data scientists. However, the huge combinatorial search space of
candidate pipelines means that current AutoML techniques, generate sub-optimal
pipelines, or none at all, especially on large, complex datasets. In this work
we propose an AutoML technique SapientML, that can learn from a corpus of
existing datasets and their human-written pipelines, and efficiently generate a
high-quality pipeline for a predictive task on a new dataset. To combat the
search space explosion of AutoML, SapientML employs a novel divide-and-conquer
strategy realized as a three-stage program synthesis approach, that reasons on
successively smaller search spaces. The first stage uses a machine-learned
model to predict a set of plausible ML components to constitute a pipeline. In
the second stage, this is then refined into a small pool of viable concrete
pipelines using syntactic constraints derived from the corpus and the
machine-learned model. Dynamically evaluating these few pipelines, in the third
stage, provides the best solution. We instantiate SapientML as part of a fully
automated tool-chain that creates a cleaned, labeled learning corpus by mining
Kaggle, learns from it, and uses the learned models to then synthesize
pipelines for new predictive tasks. We have created a training corpus of 1094
pipelines spanning 170 datasets, and evaluated SapientML on a set of 41
benchmark datasets, including 10 new, large, real-world datasets from Kaggle,
and against 3 state-of-the-art AutoML tools and 2 baselines. Our evaluation
shows that SapientML produces the best or comparable accuracy on 27 of the
benchmarks while the second best tool fails to even produce a pipeline on 9 of
the instances.



Related post