A Case for Dataset Specific Profiling. (arXiv:2208.03315v1 [cs.LG])
Data-driven science is an emerging paradigm where scientific discoveries
depend on the execution of computational AI models against rich,
discipline-specific datasets. With modern machine learning frameworks, anyone
can develop and execute computational models that reveal concepts hidden in the
data that could enable scientific applications. For important and widely used
datasets, computing the performance of every computational model that can run
against a dataset is cost prohibitive in terms of cloud resources. Benchmarking
approaches used in practice use representative datasets to infer performance
without actually executing models. While practicable, these approaches limit
extensive dataset profiling to a few datasets and introduce bias that favors
models suited for representative datasets. As a result, each dataset’s unique
characteristics are left unexplored and subpar models are selected based on
inference from generalized datasets. This necessitates a new paradigm that
introduces dataset profiling into the model selection process. To demonstrate
the need for dataset-specific profiling, we answer two questions:(1) Can
scientific datasets significantly permute the rank order of computational
models compared to widely used representative datasets? (2) If so, could
lightweight model execution improve benchmarking accuracy? Taken together, the
answers to these questions lay the foundation for a new dataset-aware
benchmarking paradigm.
Source: https://arxiv.org/abs/2208.03315