Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations. (arXiv:2305.16326v1 [cs.CL])
Biomedical literature is growing rapidly, making it challenging to curate and
extract knowledge manually. Biomedical natural language processing (BioNLP)
techniques that can automatically extract information from biomedical
literature help alleviate this burden. Recently, large Language Models (LLMs),
such as GPT-3 and GPT-4, have gained significant attention for their impressive
performance. However, their effectiveness in BioNLP tasks and impact on method
development and downstream users remain understudied. This pilot study (1)
establishes the baseline performance of GPT-3 and GPT-4 at both zero-shot and
one-shot settings in eight BioNLP datasets across four applications: named
entity recognition, relation extraction, multi-label document classification,
and semantic similarity and reasoning, (2) examines the errors produced by the
LLMs and categorized the errors into three types: missingness, inconsistencies,
and unwanted artificial content, and (3) provides suggestions for using LLMs in
BioNLP applications. We make the datasets, baselines, and results publicly
available to the community via
https://github.com/qingyu-qc/gpt_bionlp_benchmark.
Source: https://arxiv.org/abs/2305.16326