Matching Table Metadata with Business Glossaries Using Large Language Models. (arXiv:2309.11506v1 [cs.IR])
Enterprises often own large collections of structured data in the form of
large databases or an enterprise data lake. Such data collections come with
limited metadata and strict access policies that could limit access to the data
contents and, therefore, limit the application of classic retrieval and
analysis solutions. As a result, there is a need for solutions that can
effectively utilize the available metadata. In this paper, we study the problem
of matching table metadata to a business glossary containing data labels and
descriptions. The resulting matching enables the use of an available or curated
business glossary for retrieval and analysis without or before requesting
access to the data contents. One solution to this problem is to use
manually-defined rules or similarity measures on column names and glossary
descriptions (or their vector embeddings) to find the closest match. However,
such approaches need to be tuned through manual labeling and cannot handle many
business glossaries that contain a combination of simple as well as complex and
long descriptions. In this work, we leverage the power of large language models
(LLMs) to design generic matching methods that do not require manual tuning and
can identify complex relations between column names and glossaries. We propose
methods that utilize LLMs in two ways: a) by generating additional context for
column names that can aid with matching b) by using LLMs to directly infer if
there is a relation between column names and glossary descriptions. Our
preliminary experimental results show the effectiveness of our proposed
methods.
Source: https://arxiv.org/abs/2309.11506