Large Language Models for Synthetic Tabular Health Data: A Benchmark Study
Version
Published
Date Issued
2024-08-22
Author(s)
Type
Book Chapter
Language
English
Abstract
Synthetic tabular health data plays a crucial role in healthcare research, addressing privacy regulations and the scarcity of publicly available datasets. This is essential for diagnostic and treatment advancements. Among the most promising models are transformer-based Large Language Models (LLMs) and Generative Adversarial Networks (GANs). In this paper, we compare LLM models of the Pythia LLM Scaling Suite with varying model sizes ranging from 14M to 1B, against a reference GAN model (CTGAN). The generated synthetic data are used to train random forest estimators for classification tasks to make predictions on the real-world data. Our findings indicate that as the number of parameters increases, LLM models outperform the reference GAN model. Even the smallest 14M parameter models perform comparably to GANs. Moreover, we observe a positive correlation between the size of the training dataset and model performance. We discuss implications, challenges, and considerations for the real-world usage of LLM models for synthetic tabular data generation.
Publisher DOI
Journal or Serie
Studies in health technology and informatics
Journal or Serie
Studies in Health Technology and Informatics
ISSN
1879-8365
Publisher URL
Volume
316
Publisher
IOS Press
Submitter
Sariyar, Murat
Citation apa
Miletic, M., & Sariyar, M. (2024). Large Language Models for Synthetic Tabular Health Data: A Benchmark Study. In Studies in Health Technology and Informatics (Vol. 316, pp. 963–967). IOS Press. https://doi.org/10.24451/dspace/11495
File(s)![Thumbnail Image]()
Loading...
open access
Name
LLM_MIE_2024.pdf
License
Attribution-NonCommercial 4.0 International
Version
published
Size
169.7 KB
Format
Adobe PDF
Checksum (MD5)
c383d3e251acdb40ee2700ec82f56a0c
