Repository logo
  • English
  • Deutsch
  • Français
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. CRIS
  3. Publication
  4. Large Language Models for Synthetic Tabular Health Data: A Benchmark Study
 

Large Language Models for Synthetic Tabular Health Data: A Benchmark Study

URI
https://arbor.bfh.ch/handle/arbor/44700
Version
Published
Date Issued
2024-08-22
Author(s)
Miletic, Marko  
Sariyar, Murat  
Type
Book Chapter
Language
English
Subjects

GAN

Synthetic data genera...

large language models...

tabular data

Abstract
Synthetic tabular health data plays a crucial role in healthcare research, addressing privacy regulations and the scarcity of publicly available datasets. This is essential for diagnostic and treatment advancements. Among the most promising models are transformer-based Large Language Models (LLMs) and Generative Adversarial Networks (GANs). In this paper, we compare LLM models of the Pythia LLM Scaling Suite with varying model sizes ranging from 14M to 1B, against a reference GAN model (CTGAN). The generated synthetic data are used to train random forest estimators for classification tasks to make predictions on the real-world data. Our findings indicate that as the number of parameters increases, LLM models outperform the reference GAN model. Even the smallest 14M parameter models perform comparably to GANs. Moreover, we observe a positive correlation between the size of the training dataset and model performance. We discuss implications, challenges, and considerations for the real-world usage of LLM models for synthetic tabular data generation.
DOI
https://doi.org/10.24451/dspace/11495
Publisher DOI
10.3233/SHTI240571
Journal or Serie
Studies in health technology and informatics
Journal or Serie
Studies in Health Technology and Informatics
ISSN
1879-8365
Publisher URL
https://ebooks.iospress.nl/doi/10.3233/SHTI240571
Organization
Technik und Informatik  
Institut für Optimierung und Datenanalyse IODA  
Volume
316
Publisher
IOS Press
Submitter
Sariyar, Murat
Citation apa
Miletic, M., & Sariyar, M. (2024). Large Language Models for Synthetic Tabular Health Data: A Benchmark Study. In Studies in Health Technology and Informatics (Vol. 316, pp. 963–967). IOS Press. https://doi.org/10.24451/dspace/11495
File(s)
Loading...
Thumbnail Image
Download

open access

Name

LLM_MIE_2024.pdf

License
Attribution-NonCommercial 4.0 International
Version
published
Size

169.7 KB

Format

Adobe PDF

Checksum (MD5)

c383d3e251acdb40ee2700ec82f56a0c

About ARBOR

Built with DSpace-CRIS software - System hosted and mantained by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Our institution