Repository logo
  • English
  • Deutsch
  • Français
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. CRIS
  3. Publication
  4. Synthetic data for pharmacogenetics: enabling scalable and secure research
 

Synthetic data for pharmacogenetics: enabling scalable and secure research

URI
https://arbor.bfh.ch/handle/arbor/46322
Version
Published
Identifiers
10.1093/jamiaopen/ooaf107
Date Issued
2025-10
Author(s)
Miletic, Marko  
Bollinger, Anna
Allemann, Samuel S.
Sariyar, Murat  
Type
Article
Language
English
Subjects

artificial intelligen...

data privacy

genomic data

pharmacogenetics

synthetic data

Abstract
Objective: This study evaluates the performance of 7 synthetic data generation (SDG) methods-synthpop, avatar, copula, copulagan, ctgan, tvae, and the large language models-based tabula-for supporting pharmacogenetics (PGx) research.

Materials and methods: We used PGx profiles from 142 patients with adverse drug reactions or therapeutic failures, considering 2 scenarios: (1) a high-dimensional genotype dataset (104 variables) and (2) a phenotype dataset (24 variables). Models were assessed for (1) broad utility using propensity score mean squared error ( pMSE), (2) specific utility via weighted F1 score in a Train-Synthetic-Test-Real framework, and (3) privacy risk as ε-identifiability.

Results: Copula and synthpop consistently achieved strong performance across both datasets, combining low ε-identifiability (0.25-0.35) with competitive utility. Deep learning models like tabula and tvae trained for 10 000 epochs achieved lower pMSE but had higher ε-identifiability (>0.4) and limited gains in predictive performance. Specific utility was only weakly correlated with broad utility, indicating that distributional fidelity does not ensure predictive relevance. Copula and synthpop often outperformed original data in weighted F1 scores, especially under noise or data imbalance.

Discussion: While deep learning models can achieve high distributional fidelity ( pMSE), they often incur elevated ε-identifiability, raising privacy concerns. Traditional methods like copula and synthpop consistently offer robust utility and lower re-identification risk, particularly for high-dimensional data. Importantly, general utility does not predict specific utility ( F1 score), emphasizing the need for multimetric evaluation.

Conclusion: No single SDG method dominated across all criteria. For privacy-sensitive PGx applications, classical methods such as copula and synthpop offer a reliable trade-off between utility and privacy, making them preferable for high-dimensional, limited-sample settings.
Publisher DOI
10.1093/jamiaopen/ooaf107
Journal
JAMIA open
ISSN
2574-2531
Publisher URL
https://academic.oup.com/jamiaopen/article/8/5/ooaf107/8271916
Organization
Technik und Informatik  
Institut für Optimierung und Datenanalyse IODA  
Volume
8
Issue
5
Publisher
Oxford University Press
Submitter
Sariyar, Murat
Citation apa
Miletic, M., Bollinger, A., Allemann, S. S., & Sariyar, M. (2025). Synthetic data for pharmacogenetics: enabling scalable and secure research. In JAMIA open (Vol. 8, Issue 5). Oxford University Press. https://arbor.bfh.ch/handle/arbor/46322
File(s)
Loading...
Thumbnail Image
Name

ooaf107.pdf

License
Attribution 4.0 International
Version
published
Size

1.06 MB

Format

Adobe PDF

Checksum (MD5)

213574bf46cb3e3daf58341749ab9cc3

About ARBOR

Built with DSpace-CRIS software - System hosted and mantained by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Our institution