Synthetic data for pharmacogenetics: enabling scalable and secure research
Version
Published
Identifiers
10.1093/jamiaopen/ooaf107
Date Issued
2025-10
Author(s)
Type
Article
Language
English
Abstract
Objective: This study evaluates the performance of 7 synthetic data generation (SDG) methods-synthpop, avatar, copula, copulagan, ctgan, tvae, and the large language models-based tabula-for supporting pharmacogenetics (PGx) research.
Materials and methods: We used PGx profiles from 142 patients with adverse drug reactions or therapeutic failures, considering 2 scenarios: (1) a high-dimensional genotype dataset (104 variables) and (2) a phenotype dataset (24 variables). Models were assessed for (1) broad utility using propensity score mean squared error ( pMSE), (2) specific utility via weighted F1 score in a Train-Synthetic-Test-Real framework, and (3) privacy risk as ε-identifiability.
Results: Copula and synthpop consistently achieved strong performance across both datasets, combining low ε-identifiability (0.25-0.35) with competitive utility. Deep learning models like tabula and tvae trained for 10 000 epochs achieved lower pMSE but had higher ε-identifiability (>0.4) and limited gains in predictive performance. Specific utility was only weakly correlated with broad utility, indicating that distributional fidelity does not ensure predictive relevance. Copula and synthpop often outperformed original data in weighted F1 scores, especially under noise or data imbalance.
Discussion: While deep learning models can achieve high distributional fidelity ( pMSE), they often incur elevated ε-identifiability, raising privacy concerns. Traditional methods like copula and synthpop consistently offer robust utility and lower re-identification risk, particularly for high-dimensional data. Importantly, general utility does not predict specific utility ( F1 score), emphasizing the need for multimetric evaluation.
Conclusion: No single SDG method dominated across all criteria. For privacy-sensitive PGx applications, classical methods such as copula and synthpop offer a reliable trade-off between utility and privacy, making them preferable for high-dimensional, limited-sample settings.
Materials and methods: We used PGx profiles from 142 patients with adverse drug reactions or therapeutic failures, considering 2 scenarios: (1) a high-dimensional genotype dataset (104 variables) and (2) a phenotype dataset (24 variables). Models were assessed for (1) broad utility using propensity score mean squared error ( pMSE), (2) specific utility via weighted F1 score in a Train-Synthetic-Test-Real framework, and (3) privacy risk as ε-identifiability.
Results: Copula and synthpop consistently achieved strong performance across both datasets, combining low ε-identifiability (0.25-0.35) with competitive utility. Deep learning models like tabula and tvae trained for 10 000 epochs achieved lower pMSE but had higher ε-identifiability (>0.4) and limited gains in predictive performance. Specific utility was only weakly correlated with broad utility, indicating that distributional fidelity does not ensure predictive relevance. Copula and synthpop often outperformed original data in weighted F1 scores, especially under noise or data imbalance.
Discussion: While deep learning models can achieve high distributional fidelity ( pMSE), they often incur elevated ε-identifiability, raising privacy concerns. Traditional methods like copula and synthpop consistently offer robust utility and lower re-identification risk, particularly for high-dimensional data. Importantly, general utility does not predict specific utility ( F1 score), emphasizing the need for multimetric evaluation.
Conclusion: No single SDG method dominated across all criteria. For privacy-sensitive PGx applications, classical methods such as copula and synthpop offer a reliable trade-off between utility and privacy, making them preferable for high-dimensional, limited-sample settings.
Publisher DOI
Journal
JAMIA open
ISSN
2574-2531
Volume
8
Issue
5
Publisher
Oxford University Press
Submitter
Sariyar, Murat
Citation apa
Miletic, M., Bollinger, A., Allemann, S. S., & Sariyar, M. (2025). Synthetic data for pharmacogenetics: enabling scalable and secure research. In JAMIA open (Vol. 8, Issue 5). Oxford University Press. https://arbor.bfh.ch/handle/arbor/46322
File(s)![Thumbnail Image]()
Loading...
Name
ooaf107.pdf
License
Attribution 4.0 International
Version
published
Size
1.06 MB
Format
Adobe PDF
Checksum (MD5)
213574bf46cb3e3daf58341749ab9cc3
