Repository logo
  • English
  • Deutsch
  • Français
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. CRIS
  3. Publication
  4. Benchmarking speech-to-text robustness in noisy emergency medical dialogues: an evaluation of models under realistic acoustic conditions
 

Benchmarking speech-to-text robustness in noisy emergency medical dialogues: an evaluation of models under realistic acoustic conditions

URI
https://arbor.bfh.ch/handle/arbor/46330
Version
Published
Identifiers
10.1093/jamiaopen/ooaf147
Date Issued
2025-12
Author(s)
Moser, Denis Sumin  
Stanic, Nikola
Sariyar, Murat  
Type
Article
Language
English
Subjects

clinical documentatio...

emergency medical ser...

speech recognition

speech-to-text

word error rate

Abstract
Objectives: To evaluate the transcription accuracy of 6 German-capable speech-to-text (STT) systems in simulated emergency medical services (EMS) environments, focusing on clinically relevant performance under noisy and multilingual field conditions.

Materials and methods: We generated a corpus of 99 synthetic emergency dialogues and overlaid them with ecologically valid noise types-crowd chatter, traffic, public spaces, and ambulance interiors-at 5 signal-to-noise ratios (SNRs), producing 1980 noisy audio samples. Each was transcribed by 6 STT systems (recapp, Vosk, Whisper v3 variants, and RescueSpeech). We assessed performance using 5 metrics: Word Error Rate (WER), Medical Word Error Rate (mWER), TF-IDF Cosine Similarity, BLEU, and semantic embedding similarity. Statistical models quantified the effects of system, noise, and SNR on transcription fidelity.

Results: recapp consistently outperformed all other systems across metrics. Among open-source models, Whisper v3 Turbo achieved the lowest mWER and strongest phrase-level accuracy (BLEU), while Whisper v3 Large preserved semantic content best. RescueSpeech and Vosk underperformed. "Inside crowded" noise had the most degrading impact on performance, while "talking" noise had minimal effect. Performance degradation was most pronounced at the lowest SNR (-2 dB).

Discussion: STT model accuracy is highly sensitive to acoustic conditions. Clinically salient transcription errors (mWER) were most frequent under dense environmental noise. Whisper v3 Turbo balances accuracy and efficiency, suggesting strong potential for EMS applications.

Conclusion: This study introduces a clinically grounded, noise-robust benchmark for STT evaluation in EMS settings. It highlights the importance of domain-specific metrics and acoustic realism for deploying STT systems where transcription errors carry safety-critical consequences.
DOI
https://doi.org/10.24451/arbor.12696
Publisher DOI
10.1093/jamiaopen/ooaf147
Journal or Serie
JAMIA open
Journal or Serie
JAMIA Open
ISSN
2574-2531
Publisher URL
https://academic.oup.com/jamiaopen/article/8/6/ooaf147/8327118
Organization
Technik und Informatik  
Institut für Optimierung und Datenanalyse IODA  
Volume
8
Issue
6
Publisher
Oxford University Press
Submitter
Sariyar, Murat
Citation apa
Moser, D. S., Stanic, N., & Sariyar, M. (2025). Benchmarking speech-to-text robustness in noisy emergency medical dialogues: an evaluation of models under realistic acoustic conditions. In JAMIA Open (Vol. 8, Issue 6). Oxford University Press. https://doi.org/10.24451/arbor.12696
File(s)
Loading...
Thumbnail Image
Download
Name

ooaf147.pdf

License
Attribution 4.0 International
Version
published
Size

1.04 MB

Format

Adobe PDF

Checksum (MD5)

0f0b040806bf6fb29100736002b3e80d

About ARBOR

Built with DSpace-CRIS software - System hosted and mantained by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Our institution