Repository logo
  • English
  • Deutsch
  • Français
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. CRIS
  3. Publication
  4. Evaluation Metrics for Health Chatbots: A Delphi Study
 

Evaluation Metrics for Health Chatbots: A Delphi Study

URI
https://arbor.bfh.ch/handle/arbor/42891
Version
Published
Date Issued
2021
Author(s)
Denecke, Kerstin  
Abd-Alrazaq, Alaa
Househ, Mowafa
Warren, Jim
Type
Article
Language
English
Abstract
Background In recent years, an increasing number of health chatbots has been published in app stores and described in research literature. Given the sensitive data they are processing and the care settings for which they are developed, evaluation is essential to avoid harm to users. However, evaluations of those systems are reported inconsistently and without using a standardized set of evaluation metrics. Missing standards in health chatbot evaluation prevent comparisons of systems, and this may hamper acceptability since their reliability is unclear.
Objectives The objective of this paper is to make an important step toward developing a health-specific chatbot evaluation framework by finding consensus on relevant metrics.
Methods We used an adapted Delphi study design to verify and select potential metrics that we retrieved initially from a scoping review. We invited researchers, health professionals, and health informaticians to score each metric for inclusion in the final evaluation framework, over three survey rounds. We distinguished metrics scored relevant with high, moderate, and low consensus. The initial set of metrics comprised 26 metrics (categorized as global metrics, metrics related to response generation, response understanding and aesthetics).
Results Twenty-eight experts joined the first round and 22 (75%) persisted to the third round. Twenty-four metrics achieved high consensus and three metrics achieved moderate consensus. The core set for our framework comprises mainly global metrics (e.g., ease of use, security content accuracy), metrics related to response generation (e.g., appropriateness of responses), and related to response understanding. Metrics on aesthetics (font type and size, color) are less well agreed upon—only moderate or low consensus was achieved for those metrics.
Conclusion The results indicate that experts largely agree on metrics and that the consensus set is broad. This implies that health chatbot evaluation must be multiface- ted to ensure acceptability.
Subjects
Q Science (General)
T Technology (General)
DOI
10.24451/arbor.15682
https://doi.org/10.24451/arbor.15682
Publisher DOI
10.1055/s-0041-1736664
Journal or Serie
Methods of Information in Medicine
ISSN
0026-1270
Organization
Institute for Patient-centered Digital Health  
Technik und Informatik  
Volume
60
Issue
05/06
Publisher
Thieme
Submitter
Denecke, Kerstin
Citation apa
Denecke, K., Abd-Alrazaq, A., Househ, M., & Warren, J. (2021). Evaluation Metrics for Health Chatbots: A Delphi Study. In Methods of Information in Medicine (Vol. 60, Issue 05/06, pp. 171–179). Thieme. https://doi.org/10.24451/arbor.15682
File(s)
Loading...
Thumbnail Image

restricted

Name

MIM_21010075.pdf

License
Publisher
Version
published
Size

1.09 MB

Format

Adobe PDF

Checksum (MD5)

6b1cb0d77d871c73e54c63fe6c35c2af

About ARBOR

Built with DSpace-CRIS software - System hosted and mantained by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Our institution