Repository logo
  • English
  • Deutsch
  • Français
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. CRIS
  3. Publication
  4. SCALE: Scaling up the Complexity for Advanced Language Model Evaluation
 

SCALE: Scaling up the Complexity for Advanced Language Model Evaluation

URI
https://arbor.bfh.ch/handle/arbor/36354
Version
Published
Date Issued
2023-06-15
Author(s)
Rasiah, Vishvaksenan
Stern, Ronja
Matoshi, Veton  
Stürmer, Matthias  
Chalkidis, Ilias
Ho, Daniel E
Niklaus, Joël  
Type
Conference Paper
Language
English
Abstract
Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professional domain-specific ones), emphasizing the need for novel, more challenging novel ones to properly assess LLM capabilities. In this paper, we introduce a novel NLP benchmark that poses challenges to current LLMs across four key dimensions: processing long documents (up to 50K tokens), utilizing domain specific knowledge (embodied in legal texts), multilingual understanding (covering five languages), and multitasking (comprising legal document to
document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks). Our benchmark comprises diverse legal NLP datasets from the Swiss legal system, allowing for a comprehensive study of the underlying Non-English, inherently multilingual, federal legal system. Despite recent advances, efficiently processing long documents for intense review/analysis tasks remains an open challenge for language models. Also, comprehensive, domain-specific benchmarks requiring high expertise to develop are rare, as are multilingual benchmarks. This scarcity underscores our contribution’s value, considering most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. Our benchmark allows for testing and advancing the state-of-the-art LLMs. As part of our study, we evaluate several pre-trained multilingual language models on our benchmark to establish strong baselines as a point of reference. Despite the large size of our datasets ∗ Equal contribution. (tens to hundreds of thousands of examples), existing publicly available models struggle with most tasks, even after in-domain pretraining. We publish all resources (benchmark suite, pre-trained models, code) under a fully permissive open CC BY-SA license.
DOI
10.24451/arbor.19713
https://doi.org/10.24451/arbor.19713
Publisher DOI
10.48550/arXiv.2306.09237
Publisher URL
https://arxiv.org/abs/2306.09237
Organization
Institut Public Sector Transformation (IPST)  
Data and Infrastructure  
Wirtschaft  
Conference
Data-centric Machine Learning Workshop (DMLR) @ International Conference on Learning Representations (ICLR)
Submitter
VerbruggenS
Citation apa
Rasiah, V., Stern, R., Matoshi, V., Stürmer, M., Chalkidis, I., Ho, D. E., & Niklaus, J. (2023). SCALE: Scaling up the Complexity for Advanced Language Model Evaluation (pp. 1–40). https://doi.org/10.24451/arbor.19713
File(s)
Loading...
Thumbnail Image
Download

open access

Name

2306.09237.pdf

License
Attribution 4.0 International
Size

4.72 MB

Format

Adobe PDF

Checksum (MD5)

395ca2c0c9d3533207ec00c33226ad2e

About ARBOR

Built with DSpace-CRIS software - System hosted and mantained by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Our institution