Repository logo
  • English
  • Deutsch
  • Français
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. CRIS
  3. Publication
  4. MultiLegalPile: A 689GB Multilingual Legal Corpus
 

MultiLegalPile: A 689GB Multilingual Legal Corpus

URI
https://arbor.bfh.ch/handle/arbor/37247
Version
Published
Date Issued
2024-06-03
Author(s)
Niklaus, Joël  
Matoshi, Veton  
Stürmer, Matthias  
Chalkidis, Ilias
Ho, Daniel E
Type
Conference Paper
Language
English
Abstract
Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, there are few datasets available for specialized critical domains such as law and the available ones are often only for the English language. We curate and release MULTILEGALPILE, a 689GB corpus in 24 languages from 17 jurisdictions. The MULTILEGALPILE corpus, which includes diverse legal data sources with varying licenses, allows for pretraining NLP models under fair use, with more permissive licenses for the Eurlex Resources and Legal mC4 subsets. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEXTREME and our English models on LexGLUE. We release the dataset, the trained models, and all of the code under the most open possible licenses.
DOI
10.24451/arbor.19714
https://doi.org/10.24451/arbor.19714
Publisher URL
https://2024.aclweb.org/program/main_conference_papers/
Related URL
https://arxiv.org/abs/2306.02069
Organization
Institut Public Sector Transformation (IPST)  
Data and Infrastructure  
Wirtschaft  
Conference
Annual Meeting of the Association for Computational Linguistics (ACL)
Submitter
VerbruggenS
Citation apa
Niklaus, J., Matoshi, V., Stürmer, M., Chalkidis, I., & Ho, D. E. (2024). MultiLegalPile: A 689GB Multilingual Legal Corpus. Annual Meeting of the Association for Computational Linguistics (ACL). https://doi.org/10.24451/arbor.19714
File(s)
Loading...
Thumbnail Image

open access

Name

2306.02069.pdf

License
Attribution 4.0 International
Size

1.58 MB

Format

Adobe PDF

Checksum (MD5)

219d8ef3e34434b4e23b3ca5ab1e18fb

About ARBOR

Built with DSpace-CRIS software - System hosted and mantained by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback
  • Our institution