Brugger, Tobias; Stürmer, Matthias; Niklaus, Joël (2 May 2023). MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset In: 19th International Conference on Artificial Intelligence and Law - ICAIL. Braga, Portugal. 19th-23rd June 2023. 10.48550/arXiv.2305.01211
|
Text
2305.01211.pdf Available under License Creative Commons: Attribution (CC-BY). Download (723kB) | Preview |
Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130’000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.
Item Type: |
Conference or Workshop Item (Paper) |
---|---|
Division/Institute: |
Business School > Institute for Public Sector Transformation > Data and Infrastructure Business School |
Name: |
Brugger, Tobias; Stürmer, Matthias0000-0001-9038-4041 and Niklaus, Joël0000-0002-2779-1653 |
Language: |
English |
Submitter: |
Safiya Verbruggen |
Date Deposited: |
25 Aug 2023 11:56 |
Last Modified: |
25 Aug 2023 11:56 |
Publisher DOI: |
10.48550/arXiv.2305.01211 |
Related URLs: |
|
ARBOR DOI: |
10.24451/arbor.19715 |
URI: |
https://arbor.bfh.ch/id/eprint/19715 |