MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

Brugger, Tobias; Stürmer, Matthias; Niklaus, Joël (2 May 2023). MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset In: 19th International Conference on Artificial Intelligence and Law - ICAIL. Braga, Portugal. 19th-23rd June 2023. 10.48550/arXiv.2305.01211

[img]
Preview
Text
2305.01211.pdf
Available under License Creative Commons: Attribution (CC-BY).

Download (723kB) | Preview

Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130’000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.

Item Type:

Conference or Workshop Item (Paper)

Division/Institute:

Business School > Institute for Public Sector Transformation > Data and Infrastructure
Business School

Name:

Brugger, Tobias;
Stürmer, Matthias0000-0001-9038-4041 and
Niklaus, Joël0000-0002-2779-1653

Language:

English

Submitter:

Safiya Verbruggen

Date Deposited:

25 Aug 2023 11:56

Last Modified:

25 Aug 2023 11:56

Publisher DOI:

10.48550/arXiv.2305.01211

Related URLs:

ARBOR DOI:

10.24451/arbor.19715

URI:

https://arbor.bfh.ch/id/eprint/19715

Actions (login required)

View Item View Item
Provide Feedback