Rolshoven, Luca SvenLuca SvenRolshovenMatoshi, VetonVetonMatoshiEllendorff, TiliaTiliaEllendorffHostettler, SarahSarahHostettlerMeili, RahelRahelMeiliBinder, JudithJudithBinderStürmer, MatthiasMatthiasStürmerCantini, RiccardoFerragina, LucaLongo, Davide MarioNikiforva, AnastasijaNisticò, SimonaScarcello, FrancescoShahbazian, RezaThakur, DipanwitaTrubitsyna, IrinaVarricchio, Giovanna2026-03-112026-03-112026-02-071613-0073 published 2026-02-07https://doi.org/10.24451/arbor.13428https://arbor.bfh.ch/handle/arbor/47184Public procurement serves as a significant lever for promoting sustainability, yet effectively assessing the integration of sustainability criteria within diverse and heterogeneous tender documents remains a challenge. This paper presents a Natural Language Processing (NLP) pipeline for automatically identifying sustainability criteria in Swiss public procurement documents written in German. To assess sustainability, we compiled four catalogs of official Sustainable Procurement Criteria (SPCs): three domain-specific (transport, food, furniture) and one domain-independent. Each call for tenders (CFT) document was segmented into sentences and encoded using a pre-trained sentence transformer. We then computed cosine similarity scores between each sentence and all SPCs, storing the top match from both the general and the domain-specific catalog, if applicable. While similarity scores were generally high for a majority of sentences, a preliminary manual inspection suggested that only matches with a score of 0.98 or higher tended to reflect meaningful alignment. To validate this threshold, two human experts independently reviewed 100 randomly sampled sentence-criterion pairs above this threshold. To explore whether this expert validation process could be scaled, we also prompted three different Large Language Models (LLMs) to assess the same samples, classifying each pair as a correct or incorrect match based on a majority vote. Our evaluation suggests that a similarity threshold of 0.98 is useful for reducing noise and identifying relevant sustainability criteria. LLM-based validation shows potential as a scalable alternative to human annotation, although performance varies between models. While Gemini 2.0 achieved substantial agreement with the expert judgments in terms of Fleiss’ Kappa (𝜅 = 0.754), other models demonstrated weaker alignment.enNatural Language ProcessingSentence SimilarityLLM-as-a-JudgeGreen AISustainabilityPublic ProcurementIdentifying Sustainability in Public Tenderingconference_item