The Knesset Corpus Project

The Knesset Corpus is a significant linguistic resource for Modern Hebrew, containing over 30 million sentences from plenary and committee protocols spanning the years 1992–2024 (Goldin, Howell, Ordan, Rabinovich & Wintner, 2025). However, the corpus currently faces an accessibility gap: the original Kibana-based interface is no longer operational. While the corpus is available on Hugging Face, it is primarily distributed in raw or partially processed formats that require advanced programming skills to navigate. This project proposes to bridge this gap by processing the Committees section – a more linguistically diverse and spontaneous subset of the corpus – and migrating it to Sketch Engine.

The corpus will be enriched with two layers of data, linguistic and socio-political. The linguistic layer is produced using state-of-the-art Hebrew NLP via HebPipe, which provides improved morphological and syntactic analysis compared to earlier models:

• Morphological and syntactic analysis: tokenization, lemmatization, POS tagging, and morphological features, along with UD dependency relations (e.g., nsubj, obj, root) to capture "who did what to whom."

• Socio-political metadata: Each utterance will be linked to person and faction objects in the Knesset database, preserving attributes such as gender, faction affiliation, Coalition/Opposition status, and political orientation (e.g., Left, Right, Arab, Ultra-Orthodox). The annotated data will then be compiled into the Sketch Engine platform, which is accessible to researchers without specialized technical skills. This infrastructure supports searches for complex patterns, such as finding all instances where female opposition members use the hedging expression נראה לי ("it seems to me") in a committee meeting.

The project will result in a publicly available, linguistically annotated committee corpus tailored for researchers in various disciplines, including:

• Linguistics: studying morphosyntactic variation, lexical change, register, and interactional features. Because the committee protocols are comparatively less scripted, they also offer a rare window into speech-like Hebrew in an institutional setting, making it possible to investigate discourse markers, turn-taking phenomena, repairs, and stance constructions.

• Social Sciences: Parliamentary corpora are key infrastructure for political science and communication research, enabling large-scale, time-resolved study of institutions through authentic debate records. Preparing this subcorpus as a clean, queryable resource supports central social-science workflows: diachronic analyses across parliamentary terms, comparisons between politically meaningful groups (e.g., coalition vs. opposition), and committee-level contrasts grounded in protocol metadata. Alignment with Parla-CLARIN/ParlaMint further supports interoperability and cross-national comparison with established parliamentary datasets.

DHSS Hub | Projects

The Knesset Corpus Project