GerParlDia-MM: Multimodal Diachronic Bundestag Corpus

This page summarizes a multimodal corpus of German parliamentary speeches (1949-2025), designed for longitudinal research on voice, language, and rhetorical change across decades.

75Long-serving speakers
2,136Speeches with media linkage
1949-2025Temporal coverage
Audio + Video + TextMultimodal setup

Dataset In Brief

The corpus is based on official Bundestag Open Data XML metadata and persistent PoliticianID references, linked where possible to OpenDiscourse speech IDs and official plenary records. Speaker selection targets long careers using asymmetric thresholds to account for historical gender representation: at least eight legislative terms for men and six for women.

The resulting set includes 75 members (43 men, 32 women), with career spans from roughly 18 to 53 years. Text and media are aligned at sentence level (WhisperX-based pipeline), and speech boundaries are refined via regex patterns plus LLM-assisted segmentation to remove non-speech context like announcements.

Main research use-cases: voice aging, intra-speaker variation, diachronic political language, ASR benchmarking, and speaker recognition over long timespans.

Dataset Overview

Category Count Share (%)
Total speeches2,136100.0
From archive1,44367.4
From media library69832.6
With video1,14153.3
Linked to OpenDiscourse1,95191.2

Legal And Ethical Considerations (Short)

The corpus uses publicly intended parliamentary materials and official metadata. Textual plenary records are openly accessible under Bundestag archival rules, while audiovisual usage follows Mediathek licensing terms (educational/cultural/parliamentary use, source attribution, no misleading modification).

Archival materials requiring additional permissions are referenced via IDs/links but not redistributed as raw audiovisual files. The release focuses on transparency, traceability, and legal compliance, and contains no private personal data.

Data Availability Statement

The dataset release can provide structured JSON metadata including speaker-level information, speech-level records, complete transcripts, and persistent links or identifiers to OpenDiscourse entries, Bundestag Mediathek media, archival signatures, and PDF plenary protocols.

Audiovisual media files are not redistributed in the core release. Access to original recordings follows the official Bundestag Mediathek and Parliamentary Archive usage terms.

Citation Statement (Sample)

Final bibliographic details will be updated after presentation/publication at LREC.

Siegert, I. (2026). Voices across Decades: A Multimodal Diachronic Corpus of German Bundestag Debates (GerParlDia-MM). In Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) (pp. 6289–6297). European Language Resources Association (ELRA). https://doi.org/10.63317/3vgihkgnkg75.

Speakers And Speeches

Loading data...