π³οΈβπ About Queer Waves
Queer Waves is a German speech dataset designed to support the development of equitable and inclusive speech technologies. It features approximately 335 hours of spontaneous speech from over 400 self-identified LGBTQIA+ speakers, collected from podcasts and YouTube content. The dataset spans a wide range of gender identities, sexual orientations, ages (18β86) from speakers all over Germany and Austria.
The data collection a strong emphasis on ethical and legal safeguards, especially in handling sensitive personal data. By expanding the diversity of voices in speech technology, Queer Waves contributes to building fairer and more representative AI systems.
The accompanying paper will be presented at Interspeech 2025.
π Dataset Statistics
- Total Audio: 335 hours
- Speakers: 400+
- Episodes: 486
- Languages: German (main)
Below you find visual insights into the distribution of gender identities, sexual orientations, speaker age, and episode durations in the Queer Waves dataset.
Gender Identity Distribution
Sexual Orientation Distribution
Age Range of Speakers regarding gender identification
Age Range of Speakers regarding sexual orientation
Distribution of amount of material per speaker
ποΈ Podcast and Youtube Sources
The following podcasts were included in the creation of the Queer Waves dataset. Where noted, only episodes up to a specific date were used. All content was carefully selected for linguistic and identity diversity and features self-identified LGBTQIA+ speakers.
| Podcast Title | Selection Status |
|---|---|
| Auf eine TΓΌte | Complete |
| BBQ β Der Black Brown Queere Podcast | Complete |
| 030 Bootycall | Until 2024-08-28 |
| BΓΆttinger Wohnung 17 | Complete |
| Hotel Matze | Until 2024-11-20 |
| Out and About | Complete |
| Queer as Berlin | Complete |
| Queerkram | Complete |
| Reden ist Gold | Complete |
| Somewhere Over The Hay Bale | Complete |
| SPUTNIK Pride | Until 2024-11-08 |
| Willkommen im Club | Until 2024-11-20 |
Dates reflect the latest episode included from each feed at the time of dataset extraction.
π View Detailed Podcast Overview - Click here for comprehensive information about all included podcasts including covers, descriptions, episode counts, and more.
π§ Processing Pipeline
- Collection: Curated podcast selection using podcast-dl
- Preprocessing: Format conversion and segmentation
- Transcription: OpenAI Whisper
- Diarization: Speaker segmentation (1β4 speakers)
- Annotation: Sexual Orientation, Gender Identity, Age and Region based on self-identification
- Validation: Manual spot checks and metadata curation
π Available Data
We provide:
- Podcast and episode metadata (JSON)
- Annotation scheme (Gender identity, sexual orientation, age, speaker region)
- Automatic Processing Pipeline descripion + scripts(if available)
- Example transcripts (JSON)
- Metadata overview
- Documentation
β οΈ The full dataset is only available on request for non-commercial academic use.
π Sample Entry
{
"episode_id": "ep_045",
"title": "Queer Voices in Berlin",
"speakers": [
{
"id": "spk_01",
"gender": "non-binary",
"segments": [
{
"start": "00:00:05",
"end": "00:02:15",
"transcript": "Welcome to our discussion on queer identities..."
}
]
}
]
}
π₯ How to Get the Data
- Contact us: ingo.siegert@ovgu.de
- Describe your academic use case
- Sign the Data Usage Agreement
- Receive access credentials
οΏ½ Suggest a Podcast
Do you know of a German-language podcast featuring LGBTQIA+ voices that could enrich the Queer Waves dataset? We welcome suggestions for new podcast sources! Please send us your recommendations with the following information:
Required Information:
- Title: The name of the podcast
- Brief Description: Short description of the content and speakers
- Webpage: Official website or platform link
- RSS Feed Link: Direct link to the RSS feed (essential for processing)
β οΈ Important: Without an RSS feed link, podcasts cannot be processed and included in the dataset.
οΏ½π License
- Website & Documentation: CC BY-NC 4.0
- Dataset: Available only under a research-only usage agreement
π How to Cite
Siegert, I., Marquenie, J., & Grawunder, S. (2025). Queer Waves: A German Speech Dataset Capturing Gender and Sexual Diversity from Podcasts and YouTube. Interspeech 2025 Siegert, I., Marquenie, J., & Grawunder, S. (2025). Queer Waves: A German Speech Dataset Capturing Gender and Sexual Diversity from Podcasts and YouTube, Dataset. 10.5281/zenodo.15561004
π₯ Authors
- Ingo Siegert
- Jan Marquenie
- Sven Grawunder