Free, High Quality,
Rare Datasets

Multilingual, multimodal, and updated frequently.

Free Datasets

Dataset diversity you can expect

We're not only releasing 'one kind of data.'

The public list spans speech, text, vision/video, and even medical imaging.

Call-center audio + transcripts

Multilingual short utterances

Accented English (incl. AAVE)

PII detection pack

MRI DICOM (1.5T)

Video: damaged cars + human activities

FAQs

1. Are these datasets actually free?
Yes - this page is for free public dataset releases..
2. How do I know when you release new datasets?
We update this list frequently. Follow us on Hugging Face or LinkedIn — we announce each release when it goes live.
3. What types of datasets do you release?
A mix of speech/call-center, multilingual text utterances, video, PII-focused datasets, and medical imaging (varies by release).
4. Do speech datasets include transcripts or scripts?
Some releases include human transcription, scripts, or both. Each dataset release specifies exactly what's included.
5. Can I use these datasets commercially?
No. These datasets are provided strictly for research and AI model development. Commercial use, resale, or redistribution is prohibited.
6. Who are these releases for?
Anyone who needs high-quality data for building or testing AI systems, including:
  • AI engineers training/evaluating speech, LLM, or multimodal models
  • Researchers & students running benchmarks or experiments
  • Startups prototyping and validating model behavior quickly
  • Data teams testing pipelines (label formats, ingestion, QA workflows)
  • Educators using real datasets for teaching and coursework [ASSUMPTION]s
  • Builders exploring accents, languages, or real-world conversation patterns
7. Can I share the datasets with others?
Please don't redistribute the datasets. Share the official release link instead, so everyone accesses the same source and terms.