High quality, real world speech datasets
seSotho Speech Datasets
This speech data collection was planned, collected, annotated and curated with natural language processing best practice in mind. Machine learning and speech recognition rely on unbiased, fully representative datasets, which is why Way With Words speech collection focuses on collecting speech data from the widest range of demographic elements and most equal split between gender distribution as possible.
Applications of NLP in AI demand that speech recognition training data be qualified, structured, and represented to suit machine learning in speech processing. We believe that we’ve collected useful baseline datasets to benchmark effective improvements in the accuracy of existing speech to text models. Speech recognition training data can also, of course, be commissioned on a bespoke basis to suit any conventions, needs, domains, and languages that may be required.
Audio Demo
Download Sample
Data Set Details
Hours available
50 hours
Age range
18 – 49
Download size
38GB
Number of speakers
49
Audio format
WAV
Dataset Demographics
Age Range Distribution
Recorders per age group
[18 – 29]: 24 Recorders
[30 – 49]: 25 Recorders
Gender Split Across Recorded Hours
Men: 17 Recorders
Women: 32 Recorders
Hours Collected Across Domains
Runtime per domain
Retail: 12:46:52
Debt Collection: 12:51:12
Insurance: 12:08:22
Travel: 12:16:38
Total: 50:03:04
Additional Information
Gender Split of seSotho Call Recorders Across Domains
Gender Split of seSotho Call Recorders
Education Level Distribution of seSotho Call Recorders
Geographical Distribution of seSotho Call Recorders
CONTACT SALES
Frequently Asked Questions about our
Speech Collection Services
How are your dataset recordings structured?
Our off-the-shelf dataset collections comprise of unscripted, natural conversations that are conducted by call recorders recruited, trained, and approved to simulate real-world conversations in common domains. This means recordings and transcripts include routine security verifications such as ID, email, and phone number validation.
How do you recruit for Speech Collection datasets?
Our priority is to create datasets that are unbiased and cover as wide a range of demographics as possible. This is the first consideration when we begin the planning and recruitment process of any Speech Collection dataset project.
What kind of agreement is in place for the purchase of this Speech Collection dataset?
A Licence Agreement governs the sale and usage of this Speech Collection dataset. Our off-the-shelf options are available for clients to test and benchmark before larger, more custom commitments can be considered that are better suited to client requirements and conventions.
Why consider Way With Words for Speech Collection datasets?
Way With Words has produced thousands of hours of bespoke Speech Collection datasets, which are unfortunately not available under Licence Agreement. This off-the-shelf dataset was created to evidence our abilities as we believe we can offer tremendous value on custom collections delivered exactly to client specification.