AISHELL-2 中文语音数据库

原创已于 2022-03-11 10:51:02 修改 · 置顶 · 7.4k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#语音识别 #人工智能

于 2022-03-09 16:41:03 首次发布

开源数据专栏收录该内容

6 篇文章

订阅专栏

AISHELL-2是面向 Mandarin ASR 的一个大型开放源语音数据库，包含1000小时的清洁朗读语音数据，用于学术研究。数据覆盖12个领域，由1991位不同口音的中国发言人录制，文本准确率超过96%。此外，还提供了适用于工业应用的改进流程，支持多种先进的ASR技术。AISHELL-2旨在促进研究社区的转移学习和鲁棒ASR研究，以及帮助行业构建实际系统和产品。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Abstract

AISHELL-1 is by far the largest open-source speech corpus available for Mandarin speech recognition research. It was released with a baseline system containing solid training and testing pipelines for Mandarin ASR. In AISHELL-2, 1000 hours of clean read-speech data from iOS is published, which is free for academic usage. On top of AISHELL-2 corpus, an improved recipe is developed and released, containing key components for industrial applications, such as Chinese word segmentation, flexible vocabulary expension and phone set transformation etc. Pipelines support various state-of-the-art techniques, such as time-delayed neural networks and Lattic-Free MMI objective funciton. In addition, we also release dev and test data from other channels(Android and Mic). For research community, we hope that AISHELL-2 corpus can be a solid resource for topics like transfer learning and robust ASR. For industry, we hope AISHELL-2 recipe can be a helpful reference for building meaningful industrial systems and products.

Index Terms: Speech recognition, Mandarin ASR, Industrial Speech Recognition

Introduction

Automatic Speech Recognition (ASR) is a major application domain in the bloom of Artificial Intelligence (AI). Huge effort has been made from both research community and industry to improve ASR system performance. Among all solutions proposed, deep learning approach has been dominating for the last half decade. Given enough data, neural network (NN) models generally perform better in terms of recognition accuracy, and turn out to be more robust. From industrial perspective, accessing and collecting large amount of speech data has become easier than ever before, with emerging market of smart phones and various other smart devices. However, on the other hand, research community still has limited-access to real-world application data. As a result, improvements in research community do not always scale well to industrial scenarios. In computer vision, there are many high quality free data sets which transform research efforts into industrial applications, such as ImageNet [1] and COCO [2]. In Mandarin ASR, although there are corpus like thchs30 [3] and hkust [4], a large-scale high-quality free corpus is still needed.

AISHELL-2 is a 1000-hour Mandarin Chinese Speech Corpus. 718 hours are from AISHELL-ASR0009-[ZH-CN] and 282 hours are from AISHELL-ARS0010-[ZH-CN]. The speech utterance contains 12 domains, including keywords, voice command, smart home, autonomous driving, industrial production, etc.The recording was put in quiet indoor environment, using 3 different devices in parallel: high fidelity microphone (44.1kHz, 16-bit); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). AISHELL-2 choose audio data record by iOS-system.1991 speakers from different accent areas in China were participate in this recording. The manual transcription accuracy rate is above 96%, through professional speech annotation and strict quality inspection.( This database is free for academic research, not in the commerce, if without permission. )

希尔贝壳中文普通话语音数据库AISHELL-2的语音时长为1000小时，其中718小时来自AISHELL-ASR0009-[ZH-CN]，282小时来自AISHELL-ASR0010-[ZH-CN]。录音文本涉及唤醒词、语音控制词、智能家居、无人驾驶、工业生产等12个领域。录制过程在安静室内环境中，同时使用3种不同设备：高保真麦克风（44.1kHz，16bit）；Android系统手机（16kHz，16bit）；iOS系统手机（16kHz，16bit）。AISHELL-2采用iOS系统手机录制的语音数据。1991名来自中国不同口音区域的发言人参与录制。经过专业语音校对人员转写标注，并通过严格质量检验，此数据库文本正确率在96%以上。（支持学术研究，未经允许禁止商用。）