本文已被:浏览 124次 下载 33次
修订日期:2025-03-07
修订日期:2025-03-07
中文摘要: 当前,全球人工智能大模型行业竞争日趋激烈,语料库成为提升人工智能大模型技术性能和应用效果的关键。但是,我国语料库在数量和质量上均存在不足,难以满足快速发展的人工智能大模型训练需求。从全球来看,各国都在加快语料库发展,特别是推动高质量语料库的建设和应用。因此,文章基于国外对标和国内环境分析,从平台定位、总体架构、运营主体、核心内容等维度提出建设国家级语料库运营平台的建议。
Abstract:At present, the competition within the global artificial intelligence (AI) large model industry is intensifying, and corpus resources emerging as a critical determinant for enhancing the technical performance and practical efficacy of AI systems. Nevertheless, China's corpus development faces dual challenges in both quantity and quality, struggling to meet the escalating training demands of the rapidly evolving AI large model sector. Internationally, nations are ramping up efforts to develop their corpus infrastructures, particularly prioritizing the creation and deployment of high-quality linguistic datasets. In this context, through comparative analysis of international benchmarks and domestic conditions, this study proposes a strategic framework for establishing a national corpus management platform. The proposal encompasses four pivotal dimensions:platform orientation, architectural design, governing entities, and key functional components.
文章编号: 中图分类号: 文献标志码:
基金项目:国家社会科学基金重大项目(24&ZD072);中国博士后基金(2023M731171)
引用文本:
李兴腾,冯锋,黄鹂强.突破人工智能大模型的“数据瓶颈”——构建国家级语料库运营平台的思考[J].中国科学院院刊,2025,40(3):522-529.
LI Xingteng,FENG Feng,HUANG Liqiang.Breaking through "data bottleneck" of AI large models -Reflections on building a national corpus operation platform[J].Bulletin of Chinese Academy of Sciences,2025,40(3):522-529.
李兴腾,冯锋,黄鹂强.突破人工智能大模型的“数据瓶颈”——构建国家级语料库运营平台的思考[J].中国科学院院刊,2025,40(3):522-529.
LI Xingteng,FENG Feng,HUANG Liqiang.Breaking through "data bottleneck" of AI large models -Reflections on building a national corpus operation platform[J].Bulletin of Chinese Academy of Sciences,2025,40(3):522-529.