Back to Home
Accelerating researchers and developers building multilingual AI with a new open dataset

Accelerating researchers and developers building multilingual AI with a new open dataset

B
Blizine Admin
·1 min read·0 views

Home / AI & ML / LLMs Accelerating researchers and developers building multilingual AI with a new open dataset A new repository-level dataset, published on GitHub under CC0-1.0, helps researchers and developers discover multilingual developer content across READMEs, issues, and pull requests. Kevin Xu · @khxu June 15, 2026 | 5 minutes Share: Software may be written in programming languages, but human language is at the heart of developer collaboration. Developers explain how projects work in READMEs. They ask for help in issues. They review, debate, and improve code in pull requests. That collaboration often happens in English—but not always. As AI becomes a bigger part of how developers build software, multilingual developer content matters more than ever. Today, GitHub is publishing the GitHub Multilingual Repositories Dataset , a repository-level metadata dataset designed to help researchers and developers discover public GitHub repositories with evidence of non-English natural-language content. When building the dataset, we found that language distribution differs across READMEs, issues and pull requests: Korean is the most common non-English language in issue text, but only the fifth-most common in READMEs. Portuguese tops the non-English README list with more than 3 million repositories. The dataset is now available on GitHub under CC0-1.0. It follows through on a commitment we made in 2025, as part of Microsoft’s European Digital Commitments, to make multilingual data more accessible, including to open source AI developers. What’s in the dataset The GitHub Multilingual Repositories Dataset is intentionally not a dump of repository content. Instead, it is a metadata dataset that helps developers and researchers find repositories where multilingual collaboration may be happening. The dataset covers over 80 million classification rows across more than 40 million repositories . For each public repository, we provide: Language classifications

Comments