Home / AI & ML / LLMs Accelerating researchers and developers building multilingual AI with a new open dataset A new repository-level dataset, published on GitHub under CC0-1.0, helps researchers and developers discover multilingual developer content across READMEs, issues, and pull requests. Kevin Xu · @khxu June 15, 2026 | 5 minutes Share: Software may be written in programming languages, but human language is at the heart of developer collaboration. Developers explain how projects work in READMEs. They ask for help in issues. They review, debate, and improve code in pull requests. That collaboration often happens in English—but not always. As AI becomes a bigger part of how developers build software, multilingual developer content matters more than ever. Today, GitHub is publishing the GitHub Multilingual Repositories Dataset , a repository-level metadata dataset designed to help researchers and developers discover public GitHub repositories with evidence of non-English natural-language content. When building the dataset, we found that language distribution differs across READMEs, issues and pull requests: Korean is the most common non-English language in issue text, but only the fifth-most common in READMEs. Portuguese tops the non-English README list with more than 3 million repositories. The dataset is now available on GitHub under CC0-1.0. It follows through on a commitment we made in 2025, as part of Microsoft’s European Digital Commitments, to make multilingual data more accessible, including to open source AI developers. What’s in the dataset The GitHub Multilingual Repositories Dataset is intentionally not a dump of repository content. Instead, it is a metadata dataset that helps developers and researchers find repositories where multilingual collaboration may be happening. The dataset covers over 80 million classification rows across more than 40 million repositories . For each public repository, we provide: Language classifications
Back to Home

Accelerating researchers and developers building multilingual AI with a new open dataset
B
Blizine Admin
·1 min read·0 views
B
Blizine Admin
View Profile Staff Writer
Related Articles
Want Gemini features before everyone else? Google is recruiting testers
Jun 17, 2026·2 min read
For the First Time, ChatGPT Reportedly Has Less Than Half of the AI Assistant Market
Jun 17, 2026·2 min read
Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again
Jun 17, 2026·1 min read