GitHub Releases Multilingual Dataset of 40 Million Repositories for AI Research

GitHub has launched a new multilingual dataset for AI research, providing metadata from 40 million repositories. The dataset is designed to help researchers build and train models that understand code across different human languages, not just English.

What the dataset contains

The dataset includes metadata—such as repository names, descriptions, README files, programming languages, and topics—from 40 million public repositories. It covers a wide range of natural languages, reflecting the global nature of software development on the platform. The metadata is structured to be used for tasks like code search, documentation generation, and multilingual code understanding.

Most existing code-focused AI models are trained primarily on English-language data. That limits their usefulness for developers who write comments, documentation, or commit messages in other languages. By providing a large, diverse set of multilingual metadata, GitHub aims to help researchers create models that work better for a global audience. The dataset could also improve tools for translating code-related text or for cross-lingual code retrieval.

What researchers can do with it

The dataset is intended for academic and non-commercial research. It can be used to train models for tasks like predicting programming languages from descriptions, generating code comments, or understanding the relationship between natural language and code. GitHub has not specified a license for the dataset itself, but the metadata comes from public repositories, so usage must respect the original repository licenses.

The release comes as interest in large language models for code continues to grow. GitHub itself has its own AI coding assistant, Copilot, but this dataset is separate and focused on research. The company has not announced any specific partnerships or projects using the data yet.

Researchers can access the dataset through GitHub's website. No registration or fee is required, but the company asks users to follow its terms of service. The dataset is static—it's a snapshot of metadata as of a certain date, not a live feed.

What the dataset contains

What researchers can do with it

Related Articles