EleutherAI

Content sourced from Wikipedia, licensed under CC BY-SA 3.0.

EleutherAI is a grassroots, non-profit AI research group that aims to build open-source alternatives to large language models. It started in a Discord server in July 2020 to create an open version of GPT-3. In 2023 it formally became the EleutherAI Institute, a nonprofit research organization. By 2025, it runs training data sets, conducts research, and works on public policy.

The project began on July 7, 2020, originally called LibreAI before it was renamed EleutherAI. Founders were Connor Leahy, Leo Gao, and Sid Black. They created open-source AI tools and a GPT-3-like model system. On December 31, 2020 they released The Pile, a large collection of text data for training language models. The first GPT-Neo models appeared in March 2021, and on June 9, 2021 they released GPT-J-6B, the largest open-source GPT-3-like model at the time. These models were released under the Apache 2.0 license and helped inspire many new AI startups.

At first EleutherAI declined funding, using Google’s TPU Research Cloud for compute. By early 2021 they accepted help from CoreWeave and SpellML for access to powerful GPU clusters. On February 10, 2022 they released GPT-NeoX-20B, a larger model supported by those resources.

In early 2023 EleutherAI became a nonprofit institute led by Stella Biderman, Curtis Huebner, and Shivanshu Purohit. They announced a shift toward interpretability, alignment, and scientific research, while continuing to train and release models.

In July 2024 an investigation reported that The Pile included subtitles from many YouTube videos, which drew criticism over copyright. In 2025 EleutherAI released Common Pile, a dataset that avoids controversial copyrighted material, and trained two models from it. They also worked with the UK’s AI Security Institute to show that filtering training data can reduce harmful outputs without hurting model performance.

EleutherAI works with hundreds of volunteer researchers. The Pile itself is about 886 GB and is widely used to train large models, including Microsoft’s Megatron-Turing model. It is notable for being carefully documented by the researchers who built it, and for its focus on data curation. The group’s GPT-Neo family includes models from 125 million to 20 billion parameters and helped spark a wave of open-source language models. They also branched into text-to-image work after OpenAI released DALL-E in 2021, using CLIP and VQGAN to create VQGAN-CLIP, a method people could run with public notebooks.

This page was last edited on 3 February 2026, at 08:19 (CET).