Announcing our first dataset TxT360: Learn More Here.

LLM360 enables community-owned AI through open-source large model research and development.

Datasets

TxT360

A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend. The first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution.

Models

K2-65B

A 65B parameter language model trained on 1.4T tokens. It outperforms Llama 2 70B, but uses approximately 35% less compute to train.

Crystal-7B

A 7B parameter language model, distinctively trained on the SlimPajama and StarCoder datasets, eclipsing the Llama 2 frontier, skillfully balances language and coding. Its instruction-following variant, CrystalChat, stands out as a top-scoring 7B chat model, trained on a carefully selected mix publicly available language and code datasets.

Amber-7B

A 7B parameter English language model based on the LLaMA architecture has two fine-tuned instruction-following models named AmberChat and AmberSafe.

Projects

Analysis360: Open Implementations of LLM Analyses

Analysis360 provides open reference implementations for a variety of downstream analyses that can be done with and for LLM360 models, covering a range of topics including: mechanistic interpretability, visualization, machine unlearning, data memorization, AI safety, assessing toxicity & bias, and a large set of evaluation metrics.

Papers

LLM360 K2-65B: Scaling Up Fully Transparent Open-Source LLMs

In this paper, we present LLM360 K2-65B, the most powerful fully transparent open-source large language model (LLM) released to date. K2 is a 65 billion parameter LLM, which follows best practices for reproducibility from the LLM360 project. Despite numerous efforts to develop and release open-source LLMs, full transparency around the training process still remains limited...


LLM360: Towards Fully Transparent Open-Source LLMs

The recent surge in open-source Large Language Models (LLMs), such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics...

Inspired Research:

Blogs

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend

We introduce TxT360 (Trillion eXtracted Text), the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution.


Decentralized Arena via Collective LLM Intelligence

LLM360 and Maitrix.org proudly release Decentralized Arena that automates and scales “Chatbot Arena” for LLM evaluation across various fine-grained dimensions (e.g., math – algebra, geometry, probability; logical reasoning, social reasoning, biology, chemistry, …). The evaluation is decentralized and democratic, with all LLMs participating in evaluating others.


Introducing K2-65B: Charting the Blueprint Towards Open-Source Artificial General Intelligence

LLM360 is excited to announce several new releases to further our mission enabling community-owned AGI by creating standards and tools to advance the bleeding edge of LLM capability and empower knowledge transfer, research, and development.


Introducing LLM360: Fully Transparent Open-Source LLMs

In recent months, the open-source large language model (LLM) community has seen tremendous model contributions. However, model weight releases and overview technical reports do not contain enough information to cover the complexity of LLM training, which hinders openness and transparency, the mechanisms behind trustworthy and innovative research and science for decades.