NousResearch

Hi there!

Here’s a quick overview of NousResearch, why it’s buzzing, and why it might shake up the world of LLMs and AI models.

At the end of 2022, everything started on a Discord channel, with technical conversations and experiments. In 2023, they structured themselves as a startup. In 2025, they raised a Series A led by Paradigm ($50M). Small team (a few dozen people), big community (~15k on Discord), lots of open-source, and some wild ideas for training and data.

The Story#

No marketing plan here—just people tinkering on Discord, threads, and tests on their side. And then it took off. Jeffrey, Karan, Ryan, Shivani, and the crew turned conversations into code, then into a company. The promise? Make AI accessible and transparent—publish the weights, share the methods, and invite the community to participate.

What’s beautiful here is the mix: a small agile team + a very active community = fast experiments and instant feedback. They keep their “decentralized” spirit even after raising funds—and that changes everything.

NousResearch
Waiting for api.github.com...

The Models and Ideas Making Waves#

What gets people talking is the Hermes series and the whole ecosystem around it.

  • Hermes: a small family of LLMs focused on practicality. Hermes 3 (August 2024) = fine-tuning on Llama 3.1 405B with a highly synthetic dataset (lots of generated answers). Result: efficient and quick to launch.
  • Hermes 4 (August 2025) introduces the game-changer: hybrid reasoning. Two modes: fast response or long reflection (visible <think>...</think> tags). Basically, you can see when the model is “thinking.” And yes, it works—benchmarks show big gains for certain tasks.

Hermes 4 vs Frontier Models

And when you enable reflection mode, boom:

Hermes 4 Hybrid reasoning : performance boost across benchmarks

You can test the models at https://chat.nousresearch.com

Or on Huggingface where they share all their models: https://huggingface.co/NousResearch

DisTrO#

NousResearch
/
DisTrO
Waiting for api.github.com...
00K
0K
0K
Waiting...

DisTrO (Distributed Training Over-The-Internet): the thing that could make large model training accessible outside of data centers. In simple terms: DisTrO reduces inter-GPU communication needs by 4–5 orders of magnitude (≈ 1,000 to 10,000×) by smartly compressing gradients—without sacrificing convergence, according to published results.

Source: https://nousresearch.com/nous-psyche/

How does it work?

  • Each worker computes its gradients locally as usual.
  • Before sending anything, it transforms and compresses: the Discrete Cosine Transform (DCT) is used to move into a “frequency” domain where useful information is concentrated (a bit like JPEG, but for gradients).
  • The most important coefficients are pruned/quantized/encoded, so much less data is sent. Thanks to DeMo (Decoupled Momentum Optimization), the system allows controlled divergence between optimizer states and maintains good overall convergence.

DisTrO

Results & Proof

  • In their preliminary report, Nous shows empirical proof: DisTrO can match the convergence speed of classic training while massively reducing bandwidth (experiment on pre-training a 15B LLM).
  • DeMo, which complements DisTrO theoretically, decouples momentum updates and allows controlled divergences. In their experiments, they observe Spearman rank correlations ≈ 0.99 and mean squared errors far below classic baselines (orders of magnitude better in their measurements).

Why it’s a game changer

  • Enables large-scale training on heterogeneous hardware and slow Internet connections: a gaming PC can contribute.
  • Makes decentralization (Psyche) realistic: less data to transmit = broader participation and lower costs.
  • Breaks the dependency on ultra-fast interconnects (InfiniBand), opening the door to more open distributed training.
NOTE
  • Compression = potential information loss: calibration is needed to avoid degrading training.
  • In decentralized infrastructures, risks of data poisoning and malicious behavior—verification and traceability (e.g., verifiers, on-chain audit on Psyche) are crucial.
  • Implementation complexity: orchestration, quantization, aggregation, and verification require careful infrastructure and software.

In short, DisTrO + DeMo = a pragmatic approach to training massive models over the Internet without waiting for datacenter-grade networks. If it scales, it changes the game for anyone wanting to distribute compute outside the big clouds.

Forge#

Forge Reasoning API

forge

Why Forge is a game changer

  • Inference boost: Hermes 70B + Forge outperforms larger models on some reasoning benchmarks (e.g., AIME).
  • Flexibility: you choose the model or combination of models you want (see below).
  • Transparency: the reasoning trace is accessible via the API, useful for audits, improvements, and analysis.
NOTE

Hermes 70B x Forge shows very strong results on AIME (a demanding math competition) in their internal tests.

Forge Architecture Summary

  • Model Layer — Freedom of Choice: supports Hermes 3, Claude Sonnet 3.5, Gemini, GPT-4. You can use a single model or combine several.
  • Reasoning Layer — three main components (use cases):
    • MCTS (Monte Carlo Tree Search): useful for planning problems. Phases: Selection → Expansion → Simulation → Backpropagation.
    • CoC (Chain of Code): chain of reasoning linked to a code interpreter—great for math & code (e.g., evaluate, execute, verify).
    • MoA (Mixture of Agents): several models/agents produce answers, debate, and aggregate a consensus solution.
  • Execution / Orchestration: manages model calls, code execution, result aggregation, and returns the full trace.

The goal is to make inference more robust, traceable, and modular—a truly pragmatic approach to giving models reasoning without retraining everything.

Forge Blog Post

It’s in private beta for now for Lambda compute partners, but hopefully will be accessible to everyone soon.

Psyche#

the decentralized infrastructure for training models

Psyche is Nous Research’s ambition to make massive model training truly democratic. Instead of stacking thousands of accelerators in a single datacenter, Psyche orchestrates training on heterogeneous and underused hardware all over the world—relying on DisTrO/DeMo to limit the amount of data to transfer, and on the Solana blockchain for coordination, traceability, and fault tolerance.

psyche

  • Goal: allow anyone (for example, with a gaming PC) to contribute to training and be rewarded.
  • Technical pillar: DisTrO + DeMo (compressing learning information to reduce bandwidth).
  • Coordination: smart contract on Solana that stores the state of a run, manages transitions, and provides randomness & assignments.
See live runs

psyche

Psyche isn’t just a wrapper: they’ve added concrete improvements to DeMo/DisTrO, as seen above, to improve practical efficiency:

  • Overlapped Training: a node no longer has to wait to receive all previous updates before starting the next step. Updates are generated in parallel while applying others. Result: much better GPU utilization and network latency stops being the bottleneck at scale.
  • quantize_1bit: they found that sending only the sign (±1) of the DCT components, plus their indices, retained almost all useful information—and compressed results by another >3× when this option is enabled.

It’s implemented in Rust and relies on a robust P2P network (Iroh).

PsycheFoundation
/
psyche
Waiting for api.github.com...
00K
0K
0K
Waiting...

Main actors:

  • Coordinator (Solana smart contract): run metadata, participant list, state transitions, randomness source for assignments, synchronization point.
  • Clients: GPU nodes that perform training, can act as witnesses and verifiers, upload checkpoints.
  • Data providers: supply batches (local, HTTP, or TCP).

It’s ambitious, and some parts are still “researchy” (practical verification, large-scale robustness), but the good ideas are there. Psyche could really change the rules of the game—and we hope it does.

There’s more to say, but for today, that’s plenty.


Funding?#

They’re the anti-model of the opaque lab: transparent, flexible, sometimes a bit too permissive. Some people love it (control, auditability), others worry (risk of abuse). Their strength? Agility, community, and fast-deploying innovations. Their weakness? Scale, and a business model that’s not fully proven yet.

They play the mix: everything open (weights and models on HuggingFace), but with commercial bricks (Forge, Nous Chat). Paradigm put in $50M, there are compute partners like Lambda and Together AI, and the company is currently valued at over a billion dollars. There’s also talk of a NOUS token to power Psyche—a serious rumor, but nothing official yet as far as I know.


That’s a quick overview of the project. You might not hear about them everywhere, but they’re starting to carve out a real place in the AI ecosystem.

Hope you enjoyed it! Thanks for reading.

NousResearch
https://blog.ce-dev.eu/posts/en/nousresearch/
Author
Cedev
Published at
2026-01-10
License
CC BY-NC-SA 4.0