Distillation with Reasoning: Can DeepSeek R1 Teach Better Than Humans?

By Aishwarya Srinivasan|1/31/2025

TLDR;

Inclusion of reasoning “chains of thought” (CoT) in the model output significantly improves its quality, but it increases inference cost.
Distillation transfers reasoning knowledge from an expensive teacher model to a more cost-effective student, reducing overall inference cost.
DeepSeek R1 can produce detailed CoT, making it an excellent teacher model.
Synthetic data generated by DeepSeek R1 may outperform data produced by human experts.

Introduction

The recent release of DeepSeek R1 has taken the AI community by storm, offering performance on par with leading frontier models—such as OpenAI’s o1—at a fraction of the cost. Still, R1 can be expensive for use cases with high traffic or low latency requirements.

DeepSeek R1’s strength lies in its explicit step-by-step reasoning. Before generating a final answer, it creates an internal “chain of thought” (CoT) to systematically reason through each problem. This process is a form of test-time computation, allowing the model to dynamically allocate more compute to complex problems. However, these extended reasoning sequences typically increase inference cost.

Distillation

Distillation is a method for transferring knowledge from a large, more powerful teacher model to a smaller, more cost-effective student model. According to the DeepSeek R1 paper, R1 is highly effective in this teacher role. Its detailed CoT sequences guide the student model to break down complex tasks into smaller, more manageable steps.

Screenshot_2025-01-31_at_2.50.40_PM.png

Comparing Distillation to Human-Labeled Data

Although fine-tuning with human-labeled data can produce specialized models, collecting both final answers and their corresponding reasoning steps is expensive. Distillation scales more easily: rather than relying on human annotations, the teacher model automatically generates the training data for the student.

A Side Note on Terminology

The term “distillation” can refer to different methods:

Distribution Distillation
- Aligns the student model’s output token distribution with the teacher’s using Kullback–Leibler divergence (KL-divergence).
- Works best when both models share the same architecture, tokenizer, and pre-training data.
Data Distillation
- Uses the teacher model to generate completions for a set of prompts.
- Fine-tunes the student model using a standard cross-entropy loss on these generated outputs, skipping the KL-divergence term.
- Allows the teacher and student to be different model families and tokenizers (though if the teacher uses specialized tokens like __, it can be beneficial for both models to recognize them).

In this post, we focus on the data distillation because it supports a wider variety of student–teacher pairs.

Data Generation

Training data is often a bottleneck in model development. In a recent post (add link), we explored how to generate labels by combining model output with a verification function. Distillation takes a different approach, using a teacher model to synthesize missing completions.

DeepSeek R1 stands out because it not only provides final answers but also reveals its step-by-step chain of thought—unlike other reasoning models that keep this internal process hidden. If your dataset includes ground truth answers, you can identify high-quality synthetic CoTs through rejection sampling, selecting only the best chains to further improve your fine-tuned model. Rejection sampling can remove incorrect data examples either by comparing the generated data against ground truth labels or by applying a user-defined validation function. From the interface point of view, the validation function resembles the verifiable reward function used by value-model-free RL methods like these described in our recent blog post.

Screenshot_2025-01-31_at_2.55.28_PM.png

Case Study: GSM8K

GSM8K (Grade School Math 8K) is a dataset of 8.5K diverse grade-school math word problems. Each data point includes:

A problem description.
A human expert’s chain of thought.
The final answer.

We expanded this dataset by adding:

Synthetic R1 reasoning, i.e., the CoT generated by DeepSeek R1.

Then, we fine-tuned three variants of the model (using LoRA on llama-3.1-8B-instruct), each with different training targets:

Direct Answer Only: Generate the final answer without showing reasoning.
Human Expert CoT: Generate the final answer alongside a reasoning chain resembling the human expert’s.
Synthetic R1 CoT: Generate the final answer alongside DeepSeek R1’s synthetic reasoning chain.

The table below summarizes average accuracy and reasoning length:

Variant	Average accuracy	Average reasoning length
0. Llama3.1-8b-instruct 5 shot CoT	0.78*	N/A
1. Direct Answer Only	0.29	N/A
2. Human Expert CoT	0.68	280 chars
3. Synthetic R1 CoT	0.87	2k chars

Note: The accuracy for the 5-shot baseline may differ from numbers reported elsewhere due to different evaluation setups. The key focus is on comparing relative performance across distillation methods, not on beating other models.

From this study, synthetic reasoning CoTs from DeepSeek R1 appear superior to human-expert CoTs in boosting performance, albeit with a higher inference cost due to their longer length.

Fireworks AI Inference and Fine-Tuning Platform

DeepSeek R1 is available on the Fireworks AI platform. A user-friendly distillation interface will soon be part of FireOptimizer. If you need earlier access, please get in touch to explore options.

Conclusions

By incorporating reasoning-based data through distillation, organizations can drastically improve model performance without bearing the full burden of human-annotated datasets. DeepSeek R1’s ability to produce long, high-quality reasoning chains makes it a powerful teacher model—showing that, in some cases, the machine might just out-teach the human.