
Qwen 3 on Fireworks AI: Controllable Chain-of-Thought and Tool Calling at Frontier Scale
By Aishwarya Srinivasan|5/5/2025
TL;DR
- Reasoning meets function calls. Qwen 3 now streams an explicit
… trace and the exact JSON tool call in the same completion. - Turbo or stealth—your choice. Flip reasoning_effort="none" (or use the /think / /no_think tags) to trade transparency for raw throughput on the fly.
- Mixture-of-Experts giant, pay-as-you-go. The 235 B-parameter / 22 B-active Qwen3-235B-A22B runs serverlessly on Fireworks.
- Drop-in OpenAI compatibility. Use the Fireworks endpoint with the official OpenAI client; everything else stays the same.
Why this release matters ?
Until now, open-source LLMs forced a choice: show the chain of thought or call tools deterministically. Qwen 3’s new architecture does both in one pass, and keeps the reasoning block segregated so downstream code can ignore or audit it at will.
Pair that with a 128-expert MoE that only activates eight experts (≈22 B live parameters) and you get near-frontier quality at a fraction of the compute- fully Apache-2.0 and live on Fireworks today (Fireworks - Qwen3 235B-A22B model).
15-second quick-start
from openai import OpenAI
import os, json
client = OpenAI(
base_url="<https://api.fireworks.ai/inference/v1>",
api_key=os.environ["FIREWORKS_API_KEY"],
)
messages = [{"role": "user",
"content": "What’s the weather in Boston today?"}]
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Return current weather for a US city",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}]
# Reasoning based tool calls
resp = client.chat.completions.create(
model="accounts/fireworks/models/qwen3-235b-a22b",
messages=messages,
tools=tools,
max_tokens=4096,
temperature=0.6,
)
first = resp.choices[0].message
print(first.content) # contains <think> … </think>
print(first.tool_calls)
# Non-reasoning based tool calls
resp = client.chat.completions.create(
model="accounts/fireworks/models/qwen3-235b-a22b",
messages=messages,
tools=tools,
max_tokens=4096,
temperature=0.6,
extra_body={
"reasoning_effort": "none",
},
)
second = resp.choices[0].message
print(second.content) # does not contain <think> … </think>
print(second.tool_calls)
The first call contains reasoning chain of thought + tool call, the second doesn’t think, and just makes the tool calls.
Under the hood: The Hybrid Thinking Switch
- Thinking mode (
reasoning_effort!=”none”
)- Generates
<think> … </think>
and a final answer. - Recommended params:
temperature ≈ 0.6, top_p ≈ 0.95, top_k = 20
.
- Generates
- Non-thinking mode (
reasoning_effort=”none”
or a/no_think
tag)- Omits the reasoning block to save tokens and latency.
- Use slightly spicier sampling:
temperature ≈ 0.7, top_p ≈ 0.8
.
Because the trace sits in its own tag, you can log, redact, or meter it independently- the same pattern we covered in Constrained Generation with Reasoning.
Model card (Fireworks hosted)
Qwen 3-235B-A22B | |
---|---|
Total parameters | 235 B (Mixture-of-Experts) |
Active parameters | 22 B (8 / 128 experts) |
Layers | 94 |
Attention heads (Q / KV) | 64 / 4 |
Context window | 32 768 tokens (native)131 072 with YaRN |
License | Apache-2.0 |
Endpoint | accounts/fireworks/models/qwen3-235b-a22b |
Performance tips
- Long answers – Allocate at least 4 k output tokens for essays; up to 32 k for book-length generations.
- Cost & speed control – Invoke reasoning only on the turns that need it, then strip
<think>
before storage.
Using the Fireworks endpoint
Our endpoint is fully OpenAI compatible, please give it a try!
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model":"accounts/fireworks/models/qwen3-235b-a22b",
"messages":[{"role":"user","content":"Translate 这本书多少钱?"}],
"reasoning_effort": "none"
}'
Closing thoughts
With Qwen 3-235B-A22B, open-source finally gets a model that:
- Reveals its chain of thought when you ask.
- Emits tool calls in the exact same request.
- Scales to frontier-size contexts- all under Apache-2.0.
No secret weights, no bespoke SDKs. Just point your existing OpenAI-style client at Fireworks and build.
Questions, feedback, or cool demos? Drop by our Discord or tag us on X.
Happy shipping!