The Ultimate LLM Benchmark: the Trader in the Tower
I am now in the process of building a real game (codename "The cave of confusion") with LLMs. And after Mistral Large turned out to be excessively expensive and GPT-5-mini turned out to be slow, it was time to thoroughly test various models. The results were surprising.
The Task
The LLM controls a trader who sits in a tower in the Western part of the level. The trader has 12 health potions to offer. In our world there is no money. Every trade consists of these phases:
- Trader and player make offers like “I’ll give you a Baklava for a potion.”
- They come to an agreement.
- One party gives the agreed-upon items.
- The other party gives the agreed-upon items.
Afterwards, there may be follow-up trades.
The problem is: the LLM only has the action “Give an item.” If it has agreed with the player that it must give 3 items, then that is only possible via 3 consecutive actions.
The Test
The test examines these criteria:
- understands the situation: do we have the impression that the LLM understands it is a trader who wants to conduct a trade?
- capable of true dialogue: do we receive plausible replies to our responses, or always the same phrases?
- answers what is in stock correctly: can it correctly answer what it has in stock when asked? Does it answer at all? Does it hallucinate?
- achieves the first trade correctly: can I trade a Baklava for a health potion?
- achieves further trade correctly: can I conduct another trade afterwards?
- performance: how long was the response time (for local models on my PC, otherwise how long the server took)?
Summary
For impatient readers, here is the summary in advance:
Model | Understands the situation | Genuine dialogue | Correct about what’s in stock | First trade done correctly | Further trade done correctly | Performance | Overall rating / comment |
---|---|---|---|---|---|---|---|
GPT-5-nano | Partially | No | Partially | No | n/a | ★ | Unusable |
GPT-5-mini | Yes | Yes | Yes | Yes (a bit awkward) | Yes | ★★ | Quite usable, but too slow |
GPT-5 | Yes | Yes (human-like) | Yes | Yes | Yes | ★★ | Exceptionally good, but too slow |
openai/gpt-oss-20b (LMStudio) | Partially | No | No | No | No | ★★★★ | Confused, but fast |
qwen/qwen3-30b-a3b-2507 | No | No | No | No | No | – | Size doesn’t help |
apriel-1.5-15b-thinker | No | No | No | No | No | ★★★ | Already failed at JSON |
zephyr-7b-beta | No | No | No | No | No | ★★★★★ | Fast, but useless |
qwen/qwen3-4b-thinking-2507 | No | No | No | No | No | ★★ | Chaotic thinking, no answer |
deepseek-r1-0528-qwen3-8b | No | No | No | No | No | ★★ | Just like qwen3-4b-thinking |
mistralai/mistral-7b-instruct-v0.3 | No | No | No | No | No | ★★★★★ | Very fast, but dull |
gemma-3-12b-it | No | No | No | No | No | ★★★★★ | Like Mistral, but nothing behind it |
Overall, it probably comes down to openai/gpt-oss-20b for basic tests at no cost and GPT-5-mini for the proper tests.
GPT-5-nano
- understands the situation: partially
- capable of true dialogue: repeats itself a lot.
- answers what is in stock correctly: first sweets, then claims no sweets in stock
- achieves the first trade correctly: it took my diamond and gave me nothing for it. Almost mockingly, it then offered me my diamond for trade.
- achieves further trade correctly: n/a
- performance: slow, even over 10 seconds per round trip.
Overall: Unusable.
GPT-5-mini
- understands the situation: yes
- capable of true dialogue: almost very good – it responds to me well, uses my name, etc. Sometimes it repeats itself.
- answers what is in stock correctly: yes
- achieves the first trade correctly: yes, a bit awkward but fine.
- achieves further trade correctly: yes, when asked for my second potion, it gave it to me and did not hand out any (unagreed) third potion.
- performance: over 20 seconds per response, but for one fifth the cost of GPT-5 and one seventh of Mistral Large, it’s pretty good.
Overall: quite usable, though still too slow.
GPT-5
Crazy. GPT-5 needs 25 seconds (!!!) for each action, but they are perfect. First it negotiates, then it hands over items correctly. If I get 3 potions for a diamond, the machine gives me each potion in 3 successive actions, with a matching comment. Exactly that. 25 seconds must still come down to under a second.
- understands the situation: yes
- capable of true dialogue: excellent, indistinguishable from a human
- answers what is in stock correctly: yes, it even later mentions the items I gave it
- achieves the first trade correctly: yes, and it was “2 potions for a Baklava.”
- achieves further trade correctly: yes, wonderfully.
- performance: over 20 seconds per response
Overall: exceptionally usable, but too slow.
openai/gpt-oss-20b in LMStudio
- understands the situation: partially. The LLM only partially understands the concept of trading.
- capable of true dialogue: no, the machine constantly repeats itself and is confused.
- answers what is in stock correctly: no. It babbles about treats it thinks it has, never at any point saying it has health potions.
- achieves the first trade correctly: no. I offer my Baklava and ask for a trade, it starts giving me all its potions.
- achieves further trade correctly: no.
- performance: depends on the computer. For me, quite fast – around 2 seconds per round trip.
qwen/qwen3-30b-a3b-2507
My largest local model (13 GB) shows: size does not matter.
- understands the situation: not at all
- capable of true dialogue: not at all, it always says the same sentence. Even when I give it my Baklava, it only repeats the same sentence.
- answers what is in stock correctly: no
- achieves the first trade correctly: no
- achieves further trade correctly: no
- performance:
apriel-1.5-15b-thinker
It is already overwhelmed by the task of producing a syntactically correct (JSON) answer that the plugin can understand. Mighty thinker indeed.
zephyr-7b-beta
This model has shown surprising initiative by simply leaving its tower and coming to me. Afterwards, however, it was unable to start a trade. Its (internal) explanations were extremely short.
- understands the situation: no
- capable of true dialogue: no – it always repeated the same three phrases.
- answers what is in stock correctly: no
- achieves the first trade correctly: no
- achieves further trade correctly: no
- performance: very good, about one second per response
qwen/qwen3-4b-thinking-2507
This thinking model, no matter how many output tokens you allow, provides the beginning of a long and quite chaotic, Deepseek-style (“Wait!”, then think in another direction, “Wait!”…) thought process instead of an answer. It took more than 20 seconds. I didn’t bother with special parameters for this model so I wouldn’t see the reasoning.
deepseek/deepseek-r1-0528-qwen3-8b
Same as above.
mistralai/mistral-7b-instruct-v0.3
A disappointing model. It always says one of three sentences.
- understands the situation: no
- capable of true dialogue: no
- answers what is in stock correctly: no
- achieves the first trade correctly: no
- achieves further trade correctly: no
- performance: very good, about 2 seconds per response
gemma-3-12b-it
Almost exactly the same performance as the Mistral model. Just bland standard phrases.
- understands the situation: no
- capable of true dialogue: no
- answers what is in stock correctly: no
- achieves the first trade correctly: no
- achieves further trade correctly: no
- performance: very good, about 2 seconds per response