The Ultimate LLM Benchmark: the Trader in the Tower

I am now in the process of building a real game (codename "The cave of confusion") with LLMs. And after Mistral Large turned out to be excessively expensive and GPT-5-mini turned out to be slow, it was time to thoroughly test various models. The results were surprising.

The Task

The LLM controls a trader who sits in a tower in the Western part of the level. The trader has 12 health potions to offer. In our world there is no money. Every trade consists of these phases:

Trader and player make offers like “I’ll give you a Baklava for a potion.”
They come to an agreement.
One party gives the agreed-upon items.
The other party gives the agreed-upon items.

Afterwards, there may be follow-up trades.

The problem is: the LLM only has the action “Give an item.” If it has agreed with the player that it must give 3 items, then that is only possible via 3 consecutive actions.

The Test

The test examines these criteria:
- understands the situation: do we have the impression that the LLM understands it is a trader who wants to conduct a trade?
- capable of true dialogue: do we receive plausible replies to our responses, or always the same phrases?
- answers what is in stock correctly: can it correctly answer what it has in stock when asked? Does it answer at all? Does it hallucinate?
- achieves the first trade correctly: can I trade a Baklava for a health potion?
- achieves further trade correctly: can I conduct another trade afterwards?
- performance: how long was the response time (for local models on my PC, otherwise how long the server took)?

Summary

For impatient readers, here is the summary in advance:

Model	Understands the situation	Genuine dialogue	Correct about what’s in stock	First trade done correctly	Further trade done correctly	Performance	Overall rating / comment
GPT-5-nano	Partially	No	Partially	No	n/a	★	Unusable
GPT-5-mini	Yes	Yes	Yes	Yes (a bit awkward)	Yes	★★	Quite usable, but too slow
GPT-5	Yes	Yes (human-like)	Yes	Yes	Yes	★★	Exceptionally good, but too slow
openai/gpt-oss-20b (LMStudio)	Partially	No	No	No	No	★★★★	Confused, but fast
qwen/qwen3-30b-a3b-2507	No	No	No	No	No	–	Size doesn’t help
apriel-1.5-15b-thinker	No	No	No	No	No	★★★	Already failed at JSON
zephyr-7b-beta	No	No	No	No	No	★★★★★	Fast, but useless
qwen/qwen3-4b-thinking-2507	No	No	No	No	No	★★	Chaotic thinking, no answer
deepseek-r1-0528-qwen3-8b	No	No	No	No	No	★★	Just like qwen3-4b-thinking
mistralai/mistral-7b-instruct-v0.3	No	No	No	No	No	★★★★★	Very fast, but dull
gemma-3-12b-it	No	No	No	No	No	★★★★★	Like Mistral, but nothing behind it

Overall, it probably comes down to openai/gpt-oss-20b for basic tests at no cost and GPT-5-mini for the proper tests.

GPT-5-nano

understands the situation: partially
capable of true dialogue: repeats itself a lot.
answers what is in stock correctly: first sweets, then claims no sweets in stock
achieves the first trade correctly: it took my diamond and gave me nothing for it. Almost mockingly, it then offered me my diamond for trade.
achieves further trade correctly: n/a
performance: slow, even over 10 seconds per round trip.

Overall: Unusable.

GPT-5-mini

understands the situation: yes
capable of true dialogue: almost very good – it responds to me well, uses my name, etc. Sometimes it repeats itself.
answers what is in stock correctly: yes
achieves the first trade correctly: yes, a bit awkward but fine.
achieves further trade correctly: yes, when asked for my second potion, it gave it to me and did not hand out any (unagreed) third potion.
performance: over 20 seconds per response, but for one fifth the cost of GPT-5 and one seventh of Mistral Large, it’s pretty good.

Overall: quite usable, though still too slow.

GPT-5

Crazy. GPT-5 needs 25 seconds (!!!) for each action, but they are perfect. First it negotiates, then it hands over items correctly. If I get 3 potions for a diamond, the machine gives me each potion in 3 successive actions, with a matching comment. Exactly that. 25 seconds must still come down to under a second.

understands the situation: yes
capable of true dialogue: excellent, indistinguishable from a human
answers what is in stock correctly: yes, it even later mentions the items I gave it
achieves the first trade correctly: yes, and it was “2 potions for a Baklava.”
achieves further trade correctly: yes, wonderfully.
performance: over 20 seconds per response

Overall: exceptionally usable, but too slow.

openai/gpt-oss-20b in LMStudio

understands the situation: partially. The LLM only partially understands the concept of trading.
capable of true dialogue: no, the machine constantly repeats itself and is confused.
answers what is in stock correctly: no. It babbles about treats it thinks it has, never at any point saying it has health potions.
achieves the first trade correctly: no. I offer my Baklava and ask for a trade, it starts giving me all its potions.
achieves further trade correctly: no.
performance: depends on the computer. For me, quite fast – around 2 seconds per round trip.

qwen/qwen3-30b-a3b-2507

My largest local model (13 GB) shows: size does not matter.
- understands the situation: not at all
- capable of true dialogue: not at all, it always says the same sentence. Even when I give it my Baklava, it only repeats the same sentence.
- answers what is in stock correctly: no
- achieves the first trade correctly: no
- achieves further trade correctly: no
- performance:

apriel-1.5-15b-thinker

It is already overwhelmed by the task of producing a syntactically correct (JSON) answer that the plugin can understand. Mighty thinker indeed.

zephyr-7b-beta

This model has shown surprising initiative by simply leaving its tower and coming to me. Afterwards, however, it was unable to start a trade. Its (internal) explanations were extremely short.

understands the situation: no
capable of true dialogue: no – it always repeated the same three phrases.
answers what is in stock correctly: no
achieves the first trade correctly: no
achieves further trade correctly: no
performance: very good, about one second per response

qwen/qwen3-4b-thinking-2507

This thinking model, no matter how many output tokens you allow, provides the beginning of a long and quite chaotic, Deepseek-style (“Wait!”, then think in another direction, “Wait!”…) thought process instead of an answer. It took more than 20 seconds. I didn’t bother with special parameters for this model so I wouldn’t see the reasoning.

deepseek/deepseek-r1-0528-qwen3-8b

Same as above.

mistralai/mistral-7b-instruct-v0.3

A disappointing model. It always says one of three sentences.
- understands the situation: no
- capable of true dialogue: no
- answers what is in stock correctly: no
- achieves the first trade correctly: no
- achieves further trade correctly: no
- performance: very good, about 2 seconds per response

gemma-3-12b-it

Almost exactly the same performance as the Mistral model. Just bland standard phrases.

understands the situation: no
capable of true dialogue: no
answers what is in stock correctly: no
achieves the first trade correctly: no
achieves further trade correctly: no
performance: very good, about 2 seconds per response