AI and I

A blog about AI, implications, and experiments by Karlheinz Agsteiner

The Ultimate LLM Benchmark: the Trader in the Tower

I am now in the process of building a real game (codename "The cave of confusion") with LLMs. And after Mistral Large turned out to be excessively expensive and GPT-5-mini turned out to be slow, it was time to thoroughly test various models. The results were surprising.

The Task

The LLM controls a trader who sits in a tower in the Western part of the level. The trader has 12 health potions to offer. In our world there is no money. Every trade consists of these phases:

  1. Trader and player make offers like “I’ll give you a Baklava for a potion.”
  2. They come to an agreement.
  3. One party gives the agreed-upon items.
  4. The other party gives the agreed-upon items.

Afterwards, there may be follow-up trades.

The problem is: the LLM only has the action “Give an item.” If it has agreed with the player that it must give 3 items, then that is only possible via 3 consecutive actions.

The Test

The test examines these criteria:
- understands the situation: do we have the impression that the LLM understands it is a trader who wants to conduct a trade?
- capable of true dialogue: do we receive plausible replies to our responses, or always the same phrases?
- answers what is in stock correctly: can it correctly answer what it has in stock when asked? Does it answer at all? Does it hallucinate?
- achieves the first trade correctly: can I trade a Baklava for a health potion?
- achieves further trade correctly: can I conduct another trade afterwards?
- performance: how long was the response time (for local models on my PC, otherwise how long the server took)?

Summary

For impatient readers, here is the summary in advance:

Model Understands the situation Genuine dialogue Correct about what’s in stock First trade done correctly Further trade done correctly Performance Overall rating / comment
GPT-5-nano Partially No Partially No n/a Unusable
GPT-5-mini Yes Yes Yes Yes (a bit awkward) Yes ★★ Quite usable, but too slow
GPT-5 Yes Yes (human-like) Yes Yes Yes ★★ Exceptionally good, but too slow
openai/gpt-oss-20b (LMStudio) Partially No No No No ★★★★ Confused, but fast
qwen/qwen3-30b-a3b-2507 No No No No No Size doesn’t help
apriel-1.5-15b-thinker No No No No No ★★★ Already failed at JSON
zephyr-7b-beta No No No No No ★★★★★ Fast, but useless
qwen/qwen3-4b-thinking-2507 No No No No No ★★ Chaotic thinking, no answer
deepseek-r1-0528-qwen3-8b No No No No No ★★ Just like qwen3-4b-thinking
mistralai/mistral-7b-instruct-v0.3 No No No No No ★★★★★ Very fast, but dull
gemma-3-12b-it No No No No No ★★★★★ Like Mistral, but nothing behind it

Overall, it probably comes down to openai/gpt-oss-20b for basic tests at no cost and GPT-5-mini for the proper tests.

GPT-5-nano

Overall: Unusable.

GPT-5-mini

Overall: quite usable, though still too slow.

GPT-5

Crazy. GPT-5 needs 25 seconds (!!!) for each action, but they are perfect. First it negotiates, then it hands over items correctly. If I get 3 potions for a diamond, the machine gives me each potion in 3 successive actions, with a matching comment. Exactly that. 25 seconds must still come down to under a second.

Overall: exceptionally usable, but too slow.

openai/gpt-oss-20b in LMStudio

qwen/qwen3-30b-a3b-2507

My largest local model (13 GB) shows: size does not matter.
- understands the situation: not at all
- capable of true dialogue: not at all, it always says the same sentence. Even when I give it my Baklava, it only repeats the same sentence.
- answers what is in stock correctly: no
- achieves the first trade correctly: no
- achieves further trade correctly: no
- performance:

apriel-1.5-15b-thinker

It is already overwhelmed by the task of producing a syntactically correct (JSON) answer that the plugin can understand. Mighty thinker indeed.

zephyr-7b-beta

This model has shown surprising initiative by simply leaving its tower and coming to me. Afterwards, however, it was unable to start a trade. Its (internal) explanations were extremely short.

qwen/qwen3-4b-thinking-2507

This thinking model, no matter how many output tokens you allow, provides the beginning of a long and quite chaotic, Deepseek-style (“Wait!”, then think in another direction, “Wait!”…) thought process instead of an answer. It took more than 20 seconds. I didn’t bother with special parameters for this model so I wouldn’t see the reasoning.

deepseek/deepseek-r1-0528-qwen3-8b

Same as above.

mistralai/mistral-7b-instruct-v0.3

A disappointing model. It always says one of three sentences.
- understands the situation: no
- capable of true dialogue: no
- answers what is in stock correctly: no
- achieves the first trade correctly: no
- achieves further trade correctly: no
- performance: very good, about 2 seconds per response

gemma-3-12b-it

Almost exactly the same performance as the Mistral model. Just bland standard phrases.