Whoohoo my own little eval app (for my RPGMaker plugin)

Recently I came across this podcast about evals being totally important:

https://www.youtube.com/watch?v=BsWxPI9UM4c

Warning: it is boring. But it is also important. Because its core message is "if you build something with LLMs, you need a structured way to analyze the LLM's behavior, categorize it, and continuously improve it based on data.". And this core message is also true for my little plugin for RPGMaker that turns Events into LLM-controlled Non-player-Characters. As I sometimes have the feeling that the LLM doesn't do what it should, and particularly have the feeling that some LLMs do better and others do worse, but without strong data, I had to write a little eval app.

The app

So I created a specification document that described a python app. First, the Plugin needs to write performance logs, and communication logs. It writes the whole prompt and the whole reply of every LLM call into a file, along with data you can use to filter (which LLM, which event name). The app can show you a performance trace, but mostly it gives you a screen to walk through each communication, see the whole prompt and reply, and rate this LLM reaction as okay and not okay, adding text to explain your (negative) rating. And in another screen you can filter by LLM and NPC, and see what percentage of "okay" ratings the NPC using this LLM received, and what your comments were.

And Cursor + GPT5 took less than an hour and 5 "this doesn't work right" prompts to get me to where I wanted to be.

Screenshot showing the annotation screen

And how it actually helped me improve my game

Some of my NPCs behave slightly different to what I expect them to. Before I had my app I needed to wade through huge log files, which is not what your programming hobby should be like. Unexpectedly (really!) the app totally changed this.

So I have this screen where you enter a large cave in a dungeon with a derelict house. Later the player will enter the house and meet an old chap living there who is very unhappy but doesn't want to leave. As a teaser for this, I created an invisible event / NPC in the cave that should sit there, not move, and complain about his miserable life.

The NPC did this, albeit at an unexpectedly low frequency. No clue why. Maybe the LLM was slow.

Turns out, after using the app, that it did move a lot. The prompt describing the NPC was not clear enough.

Screenshot showing a poorly behaving NPC

So I went into RPGMaker, changed the Description of the NPC and voila, the character behaved as it should.

Screenshot showing that the NPC is behaving better now.

My conclusion

Evals still are boring. Probably there are expensive pro apps out there that do that for you, but if you are in a similar situation with your own AI-based app, maybe you ask an AI to build a hand-crafted eval app for your own purpose.