Scary but fun

Okay, there’s this old song by Grace Jones called “Scary But Fun.” Probably nobody knows it. Still, it came to mind. Because the good LLMs can now handle really big refactors without errors. ChatGPT 4o had still crashed and burned on something like that.

I’m tinkering with this backgammon training quiz. I first developed it just for myself. Very simple. You store all quiz positions in a text file. You store your user and password for DailyGammon simply in a text file. There’s only one user, me.

Now I wanted to change that, to turn a prototype into a product. That takes effort. Conceptual, brain-taxing effort too. You have to set up a database. You have to change all the code that reads or writes text files so that it uses the database. You have to make sure the data of different users is cleanly separated. You have to build a login screen and an authentication concept. You have to – in my example – make sure that 20 users don’t all start Gnu Backgammon at the same time (which would overload my small virtual server).

You see, there’s a lot you can do wrong.

So I wrote a prompt about 30 lines long, more like a specification document. I didn’t go overboard – you could also specify a change like this in 200 lines. And in Cursor I started the AI with the wonderfully simple name GPT-5.1 codex (high).

The AI then worked away for somewhere between 15 and 30 minutes. My job was: click OK from time to time.

After that, the app was refactored. Perfectly. Flawlessly. With a small database. With login. With a user queue for DailyGammon analyses (which is also shown in the UI).

I’m impressed.

Scary but not fun at all

But something was off. The quiz positions I had supposedly played incorrectly I almost always guessed correctly in the quiz. Either I’m extremely unfocused in real life, or the app was still doing something wrong – weeks ago I’d had a long and unpleasant debugging session because the AI simply wouldn’t generate the so‑called Gnu ID – a 70-character string that describes a position that’s first encoded in binary and then in base64 – correctly.

“Not again,” I thought, but there was no way around it. The AI again turned out to be helpless, so I had to debug the LLM-generated code myself. And the code really wasn’t that bad. Every now and then you see these “no code” frameworks where a regular person describes a program graphically and then a machine algorithmically generates code from that. Horrible code. Code that no human should ever have to debug. Just take a look at the HTML that graphical editors generate. Something like that.

But the code felt like it had been written by a human; debugging it really wasn’t that bad.

In the end it turned out: unlike in chess, in backgammon the player to move first is determined by a dice roll. So there are games that start with

First move player 1, first move player 2
Second move player 1, second move player 2

And ones that start with

----, first move player 2
First move player 1, second move player 2

The LLM hadn’t considered that and interpreted the second move list as

first move player 1, ----
second move player 1, first move player 2

which then led to incorrect positions.

After I found the bug, the AI (this time the brilliant Claude Opus 4.5) fixed it immediately and flawlessly.

And now?

Lesson learned again: sometimes you have to step in yourself. And you have to be able to do that. For more than very simple apps, even the latest AIs still aren’t far enough along, especially when it’s about a niche topic like backgammon apps, of which GitHub is not exactly full.