I want to note that if you really wanted an AI to play Pokémon you can do it wit...

futureshock · on Feb 26, 2025

I know what you are saying, but I very much disagree. There are also better chess engines. That’s not the point.

It’s all about the “G” in AGI. This is a nice demonstration of how LLMs are a generalizable intelligence. It was not designed to play Pokémon, Pokémon was no special part of its training set, Pokémon was not part of its evaluation criteria. And yet, it plays Pokémon, and rather well!

And to see each iteration of Claude be able to progress further and faster in Pokémon helps demonstrate that each generation of the LLM is getting smarter in general, not just better fitted to standard benchmarks.

The point is to build the universal hammer that can hammer every nail, just as the human mind is the universal hammer.

deadbabe · on Feb 26, 2025

It is not generalizable intelligence, its wisdom of the crowds. Claude does not form long term strategies or create predictions about future states. A simpler GOAP engine could create far more elaborate plans and still run entirely locally on your device (while adapting constantly to changing world states).

And yea you could have Claude use a GOAP tool for planning, but all you’re really doing is layering an LLM on top of a conventional AI as a presentation layer to make the lower AI seem far more intelligent than it is. This is why trying to use LLMs for complex decision making about anything that isn’t text and words is a dead end.

lyu07282 · on Feb 27, 2025

> It is not generalizable intelligence, its wisdom of the crowds.

Did you see twitch chat plays pokemon? There was not much wisdom in that crowd :P

eru · on March 3, 2025

Well, we know that some ways to organise crowds work better than others.

wordpad25 · on Feb 26, 2025

Pokémon guides were definitely part of every LLM training set. Game is so old, there are thousands of guides and videos on the topic.

LLMs will readily offer high quality Pokémon gameplay advice without needing to searc online.

hombre_fatal · on Feb 27, 2025

If you're implying that generalization isn't at play because game knowledge shows up in its training data, you can disabuse yourself of that by watching the stream and how it reasons itself out of situations. You can see its chain of thought.

It spends most of its time stuck and reasoning about what it can do. It might throw back to knowledge like "I know Pokemon games can have a ledge system that you can walk off, so I will try to see if this is a ledge" (and it fails and has to think of something else), but it's not like it knows the moment to moment intricacies of the game. It's clearly generalized problem solving.

minimaxir · on Feb 26, 2025

The operative phrase of that comment being “no special part.”

If you watch the Twitch stream it is obvious Claude has general knowledge of what to do to win in Pokémon but cannot recall specifics.

northern-lights · on Feb 26, 2025

For eg., Bug type attack is super effective against Poison type in Gen 1 but not very effective in Gen 2 and onnwards. But Claude keeps bringing Nidoran into Weedle/Caterpie.

minimaxir · on Feb 26, 2025

The AI Plays Pokemon project only made it to Mt. Moon (where coincidentially ClaudePlaysPokemon is stuck now) with many months of iteration and many many hours of compute.

The reason Claude 3.7's performance is interesting is that the LLM approach defeated Lt. Surge, far past Mt. Moon. (I wonder how Claude solved the infamous puzzle in Surge's gym)

https://www.anthropic.com/research/visible-extended-thinking

gyomu · on Feb 27, 2025

The fact that these models can only play up to a certain point seems like an interesting indication as to the inherent limitation of their capabilities.

After all, the game does not introduce any significant new mechanics beyond the first couple areas - any human player who has the reading/reasoning ability to make it to Mt Moon/Lt Surge would be able to complete the rest of the game.

So why are these models getting stuck at arbitrary points in the game?

minimaxir · on Feb 27, 2025

There's one major mechanic that opens up shortly after Lt. Surge: nonlinearity. Once you get to Lavender Town, there are several options to go to, and I suspect that will be difficult for an AI to handle over a limited context window.

And if the AI decides to attempt Seafoam Islands, all bets are off.

deadbabe · on Feb 26, 2025

Not talking about Reinforcement learning type AI, I’m talking about classically programmed AI with standard pathfinders, GOAP, behavior trees, etc…

Philpax · on Feb 26, 2025

But how much effort do you have to put in to build an agent that can play a specific game? Can you retarget that agent easily? How well will your agent deal with circumstances that it wasn't designed for?

deadbabe · on Feb 26, 2025

A lot less effort than training a massive LLM.

Also, there’s no point in designing for use cases it will never encounter. A Pokémon rpg AI is never going to have to go play GTA.

Philpax · on Feb 27, 2025

A LLM can be reused for other use cases. Your agent can't.

deadbabe · on Feb 27, 2025

The reusability is overrated.

For every problem that isn’t natural language processing, there exists a far better solution that runs faster and more optimally than an LLM, at the expense of having to actually program the damn thing (for which you can use an LLM to help you anyway).

Who can fight harder and better in a Pokémon battle, a programmed AI or an LLM? The programmed AI, because it has tactics and analysis built in. Even better, the AI’s difficulty can be scaled trivially where as an LLM you can tell it to “go easy” but it doesn’t actually know what that means? There’s no point in wasting time with an LLM for such an application.

adenta · on Feb 26, 2025

Got a link handy?

drusepth · on Feb 26, 2025

I don't think this project is meant to "solve" a task (hammer, nail) insomuch as it's just an interesting "what if" experiment to observe and play around with new technology.

adenta · on Feb 26, 2025

I disagree. Getting a computer to play a game like a human has an incredibly broad range of applications. Imagine a system like this that is on autopilot, but can get suggestions from a twitch chat, nudging its behavior in a specific direction. Two such systems could be run by two teams, and they could do a weekly battle.

This isn’t an exercise in AI, it’s an exercise in TV production IMO.

imtringued · on Feb 26, 2025

It's a publicity stunt by anthropic (Claude plays Pokémon).

Obviously they are going to show off their LLM