▲Pitfalls of premature closure with LLM assisted codingshayon.dev

48 points by shayonj 3 days ago | 22 comments

lubujackson 56 minutes ago [-]

My experience with LLMs currently is that they can handle any level of abstraction and focus, but you have discern the "layer" to isolate and resolve.

The next improvement may be something like "abstraction isolation" but for now I can vibe code a new feature which will produce something mediocre. Then I ask "is that the cleanest approach?" and it will improve it.

Then I might ask "is this performant?" Or "does this follow the structure used elsewhere?" Or "does this use existing data structures appropriately?" Etc.

Much like the blind men describing an elephant they all might be right, but collectively can still be wrong. Newer, slower models are definitely better at this, but I think rather than throwing infinite context at problems if they were designed with a more top down architectural view and a checklist of competing concerns we might get a lot further in less time.

This seems to be how a lot of people are using them effectively right now - create an architecturr, implement piecemeal.

vjerancrnjak 30 minutes ago [-]

I’ve found it to never be able to produce the cleanest approach. I can spend 5 hours and get something very simple and I can even give it that as an example and it cannot extrapolate.

It can’t even write algorithms that rely on the fact that something is sorted. It needs intermediate glue that is not necessary, etc. massive noise.

Tried single allocation algorithms, can’t do that. Tried guiding to exploit invariants, can’t find single pass workflows.

The public training data is just bad and it can’t really understand what actually good is.

nonethewiser 48 minutes ago [-]

This has been my experience too. It's largely about breaking things down into smaller problems. LLMs just stop being effective when the scope gets too large.

Architecture documentation are helpful too, as you mentioned. They are basically a set of rules and intentions. It's kind of a compressed version of your codebase.

Of course, this means the programmer still has to do all the real work.

cryptonym 29 minutes ago [-]

Sounds like coding with extra steps.

nonethewiser 2 minutes ago [-]

What is the extra step? You have to do the upfront legwork either way.

mrklol 24 minutes ago [-]

"But what if the real issue is an N+1 query pattern that the index would merely mask? What if the performance problem stems from inefficient data modeling that a different approach might solve more elegantly?"

In the best case you would have to feed every important information into the context. These are my indexes, this is my function, these are my models. After that the model can find problematic code. So the main problem is go give your model all of the important information, that has to be fixed if the user isn’t doing that part (Obviously that doesn’t mean that the LLM will find a correct problem, but it can improve the results).

jbellis 2 hours ago [-]

I think this is correct, and I also think it holds for reviewing human-authored code: it's hard to do the job well without first having your own idea in your head of what the correct solution looks like [even if that idea is itself flawed].

danielbln 6 hours ago [-]

I put the examples he gave into Claude 4(Sonnet) purely asking to eval the code, it pointed out every single issue about the code snippets (N+1 Query, race condition, memory leak). The article doesn;t mention which model was used, or how exactly it was used, or in which environment/IDE it was used.

The rest of the advice in there is sound, but without more specifics I don't know how actionable the section "The spectrum of AI-appropriate tasks" really is.

metalrain 2 hours ago [-]

It's not about "model quality". Most models can improve code their output when asked, but problem is the lack of introspection by the user.

Basically same problem as copy paste coding, but LLM can (sometimes) know your exact variable names, types so it's easier to forget that you need to understand and check the code.

shayonj 4 hours ago [-]

My experience hasn't changed between models, given the core issue mentioned in the article. Primarily I have used Gemini and Claude 3.x and 4. Some GPT 4.1 here and there.

All via Cursor, some internal tools and Tines Workbench

soulofmischief 1 hours ago [-]

My experience changes just throughout the day on the same model, it seems pretty clear that during peak hours (lately most of the daytime) Anthropic is degrading their models in order to meet demand. Claude becomes a confident idiot and the difference is quite noticeable.

genewitch 7 minutes ago [-]

this is on paid plans?

lazy_moderator1 22 minutes ago [-]

did it detect n+1 in the first one, race condition in the second one and memory leak in the third one?

danielbln 2 minutes ago [-]

It did, yeah.

59 minutes ago [-]

nonethewiser 57 minutes ago [-]

This largely seems like an alternative way of saying "you have to validate the results of an LLM." Is there any "premature closure" risk if you simply validate the results?

Premature closure is definitely a risk with LLMs but I think code is much less at risk because you can and SHOULD test it. But its a bigger problem for things you cant validate.

I might starting calling this "the original sin" with LLMs... not validating the output. There are many problems people have identified with using LLMs and perhaps all of them come back to not validating.

bwfan123 36 minutes ago [-]

> I might starting calling this "the original sin" with LLMs... not validating the output.

I would rephrase it - the original sin of llms is not "understanding" what they output. By "understanding" it is meant the "the why of the output" - starting from the original problem and reasoning through to the output solution - ie, the causation process behind the output. What llms do is to pattern match a most "plausible output" to the input. But the output is not born out of a process of causation - it is born out of pattern matching.

Humans can find meaning in the output of LLMs, but machines choke at it - which is why, LLM code looks fine at a first glance until someone tries to run it. Another way to put it is, LLMs sound persuasive to humans - but at the core are rote students who dont understand what they are saying.

MontagFTB 1 hours ago [-]

This isn’t new. Have we not already seen this everywhere already? The example at the top of the article (in a completely different field, no less) just goes to show humans had this particular sin nailed well before AI came along.

Bloated software and unstable code bases abound. This is especially prevalent in legacy code whose maintenance is handed down from one developer to the next, where their understanding of the code base differs from their predecessor’s. Combine that with pressures to ship now vs. getting it right, and you have the perfect recipe for an insipid form of technical debt.

1 hours ago [-]

suddenlybananas 5 hours ago [-]

I initially thought that layout of the sections was an odd and terrible poem.

tempodox 1 hours ago [-]

Now that you mention it, me too.

mock-possum 1 hours ago [-]

Oh wow me too - I kinda like it that way.

But if it’s meant to be a table of contents, it really should be styled like a list, rather than a block quote.

shayonj 4 hours ago [-]

haha! I didn't see it that way originally. Shall take it as a compliment and rework that ToC UI a bit :D.

shayonj 1 hours ago [-]

ok ok - i put in a little touch up :D

llmenth 4 hours ago [-]

[dead]

Loading comments...

lubujackson 56 minutes ago [-]

My experience with LLMs currently is that they can handle any level of abstraction and focus, but you have discern the "layer" to isolate and resolve.

Then I might ask "is this performant?" Or "does this follow the structure used elsewhere?" Or "does this use existing data structures appropriately?" Etc.

This seems to be how a lot of people are using them effectively right now - create an architecturr, implement piecemeal.

vjerancrnjak 30 minutes ago [-]

I’ve found it to never be able to produce the cleanest approach. I can spend 5 hours and get something very simple and I can even give it that as an example and it cannot extrapolate.

It can’t even write algorithms that rely on the fact that something is sorted. It needs intermediate glue that is not necessary, etc. massive noise.

Tried single allocation algorithms, can’t do that. Tried guiding to exploit invariants, can’t find single pass workflows.

The public training data is just bad and it can’t really understand what actually good is.

nonethewiser 48 minutes ago [-]

This has been my experience too. It's largely about breaking things down into smaller problems. LLMs just stop being effective when the scope gets too large.

Architecture documentation are helpful too, as you mentioned. They are basically a set of rules and intentions. It's kind of a compressed version of your codebase.

Of course, this means the programmer still has to do all the real work.

cryptonym 29 minutes ago [-]

Sounds like coding with extra steps.

nonethewiser 2 minutes ago [-]

What is the extra step? You have to do the upfront legwork either way.

mrklol 24 minutes ago [-]

jbellis 2 hours ago [-]

danielbln 6 hours ago [-]

The rest of the advice in there is sound, but without more specifics I don't know how actionable the section "The spectrum of AI-appropriate tasks" really is.

metalrain 2 hours ago [-]

It's not about "model quality". Most models can improve code their output when asked, but problem is the lack of introspection by the user.

Basically same problem as copy paste coding, but LLM can (sometimes) know your exact variable names, types so it's easier to forget that you need to understand and check the code.

shayonj 4 hours ago [-]

My experience hasn't changed between models, given the core issue mentioned in the article. Primarily I have used Gemini and Claude 3.x and 4. Some GPT 4.1 here and there.

All via Cursor, some internal tools and Tines Workbench

soulofmischief 1 hours ago [-]

genewitch 7 minutes ago [-]

this is on paid plans?

lazy_moderator1 22 minutes ago [-]

did it detect n+1 in the first one, race condition in the second one and memory leak in the third one?

danielbln 2 minutes ago [-]

It did, yeah.

59 minutes ago [-]

nonethewiser 57 minutes ago [-]

This largely seems like an alternative way of saying "you have to validate the results of an LLM." Is there any "premature closure" risk if you simply validate the results?

Premature closure is definitely a risk with LLMs but I think code is much less at risk because you can and SHOULD test it. But its a bigger problem for things you cant validate.

bwfan123 36 minutes ago [-]

> I might starting calling this "the original sin" with LLMs... not validating the output.

MontagFTB 1 hours ago [-]

1 hours ago [-]

suddenlybananas 5 hours ago [-]

I initially thought that layout of the sections was an odd and terrible poem.

tempodox 1 hours ago [-]

Now that you mention it, me too.

mock-possum 1 hours ago [-]

Oh wow me too - I kinda like it that way.

But if it’s meant to be a table of contents, it really should be styled like a list, rather than a block quote.

shayonj 4 hours ago [-]

haha! I didn't see it that way originally. Shall take it as a compliment and rework that ToC UI a bit :D.

shayonj 1 hours ago [-]

ok ok - i put in a little touch up :D

llmenth 4 hours ago [-]

[dead]