More

summarity · 2026-04-14T17:31:17 1776187877

If you’re trying this for automating things on GitHub, also take a look at Agentic Workflows: https://github.github.com/gh-aw/

They support much of the same triggers and come with many additional security controls out of the box

eranation · 2026-04-14T20:05:57 1776197157

+1 for that, having that said, because GH agentic workflows require a bit more handholding and testing to work, (and have way more guardrails, which is great, but limiting), and lack some basic connectors (for example - last time I tried it, it had no easy slack connector, I had to do it on my own). This is why I'm moving some of the less critical gh-aw (all the read only ones) to Claude Routines.

gavinray · 2026-04-14T18:08:48 1776190128

Why have I not heard of this? Was looking for a way to integrate LLM CLI's to do automated feature development + PR submission triggered by Github issues, seems like this would solve it.

eranation · 2026-04-14T21:58:37 1776203917

Built in Co-Pilot I believe can do this better than gh-aw (or a click away).

Cursor has that too by the way (issue -> remote coding session -> PR -> update slack)

deadfall23 · 2026-04-14T19:48:08 1776196088

Why not https://github.com/anthropics/claude-code-action?

summarity · 2026-04-14T14:42:33 1776177753

It’s growing but not a lot, I have some data here: https://pierretempel.com/p/nim-usage-on-github

Most code I write is still Nim though.

summarity · 2026-04-06T16:11:49 1775491909

Not claude code specific, but I've been noticing this on Opus 4.6 models through Copilot and others as well. Whenever the phrase "simplest fix" appears, it's time to pull the emergency break. This has gotten much, much worse over the past few weeks. It will produce completely useless code, knowingly (because up to that phrase the reasoning was correct) breaking things.

Today another thing started happening which are phrases like "I've been burning too many tokens" or "this has taken too many turns". Which ironically takes more tokens of custom instructions to override.

Also claude itself is partially down right now (Arp 6, 6pm CEST): https://status.claude.com/

andoando · 2026-04-06T16:24:24 1775492664

Ive been noticing something similar recently. If somethings not working out itll be like "Ok this isnt working out, lets just switch to doing this other thing instead you explicitly said not to do".

For example I wanted to get VNC working with PopOS Cosmic and itll be like ah its ok well just install sway and thatll work!

albert_e · 2026-04-06T17:52:55 1775497975

Experienced this -- was repeatedly directing CC to use Claude in Chrome extension to interact with a webpage and it was repeatedly invoking Playwright MCP instead.

RALaBarge · 2026-04-07T12:05:03 1775563503

I actually submitted an upstream patch for Cosmic-Comp thanks to Claude on Saturday. I wanted to play Guild Wars remake and something was going on with the mouse and moving the camera. We had it fixed in no time and now shit is working great.

robotswantdata · 2026-04-06T18:12:30 1775499150

It’s as if it gives up, I respond keep going with original plan, you can do it champ!

rootnod3 · 2026-04-06T16:47:55 1775494075

[flagged]

andoando · 2026-04-06T16:57:18 1775494638

satvikpendem · 2026-04-06T18:40:48 1775500848

They're saying just do it yourself instead of trying to herd an unpredictable animal to your bidding like an LLM.

robwwilliams · 2026-04-06T16:45:56 1775493956

Yes, and over the last few weeks I have noticed that on long-context discussions Opus 4.6e does its best to encourage me to call it a day and wrap it up; repeatedly. Mother Anthropic is giving preprompts to Claude to terminate early and in my case always prematurely.

TonyAlicea10 · 2026-04-06T19:19:15 1775503155

I've noticed this as well. "Now you should stop X and go do Y" is a phrase I see repeated a lot. Claude seems primed to instruct me to stop using it.

lukewarm707 · 2026-04-06T22:37:40 1775515060

as someone who uses deepseek, glm and kimi models exclusively, an llm telling me what to do is just off the wall

glm and kimi in particular, they can't stop writing... seriously very eager to please. always finishing with fireworks emoji and saying how pleased it is with the test working.

i have to say to write less documentation and simplify their code.

otabdeveloper4 · 2026-04-07T08:59:07 1775552347

LLMs are next token predictors. Outputting tokens is what they do, and the natural steady-state for them is an infinite loop of endlessly generated tokens.

You need to train them on a special "stop token" to get them to act more human. (Whether explicitly in post-training or with system prompt hacks.)

This isn't a general solution to the problem and likely there will never be one.

logicchains · 2026-04-06T16:49:14 1775494154

Try Codex, it's a breath of fresh air in that regard, tries to do as much as it can.

onlyrealcuzzo · 2026-04-06T16:57:26 1775494646

> Whenever the phrase "simplest fix" appears, it's time to pull the emergency break.

Second! In CLAUDE.md, I have a full section NOT to ever do this, and how to ACTUALLY fix something.

This has helped enormously.

bowersbros · 2026-04-06T17:03:48 1775495028

Any chance you could share those sections of your claude file? I've been using Claude a bit lately but mostly with manual changes, not got much in the way of the claude file yet and interested in how to improve it

onlyrealcuzzo · 2026-04-06T19:13:54 1775502834

https://github.com/cuzzo/easy-vm/blob/master/CLAUDE.md

LeoPanthera · 2026-04-06T19:17:24 1775503044

Typo - "proove". "Prove" only has one O.

onlyrealcuzzo · 2026-04-06T19:19:30 1775503170

Thank you!

causal · 2026-04-06T17:59:54 1775498394

I switched from Cursor to Claude because the limits are so much higher but I see Anthropic playing a lot more games to limit token use

talim · 2026-04-06T17:09:32 1775495372

What wording do you use for this, if you don't mind? This thread is a revelation, I have sworn that I've seen it do this "wait... the simplest fix is to [use some horrible hack that disregards the spec]" much more often lately so I'm glad it's not just me.

However I'm not sure how to best prompt against that behavior without influencing it towards swinging the other way and looking for the most intentionally overengineered solutions instead...

twalichiewicz · 2026-04-06T17:23:57 1775496237

My own experience has been that you really just have to be diligent about clearing your cache between tasks, establishing a protocol for research/planning, and for especially complicated implementations reading line-by-line what the system is thinking and interrupting the moment it seems to be going bad.

If it's really far off the mark, revert back to where you originally sent the prompt and try to steer it more, if it's starting to hesitate you can usually correct it without starting over.

aforwardslash · 2026-04-06T19:47:22 1775504842

That is generically my experience as well. Claude half-assing work or skipping stuff because "takes too much time" is something I've been experiencing since I started using it (May 2025). Forcing it to create and review and implementation plan, and then reviewing the implementation cross-referenced with the plan almost always produces consistent results in my case.

imiric · 2026-04-06T17:38:14 1775497094

Make sure to use "PRETTY PLEASE" in all caps in your `SOUL.md`. And occasionally remind it that kittens are going to die unless it cooperates. Works wonders.

KaoruAoiShiho · 2026-04-06T19:13:57 1775502837

Can you paste the relevant section in your soul please?

imiric · 2026-04-06T21:30:11 1775511011

Sure, as soon as I locate my soul.

mghackerlady · 2026-04-06T18:46:09 1775501169

I love how despite how cold and inhuman LLMs are, we've at least taught them to respect the lives of kittens

onlyrealcuzzo · 2026-04-06T19:14:13 1775502853

https://github.com/cuzzo/easy-vm/blob/master/CLAUDE.md

aktenlage · 2026-04-06T20:53:00 1775508780

Where is that? I found "Return the simplest working solution. No over-engineering." which sounds more like the simplest fix.

psadauskas · 2026-04-06T16:58:22 1775494702

I need to add another agent that watches the first, and pulls the plug whenever it detects "Wait, I see the problem now..."

iterateoften · 2026-04-06T18:13:18 1775499198

Yeah it’s so frustrating to have to constantly ask for the best solution, not the easiest / quickest / less disruptive.

I have in Claude md that it’s a greenfield project, only present complete holistic solutions not fast patches, etc. but still I have to watch its output.

selfmodruntime · 2026-04-06T21:57:51 1775512671

Time's up and money is tight. The downgrade was bound to happen.

nikanj · 2026-04-06T17:22:37 1775496157

”I can’t make this api work for my client. I have deleted all the files in the (reference) server source code, and replaced it with a python version”

Repeatedly, too. Had to make the server reference sources read-only as I got tired of having to copy them over repeatedly

mavamaarten · 2026-04-06T19:10:06 1775502606

Haha yeah. I once asked it to make a field in an API response nullable, and to gracefully handle cases where that might be an issue (it was really easy, I was just lazy and could have done it myself, but I thought it was the perfect task for my AI idiot intern to handle). Sure, it said. Then it was bored of the task and just deleted the field altogether.

pixel_popping · 2026-04-06T16:27:35 1775492855

It's a bit insane that they can't figure out a cryptographic way for the delivery of the Claude Code Token, what's the point of going online to validate the OAuth AFTER being issued the code, can't they use signatures?

giwook · 2026-04-06T16:27:21 1775492841

I think in general we need to be highly critical of anything LLMs tell us.

pixel_popping · 2026-04-06T16:28:00 1775492880

Claude code shows: OAuth error: timeout of 15000ms exceeded

giwook · 2026-04-06T16:29:54 1775492994

Maybe a local or intermittent issue? Working for me.

pixel_popping · 2026-04-06T16:42:39 1775493759

Seems solved now indeed.

mikepurvis · 2026-04-06T16:26:31 1775492791

That helps explain why my sessions signed themselves out and won't log back in.

me_vinayakakv · 2026-04-06T16:28:17 1775492897

I just experienced this some time ago and could not sign in still.

Their status page shows everything is okay.

j45 · 2026-04-06T17:31:30 1775496690

Certain phrases invoke an over-response trying to course correct which makes it worse because it's inclined to double down on the wrong path it's already on.

rootnod3 · 2026-04-06T16:47:10 1775494030

The cope is hard. Just at this point admit that the LLM tech is doomed and sucks.

subscribed · 2026-04-06T18:05:38 1775498738

But it was clearly really food before the regression, the original link (analysis) says as much.

randomNumber7 · 2026-04-06T17:37:27 1775497047

Just because some people try to use a hammer as a screwdriver it doesn't follow that the hammer sucks.

r_lee · 2026-04-06T17:38:54 1775497134

how is it "doomed"?

selfmodruntime · 2026-04-06T21:58:54 1775512734

The cost far outweighs the profits.

lukewarm707 · 2026-04-06T22:44:08 1775515448

i am already on api tokens for the chinese open source models and no subscriptions. these are all available in the original form open source and priced above the inference cost. i think this is the long term option.

zero degradation in speed or quality seen.

jeremyjh · 2026-04-07T04:52:38 1775537558

So you see better performance with the API plans than the subscriptions?

simooooo · 2026-04-06T17:53:11 1775497991

How complex are we talking? I one shotted a game boy emulator in <6 minutes today

root_axis · 2026-04-06T18:14:26 1775499266

There are countless reference examples online, that's just a slower, buggier, and more expensive git clone.

TimTheTinker · 2026-04-06T19:57:15 1775505435

Yep. If you ask Claude to create a drop-in replacement for an open-source project that passes 100% of the test suite of the project, it will basically plagiarize the project wholesale, even if you changed some of the requirements.

whateveracct · 2026-04-06T18:38:13 1775500693

try one shotting something actually original and see how it goes

i keep getting nonsense

nubinetwork · 2026-04-07T09:47:58 1775555278

A full emulator in 6 minutes? I gotta call BS... I've been working on a SPC700 audio editor in the background as a side project, and implementing the cpu has taken at least 2 hours, and I still haven't implemented all of the opcodes.

summarity · 2026-04-04T18:43:51 1775328231

This is somewhat in line with the approach taken by some softfloat libraries, e.g. https://bigfloat.org/architecture.html

summarity · 2026-04-04T11:14:43 1775301283

Related work from our security lab:

Stream of vulnerabilities discovered using security agents (23 so far this year): https://securitylab.github.com/ai-agents/

Taskflow harness to run (on your own terms): https://github.blog/security/how-to-scan-for-vulnerabilities...

summarity · 2026-04-04T11:11:03 1775301063

Already happend: https://arxiv.org/abs/2407.08708

summarity · 2026-04-04T10:58:13 1775300293

I posted this and it picked up steam over night, so I thought I'd add how I'm using it:

I work on 3D/4D math in F#. As part of the testing strategy for algorithms, I've set up a custom agent with an F# script that instruments Roslyn to find FP and FP-in-loop hotspots across the codebase.

The agent then reasons through the implementation and writes core expressions into an FPCore file next to the existing tests, running several passes, refining the pres based on realistic caller input. This logs Herbie's proposed improvements as output FPCore transformations. The agent then reasons through solutions (which is required, Herbie doesn't know algorithm design intent, see e.g. this for a good case study: https://pavpanchekha.com/blog/herbie-rust.html), and once convinced of a gap, creates additional unit tests and property tests (FsCheck/QuickCheck) to prove impact. Then every once in a while I review a batch to see what's next.

Generally there are multiple types of issues that can be flagged:

a) Expression-level imprecision over realistic input ranges: this is Herbie's core strength. Usually this catches "just copied the textbook formula" instance of naive math. Cancellation, Inf/NaN propagation, etc. The fixes are consistently using fma for accumulation, biggest-factor scaling to prevent Inf, hypot use, etc.

b) Ill-conditioned algorithms. Sometimes the text books lie to you, and the algorithms themselves are unfit for purpose, especially in boundary regions. If there are multiple expressions that have a <60% precision and only a 1 to 2% improvement across seeds, it's a good sign the algo is bad - there's no form that adequately performs on target inputs.

c) Round-off, accumulation errors. This is more a consequence of agent reasoning, but often happens after an apparent "100% -> 100%" pass. The agent is able to, via failing tests, identify parts of an algorithm that can benefit from upgrading the context to e.g. double-word arithmetic for additional precision.

Archit3ch · 2026-04-04T13:33:51 1775309631

> I work on 3D/4D math in F#. As part of the testing strategy for algorithms, I've set up a custom agent with an F# script that instruments Roslyn to find FP and FP-in-loop hotspots across the codebase.

I don't know if there is an equivalent in Roslyn, but in Julia you can have the agent inspect the LLVM output to surface problems in hot loops.

summarity · 2026-03-17T23:53:20 1773791600

Well sort of, the industry tried to go way beyond that by capturing the entire light field: https://techcrunch.com/2016/04/11/lytro-cinema-is-giving-fil...

summarity · 2026-03-13T12:49:03 1773406143

Finally, Desert Golf and Flappy Bird merged into one

summarity · 2026-03-09T07:13:41 1773040421

This guys factory is just across the lake from where I live and this is painful to watch. Both Alibaba and the general local industry (metal fabs, train shops, etc) have high degrees of expertise in supply chain verification. You can hire (heck even bribe) experts along the way to reduce fuck ups. The video contained no mention of any audits, any additional paperwork beyond some pictures.

I once had a company that procured very simple electronics (fingerprint readers) from Taiwan and due diligence included travelling there, meeting every single person in the engineering office in person, then touring the contract factory where this would be built, then negotiating shipping and even driver development details.

This took all of one week and the price of a few plane tickets. We didn’t have the cash for professional auditors. In the end we got a product that worked, and even at a lower price (negotiating at a distance is not effective).

burnt-resistor · 2026-03-09T22:22:49 1773094969

This is what due-diligence looks like. Without it, platform vendors lacking scruples and a proper platform accountability process will cheat because it means more money for them.

rasz · 2026-03-09T17:26:42 1773077202

This all sounds quite insane. I bet Roman would not buy a car sight unseen from Europe, but here he is sending money in blind to China.