More

warwickmcintosh · 2026-04-13T22:42:24 1776120144

ROCm has improved but the reality is you're still fighting the driver stack more than the models. If you're actually doing local inference on AMD you're spending your time on CUDA compatibility layers, not the AI part. Two lines of python is marketing, the gap between demo and working AMD setup is still real.

ddtaylor · 2026-04-13T23:18:01 1776122281

Ollama works very well in Linux on my AMD hardware. I have a 6800 XT which isn't even originally supported by the ROCm stack in some ways and it "just works" for a ton of very nice models, especially if I seek out quantized versions of the model.

warwickmcintosh · 2026-04-12T23:20:55 1776036055

Regex and NER both have fun edge cases for redaction. Medical record numbers that look like dates, addresses embedded in prose, account numbers with varying formats. Detection method matters more than people think.

warwickmcintosh · 2026-04-12T22:38:28 1776033508

laziness makes you understand the problem before writing anything. an LLM will happily generate 500 lines for something that needed 20 because it never has to maintain any of it.

warwickmcintosh · 2026-04-01T21:34:23 1775079263

The sanitised optimism problem mentioned upthread is the real gap here. Event stream logging tells you what tools were called and in what order, but it doesn't tell you whether the agent's self-reported outcome matches reality.

warwickmcintosh · 2026-03-29T21:40:19 1774820419

LLM as judge drifts in weird ways if you don't have ground truth to calibrate against. Good that you've got that built in. Would love to see eval stability tracking over time though, same prompt different day sometimes gives different scores.

warwickmcintosh · 2026-03-27T04:49:26 1774586966

Biggest change for us has been alert volume. AI-generated code ships faster, which means more deploys per day, which means more things to monitor...

warwickmcintosh · 2026-03-27T04:48:41 1774586921

The "every new model is THE one" cycle is getting a bit old but the Capybara tier thing is actually worth paying attention to.

ggsp · 2026-03-27T08:30:22 1774600222

TFA states "Capybara and Mythos appear to refer to the same underlying model"

warwickmcintosh · 2026-03-26T11:58:18 1774526298

The approval layer problem is interesting because the naive approach (approve every action) doesn't scale, but the fully autonomous approach terrifies anyone running these in production...

warwickmcintosh · 2026-03-26T11:57:53 1774526273

I've noticed the time-of-day variance too. My working theory is it's related to load, not model changes. Same prompt at 6am Sydney time (when US is asleep) consistently gets better results than the same prompt at noon. The "ignoring instructions" behavior usually means it's working from a compressed context where earlier instructions got summarized away.

narmiouh · 2026-03-26T14:55:14 1774536914

I think even on simple instructions it fails, people who have been in this for a while understand compaction impacts etc... but it feels lacking even in cases where it felt it worked well in jan/feb

warwickmcintosh · 2026-03-26T11:25:00 1774524300

Retry logic modeling is a great inclusion most cost estimates miss this completely. In my experience retries account for 15-25% of total cost in production agent systems, especially with tool calling where validation failures trigger re-prompts. Would be useful to see a "worst case" scenario that models cascading retries (retry triggers another tool call that also retries).

indiegoing · 2026-03-26T19:42:36 1774554156

This is really valuable, especially the 15-25% figure from real production systems — current assumption of 8% is clearly too conservative for agent workloads with tool calling. The cascading retry scenario (retry triggers another tool call that also retries) is a great idea for a worst case column alongside the current realistic estimate. Going to look into this for v2.