ROCm has improved but the reality is you're still fighting the driver stack more than the models. If you're actually doing local inference on AMD you're spending your time on CUDA compatibility layers, not the AI part. Two lines of python is marketing, the gap between demo and working AMD setup is still real.
Ollama works very well in Linux on my AMD hardware. I have a 6800 XT which isn't even originally supported by the ROCm stack in some ways and it "just works" for a ton of very nice models, especially if I seek out quantized versions of the model.
Regex and NER both have fun edge cases for redaction. Medical record numbers that look like dates, addresses embedded in prose, account numbers with varying formats. Detection method matters more than people think.
laziness makes you understand the problem before writing anything. an LLM will happily generate 500 lines for something that needed 20 because it never has to maintain any of it.
The sanitised optimism problem mentioned upthread is the real gap here. Event stream logging tells you what tools were called and in what order, but it doesn't tell you whether the agent's self-reported outcome matches reality.
LLM as judge drifts in weird ways if you don't have ground truth to calibrate against. Good that you've got that built in. Would love to see eval stability tracking over time though, same prompt different day sometimes gives different scores.
The approval layer problem is interesting because the naive approach (approve every action) doesn't scale, but the fully autonomous approach terrifies anyone running these in production...
I've noticed the time-of-day variance too. My working theory is it's related to load, not model changes. Same prompt at 6am Sydney time (when US is asleep) consistently gets better results than the same prompt at noon. The "ignoring instructions" behavior usually means it's working from a compressed context where earlier instructions got summarized away.
I think even on simple instructions it fails, people who have been in this for a while understand compaction impacts etc... but it feels lacking even in cases where it felt it worked well in jan/feb
Retry logic modeling is a great inclusion most cost estimates miss this completely. In my experience retries account for 15-25% of total cost in production agent systems, especially with tool calling where validation failures trigger re-prompts. Would be useful to see a "worst case" scenario that models cascading retries (retry triggers another tool call that also retries).
This is really valuable, especially the 15-25% figure from real production systems — current assumption of 8% is clearly too conservative for agent workloads with tool calling. The cascading retry scenario (retry triggers another tool call that also retries) is a great idea for a worst case column alongside the current realistic estimate. Going to look into this for v2.