LLMs fail at laser-focused troubleshooting, but they excel at brute-force breadth. Priming an agent to list 50 distinct possible causes for a database connection failure and investigate each one of them works better than hoping it guesses the exact root cause.
The main take away from this article for me is that battle scars can be used to unbog these agents. That explains the current productivity divide we're seeing, seniors use their past experience to unbog agents. Juniors naturally frame the problem into the brute-force breadth approach. The problem then is focused on Mid-career devs that get the worst results because they are naturals on framing the problem on brute force way so they try to force agents through rigid logic without deep experience.
I don’t think mid career devs are inherently worse, if anything, they’re in the best position to adapt. The real skill shift isn’t “prompt better” vs “think harder”. Rather, it’s knowing when to explore vs when to cut the tree down.
The interesting thing about religions as a whole is that the timespan is so big that you can really see how the backbone of the narrative stays the same while the fanbase and how they pick winners changes a lot, the Vatican state itself is a theocratic state created by an agreement between the pope and the Mussolini.
And if you wanna go back even further just remember that while Europe and christian countries were living in the dark ages the Islamic world was the one driving forward scientific knowledge and the exchange of ideas with the East. https://en.wikipedia.org/wiki/Islamic_Golden_Age
I think the main issue is treating LLM as a unrestrained black box, there's a reason nobody outside tech trust so blindly on LLMs.
The only way to make LLMs useful for now is to restrain their hallucinations as much as possible with evals, and these evals need to be very clear about what are the goal you're optimizing for.
See karpathy's work on the autoresearch agent and how it carry experiments, it might be useful for what you're doing.
We were working on translations for Arabic and in the spec it said to use "Arabic numerals" for numbers. Our PM said that "according to ChatGPT that means we need to use Arabic script numbers, not Arabic numerals".
It took a lot of back-and-forths with her to convince her that the numbers she uses every day are "Arabic numerals". Even the author of the spec could barely convince her -- it took a meeting with the Arabic translators (several different ones) to finally do it. Think about that for a minute. People won't believe subject matter experts over an LLM.
Honestly I think we're just becoming more aware of this way of thinking. It's certainly exacerbated it now that everyone has "an expert" in their pocket.
It's no different than conspiracy theorists. We saw a lot more with the rise in access to the internet. Not because they didn't put in work to find answers to their questions, but because they don't know how to properly evaluate things and because they think that if they're wrong then it's a (very) bad thing.
But the same thing happens with tons of topics, and it's way more socially acceptable. Look how everyone has strong opinions on topics like climate, rockets, nuclear, immigration, and all that. The problem isn't having opinions or thoughts, but the strength of them compared to the level of expertise. How many people think they're experts after a few YouTube videos or just reading the intro to the wiki page?
Your PM is no different. The only difference is the things they believed in, not the way they formed beliefs. But they still had strong feelings about something they didn't know much about. It became "their expert" vs "your expert" rather than "oh, thanks for letting me know". And that's the underlying problem. It's terrifying to see how common it is. But I think it also leads to a (partial) solution. At least a first step. But then again, domain experts typically have strong self doubt. It's a feature, not a bug, but I'm not sure how many people are willing to be comfortable with being uncomfortable
In my experience, people outside of tech have nearly limitless faith in AI, to the point that when it clashes with traditional sources of truth, people start to question them rather than the LLM.
Clothes, wristwatches, cars, you name it. It's a very common play on luxury brands, Hermes Birkins is the most famous that comes to my mind and follow a very similar playbook.
Apart from the KYC aspect of the process it's their way of solving the problem of artificial scarcity on the second-hand market as the article explains. They want a second hand market to exist to indicate that this is a luxury item, but too many and the price tanking with excess supply.
It also solves the real problem of labor scarcity. If you have X master watchmakers available to make a halo product you can only get so much output from them. You can increase X, increase production efficiencies (reduce labor input), or limit supply. The first two reduce exclusivity and perceived quality so the third makes sense if you can live without growth or can grow via high pricing strategies.
> This document was written by an LLM (Claude) and then iteratively de-LLMed by that same LLM under instruction from a human, in a conversation that went roughly like this
The hardware looks fine, but Apple's software vision is so confusing.
MacBook Neo is cheaper and weaker than a MacBook Air, yet shares the same price and single-app mindset as an iPad.
It uses a phone chip similar to an iPad Pro, but gets multi-user support and a keyboard.
I struggle to run Tahoe on my 16GB M2 Air and somewhat I have to believe running it on a 8GB phone chip is gonna be alright, which if true have me thinking what exactly is the role of iPadOS anyway.
Ultimately, it feels like iPadOS and Tahoe are on a crash course for a middle ground that nobody asked for.
At least that's the story LLM labs leaders wanna tell everyone, just happen to be a very good story if you wanna hype your valuation before investment rounds.
Working with LLM on a daily basis I would say that's not happening, not as they're trying to sell it. You can get rid of a 5 vendor headcount that execute a manual process that should have been automated 10 years ago, you're not automating the processes involving high paying people with a 1% error chance where an error could cost you +10M in fines or jail time.
When I see Amodei or Sam flying on a vibe coded airplane is the day I believe what they're talking about.
> Workforce re-skilling programs should prioritize “fusion skills”, such as prompt engineering, data stewardship and human-in-the-loop
decision-making that enhance human-AI complementarity.
First time reading the term fusion skills used in this context and I really like it.
I created something very similar but to display raindrop links into the remarkable and sync highlights back into raindrop. I also added a GenAI powered paper summary filter as a preface to the papers it send to the remarkable which is working quite well.
As you mentioned Remarkable file format is a PITA to extract highlights, one thing that helped a lot was to add an OCR fix phase that uses Gemini flash model to fix common OCR errors and to merge single highlights that are across pages.
The main take away from this article for me is that battle scars can be used to unbog these agents. That explains the current productivity divide we're seeing, seniors use their past experience to unbog agents. Juniors naturally frame the problem into the brute-force breadth approach. The problem then is focused on Mid-career devs that get the worst results because they are naturals on framing the problem on brute force way so they try to force agents through rigid logic without deep experience.
reply