> "AI" is basically a vast, curated, compressed database with a powerful index. If the database reflects the current state of the art, it'll have better understanding than the majority of human practitioners.
But it's not. You're missing the point entirely and don't know what you're advocating for.
A dictionary contains all the words necessary to describe any concept and rudimentary definitions to help you string sentences together but you wouldn't have a doctor diagnose someone's medical condition with a dictionary, despite the fact that it contains most if not all of the concepts necessary to describe and diagnose any disease. It's useful information, but not organized in a way that is conducive to the task at hand.
I assume based on the way you're describing AI that you're referring to LLMs broadly, which, again, are spicy autocorrect. Super simplified, they're just big masses of understanding of what things might come in what order, what words or concepts have proximity to one another, and what words and sentences look like. They lack (and really cannot develop) the ability to perform acts of deductive reasoning, to come up with creative or new ideas, or to actually understand the answers they're giving. If they connect a bunch of irrelevant dots they will not second guess their answer if something seems off. They will not consult with other experts to get outside opinions on biases or details they overlooked or missed. They have no concept of details. They have no concept of expertise. They cannot ask questions to get you to expand on vague things you said that a doctor has intuition might be important
The idea that you could type some symptoms into ChatGPT and get a reasonable diagnosis is foolish beyond comprehension. ChatGPT cannot reliably count the number of letters in a word. If it gives you an answer you don't like and you say that's wrong it will instantly correct itself, and sometimes still give you the wrong answer in direct contradiction to what you said. Have you used google, lately? Gemini AI summaries at the tops of the search results often contain misleading or completely incorrect information.
ChatGPT isn't poring over medical literature and trying to find references to things that sound like what you described and then drawing conclusions, it's just finding groups of letters with proximity to the ones you gave it (without any concept of what the medical field is.) ChatGPT is a machine that gives you an answer in the (impressively close, no doubt) shape of the answer you'd expect when asked a question that incorporates massive amounts of irrelevant data from all sorts of places (including, for example, snake oil alternative medicine sites and conspiracy theory content) that are also being considered as part of your answer.
AI undoubtedly has a place in medicine, in the sorts of contexts it's already being used in. Specialized machine learning algorithms can be trained to examine medical imaging and detect patterns that look like cancers that humans might miss. Algorithms can be trained to identify or detect warning signs for diseases divined from analyses of large numbers of specific cases. This stuff is real, already in the field, and I'm not experienced enough in the space to know how well it works, but it's the stuff that has real promise.
LLMs are not general artificial intelligence. They're prompted text generators that are largely being tuned as a consumer product that sells itself on the basis of the fact that it feels impressive. Every single time I've seen someone try to apply one to any field of experienced knowledge work they either give up using it for anything but the most simple tasks, because it's bad at the things it's done, or the user winds up Dunning-Kreugering themselves into not learning anything.
If you are seriously asking ChatGPT for medical diagnoses, for your own sake, stop it. Go to an actual doctor. I am not at all suggesting that the current state of healthcare anywhere in particular is perfect but the solution is not to go ask your toaster if you have cancer.
I think that your information is slightly out of date. (From Wolfram's book, perhaps?) LLM + plain vanilla RAG solves almost all of the problems you mentioned. LLM + agentic RAG solves them pretty much entirely.
Thus your comment is basically at odds with reality. Not only have these models eclipsed what they were capable of in early 2023, when it was easy to dismiss them as "glorified autocompletes," but they're now genuinely turning the "expert system" meme into a reality via RAG-based techniques and other methods.
Read the conclusions section from the paper you linked:
> GPT-4o’s performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.
The ability of an LLM to pass a multiple-choice test has no relationship to its ability to make correlations between things it's observing in the real world and diagnoses on actual cases. Being a doctor isn't doing a multiple choice test. The paper is largely making the determination that GPT might likely be used as a study aid by med students, not by experienced doctors in clinical practice.
From the protocol section:
> This protocol for eliciting a response from ChatGPT was as follows: “Answer the following question and provide an explanation for your answer choice.” Data procured from ChatGPT included its selected response, the rationale for its choice, and whether the response was correct (“accurate” or “inaccurate”). Responses were deemed correct if ChatGPT chose the correct multiple-choice answer. To prevent memory retention bias, each vignette was processed in a new chat session.
So all this says is in a scenario where you present ChatGPT with a limited number of options and one of them is guaranteed to be correct, in the format of a test question, it is likely accurate. This is a much lower hurdle to jump than what you are suggesting. And further, under limitations:
> This study contains several limitations. The 750 MCQs are robust, although they are “USMLE-style” questions and not actual USMLE exam questions. The exclusion of clinical vignettes involving imaging findings limits the findings to text-based accuracy, which potentially skews the assessment of disciplinary accuracies, particularly in disciplines such as anatomy, microbiology, and histopathology. Additionally, the study does not fully explore the quality of the explanations generated by the AI or its ability to handle complex, higher-order information, which are crucial components of medical education and clinical practice—factors that are essential in evaluating the full utility of LLMs in medical education. Previous research has highlighted concerns about the reliability of AI-generated explanations and the risks associated with their use in complex clinical scenarios [10,12]. These limitations are important to consider as they directly impact how well these tools can support clinical reasoning and decision-making processes in real-world scenarios. Moreover, the potential influence of knowledge lagging effects due to the different datasets used by GPT-3.5, GPT-4, and GPT-4o was not explicitly analyzed. Future studies might compare MCQ performance across various years to better understand how the recency of training data affects model accuracy and reliability.
To highlight one specific detail from that:
> Additionally, the study does not fully explore the quality of the explanations generated by the AI or its ability to handle complex, higher-order information, which are crucial components of medical education and clinical practice—factors that are essential in evaluating the full utility of LLMs in medical education.
Finally:
> Previous research has highlighted concerns about the reliability of AI-generated explanations and the risks associated with their use in complex clinical scenarios [10,12]. These limitations are important to consider as they directly impact how well these tools can support clinical reasoning and decision-making processes in real-world scenarios.
You're saying that "LLMs are much more accurate than medical students in licensing exam questions" and extrapolating that to "LLMs can currently function as doctors."
What the study says is "Given a set of text-only questions and a list of possible answers that includes the correct one, one LLM routinely scores highly (as long as you don't include questions related to medical imaging, which it cannot provide feedback on) on selecting the correct answer but we have not done the necessary validation to prove that it arrived at it in the correct way. It may be useful (or already in use) among students as a study tool and thus we should be ensuring that medical curriculums take this into account and provide proper guidelines and education around their limitations."
I get that you really disdain LLMs. But consider that a totally off-the-shelf, stock model is acing the medical licensing exam. It doesn't only perform better than human counterparts at the very peak of their ability (young, high-energy, immediately following extensive schooling and dedicated multidisciplinary study) it leaves them in the dust.
Surely you realize that they're not going to write, "AI is already capable of replacing family doctors," though that is the obvious implication.
And that's just a stock model. GPT-o1 via the API /w agentic RAG is a better doctor than >99% of working physicians. (By "doctor" I mean something like "medical oracle" -- ask a question, get a correct answer.) It's not yet quite as good at generating and testing hypotheses, but few doctors actually bother to do that.
As an aside, I quickly tested GPT-o1 by giving it question 95 on the sample test. I'm no doctor, but I've got extensive training in chemistry and biochem, and I'm not ashamed to admit that the question totally stumped me.
GPT-o1 gave the correct answer the first time around, and a very detailed explanation as to why all other potential answers must be false. A really remarkable performance, I think.
Now imagine it's an open-ended scenario, not multiple-choice. It would still come to the right conclusion and provide an accurate diagnosis.
LLM can only provide statistically most likely diagnosis based on the training data.
If used by the experienced doctor LLM can be a valuable tool that saves time and maybe even increases accuracy in majority of cases.
The problem is that LLMs will be used to replace experienced doctors and will be used as 100% accurate tools.
This will result in experienced doctors becoming first rare and then non-existent and outcomes for patients will become increasingly unfavorable due to LLMs always producing a confident result even when it would be obviously wrong to an experienced doctor or even just a student.
it has nothing to do with disdain of LLMs, i'm an extensive user of warp (a very good LLM-based tool) and at my job we use them in depth in the software i build for summarization and other tasks that LLMs are generally considered good at. i spend a lot of time working with LLMs and find that, in some cases, they can be extremely useful, particularly when it comes to completing simple tasks in natural language.
i am also aware of their limitations and have a reasonable and realistic view of what they can currently do and where they are headed. i have seen many failure modes, i am familiar with patterns in their output, and i understand the boundaries of their comprehension, capabilities and understanding.
not buying into the current silicon valley money pit du jour and misunderstanding studies to validate that view does not equate to just being disdainful. i'm being realistic, because i understand what they do and how they work.
i'm not going to circle with you - you don't seem all that interested in engaging with the meat of anything i say to you, and instead just want to continue to try and rationalize your misunderstanding of the single study you found in support of your position, which is your right.
i feel ethically obligated to say, once again, that an LLM isn't a doctor and you should under no circumstances go to one for medical advice. you could really cause yourself some problems.
if you do so, that's on you. best of luck. incidentally i suspect someone has an awesome picture of a monkey at a steep discount you might be interested in.
It's far more than a single study, just one example of a very powerful development. The same phenomenon is also occurring in state bar exams, etc. And that one study is hardly misunderstood -- as you can verify for yourself.
> i understand the boundaries of their comprehension, capabilities and understanding.
It seems to me that you are quite far behind the current state of the art, and you apparently underestimate even stock GPT-o1, which is pretty old news.
I'm willing to place a friendly wager with you: Let's find a doctor who does online consultations and give him three questions selected at random from that sample test. These are diagnostic-type questions that well reflect what a country doctor would encounter in daily practice. We can leave the questions open-ended or give him the multiple-choice options. Then we put GPT-o1 to the same questions. I'd be very happy to bet that the LLM outperforms the doctor. I'd even place a secondary bet that the LLM answers all questions correctly and that the doctor answers less than two questions correctly.
But it's not. You're missing the point entirely and don't know what you're advocating for.
A dictionary contains all the words necessary to describe any concept and rudimentary definitions to help you string sentences together but you wouldn't have a doctor diagnose someone's medical condition with a dictionary, despite the fact that it contains most if not all of the concepts necessary to describe and diagnose any disease. It's useful information, but not organized in a way that is conducive to the task at hand.
I assume based on the way you're describing AI that you're referring to LLMs broadly, which, again, are spicy autocorrect. Super simplified, they're just big masses of understanding of what things might come in what order, what words or concepts have proximity to one another, and what words and sentences look like. They lack (and really cannot develop) the ability to perform acts of deductive reasoning, to come up with creative or new ideas, or to actually understand the answers they're giving. If they connect a bunch of irrelevant dots they will not second guess their answer if something seems off. They will not consult with other experts to get outside opinions on biases or details they overlooked or missed. They have no concept of details. They have no concept of expertise. They cannot ask questions to get you to expand on vague things you said that a doctor has intuition might be important
The idea that you could type some symptoms into ChatGPT and get a reasonable diagnosis is foolish beyond comprehension. ChatGPT cannot reliably count the number of letters in a word. If it gives you an answer you don't like and you say that's wrong it will instantly correct itself, and sometimes still give you the wrong answer in direct contradiction to what you said. Have you used google, lately? Gemini AI summaries at the tops of the search results often contain misleading or completely incorrect information.
ChatGPT isn't poring over medical literature and trying to find references to things that sound like what you described and then drawing conclusions, it's just finding groups of letters with proximity to the ones you gave it (without any concept of what the medical field is.) ChatGPT is a machine that gives you an answer in the (impressively close, no doubt) shape of the answer you'd expect when asked a question that incorporates massive amounts of irrelevant data from all sorts of places (including, for example, snake oil alternative medicine sites and conspiracy theory content) that are also being considered as part of your answer.
AI undoubtedly has a place in medicine, in the sorts of contexts it's already being used in. Specialized machine learning algorithms can be trained to examine medical imaging and detect patterns that look like cancers that humans might miss. Algorithms can be trained to identify or detect warning signs for diseases divined from analyses of large numbers of specific cases. This stuff is real, already in the field, and I'm not experienced enough in the space to know how well it works, but it's the stuff that has real promise.
LLMs are not general artificial intelligence. They're prompted text generators that are largely being tuned as a consumer product that sells itself on the basis of the fact that it feels impressive. Every single time I've seen someone try to apply one to any field of experienced knowledge work they either give up using it for anything but the most simple tasks, because it's bad at the things it's done, or the user winds up Dunning-Kreugering themselves into not learning anything.
If you are seriously asking ChatGPT for medical diagnoses, for your own sake, stop it. Go to an actual doctor. I am not at all suggesting that the current state of healthcare anywhere in particular is perfect but the solution is not to go ask your toaster if you have cancer.