> "AI" is basically a vast, curated, compressed database with a powerful index. ...

A_D_E_P_T · 2024-11-12T18:04:18 1731434658

I think that your information is slightly out of date. (From Wolfram's book, perhaps?) LLM + plain vanilla RAG solves almost all of the problems you mentioned. LLM + agentic RAG solves them pretty much entirely.

Even as of right now, stock LLMs are much more accurate than medical students in licensing exam questions: https://mededu.jmir.org/2024/1/e63430

Thus your comment is basically at odds with reality. Not only have these models eclipsed what they were capable of in early 2023, when it was easy to dismiss them as "glorified autocompletes," but they're now genuinely turning the "expert system" meme into a reality via RAG-based techniques and other methods.

bangaroo · 2024-11-12T19:06:58 1731438418

Read the conclusions section from the paper you linked:

> GPT-4o’s performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.

The ability of an LLM to pass a multiple-choice test has no relationship to its ability to make correlations between things it's observing in the real world and diagnoses on actual cases. Being a doctor isn't doing a multiple choice test. The paper is largely making the determination that GPT might likely be used as a study aid by med students, not by experienced doctors in clinical practice.

From the protocol section:

> This protocol for eliciting a response from ChatGPT was as follows: “Answer the following question and provide an explanation for your answer choice.” Data procured from ChatGPT included its selected response, the rationale for its choice, and whether the response was correct (“accurate” or “inaccurate”). Responses were deemed correct if ChatGPT chose the correct multiple-choice answer. To prevent memory retention bias, each vignette was processed in a new chat session.

So all this says is in a scenario where you present ChatGPT with a limited number of options and one of them is guaranteed to be correct, in the format of a test question, it is likely accurate. This is a much lower hurdle to jump than what you are suggesting. And further, under limitations:

> This study contains several limitations. The 750 MCQs are robust, although they are “USMLE-style” questions and not actual USMLE exam questions. The exclusion of clinical vignettes involving imaging findings limits the findings to text-based accuracy, which potentially skews the assessment of disciplinary accuracies, particularly in disciplines such as anatomy, microbiology, and histopathology. Additionally, the study does not fully explore the quality of the explanations generated by the AI or its ability to handle complex, higher-order information, which are crucial components of medical education and clinical practice—factors that are essential in evaluating the full utility of LLMs in medical education. Previous research has highlighted concerns about the reliability of AI-generated explanations and the risks associated with their use in complex clinical scenarios [10,12]. These limitations are important to consider as they directly impact how well these tools can support clinical reasoning and decision-making processes in real-world scenarios. Moreover, the potential influence of knowledge lagging effects due to the different datasets used by GPT-3.5, GPT-4, and GPT-4o was not explicitly analyzed. Future studies might compare MCQ performance across various years to better understand how the recency of training data affects model accuracy and reliability.

To highlight one specific detail from that:

> Additionally, the study does not fully explore the quality of the explanations generated by the AI or its ability to handle complex, higher-order information, which are crucial components of medical education and clinical practice—factors that are essential in evaluating the full utility of LLMs in medical education.

Finally:

> Previous research has highlighted concerns about the reliability of AI-generated explanations and the risks associated with their use in complex clinical scenarios [10,12]. These limitations are important to consider as they directly impact how well these tools can support clinical reasoning and decision-making processes in real-world scenarios.

You're saying that "LLMs are much more accurate than medical students in licensing exam questions" and extrapolating that to "LLMs can currently function as doctors."

What the study says is "Given a set of text-only questions and a list of possible answers that includes the correct one, one LLM routinely scores highly (as long as you don't include questions related to medical imaging, which it cannot provide feedback on) on selecting the correct answer but we have not done the necessary validation to prove that it arrived at it in the correct way. It may be useful (or already in use) among students as a study tool and thus we should be ensuring that medical curriculums take this into account and provide proper guidelines and education around their limitations."

This is not the success you believe it to be.

A_D_E_P_T · 2024-11-12T19:35:20 1731440120

I get that you really disdain LLMs. But consider that a totally off-the-shelf, stock model is acing the medical licensing exam. It doesn't only perform better than human counterparts at the very peak of their ability (young, high-energy, immediately following extensive schooling and dedicated multidisciplinary study) it leaves them in the dust.

If you think that the test is simple or even text-only, here are some sample questions: https://www.usmle.org/sites/default/files/2021-10/Step_1_Sam...

> What the study says is ...

Surely you realize that they're not going to write, "AI is already capable of replacing family doctors," though that is the obvious implication.

And that's just a stock model. GPT-o1 via the API /w agentic RAG is a better doctor than >99% of working physicians. (By "doctor" I mean something like "medical oracle" -- ask a question, get a correct answer.) It's not yet quite as good at generating and testing hypotheses, but few doctors actually bother to do that.

A_D_E_P_T · 2024-11-12T23:37:05 1731454625

As an aside, I quickly tested GPT-o1 by giving it question 95 on the sample test. I'm no doctor, but I've got extensive training in chemistry and biochem, and I'm not ashamed to admit that the question totally stumped me.

GPT-o1 gave the correct answer the first time around, and a very detailed explanation as to why all other potential answers must be false. A really remarkable performance, I think.

Now imagine it's an open-ended scenario, not multiple-choice. It would still come to the right conclusion and provide an accurate diagnosis.

dmz73 · 2024-11-13T23:29:44 1731540584

LLM can only provide statistically most likely diagnosis based on the training data. If used by the experienced doctor LLM can be a valuable tool that saves time and maybe even increases accuracy in majority of cases. The problem is that LLMs will be used to replace experienced doctors and will be used as 100% accurate tools. This will result in experienced doctors becoming first rare and then non-existent and outcomes for patients will become increasingly unfavorable due to LLMs always producing a confident result even when it would be obviously wrong to an experienced doctor or even just a student.

bangaroo · 2024-11-13T14:13:21 1731507201

it has nothing to do with disdain of LLMs, i'm an extensive user of warp (a very good LLM-based tool) and at my job we use them in depth in the software i build for summarization and other tasks that LLMs are generally considered good at. i spend a lot of time working with LLMs and find that, in some cases, they can be extremely useful, particularly when it comes to completing simple tasks in natural language.

i am also aware of their limitations and have a reasonable and realistic view of what they can currently do and where they are headed. i have seen many failure modes, i am familiar with patterns in their output, and i understand the boundaries of their comprehension, capabilities and understanding.

not buying into the current silicon valley money pit du jour and misunderstanding studies to validate that view does not equate to just being disdainful. i'm being realistic, because i understand what they do and how they work.

i'm not going to circle with you - you don't seem all that interested in engaging with the meat of anything i say to you, and instead just want to continue to try and rationalize your misunderstanding of the single study you found in support of your position, which is your right.

i feel ethically obligated to say, once again, that an LLM isn't a doctor and you should under no circumstances go to one for medical advice. you could really cause yourself some problems.

if you do so, that's on you. best of luck. incidentally i suspect someone has an awesome picture of a monkey at a steep discount you might be interested in.

A_D_E_P_T · 2024-11-13T17:55:09 1731520509

It's far more than a single study, just one example of a very powerful development. The same phenomenon is also occurring in state bar exams, etc. And that one study is hardly misunderstood -- as you can verify for yourself.

> i understand the boundaries of their comprehension, capabilities and understanding.

It seems to me that you are quite far behind the current state of the art, and you apparently underestimate even stock GPT-o1, which is pretty old news.

I'm willing to place a friendly wager with you: Let's find a doctor who does online consultations and give him three questions selected at random from that sample test. These are diagnostic-type questions that well reflect what a country doctor would encounter in daily practice. We can leave the questions open-ended or give him the multiple-choice options. Then we put GPT-o1 to the same questions. I'd be very happy to bet that the LLM outperforms the doctor. I'd even place a secondary bet that the LLM answers all questions correctly and that the doctor answers less than two questions correctly.