Poster | Poster Session 04 Program Schedule
02/15/2024
12:00 pm - 01:15 pm
Room: Majestic Complex (Posters 61-120)
Poster Session 04: Neuroimaging | Neurostimulation/Neuromodulation | Teleneuropsychology/Technology
Final Abstract #95
Toward Automated Neuropsychology: Evaluating the Utility of ChatGPT in Neuropsychological Testing with Older Adults
Hanan Rafiuddin, University of North Texas, Denton, United States Allan Dimmick, University of North Texas, Denton, United States Charlie Su, University of North Texas, Denton, United States April Wiechmann, University of North Texas Health Science Center, Fort Worth, United States David Cicero, University of North Texas, Denton, United States
Category: Assessment/Psychometrics/Methods (Adult)
Keyword 1: neuropsychological assessment
Keyword 2: technology
Keyword 3: mild cognitive impairment
Objective:
ChatGPT, a prominent OpenAI language model, has gained attention for its potential applications in clinical practice. Previous research has primarily focused on ChatGPT’s accuracy in screening decision-making, differential diagnoses, and generating reports within the medical field. To date, no studies have explored ChatGPT’s utility in neuropsychology, particularly for diagnosing individuals based on neurocognitive assessment results.
Participants and Methods:
Participants were a community sample of 165 older adults seeking a neuropsychological evaluation from an outpatient geriatric clinic in the Dallas-Fort Worth area (Mage = 77.88, 60% female).
ChatGPT-4 was utilized to provide a provisional neurocognitive diagnosis (normal changes, amnestic mild cognitive impairment, nonamnestic mild cognitive impairment, or dementia). ChatGPT’s understanding of each neurocognitive measure was validated and enhanced as needed. ChatGPT was provided brief demographic information and performance on measures (Test of Executive Functioning in an Emergency, Behavioral Dyscontrol Scale, Trails A & B, Similarities, Digit Span, American National Adult Reading Test, Mini Mental Status Exam, Boston Naming Test, FAS Phonemic Fluency Test, Animals Semantic Fluency Test, Constructional Praxis, Hopkins Verbal Learning Test, Clock, Logical Memory, Visual Recognition, and Clinical Dementia Rating [CDR]). Score ranges, means, standard deviations, classifications, and the score reporting format (T-scores, scaled scores, standard scores, or raw scores) were provided for all measures.
We evaluated interrater reliability (IRR) with Cohen’s kappa and a chi-square test of association. Receiver operating characteristic (ROC) curve analysis was then conducted to investigate which measures predicted diagnostic agreement.
Results:
Cohen’s kappa demonstrated fair agreement on diagnostic (κ = .33; 95% CI: .23-.42; p < .001). The chi-square test [χ2 (9) = 87.39, p <.001] revealed a significant, but moderate (Cramer’s V = .44, p < .001), association between diagnoses made by ChatGPT and clinicians. There was a 66% agreement between ChatGPT and the neuropsychologists. Follow-up ROC curves showed that only the CDR total score yielded an acceptable area under the curve (AUC) of .84. Regarding sensitivity, there was agreement between ChatGPT and clinicians 91% of the time when the total CDR score was over 2.75. Regarding 1-specificity, there was disagreement 55% of the time when the CDR was below 2.75. Increasing the cut-off score to 5.75 resulted in a decreased sensitivity (.71) but a more acceptable 1-specificity (.14).
Conclusions:
This study highlights the potential of integrating language learning models (LLMs) into neuropsychological practice, even without access to clinical interview transcripts. Findings demonstrate significant IRR, however, we acknowledge that LLMs remain in early development, necessitating further investigation to ensure full efficacy in the field. This research serves as an initial exploration, paving the way for more rigorous evaluation of AI in neuropsychological diagnostics.
|