Evaluation of the Competency of Large Language Models GPT-4o and Claude 3.5 Sonnet in Endodontic Emergencies

Sarı, MERVE; Tüfenkçi, PELİN

doi:10.52037/eads.2025.0002

Evaluation of the Competency of Large Language Models GPT-4o and Claude 3.5 Sonnet in Endodontic Emergencies

Sarı M., Tüfenkçi P.

European annals of dental sciences (Online), cilt.52, sa.1, ss.10-16, 2025 (TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 52 Sayı: 1
Basım Tarihi: 2025
Doi Numarası: 10.52037/eads.2025.0002
Dergi Adı: European annals of dental sciences (Online)
Derginin Tarandığı İndeksler: TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.10-16
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Hatay Mustafa Kemal Üniversitesi Adresli: Evet

Özet

Purpose: This studyaimed to evaluate the accuracyand comprehensiveness ofthe responses generated byGPT-4o and Claude-3.5 Sonnetto the mostfrequentlyasked questions aboutendodontic emergencies. Materials and Methods: Themostfrequentlyaskedquestions aboutnine differenttopics (inferior alveolar nerve block, sodium hypochlorite accidents, aspiration of dental materials, separated instruments, perforation, transportation, Ca(OH)2 extrusion, root f illing, and flare-up) in endodontics were generated byGPT3.5. Each question was asked to both GPT-4o andClaude3.5 Sonnet. Twoauthorsindependentlyscored the responses. Accuracyand comprehensiveness were assessed for each question using Likert scales. The data were statistically analyzed using the Mann–WhitneyU test and the Kruskal–Wallis test. The significance level wassetat0.05. Results: Responses generated byboth GPT-4oandClaude3.5Sonnettoatotalof81open-endedquestionswereevaluated. Thetwo models yielded similar results in terms of accuracyand comprehensiveness (p > 0.05). The topics ofrootfilling, perforation, and f lare-up havethelowestaccuracyscores, and rootfilling and separated instruments have the lowestcomprehensiveness scores for GPT-4o(p<0.05). TheaccuracyofClaude3.5’s responses did notshowsignificantdifferences between the topics (p > 0.05); however, separated instruments had the lowestcomprehensiveness scores (p < 0.05). Conclusions: The accuracyand comprehensiveness scores ofGPT-4 and Claude 3.5 Sonnetare statisticallysimilar. Despite the high levels of accuracy and comprehensiveness shown byGPT-4oandClaude3.5Sonnet, theydonotyethavetheeffectof replacing the operator in endodontic procedures.