Tropical Medicine and Infectious Disease Article Enhancing the Interpretability of Malaria and Typhoid Diagnosis with Explainable AI and Large Language Models Kingsley Attai 1, * , Moses Ekpenyong 2,3 , Constance Amannah 4 , Daniel Asuquo 5 , Peterben Ajuga 6 , Okure Obot 2 , Ekemini Johnson 1 , Anietie John 1 , Omosivie Maduka 7 , Christie Akwaowo 8 and Faith-Michael Uzoka 9 1 2 3 4 5 6 7 8 9 * Citation: Attai, K.; Ekpenyong, M.; Amannah, C.; Asuquo, D.; Ajuga, P.; Obot, O.; Johnson, E.; John, A.; Maduka, O.; Akwaowo, C.; et al. Enhancing the Interpretability of Malaria and Typhoid Diagnosis with Explainable AI and Large Language Models. Trop. Med. Infect. Dis. 2024, 9, 216. https://doi.org/10.3390/ tropicalmed9090216 Academic Editor: John Frean Received: 7 August 2024 Revised: 13 September 2024 Accepted: 14 September 2024 Published: 16 September 2024 Copyright: © 2024 by the authors. Licensee MDPI, Basel, Switzerland. Department of Mathematics and Computer Science, Ritman University, Ikot Ekpene 530101, Nigeria; eke5461@gmail.com (E.J.); aniettejohn5@gmail.com (A.J.) Department of Computer Science, Faculty of Computing, University of Uyo, Uyo 520103, Nigeria; mosesekpenyong@uniuyo.edu.ng (M.E.); okureobot@uniuyo.edu.ng (O.O.) Science, Technology, Engineering and Mathematics (STEM) Centre, University of Uyo and Centre for Research, University of Uyo, Uyo 520103, Nigeria Department of Computer Science, Ignatius Ajuru University of Education, Port Harcourt 500102, Nigeria; aftermymsc@gmail.com Department of Information Systems, Faculty of Computing, University of Uyo, Uyo 520103, Nigeria; danielasuquo@uniuyo.edu.ng Department of Computer Engineering, Faculty of Engineering, Gregory University, Uturu 441106, Nigeria; ajugapeterben@gmail.com University of Port Harcourt Teaching Hospital, Port Harcourt 500102, Nigeria; omosivie.maduka@gmail.com University of Uyo Teaching Hospital, Uyo 520103, Nigeria; christieakwaowo@uniuyo.edu.ng Department of Mathematics and Computing, Mount Royal University, Calgary, AB T3E 6K6, Canada; fuzoka@mtroyal.ca Correspondence: attai.kingsley@ritmanuniversity.edu.ng; Tel.: +234-8101250218 Abstract: Malaria and Typhoid fever are prevalent diseases in tropical regions, and both are exacerbated by unclear protocols, drug resistance, and environmental factors. Prompt and accurate diagnosis is crucial to improve accessibility and reduce mortality rates. Traditional diagnosis methods cannot effectively capture the complexities of these diseases due to the presence of similar symptoms. Although machine learning (ML) models offer accurate predictions, they operate as “black boxes” with non-interpretable decision-making processes, making it challenging for healthcare providers to comprehend how the conclusions are reached. This study employs explainable AI (XAI) models such as Local Interpretable Model-agnostic Explanations (LIME), and Large Language Models (LLMs) like GPT to clarify diagnostic results for healthcare workers, building trust and transparency in medical diagnostics by describing which symptoms had the greatest impact on the model’s decisions and providing clear, understandable explanations. The models were implemented on Google Colab and Visual Studio Code because of their rich libraries and extensions. Results showed that the Random Forest model outperformed the other tested models; in addition, important features were identified with the LIME plots while ChatGPT 3.5 had a comparative advantage over other LLMs. The study integrates RF, LIME, and GPT in building a mobile app to enhance the interpretability and transparency in malaria and typhoid diagnosis system. Despite its promising results, the system’s performance is constrained by the quality of the dataset. Additionally, while LIME and GPT improve transparency, they may introduce complexities in real-time deployment due to computational demands and the need for internet service to maintain relevance and accuracy. The findings suggest that AI-driven diagnostic systems can significantly enhance healthcare delivery in environments with limited resources, and future works can explore the applicability of this framework to other medical conditions and datasets. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// Keywords: malaria diagnosis; typhoid diagnosis; machine learning; XAI; LIME; GPT; BERT; ChatGPT; Gemini; perplexity; explainability; interpretability creativecommons.org/licenses/by/ 4.0/). Trop. Med. Infect. Dis. 2024, 9, 216. https://doi.org/10.3390/tropicalmed9090216 https://www.mdpi.com/journal/tropicalmed Trop. Med. Infect. Dis. 2024, 9, 216 2 of 23 1. Introduction Typhoid fever and malaria are two of the most prevalent febrile diseases in the world, presenting serious public health issues, especially in tropical and subtropical areas. Typhoid and malaria are common in these areas due to the high humidity, temperatures, inadequate healthcare facilities, and the shortage of qualified healthcare providers [1]. Despite these diseases being caused by different pathogens and transmitted by diverse vectors, they share several similarities as regards epidemiology, clinical manifestation, and co-infection. Their prevalence is attributed to environmental and healthcare factors, including a warm and humid climate, rapid urbanization without adequate infrastructure, which results in crowded living conditions and poor sanitation, limited access to high-quality healthcare, a lack of preventive measures, and weak disease surveillance systems in these regions. Typhoid fever and malaria continue to be the leading causes of morbidity and mortality [2]. Salmonella enterica serotype Typhi is the bacteria that causes typhoid fever or enteric fever, which affects millions of people worldwide and can have serious consequences if left untreated [3–5]. Malaria, on the other hand, is caused by plasmodium parasites that are transmitted by Anopheles mosquito bites, infecting millions of people and claiming the lives of hundreds of thousands every year [6–8]. Malaria remains one of the world’s most serious health problems [9] and the second most studied disease according to a systematic review [10]; this is due to its widespread prevalence, high mortality rate, drug resistance, and environmental factors such as climate change in tropical regions. The prompt and effective diagnosis of these febrile diseases is essential for efficient treatment and care, but current diagnostic techniques often face limitations in accessibility, specificity, and sensitivity. Blood smear examination (microscopy) and rapid diagnostic tests (RDTs) are the current diagnostic techniques for malaria while the Widal test and blood culture are the tests for typhoid fever. Since blood smear microscopy is low-cost, effective, and capable of differentiating between malaria species and quantifying parasites, it is the gold standard for diagnosing malaria. However, it does require a functional infrastructure and skilled, qualified microscopists. RDTs identify malaria antigens in a small volume of blood by using monoclonal antibodies that are directed against the target parasite antigen and impregnated on a test strip but may be less sensitive to identify mixed or non-Plasmodium falciparum infections [11]. The Widal test detects typhoid fever in patients’ serum using a suspension of dead Salmonella enterica as an antigen. Still, it has low specificity and sensitivity, which can result in incorrect diagnosis and treatment. In contrast, blood culture has high specificity but can have compromised sensitivity due to low bacterial loads or previous antibiotic use [12,13]. Machine learning (ML) algorithms are frequently used in the healthcare sector to help decision-makers make well-informed decisions [14,15]. Medical diagnostics has found ML to be a potent tool that can improve the efficiency and accuracy of diagnosis, but to guarantee that medical professionals can rely on and comprehend the judgments made by these models, the use of ML models in clinical settings demands a high level of interpretability and transparency. Studies have applied numerous ML techniques in diagnosing malaria [16–18] and typhoid fever [19], as well as both conditions together [20–22]. Even though ML models are frequently used to diagnose diseases, the lack of integrated explainability in previous research makes it difficult for medical professionals to have high confidence in the predictions. According to Anderson and Thomas [23], concerns about ML algorithms’ lack of interpretability frequently impede their acceptance in the healthcare sector. Since the healthcare sector is highly regulated, there is a high demand for accountability and transparency in the decision-making processes for ML models before their adoption [24]. Healthcare practitioners must be able to comprehend and interpret the predictions made by ML models to be used safely as these models are used to supplement clinical decision-making. Their capacity to comprehend and interpret the choices made by ML models is critical in this sector, as decisions can have a significant impact on patient outcomes. To address this challenge, an explainable AI (XAI) technique like Local Interpretable Model-agnostic Explanations (LIME) offers insights into how models arrive at their predictions, thereby promoting trust and aiding in clinical decision-making by healthcare professionals. XAI is becoming increasingly important in the healthcare sector, where making Trop. Med. Infect. Dis. 2024, 9, 216 3 of 23 decisions has extremely high stakes because it is challenging for healthcare professionals to trust and comprehend the decisions made by traditional machine learning models. In clinical settings, where comprehension of the reasoning behind a diagnosis is critical for patient safety, regulatory compliance, and ethical considerations, the lack of interpretability may impede the adoption of AI [25]. Therefore, XAI offers solutions to these problems by facilitating AI models’ transparent and intelligible decision-making process. XAI techniques such as LIME are widely utilized to clarify the inner workings of complex models. LIME operates by using an interpretable model local to the prediction to approximate the blackbox model. It modifies the input data, tracks how the predictions change as a result, and then fits a straightforward, understandable model to these modified samples [26,27]. In situations where individual case explanations are required, LIME is especially helpful as it helps determine which characteristics are most important for a particular prediction. The interpretability of ML models in the healthcare industry is greatly enhanced by LIME, which allows physicians to better comprehend and rely on AI-driven insights, and their capacity to offer concise, useful explanations improves the usefulness of AI systems in the processes of diagnosis and treatment planning. LIME has been applied in several healthcare settings such as in diagnosing diabetes [28], classification of co-morbidities associated with febrile diseases in children and pregnant women [29], and transparent health predictions [30]. To further improve accuracy and explainability, incorporating large language models (LLMs) into diagnostic processes seems promising in combination with XAI techniques. LLMs are advanced AI systems built using deep learning techniques and trained on vast amounts of data to accomplish a wide range of natural language processing (NLP) tasks. These models can bridge the gap between complex ML algorithms and clinical understanding. They are trained on a wealth of medical data and can provide distinctive interpretations and generate detailed, contextually relevant explanations for diagnostic outcomes. The use of LLMs in medical contexts has advanced significantly thanks to projects like Generative Pre-trained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT). These models can produce human-like text and comprehend intricate linguistic patterns because they have been trained on enormous volumes of text data. The applications of BERT go beyond identifying pandemic illnesses; it can also be used to process electronic medical records and evaluate the results of goals-of-care talks in clinical trials [30–33]. GPT has proven to be remarkably adept at producing coherent and contextually relevant text in various domains [34]. GPT can help in the healthcare industry by delivering comprehensive patient reports, producing justifications for medical diagnoses, and offering assistance during clinical decision-making processes [35]. The accuracy and explainability of diagnostic systems can be greatly improved by integrating these LLMs and they can produce thorough narratives that clarify the reasoning behind diagnostic predictions, which facilitates clinician comprehension and validation of AI recommendations. This ability is essential for bridging the gap between cutting-edge AI models and real-world, routine clinical use, raising the standard of healthcare delivery as a whole. Several other studies have integrated ML and XAI in diagnosis such as predicting the risk of hypertension [36], preventing breast cancer [37], differentiating bipolar disorder [38], predicting hepatitis C [39], and modeling comorbidity in patients with febrile diseases [29]. Other studies have proposed LLMs for healthcare purposes such as the prediction of potential diseases [40], multimodal diagnosis [41], answering cardiology and vascular pathologies questions [42], and answering questions on health diagnosis [43], but there appears to be a gap in the literature regarding the combined use of all three methods (ML, XAI, and LLMs) in diagnosing febrile diseases such as malaria and typhoid fever. This study aims to enhance the interpretability of typhoid and malaria diagnosis using ML techniques like Extreme Gradient Boost (XGBoost), Random Forest (RF), and Support Vector Machine (SVM) with LIME, and LLMs such as GPT, Gemini, and Perplexity. RF reduces the chance of overfitting and produces a robust result by combining multiple decision trees. The XGBoost algorithm is incredibly scalable and effective, capable of effectively Trop. Med. Infect. Dis. 2024, 9, 216 4 of 23 managing both linear and nonlinear relationships while SVM can generalize well to new data, making it a dependable tool for diagnosing diseases with similar symptoms. The XIA tool gives healthcare workers concise explanations for every diagnosis, assisting them in determining which symptoms had the greatest influence on the diagnosis. The LLMs further improve the output and increase the tool’s usability for non-specialists by converting complex technical explanations into plain language. This study emphasizes the potential of integrating these tools to interpret and contextualize medical data, hence bridging the gap between healthcare worker comprehension and complex ML diagnoses. A dataset consisting of patients’ symptoms and diagnoses of malaria and typhoid was collected from healthcare facilities across the Niger Delta region of Nigeria. By leveraging these advanced tools, we seek to develop a diagnostic model that delivers precise diagnoses and provides transparent and understandable insights into their decision-making processes. This research holds significant potential to improve diagnostic practices, ultimately contributing to better patient outcomes and advancing the field of medical diagnostics. This study can advance the field of diagnostic medicine and enhance diagnostic procedures, which will ultimately lead to better patient outcomes. This study’s primary contributions are: • • • • • • The consideration of multiple diseases (typhoid fever and malaria) allows for a thorough evaluation of the patient’s health, which is essential for managing co-infection and comorbidity. Using real-world data ensures that the models are trained and validated on clinical cases, thereby enhancing the practical relevance and applicability of our findings. The black-box nature of many ML models is addressed by the integration of an XAI method, which gives medical professionals transparent and comprehensible insights into how each feature influences the diagnosis, ensuring that diagnostic results are presented in a way that is meaningful for easier interpretation. By focusing on interpretability, healthcare workers can make more accurate and timely diagnoses. LLMs give the diagnosis process an extra layer of context-aware understanding and incorporating them makes it possible to better understand and analyze complex medical outcomes. The combination of LLMs and conventional ML models enables a thorough comparison of various diagnosis strategies. This not only demonstrates the models’ efficacy but also the advantages and disadvantages of each approach to managing medical data. The integration of XAI, LLMs, and ML puts this work at the forefront of medical AI research. It demonstrates the viability and benefits of using a hybrid approach to address difficult diagnostic problems, establishing a standard for further study in the area. The study is prepared as follows: The methodology is presented in Section 2, including data collection, preprocessing, and the application of XAI and ML models, along with the incorporation of LLMs for improved diagnostic interpretability. The results are discussed in Section 3, evaluating the effectiveness of various algorithms and illustrating how XAI offers insights into model decisions, along with the implications for clinical practice. Section 4 concludes the study, highlighting its limitations, and offering recommendations for further research to advance diagnostic techniques. 2. Methodology 2.1. Malaria and Typhoid Diagnosis Framework The proposed diagnosis framework for malaria and typhoid fever is presented in Figure 1. The major components of the framework include a healthcare worker, medical experts, and a mobile device for the collection, processing, and storage of information locally and on a cloud-based storage for decision making. Patient data were obtained from medical experts and pre-processed into a format suitable for machine learning modeling and processing by large language models. Pre-processing ensures data quality, selects and encodes pertinent features, balances the dataset, and normalizes inputs, which contributes to the model’s ability to make more dependable predictions. The proposed model can Trop. Med. Infect. Dis. 2024, 9, 216 experts, and a mobile device for the collection, processing, and storage of information locally and on a cloud-based storage for decision making. Patient data were obtained from medical experts and pre-processed into a format suitable for machine learning modeling and processing by large language models. Pre-processing ensures data quality, selects and 5 of 23 encodes pertinent features, balances the dataset, and normalizes inputs, which contributes to the model’s ability to make more dependable predictions. The proposed model can be utilized in the diagnoses of typhoid fever and malaria with enhanced accuracy and be utilized in diagnosesworker of typhoid fever and malaria enhanced accuracy and explainability bythe a healthcare through a mobile device.with Through the user-friendly explainability by a healthcare worker through a mobile device. Through the interface, healthcare workers can input patient’s vitals and symptoms using user-friendly dropdown interface, workers can input patient’s vitals can andprocess symptoms using menus and healthcare sliders. After the data are entered, the model them and dropdown instantly menus and sliders.as After thehaving data are entered, themalaria, model can process them and instantly diagnose the patient likely typhoid fever, neither, or both. diagnose the patient as likely having typhoid fever, malaria, neither, or both. Figure 1. 1. Malaria and Typhoid Fever Diagnosis Framework. Figure Malaria and Typhoid Fever Diagnosis Framework. 2.2. Description Dataset Used Study 2.2. Description of of thethe Dataset Used forfor thethe Study The New Frontiers Research Fund project’s dataset instrument,designed designedbyby a team The New Frontiers inin Research Fund project’s dataset instrument, a team medical experts in the field of febrile diseases and computer scientists, was in used ofof medical experts in the field of febrile diseases and computer scientists, was used thisin this study. The dataset, comprising patient records, organized into sections, study. The dataset, comprising 4870 4870 patient records, waswas organized into sixsix sections, includingdemographic demographicdata, data, patient patient symptoms, information [44]. including symptoms, risk riskfactors, factors,and anddiagnosis diagnosis information The first section contains the patient demographics as shown in Table 1 and the diagnosing [44]. The first section contains the patient demographics as shown in Table 1 and the physician’s information. The second the patient’s symptoms on a fivediagnosing physician’s information. Thesection secondcontains section contains the patient’s symptoms point scale (1 = absent; 2 = mild; 3 = moderate; 4 = severe; 5 = very severe), along the on a five-point scale (1 = absent; 2 = mild; 3 = moderate; 4 = severe; 5 = very severe),with along doctor’s level of confidence (a numerical rating scale from 1 to 10). The five-point symptom with the doctor’s level of confidence (a numerical rating scale from 1 to 10). The five-point scale is based reality, where symptoms in severity, and ensures that symptom scale on is clinical based on clinical reality, wherevary symptoms vary initseverity, anddata it collection is consistent across various doctors and cases, reducing variability and potential ensures that data collection is consistent across various doctors and cases, reducing bias. The patient’s degreebias. of susceptibility to the otherofnon-clinical risk to factors was listed variability and potential The patient’s degree susceptibility the other non-in the third section, and the doctor’s initial diagnosis was listed in the fourth. The confirmed clinical risk factors was listed in the third section, and the doctor’s initial diagnosis was diagnosis was included in the sixth section of the dataset after further investigations such listed in the fourth. The confirmed diagnosis was included in the sixth section of the as full blood count, blood film, and serology were conducted on the patient in the fifth dataset after further investigations such as full blood count, blood film, and serology were section. A linguistic scale (1 = absent; 2 = very low; 3 = low; 4 = moderate; 5 = high; 6 = very conducted on the patient in the fifth section. A linguistic scale (1 = absent; 2 = very low; 3 high) was used to rate the intensity of attack for both preliminary and confirmed diagnoses = low; 4 = moderate; 5 = high; 6 = very high) was used to rate the intensity of attack for (Sections 4 and 6), along with the doctor’s degree of confidence (1–10) in each case. The both preliminary and confirmed diagnoses (Sections 4 and 6), along with the doctor’s dataset contained malaria, typhoid, HIV, respiratory tract infection, urinary tract infection, degree of confidence (1–10) in each case. The dataset contained malaria, typhoid, HIV, tuberculosis, lassa fever, yellow fever, and dengue fever with a total of 50 symptoms. respiratory tract infection, urinary tract infection, tuberculosis, lassa fever, yellow fever, and dengue fever with a total of 50 symptoms. Table 1. Statistics of male and female patients in the dataset. Table 1. Statistics of male and female in the dataset. Patient Age (Years) <5 patients 5–12 13–19 Male Female Total 534 419 953 346 323 669 150 213 363 20–64 ≥65 Total 1012 1605 2617 133 135 268 2175 2695 4870 2.3. Data Preprocessing and Oversampling The collected dataset comprised columns with both numeric and string data types, along with a few missing values. Missing values are a common problem in datasets and can cause bias, reduce model accuracy, and complicate data preprocessing, all of which can negatively affect ML model performance. Trop. Med. Infect. Dis. 2024, 9, 216 6 of 23 Data preprocessing was conducted, including feature selection, feature scaling, and data cleaning. Records with missing features, irrelevant data, and columns that were not needed were eliminated during the data-cleaning process. Records with missing symptoms were likewise eliminated to maintain the integrity of the dataset. Because the patient consultation tool did not include symptoms for patients under the age of five, records of those patients were removed. This is because patients in this age group may not be able to accurately express certain symptoms, leaving them to rely entirely on their parents’ interpretation. A selection of the most pertinent and significant features for modeling febrile illness (malaria and typhoid fever symptoms) was made to carry out the feature process. The dataset was reduced to 3914 records, including only the malaria7 and Trop. Med. Infect. Dis. 2024, 9, x FOR selection PEER REVIEW of 30 typhoid fever confirmed diagnoses and their twelve (12) symptoms. These two diseases with their 12 symptoms were selected from the list of symptoms because the rest of the diseases were underrepresented in the dataset, leading to an imbalanced dataset. The A patient’s symptoms, the intensity of each symptom, and confirmed diagnoses scope was narrowed to these two diseases to enhance the model’s ability to detect and (malaria andbetween typhoidthese fever)two are diseases all included ineffectively. the processed dataset. The list of symptoms differentiate more and A diseases with abbreviations is presented in Table 2. As shown in Figurediagnoses 2, custom patient’s symptoms, the intensity of each symptom, and confirmed mapping was used to map the disease severity Absent’ (1) to binary 0 and Very-low’ to (malaria and typhoid fever) are all included in the processed dataset. The list of symptoms Very-severe’ (2 to 6) to binary 1. and diseases with abbreviations is presented in Table 2. As shown in Figure 2, custom mapping was used to map the disease severity ‘Absent’ (1) to binary 0 and ‘Very-low’ to Table 2. Symptoms and diseases with abbreviations. ‘Very-severe’ (2 to 6) to binary 1. Symptom/Disease Abbreviation Table 2. Symptoms and diseases with abbreviations. Abdominal pains ABDPN Bitter taste in mouth BITAIM Symptom/Disease Abbreviation Chills and rigors CHLNRIG Abdominal pains ABDPN Bitter tasteConstipation in mouth BITAIMCNST Chills and Fatigue rigors CHLNRIG FTG Constipation CNST FVR Fever Fatigue FTG Generalized body pain GENBDYPN Fever FVR Headaches HDACH Generalized body pain GENBDYPN High-grade fever HGGDFVR Headaches HDACH High-grade fever HGGDFVR Lethargy LTG Lethargy LTG Muscle and body pain MSCBDYPN Muscle and body pain MSCBDYPN Stepwise rise fever SWRFVR Stepwise rise fever SWRFVR Malaria Malaria MAL MAL Typhoid fever/Enteric fever fever ENFVR Typhoid fever/Enteric ENFVR Figure2.2.Pre-processed Pre-processeddataset. dataset. Figure After Afterfurther furtheranalysis, analysis,we wenoticed noticedthat thatof ofthe the3914 3914patients, patients,1088 1088patients patientshad hadneither neither malaria 1669 had only malaria, 107107 hadhad onlyonly typhoid fever,fever, and 1050 had malarianor nortyphoid typhoidfever, fever, 1669 had only malaria, typhoid and 1050 had both diseases. The Synthetic Minority Oversampling Technique (SMOTE) was employed to handle the class imbalance. SMOTE has several advantages and when compared to just replicating minority class instances, it lowers the chance of overfitting by creating synthetic samples. It improves model performance, is compatible with most ML techniques, and is useful for various types of data. SMOTE identified minority class Trop. Med. Infect. Dis. 2024, 9, 216 7 of 23 both diseases. The Synthetic Minority Oversampling Technique (SMOTE) was employed to handle the class imbalance. SMOTE has several advantages and when compared to just replicating minority class instances, it lowers the chance of overfitting by creating synthetic samples. It improves model performance, is compatible with most ML techniques, and is useful for various types of data. SMOTE identified minority class instances, selected k-nearest neighbors, and generated and added synthetic samples to the original dataset, Trop. Med. Infect. Dis. 2024, 9, x FOR as PEER REVIEW in Figure 3. The oversampled dataset contains 6676 patient records with 8 of 30 presented the class labels 0 (No disease), 1 (Typhoid only), 2 (Malaria only), and 3 (Both diseases) in the ‘Disease’ column. Figure3.3.Oversampled Oversampleddataset datasetwith withSMOTE. SMOTE. Figure 2.4. 2.4.Diagnostic DiagnosticModels Modelsand andModel ModelOptimization Optimization We Weused usedGoogle GoogleColaboratory Colaboratory(Colab), (Colab),aafree freecloud-based cloud-basedplatform platformfrom fromGoogle Googlethat that offers programming environment with quick robust graphics offersa Python a Python programming environment with access quicktoaccess to robustprocessing graphics unit (GPU) resources and ML libraries. Additionally, the platform provides a CPU runtimea processing unit (GPU) resources and ML libraries. Additionally, the platform provides and easily integrates for storage. Python and libraries as NumPy, CPU runtime and Google easily Drive integrates Google Drivepackages for storage. Pythonsuch packages and Pandas, and Matplotlib, which are necessary for creating classification models, librariesScikit-Learn, such as NumPy, Pandas, Scikit-Learn, and Matplotlib, which are necessary for were utilized. The ML algorithms used in building our diagnostic models and the performance creating classification models, were utilized. The ML algorithms used in building our metrics are presented in the incorporating tuning, known as diagnostic models and thesubsection performance metrics hyperparameter are presented in the subsection grid search cross-validation (GridSearchCV), which is used to increase the precision of the incorporating hyperparameter tuning, known as grid search cross-validation diagnosis. GridSearchCV expanded method optimizing hyperparameters by enabling (GridSearchCV), which is isan used to increase thefor precision of the diagnosis. GridSearchCV customized search spacesfor foroptimizing each hyperparameter, using designated The hyperis an expanded method hyperparameters by enablingranges. customized search parameter setting used was: SVM (‘C’: [0.1, 1, 10, 100], ‘gamma’: [‘scale’, ‘auto’, 0.001, 0.01, spaces for each hyperparameter, using designated ranges. The hyper-parameter setting 0.1], ‘kernel’: [‘rbf’, ‘linear’]). ‘C’ is the parameter that controls the trade-off between achieving used was: SVM ( C’: [0.1, 1, 10, 100], gamma’: [ scale’, auto’, 0.001, 0.01, 0.1], kernel’: a[ low on theC’ training and minimizing the model complexity, Gamma definesahow rbf’,error linear’]). is the data parameter that controls the trade-off between achieving low far the influence of a single training example reaches, while the Kernel function transforms error on the training data and minimizing the model complexity, Gamma defines howthe far data into a higher-dimensional spaceexample to make reaches, them easier to separate using a lineartransforms boundary. the influence of a single training while the Kernel function XGBoost [3, 4, 5, 6], ‘learning_rate’: 0.2],to ‘n_estimators’: [100, 200, the data (‘max_depth’: into a higher-dimensional space to make[0.01, them0.1, easier separate using a linear 300], ‘colsample_bytree’: [0.3, 0.7]), where determines the maximum depth of boundary. XGBoost ( max_depth’: [3–6], max_depth learning_rate’: [0.01, 0.1, 0.2], n_estimators’: the trees, learning_rate controls how much the model’s weights are adjusted to the loss [100, 200, 300], colsample_bytree’: [0.3, 0.7]), where max_depth determines the maximum gradient, n_estimators indicate the number of trees to be built, and colsample_bytree defines depth of the trees, learning_rate controls how much the model’s weights are adjusted to the subsample ratio of columns when constructing each tree. RF(‘n_estimators’: [100, 200, the loss gradient, n_estimators indicate the number of trees to be built, and 300], ‘max_depth’: [None, 10, 20, 30], ‘min_samples_split’: [2, 5, 10], ‘min_samples_leaf’: [1, 2, colsample_bytree defines the subsample ratio of columns when constructing each tree. 4], ‘bootstrap’: [True, False]), where min_samples_split determines the minimum number of RF( n_estimators’: [100,200,300], max_depth’: [None, 10, 20, 30], min_samples_split’: [2, samples required to split an internal node, min_samples_leaf specifies the minimum number 5, 10], min_samples_leaf’: [1,2,4], bootstrap’: [True, False]), where min_samples_split of samples required to be at a leaf node, and bootstrap determines whether bootstrap samples determines the minimum number of samples required to split an internal node, min_samples_leaf specifies the minimum number of samples required to be at a leaf node, and bootstrap determines whether bootstrap samples are used when building trees. Each of these hyperparameters aids in fine-tuning the behavior of the model, enhancing its functionality and ability to diagnose the febrile illnesses considered in this study with good generalization. These hyperparameters were derived from built-in functions of the Trop. Med. Infect. Dis. 2024, 9, 216 8 of 23 are used when building trees. Each of these hyperparameters aids in fine-tuning the behavior of the model, enhancing its functionality and ability to diagnose the febrile illnesses considered in this study with good generalization. These hyperparameters were derived from built-in functions of the corresponding ML algorithms. Our local laptops utilized for this study were a Dell Latitude 7480, Core i5-7200U CPU @ 2.50 GHz (4 CPUs), ~2.7 GHz with 16 GB RAM for the ML and XAI modeling while a Samsung 950QDB, Core i7-1165G7 @ 2.80Ghz (8 CPUs) ~ 2.8 GHz with 16 GB RAM was used for the large language modeling. We used Visual Studio Code, a free coding editor that supports several extensions and allows for quick coding initiation. LLMs are easily accessible thanks to the Python foundation of our development environment. The process was automated by utilizing core Python packages and libraries, such as pandas, numpy, flet, matplotlib, flask, flask-sqlalchemy, seaborn, sk-learn, and joblib for loading models. The information extractor comprises a prompt generator, automator, and interpreter. The Malaria and Typhoid Diagnosis System interacts with various application programming interfaces (APIs) for database communication and diagnosis management. It has two main components: the front-end, built using Flet with Matplotlib and Seaborn for visualizations, and the back-end, powered by Flask for API integration and MySQL database management via Flask-SQLAlchemy. The prompt generator converts data into a readable format, saves prompts in a JSON file, and organizes the patient’s symptoms and severity into manageable prompts. The prompt used by Caruccio et al. [45] mimics the conversation of a physician when seeking assistance in diagnosing a patient based on particular symptoms. The template is “The patient has these symptoms: [S] Tell me which of the following diagnoses is most related to the symptoms: [D]? [H]”. Where [S] is all of the symptoms listed in the prompt, [D] the diagnoses the LLM must decide on, and [H] the answer or diagnoses provided by the LLM. This template was modified to arrive at our prompt: “The patient has these symptoms with severity levels, listed in the table below. (create a table with only the diagnosis column filled in), the output should be in CSV format, diagnosis [Malaria, Typhoid Fever, Both, None]?”. The automator manages data flow by retrieving outputs and storing them in a JSON feeds these prompts into the large language models (GPT, Gemini, and Perplexity). Trop. Med. Infect. Dis. 2024, 9, x FOR file. PEERIt REVIEW 10 of 30 After that, the interpreter transforms the JSON output into an Excel file so that reporting and analysis of the data can be carried out. The link to the scripts can be found in this GitHub account https://github.com/FebrileDiseasesDiagnoses/Auto_tool.git (2 August 2024). 2.4.1. Random Forest 2.4.1. Random Random Forest algorithm is an ensemble ML technique with robust resistance to over-fitting combines several to increase prediction accuracy to [46], as Randomthat Forest algorithm is an decision ensembletrees ML technique with robust resistance overshownthat in Figure 4. RFseveral trains predictions concurrently, well on large datasets, and fitting combines decision trees to increase operates prediction accuracy [46], as shown is good at estimating missing data [47]. RF can easily resolve high-dimensional and in Figure 4. RF trains predictions concurrently, operates well on large datasets, and is complex problemsmissing such asdata the [47]. prediction disease conditions [48–50]. By combining good at estimating RF can of easily resolve high-dimensional and complex individual tree predictions via voting, the final prediction is produced. This approach problems such as the prediction of disease conditions [48–50]. By combining individual increases the model’s robustness, decreases overfitting, and can in diagnosing febrile tree predictions via voting, the final prediction is produced. Thisaid approach increases the diseases.robustness, decreases overfitting, and can aid in diagnosing febrile diseases. model’s Figure4.4.Random RandomForest Forestschematic schematicdiagram. diagram. Figure 2.4.2. Extreme Gradient Boost XGBoost algorithm is a component of the gradient boosting framework, which can be applied to regression or classification predictive modeling issues. Figure 5 depicts the computation process used by XGBoost as it introduces weak learners into the ensemble, focusing each new learner on correcting the mistakes made by the previous ones. Because Trop. Med. Infect. Dis. 2024, 9, 216 9 of 23 Figure 4. Random Forest schematic diagram. 2.4.2. Extreme ExtremeGradient GradientBoost Boost 2.4.2. XGBoost algorithm algorithm is is aacomponent component of ofthe thegradient gradientboosting boostingframework, framework, which whichcan can XGBoost be applied to regression or classification predictive modeling issues. Figure 5 depicts the be applied to regression or classification predictive modeling issues. Figure 5 depicts the computation process used by XGBoost as it introduces weak learners into the ensemble, computation process used by XGBoost as it introduces weak learners into the ensemble, focusingeach eachnew newlearner learneron oncorrecting correctingthe themistakes mistakesmade madeby bythe theprevious previousones. ones.Because Because focusing of its reputation for managing structured data, XGBoost is extensively utilized in of its reputation for managing structured data, XGBoost is extensively utilized in numerous numerous applications, including theof prediction of disease [51]. applications, including the prediction disease [51]. Figure5.5.Extreme Extremegradient gradientboosting boostingschematic schematicdiagram. diagram. Figure 2.4.3. 2.4.3. Support Support Vector VectorMachine Machine SVM SVM isiswell-known well-known for forworking working well wellin inhigh-dimensional high-dimensional spaces spaces and andfor forhandling handling non-linearly non-linearly separable separable data data by by utilizing utilizing kernel kernel functions. functions. It It seeks seeks to to determine determine which which hyperplane best divides the data into distinct classes. The margin is the distance between hyperplane best divides the data into distinct classes. The margin is the distance between the and that thehyperplane hyperplane and the theclosest closestobservations, observations, and and the the support support vectors vectors are are the thepoints points that Trop. Med. Infect. Dis. 2024, 9, x FOR PEER REVIEW 11 of 30 are closest to it, as shown in Figure 6. SVM uses little memory, performs well with a wide are closest to it, as shown in Figure 6. SVM uses little memory, performs well with a wide range rangeof offeatures, features,and andcan canbe betailored tailoredwith withvarious variouskernel kernelfunctions functionsfor forintricate intricatedecision decision boundaries. SVM is resistant to overfitting and can handle high-dimensional data as boundaries. SVM is resistant to overfitting and can handle high-dimensional data aswell well as an effective as binary binary and and multi-class multi-class classification classification issues issues in in medical medical diagnosis, diagnosis, making making it it an effective tool diseases [52]. [52]. tool for for diagnosing diagnosing diseases Figure 6. Support Vector Machine diagram. 2.5. Interpretability Interpretability and and Explainability Explainability Methods Methods 2.5. Local Interpretable Interpretable Model-agnostic Model-agnostic Explanations Explanations (LIME) (LIME) approximates approximates the the complex complex Local model near near aa particular particular prediction prediction with with an an interpretable interpretable model model such such as as aa linear model to to model linear model provide local explanations. The integration of LIME into our model follows these key steps: provide local explanations. The integration of LIME into our model follows these key (i) Instance Selection: LIME was applied to each instance in the test dataset, generating steps: (i) Instance Selection: LIME was applied to each instance in the test dataset, localized explanations for the model’s predictions on a case-by-case basis; (ii) Feature generating localized explanations for the model’s predictions on a case-by-case basis; (ii) Contribution Analysis: LIME produces visualizations that indicate the contribution of each Feature Contribution Analysis: LIME produces visualizations that indicate the contribution of each feature to the prediction. Features that positively influence the likelihood of a specific disease are displayed on the right side of the plot, while those that decrease the likelihood are shown on the left; (iii) Global Insight Aggregation: By aggregating LIME explanations across multiple instances, we can identify patterns and key features that consistently influence the model’s decisions, providing a broader Trop. Med. Infect. Dis. 2024, 9, 216 10 of 23 feature to the prediction. Features that positively influence the likelihood of a specific disease are displayed on the right side of the plot, while those that decrease the likelihood are shown on the left; (iii) Global Insight Aggregation: By aggregating LIME explanations across multiple instances, we can identify patterns and key features that consistently influence the model’s decisions, providing a broader understanding of the model’s behavior across the dataset. Generative Pre-trained Transformer (GPT) is pre-trained using unsupervised learning on a large corpus of text data, where it learns to predict the word that will appear next in a sequence based on every word that has come before it. This pre-training gives GPT a thorough grasp of language syntax, semantics, and context. When GPT is fine-tuned on particular tasks, like text generation, question answering, or text completion, it uses its learned representations to produce outputs that make sense and are relevant to the context. GPT is an effective tool for NLP applications because it can produce text similar to that of a human being and manage a wide range of linguistic tasks. Bidirectional Encoder Representations from Transformers (BERT) is a transformerbased model that is trained to predict missing words in both directions with the help of masking certain words in the input and making predictions about them using both left and right context. Thanks to this bidirectional training, it can capture more complex contextual meanings and relationships within text, producing more accurate language representations. BERT can comprehend subtleties in language and performs well on a variety of natural language understanding tasks, including named entity recognition, sentiment analysis, and machine translation, thanks to its large-scale pre-training tasks. BERT is a flexible and potent model for a range of NLP applications. Its performance can be further improved by fine-tuning it for particular tasks. 2.6. Model Performance Metrics The dataset used for this study initially contained 4870 patient records with symptoms of febrile diseases. After preprocessing, the records were reduced to 3914, and after oversampling, 6676 patient records with relevant features were retained for ML modeling. In total, 80% of the dataset was used for training and 20% for testing. GridSearchCV was employed to optimize model performance and StratifiedKFold was used for crossvalidation, dividing the dataset into five stratified folds and shuffling the data before splitting to ensure robust and unbiased results. The experimental models were evaluated using key performance metric components. True Positives (TP) are cases where the model correctly predicts the positive class, represented by the diagonal elements of the matrix, while True Negatives (TN) are cases where the model correctly predicts the negative class. TN is the sum of all the cells that are neither in the row nor the column corresponding to the class being considered. False Positives (FP) are cases where the model incorrectly predicts the positive class while False Negatives (FN) are cases where the model incorrectly predicts the negative class. When evaluating the sensitivity and specificity of diagnostic tests, these metric components are helpful. The evaluation metrics listed below were used in this paper. Accuracy is a measurement of how well a model predicts all labels linked to each data point in a dataset. Datasets with a balanced distribution of positive and negative samples are good candidates for accuracy. For unbalanced datasets, it is less helpful because they can be deceptive. Accuracy = True Positives + True Negatives True Positives + True Negatives + False Positives + False Negatives (1) Precision is a metric that expresses how accurately a model predicts positive outcomes; it measures the model’s capacity to correctly identify true positive instances while avoiding false positives. When false positives come at a high cost, accuracy matters. In the context of medical diagnosis, for instance, a false positive may result in needless treatments. Trop. Med. Infect. Dis. 2024, 9, 216 11 of 23 Precision = True Positives True Positives + False Positives (2) Recall is a metric used to assess a model’s capacity to locate every positive instance in a dataset. It measures how sensitive the model is to True Positive cases. When the cost of false negatives is significant, as in medical screenings, recall is crucial because it can be crucial to miss a positive case (false negative). Recall = True Positives True Positives + False Negatives (3) F1-Score is a metric that represents the harmonic mean of Rrecall and Pprecision. The F1-score is limited to a range of binary values, where 1 denotes that every class’s data point was correctly predicted and 0 denotes that any class’s data point was incorrectly predicted. When you must strike a balance between Rrecall and Pprecision, the F1 Score can be helpful, particularly when your class distribution is not uniform.   Precision ∗ Recall F1 = 2 (4)of 30 Trop. Med. Infect. Dis. 2024, 9, x FOR PEER REVIEW 18 Precision + Recall Log Loss is a measure of the probability of a prediction’s accuracy and it penalizes the difference between the expected probabilities and the actual class labels. Log loss is instance being in the positive class, and incorrect predictions more severely. logpi N where N is the total number of samples in the dataset, yi is the actual label of the i − th instance, pi is the predicted probability of the i − th instance being in the positive class, and is the natural logarithm of theofpredicted probability for the positive class log( p ) is the natural logarithm the predicted probability for the positive class i 3. and Discussion Discussion 3. Results Results and performance areare shown in this section, The results results of ofour ourassessment assessmentofofthe themodels’ models’ performance shown in this section, including the XAI method including method adopted adoptedas aswell wellas asthe theexperimental experimentalassessment assessmentofofthe theLLMs LLMs of of Malaria and TyphoidFever Fever diagnoses. diagnoses. Figures thethe confusion matrices, an an Malaria and Typhoid Figures7–9 7–9present present confusion matrices, essential instrument for assessing how well a classification ML model performs. essential instrument for assessing how well a classification ML model performs. Figure 7. Figure 7. XGBoost XGBoostAlgorithm AlgorithmConfusion ConfusionMatrix. Matrix. Trop. Med. Infect. Dis. 2024, 9, 216 12 of 23 Figure 7. XGBoost Algorithm Confusion Matrix. Trop. Med. Infect. Dis. 2024, 9, x FOR PEER REVIEW 19 of 30 Figure 8. 8. RF RF Algorithm Algorithm Confusion Confusion Matrix. Matrix. Figure Figure 9. 9. SVM SVMAlgorithm AlgorithmConfusion ConfusionMatrix. Matrix. Figure Table 33 presents presents values values of of these these metrics metrics and and the the computation computation time time of of each each model model Table while Figure 10 is a pictorial representation of the model’s performance based on the while Figure 10 is a pictorial representation of the model’s performance based on the considered metrics. The result shows that RF (accuracy = 71.99%, precision = 71.29%, considered metrics. The result shows that RF (accuracy = 71.99%, precision = 71.29%, recall recall = 71.99%, F1-Score = 71.45%) demonstrates superior performance, outperforming = 71.99%, F1-Score = 71.45%) demonstrates superior performance, outperforming XGBoost XGBoost (accuracy = 71.29%, precision = 70.56%, recall = 71.29%, F1-Score = 70.66%) and (accuracy = 71.29%, precision = 70.56%, recall = 71.29%, F1-Score = 70.66%) and SVM SVM (accuracy = 68.60%, precision = 68.65%, recall = 68.60%, F1-Score = 68.21%). High (accuracy = 68.60%, precision = 68.65%, recall = 68.60%, F1-Score = 68.21%). High recall recall and precision are essential for diagnosing diseases like typhoid and malaria by and precision are essential for diagnosing diseases like typhoid and malaria by guaranteeing that the majority of real cases are identified. In this case, high precision helps guaranteeing that the majority of real cases are identified. In this case, high precision helps prevent needless treatments for illnesses that are not present. Because both XGBoost and prevent needless treatments for illnesses that are not present. Because both XGBoost and RF do a good job of balancing these metrics, they are better suited for clinical applications RF do a good job of balancing these metrics, they are better suited for clinical applications where false positives and false negatives can have detrimental effects. Also, XGBoost has a where false positives and false negatives can have detrimental effects. Also, XGBoost has smaller log loss, which suggests more accurate and well-calibrated probability estimates as a smaller log loss, which suggests more accurate and well-calibrated probability estimates well as stronger diagnosis confidence. This may be critical in medical diagnostics, where as well asis stronger diagnosis confidence.inThis may be of critical in medical diagnostics, accuracy not as important as confidence the presence a disease. In medical scenarios where not as important as confidence the presence of a disease. In medical where accuracy treatmentisdecisions are influenced by the in certainty of a diagnosis, lower log loss scenarios treatment are influenced by the of a diagnosis, lower values forwhere XGBoost indicatedecisions that its probability estimates arecertainty more reliable. Because of RF’s log loss values for XGBoost indicate that its probability estimates are more reliable. higher log loss, probability estimates are less trustworthy, which could cause uncertainty Because of RF’sdecisions. higher logSVM loss, performs probability estimates trustworthy, could when making worse than are the less other two modelswhich in terms of cause uncertainty when making decisions. SVM performs worse than the other two performance metrics and computation time (running time exceeds one hour), implying models in terms performance metrics and computation (running time exceeds one that it might notof work as well for diagnosing typhoid andtime malaria in this specific dataset. hour), implying that it might not work as well for diagnosing typhoid and malaria in this Therefore, ensemble techniques (XGBoost and Random Forest) may be better at capturing specific dataset. Therefore, between ensemblesymptoms techniquesand (XGBoost andthan Random Forest) may RF be the intricate relationships diseases the SVM model. better at capturing the intricate relationships between symptoms and diseases than the SVM model. RF combines the predictions of multiple decision trees to make a final prediction, which results in a slightly higher accuracy but at the cost of increased computational complexity, while XGBoost optimizes each tree by minimizing errors from the previous ones, leading to faster convergence and efficient model optimization. The Trop. Med. Infect. Dis. 2024, 9, 216 13 of 23 combines the predictions of multiple decision trees to make a final prediction, which results in a slightly higher accuracy but at the cost of increased computational complexity, while XGBoost optimizes each tree by minimizing errors from the previous ones, leading to faster convergence and efficient model optimization. The moderate F1-scores in these models are a result of typhoid fever and malaria having very similar symptoms, making it challenging for the models to differentiate between the two illnesses. This overlap can impair the model’s predictive accuracy, especially concerning recall and precision. Table 3. Diagnostics model performance. Algorithm Accuracy XGBoost 0.7129 RF Trop. Med. Infect. Dis. 2024, 9, x FOR PEER REVIEW 0.7199 SVM 0.6860 Precision Recall F1-Score Log Loss Computation Time 0.7056 0.7129 0.6865 0.7129 0.7199 0.6860 0.7066 0.7145 0.6821 0.7808 1.0548 0.8016 2 min, 32 s 14 min, 9 s 20 of 30 1 h, 22 min, 7 s 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Accuracy Precision XGBoost Recall RF F1-score Log Loss SVM Figure10. 10.Performance PerformanceEvaluation Evaluationof ofthe theMachine MachineLearning LearningModels. Models. Figure TheLIME LIME plots (Figures 11–13) provide global viewtheoffeatures how the features The plots (Figures 11–13) provide a globala view of how (symptoms) (symptoms) contribute the model’s diagnoses across entireidentifying test dataset, identifying contribute to the model’stodiagnoses across the entire test the dataset, features with features with the highest average contributions, both positively and across negatively, across all the highest average contributions, both positively and negatively, all diagnoses. diagnoses. The XGBoost LIME Figuresymptoms 11 shows symptoms such as HDACH, SWRFVR, The XGBoost LIME diagram in diagram Figure 11inshows such as SWRFVR, HDACH, CNST, asby specified by theircontributions negative contributions the the and CNST,and as specified their negative on the left on side ofleft theside plot,ofsuggesting that the lower levels or levels absence of these of symptoms are associated with a lower plot, suggesting that the lower or absence these symptoms are associated with a likelihood of a patient typhoid. Meanwhile, symptoms such as BITAIM, lower likelihood of a having patient malaria having and malaria and typhoid. Meanwhile, symptoms such as LTG, CHLNRIG, MSCBDYPN, and FVR are most BITAIM, LTG, CHLNRIG, MSCBDYPN, andthe FVR areinfluential the most symptoms influential constantly symptoms contributing to the diagnoses malaria and typhoid across numerous patients. patients. constantly contributing to theof diagnoses of malaria and typhoid across numerous The RF LIME diagram in Figure 12 also points out that the same symptoms (SWRFVR, HDACH, and CNST) are associated with a lower likelihood of having malaria and typhoid, whereas BITAIM, CHLNRIG, ABDPN, LTG, GENBDYPN, MSCBDYPN, FTG, and HGGDFVR are influential symptoms that contribute to the diagnoses of malaria and typhoid among patients. Figure 13 shows the SVM LIME diagram, indicating that CHLNRIG has the highest feature importance, followed by MSCBDYPN, LTG, ABDPN, BITAIM, FTG, and CNST as the influential symptoms that contribute to the diagnoses of malaria and typhoid among patients. Meanwhile, GENBDYPN, SWRFVR, FVR, HGGDFVR, and HDACH are associated with a lower likelihood of having malaria and typhoid. diagnoses. The XGBoost LIME diagram in Figure 11 shows symptoms such as SWRFVR, HDACH, and CNST, as specified by their negative contributions on the left side of the plot, suggesting that the lower levels or absence of these symptoms are associated with a lower likelihood of a patient having malaria and typhoid. Meanwhile, symptoms such as Trop. Med. Med. Infect. Infect. Dis. 9, 216 x FOR PEER REVIEW 21 of 30 Trop. Dis. 2024, 2024, 9, 14 of 23 BITAIM, LTG, CHLNRIG, MSCBDYPN, and FVR are the most influential symptoms constantly contributing to the diagnoses of malaria and typhoid across numerous patients. Trop. Med. Infect. Dis. 2024, 9, x FOR PEER REVIEW Figure 12. RF Algorithm LIME diagram. Figure 11. XGBoost Algorithm LIME diagram. 21 of 30 Figure 13 shows the SVM LIME diagram, indicating that CHLNRIG has the highest Theimportance, RF LIME followed diagram by inMSCBDYPN, Figure 12 also outBITAIM, that theFTG, same symptoms feature LTG,points ABDPN, and CNST as (SWRFVR, HDACH, and CNST) are associated with a lower likelihood of having the influential symptoms that contribute to the diagnoses of malaria and typhoid malaria among and typhoid, whereas GENBDYPN, BITAIM, CHLNRIG, ABDPN, GENBDYPN, MSCBDYPN, patients. Meanwhile, SWRFVR, FVR, LTG, HGGDFVR, and HDACH are FTG, and HGGDFVR arelikelihood influentialofsymptoms that contribute to the diagnoses of malaria associated with a lower having malaria and typhoid. and typhoid among patients. It is observed that medical experts should focus on the following influential symptoms for the diagnosis of malaria and typhoid fever in patients: BITAIM, CHNLNRIG, LTG, ABDPN, MSCBDYPN, FVR, GENBDYPN, FTG, and HGGDFVR. This is consistent with the results of Asuquo et al. [53], where GENBDYPN, CHNLNRIG, ABPN, FVR, FTG, and HGDFVR were observable symptoms. LIME has numerous advantages. It explains the individual diagnosis in a form that is relatively easy for humans to comprehend, aiding healthcare workers to understand why a model made a specific diagnosis. LIME can be applied to many ML models and this versatility makes it suitable for various medical diagnostic systems. In addition, LIME is suitable for generating explanations using local approximations [54]. The limitation of LIME is that it is computationally intensive and expensive to generate explanations for individual RF Algorithmfor LIME diagram. Figure 12. RF Algorithm LIME diagram. diagnoses, especially large datasets and complex models. Figure 13 shows the SVM LIME diagram, indicating that CHLNRIG has the highest feature importance, followed by MSCBDYPN, LTG, ABDPN, BITAIM, FTG, and CNST as the influential symptoms that contribute to the diagnoses of malaria and typhoid among patients. Meanwhile, GENBDYPN, SWRFVR, FVR, HGGDFVR, and HDACH are associated with a lower likelihood of having malaria and typhoid. It is observed that medical experts should focus on the following influential symptoms for the diagnosis of malaria and typhoid fever in patients: BITAIM, CHNLNRIG, LTG, ABDPN, MSCBDYPN, FVR, GENBDYPN, FTG, and HGGDFVR. This is consistent with the results of Asuquo et al. [53], where GENBDYPN, CHNLNRIG, ABPN, FVR, FTG, and HGDFVR were observable symptoms. LIME has numerous advantages. It explains the individual diagnosis in a form that is relatively easy for humans to comprehend, aiding healthcare workers to understand why a model made a specific diagnosis. LIME can be applied to many ML models and this versatility makes it suitable for various medical diagnostic systems. In addition, LIME is suitable for generating explanations using local approximations [54]. The limitation of LIME is that it SVM Algorithm LIME LIME diagram. diagram. Figure 13. SVM Algorithm is computationally intensive and expensive to generate explanations for individual It is observed thatfor medical experts should focus on the following influential diagnoses, especially large datasets and complex models. Three sets of experiments were conducted to evaluate the performance of symptoms ChatGPT, for the diagnosis of malaria and typhoid fever in typhoid. patients:In BITAIM, CHNLNRIG, LTG, Gemini, and Perplexity in diagnosing malaria and Experiment 1, one prompt ABDPN, MSCBDYPN, FVR, GENBDYPN, FTG, and HGGDFVR. This is consistent at a time was sent to the LLMs for the first 100 patients in the dataset, recordingwith the Trop. Med. Infect. Dis. 2024, 9, 216 15 of 23 the results of Asuquo et al. [53], where GENBDYPN, CHNLNRIG, ABPN, FVR, FTG, and HGDFVR were observable symptoms. LIME has numerous advantages. It explains the individual diagnosis in a form that is relatively easy for humans to comprehend, aiding healthcare workers to understand why a model made a specific diagnosis. LIME can be applied to many ML models and this versatility makes it suitable for various medical diagnostic systems. In addition, LIME is suitable for generating explanations using local approximations [54]. The limitation of LIME is that it is computationally intensive and expensive to generate explanations for individual diagnoses, especially for large datasets and complex models. Three sets of experiments were conducted to evaluate the performance of ChatGPT, Gemini, and Perplexity in diagnosing malaria and typhoid. In Experiment 1, one prompt at a time was sent to the LLMs for the first 100 patients in the dataset, recording the outputs in a CSV format to see how they performed with a single set of prompts. Experiment 2 involved sending 100 prompts from the first 100 patients in the dataset to the LLMs and storing the outputs in a CSV format to observe their responses to a series of prompts. In Experiment 3, 100 unique prompts were sent to the models repeatedly until the entire dataset was exhausted in order to assess how the models performed when given large sets of unique prompts. Table 4 presents the results of the three experiments. In Exp 1, ChatGPT 3.5 has a slightly better performance with the highest F1-score (30.99%); F1-score is crucial as it balances recall and precision, providing a comprehensive measure of the model’s performance. Although better accuracy and recall are achieved by ChatGPT 3.5 and Gemini (30%), Perplexity is better at minimizing false positives with its highest precision (38.90%). In Exp 2, Perplexity performs better, with the highest F1-score (26.29%), accuracy (28%), and recall (28%). Because it provides a comprehensive measure of the model’s performance by balancing recall and precision, the F1-score is especially significant. ChatGPT 3.5 is better at reducing false positives with the highest precision, while Gemini has the lowest performance. In Exp 3, ChatGPT 3.5 has better accuracy, precision, and recall, followed by Gemini and Perplexity. Although the ChatGPT model may have trouble striking a balance between minimizing false positives and identifying true positives, the model’s relatively low F1-score suggests that there may be an imbalance between precision and recall. Table 4. Large language models’ performance. Experiment Algorithm Accuracy Precision Recall F1-Score Exp 1 Chat GPT 3.5 Gemini Perplexity 0.3000 0.3000 0.2600 0.35562 0.3449 0.3890 0.3000 0.3000 0.2600 0.30999 0.2908 0.28736 Exp 2 Chat GPT 3.5 Gemini Perplexity 0.2600 0.2700 0.2800 0.2909 0.2607 0.2524 0.2600 0.2700 0.2800 0.2615 0.2296 0.2629 Exp 3 Chat GPT 3.5 Gemini Perplexity 0.3297 0.2895 0.2632 0.3324 0.2709 0.1957 0.3297 0.2895 0.2632 0.2926 0.2728 0.2171 Although LLMs have a broad range of knowledge, they may not be specialized in diagnosing complicated medical conditions like ML models that have been specifically trained in this area. The low F1-score in Table 4 may be related to LLMs’ limitations in handling medical diagnosis tasks, particularly diseases with similar or overlapping symptoms. The three experiments were carried out to test how the LLMs perform in different scenarios. Exp 1 tests the consistency and reliability of the LLMs in diagnosing diseases when a single prompt is used at a time. Exp 2 tests the LLMs’ capacity to manage more inputs concurrently because healthcare systems frequently handle several cases at the same time. Exp 3 tests the LLMs’ capacity to identify illnesses across a larger dataset through repeated exposure to various inputs. ChatGPT is an innovative technological tool for comprehending and processing natural language, making it suitable for interpreting and Trop. Med. Infect. Dis. 2024, 9, 216 16 of 23 summarizing complex up-to-date information. Gemini is an adaptable tool that can handle various data types such as images and text, making it suitable for diagnostic purposes. Perplexity is specialized in comprehending and generating complex queries as well as maintaining context that can be vital for the retrieval and analysis of medical research. These LLMs lack specialized knowledge and are also capable of producing inaccurate answers, which can be critical in a medical context. They require high computational power to generate and process responses, which could limit real-time systems. Data security and patient privacy are concerns when handling sensitive medical data and they require proper validation and regulatory approval before they can be trusted and adopted for clinical use. To facilitate healthcare professionals’ comprehension of the reasoning behind a diagnostic output, LLMs integrate and analyze large amounts of medical data and produce human-readable explanations for their decisions. The overall ML models’ performance in the study was moderate, suggesting the need for a sufficient dataset to enhance the diagnostic models. While the traditional SMOTE aided in balancing the dataset, employing an advanced oversampling method may help in improving the model performance. Even with GridSearchCV, the hyperparameters might still be improved, particularly for SVM. Better configurations could result from investigating alternative parameter tuning techniques like RandomizedSearchCV or Bayesian Optimization. To improve the results of the LLMs, the LLMs will be fine-tuned with a larger dataset, and an ensemble method will be employed to combine the strengths of different LLMs. To integrate ML, XAI, and LLM techniques into an app, we propose two methods. Method 1: Separate Training and Validation for ML and LLM 1. Train, test, and validate an ML model to diagnose malaria and typhoid based on the patient dataset 2. Apply LIME to explain the ML models’ diagnoses and how each symptom contributed to the diagnoses 3. Train, test, and validate an LLM model independently for generating explanations based on the patient dataset 4. Integrate the outputs from ML, LIME, and LLM to provide a comprehensive and interpretable diagnosis. The advantage of method 1 is that it might lead to higher diagnostic performance considering the specific training of the two models (ML and LLM) for this task. The disadvantage is that the training and validation process of two independent models would increase the computational complexity of the diagnostic system, especially in combining the results to ensure consistency and coherence. Method 2: Integrated ML, LIME, and LLM Process 1. Train, test, and validate an ML model to diagnose malaria and typhoid based on the patient dataset 2. Apply LIME to explain the ML models’ diagnoses and how each symptom contributed to the diagnoses 3. Use LLM for further explainability by passing the patient symptoms and ML results (with LIME explanations) through the model to generate diagnostic explanations in natural language. The advantage of method 2 is its simplicity because an integrated pipeline reduces complexity, making the system easier to develop, test, and maintain, which we have implemented. Performance will be increased and computational overhead could be decreased by streamlining the procedure into a single pipeline. The explanations produced by LIME are directly considered by the LLM, which results in more logical and contextually appropriate explanations. The disadvantage is that the quality of the initial ML and LIME outputs determines the quality of the explanations provided by the LLM. The Malaria and Typhoid Fever Diagnosis System is a mobile app that healthcare workers can use to diagnose typhoid fever and malaria with an easy-to-use interface. The Trop. Med. Infect. Dis. 2024, 9, 216 17 of 23 system comprises user authentication, a User main dashboard, and a Patient dashboard. The basic app requirement is an Android OS version 4.0 and above, 4 GB RAM: 2 GB minimum, ROM: 8 GB minimum, Display Layout: Portrait, and Internet connection. The user login is shown in Figure 14 and the User main dashboard is shown in Figure 15. The healthcare worker can register a patient, view a list of patients, and set up appointments for patients. Figure 16 is the patient registration form while Figure 17 is the patient dashboard where the patient vitals can be entered as well as the history taking and examination in Figure 18. The patient’s provisional diagnosis with XAI results is shown in Figure 19 with Trop. Med. Infect. Dis. 2024, 9, x FOR PEER REVIEW 24 of 30 LIME plot displaying the symptoms and how they influenced the model’s decision and Trop. Med. Infect. Dis. 2024, 9, x FOR the PEER REVIEW 24 of 30 the explanation by the ChatGPT LLM. Figure14. 14.User UserLogin. Login. Figure Figure 14. User Login. Figure 15. User Main Dashboard. Figure15. 15.User UserMain MainDashboard. Dashboard. Figure Trop. Trop.Med. Med.Infect. Infect.Dis. Dis.2024, 2024,9,9,216 x FOR PEER REVIEW Trop. Med. Infect. Dis. 2024, 9, x FOR PEER REVIEW Figure 16. Patient Patient Registration. Figure Figure16. 16. Patient Registration. Registration. Figure 17. Patient Account Dashboard. Figure17. 17. Patient Patient Account Account Dashboard. Dashboard. Figure 25 30 18 of 23 25 of 30 Trop.Med. Med.Infect. Infect.Dis. Dis.2024, 2024,9,9,216 x FOR PEER REVIEW Trop. Trop. Med. Infect. Dis. 2024, 9, x FOR PEER REVIEW 26 of 23 30 19 of 26 of 30 Figure18. 18.History HistoryTaking Takingand andExamination. Examination. Figure Figure 18. History Taking and Examination. Figure 19. XAI Diagnosis Results. Figure19. 19.XAI XAIDiagnosis DiagnosisResults. Results. Figure Previous studies [20,55,56] applied ML models for diagnosing malaria and typhoid Previous studies [20,55,56] applied MLmodels models fordiagnosing diagnosing malaria andtyphoid typhoid Previous [20,55,56] applied ML for and fever, thoughstudies these studies lacked appropriate interpretability in malaria the decision-making fever, though these studies lacked appropriate interpretability in the decision-making fever, though these studies lacked appropriate interpretability in the decision-making process, which often results in medical experts having difficulties in comprehending the process, which often resultsin in medical experts havingdifficulties difficulties inand comprehending the process, which often results medical experts in comprehending the reasoning behind diagnostic results. This studyhaving integrated ML, XAI, LLM to enhance reasoning behind diagnostic results.in This study integrated ML,XAI, XAI, and LLMwith toenhance enhance reasoning behind results. This integrated ML, and LLM to transparency anddiagnostic interpretability thestudy diagnostic processes that align global transparency and interpretability in the diagnostic processes that align with global healthtransparency and interpretability in the diagnostic processes that align with global healthcare goals. The use of LIME for feature importance analyses and ChatGPT for care goals. The use of LIME for feature importance analyses and ChatGPT for generating healthcare goals. The use of LIME for feature importance analyses and ChatGPT for generating context-aware explanations have distinguished the present study. Several context-aware explanations explanations have distinguished the present study. Several factors can congenerating context-aware have distinguished the present study. Several factors can contribute to the low performance scores in Table 4. These include: (1) the tribute to thecontribute low performance scores in Table 4. These include: (1)4.the dataset used during factors to training the low performance scores Table These include: (1) the dataset can used during the is limited in size andindiversity, affecting the models’ dataset used during the training is limited in size and diversity, affecting the models’ Trop. Med. Infect. Dis. 2024, 9, 216 20 of 23 the training is limited in size and diversity, affecting the models’ ability to generalize to unseen cases; (2) LLMs may require further fine-tuning and optimization, as the complexity of the diseases being diagnosed may overlap with other illnesses, thereby challenging the models to accurately differentiate between them. Furthermore, LLMs did not show high domain tolerance to the investigated illnesses, hence fine-tuning them on domain-specific data can significantly improve their performance. 4. Conclusions This study creates a medical diagnostic framework for Malaria and Typhoid fever by integrating XAI, LLMs, and ML models. This approach aims to demystify the black-box nature of ML models, offering transparent insights into how each feature or symptom affects the diagnosis. The RF model showed superior prediction performance in terms of accuracy, recall, precision, and F1-score compared to XGBoost and SVM. The high recall and precision values in RF are crucial for accurately diagnosing these diseases and for making appropriate treatment decisions. However, XGBoost exhibited the lowest log loss and fastest computation time. Further analysis indicates that SVM performs worse than the other two models, making it less suitable for this dataset. The study suggests that ensemble techniques like RF and XGBoost better capture the complex relationships between symptoms and diseases. The XAI analysis identified BITAIM, CHNLNRIG, LTG, ABDPN, MSCBDYPN, FVR, GENBDYPN, FTG, and HGGDFVR as key features for predicting Malaria and Typhoid. Among LLMs, ChatGPT 3.5 performed slightly better than Gemini and Perplexity. This study has shown how RF, LIME, and GPT can be used effectively to diagnose typhoid fever and malaria using a mobile-based system that meets the crucial requirements of interpretability and transparency, improving the diagnostic process’s acceptability and understanding among medical professionals. Future research should examine the application of various machine learning models, XAI techniques, and LLMs on a variety of datasets and across other medical conditions, such as in the diagnosis of diabetes, cardiovascular diseases, and cancer detection, to further validate and generalize the findings of this study. The validity of AI-driven diagnostics can be strengthened by extending its application to additional medical conditions. This will ultimately improve patient outcomes in a range of healthcare domains. Author Contributions: Conceptualization, F.-M.U. and K.A.; methodology, K.A., C.A. (Constance Amannah), D.A. and M.E.; validation, F.-M.U., O.O., K.A., C.A. (Constance Amannah), D.A. and M.E.; formal analysis, K.A., P.A. and D.A.; data curation, K.A. and P.A.; writing—original draft preparation, K.A., D.A., P.A., E.J., A.J. and M.E.; writing—review and editing, K.A., M.E., D.A., C.A. (Constance Amannah), O.O., C.A. (Christie Akwaowo), O.M. and F.-M.U. supervision, F.-M.U., C.A. (Constance Amannah), D.A., O.O. and M.E.; project administration, F.-M.U. and O.O.; funding acquisition, F.-M.U. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by the New Frontier Research Fund, grant number NFRFE-201901365 between April and March 2024. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: The raw data supporting the conclusions of this article will be made available by the authors on request. Conflicts of Interest: The authors declare no conflicts of interest. References 1. 2. Asuquo, D.; Attai, K.; Obot, O.; Ekpenyong, M.; Akwaowo, C.; Arnold, K.; Uzoka, F.M. Febrile Disease Modeling and Diagnosis System for Optimizing Medical Decisions in Resource-Scarce Settings. Clin. eHealth 2024, 7, 52–76. [CrossRef] Sohanang-Nodem, F.S.; Ymele, D.; Fadimatou, M.; Fodouop, S.C. Malaria and Typhoid Fever Coinfection among Febrile Patients in Ngaoundéré (Adamawa, Cameroon): A Cross-Sectional Study. J. Parasitol. Res. 2023, 2023, 5334813. [CrossRef] [PubMed] Trop. Med. Infect. Dis. 2024, 9, 216 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 21 of 23 Galán, J.E. Typhoid Toxin Provides a Window into Typhoid Fever and the Biology of Salmonella Typhi. Proc. Natl. Acad. Sci. USA 2016, 113, 6338–6344. [CrossRef] Gashaw, T.; Jambo, A. Typhoid in Less Developed Countries: A Major Public Health Concern. In Hygiene and Health in Developing Countries—Recent Advances; IntechOpen: Rijeka, Croatia, 2022. [CrossRef] Alhumaid, N.K.; Alajmi, A.M.; Alosaimi, N.F.; Alotaibi, M.; Almangour, T.A.; Nassar, M.S.; Tawfik, E.A. Reported Bacterial Infectious Diseases in Saudi Arabia: Overview and Recent Advances. Res. Sq. 2023, 1–39. [CrossRef] Paton, D.G.; Childs, L.M.; Itoe, M.A.; Holmdahl, I.E.; Buckee, C.O.; Catteruccia, F. Exposing Anopheles Mosquitoes to Antimalarials Blocks Plasmodium Parasite Transmission. Nature 2019, 567, 239–243. [CrossRef] Sato, S. Plasmodium—A Brief Introduction to the Parasites Causing Human Malaria and Their Basic Biology. J. Physiol. Anthropol. 2021, 40, 1. [CrossRef] Carson, B.B., III. Mosquitos and Malaria Take a Toll. In Challenging Malaria: The Private and Social Incentives of Mosquito Control; Springer International Publishing: Cham, Switzerland, 2023; pp. 15–25. [CrossRef] Bria, Y.P.; Yeh, C.H.; Bedingfield, S. Significant Symptoms and Nonsymptom-Related Factors for Malaria Diagnosis in Endemic Regions of Indonesia. Int. J. Infect. Dis. 2021, 103, 194–200. [CrossRef] [PubMed] Attai, K.; Amannejad, Y.; Vahdat Pour, M.; Obot, O.; Uzoka, F.M. A Systematic Review of Applications of Machine Learning and Other Soft Computing Techniques for the Diagnosis of Tropical Diseases. Trop. Med. Infect. Dis. 2022, 7, 398. [CrossRef] Bosco, A.B.; Nankabirwa, J.I.; Yeka, A.; Nsobya, S.; Gresty, K.; Anderson, K.; Mbaka, P.; Prosser, C.; Smith, D.; Opigo, J.; et al. Limitations of Rapid Diagnostic Tests in Malaria Surveys in Areas with Varied Transmission Intensity in Uganda 2017–2019: Implications for Selection and Use of HRP2 RDTs. PLoS ONE 2020, 15, e0244457. [CrossRef] [PubMed] Mohan, F.R.; Jaber, A.S. Role of the Widal Test in Diagnosing Typhoid Fever Compared with Culture at Teaching Al-Hussein Hospital in Nasiriyah. Peerian J. 2024, 26, 72–78. Mather, R.G.; Hopkins, H.; Parry, C.M.; Dittrich, S. Redefining Typhoid Diagnosis: What Would an Improved Test Need to Look Like? BMJ Glob. Health 2019, 4, e001831. [CrossRef] [PubMed] Boina, R.; Ganage, D.; Chincholkar, Y.D.; Wagh, S.; Shah, D.U.; Chinthamu, N.; Shrivastava, A. Enhancing Intelligence Diagnostic Accuracy Based on Machine Learning Disease Classification. Int. J. Intell. Syst. Appl. Eng. 2023, 11, 765–774. Asuquo, D.E.; Umoren, I.; Osang, F.; Attai, K. A Machine Learning Framework for Length of Stay Minimization in Healthcare Emergency Department. Stud. Eng. Technol. J. 2023, 10, 1–17. [CrossRef] Muhammad, B.; Varol, A. A Symptom-Based Machine Learning Model for Malaria Diagnosis in Nigeria. In Proceedings of the 2021 9th International Symposium on Digital Forensics and Security (ISDFS), Elazig, Turkey, 28–29 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [CrossRef] Barracloug, P.A.; Were, C.M.; Mwangakala, H.; Fehringer, G.; Ohanya, D.O.; Agola, H.; Nandi, P. Artificial Intelligence System for Malaria Diagnosis. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 100806. [CrossRef] La-Ariandi, H.; Setyanto, A.; Sudarmawan, S. Classification of Malaria Types Using Naïve Bayes Classification. J. Indones. Sos. Teknol. 2024, 5, 2311–2327. [CrossRef] Bhuiyan, M.A.; Rad, S.S.; Johora, F.T.; Islam, A.; Hossain, M.I.; Khan, A.A. Prediction of Typhoid Using Machine Learning and ANN Prior to Clinical Test. In Proceedings of the 2023 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 23–25 January 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–7. [CrossRef] Awotunde, J.B.; Imoize, A.L.; Salako, D.P.; Farhaoui, Y. An Enhanced Medical Diagnosis System for Malaria and Typhoid Fever Using Genetic Neuro-Fuzzy System. In Proceedings of the International Conference on Artificial Intelligence and Smart Environment, Errachidia, Morocco, 24–26 November 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 173–183. [CrossRef] Odion, P.O.; Ogbonnia, E.O. Web-Based Diagnosis of Typhoid and Malaria Using Machine Learning. Niger. Def. Acad. J. Mil. Sci. Interdiscip. Stud. 2022, 1, 89–103. Apanisile, T.; Ayeni, J.A. Development of an Extended Medical Diagnostic System for Typhoid and Malaria Fever. Artif. Intell. Adv. 2023, 5, 28–40. [CrossRef] Anderson, J.; Thomas, J. Interpretable Machine Learning Models for Healthcare Applications; EasyChair: Manchester, UK, 2024; p. 12358. Kiseleva, A.; Kotzinos, D.; De Hert, P. Transparency of AI in Healthcare as a Multilayered System of Accountabilities: Between Legal Requirements and Technical Limitations. Front. Artif. Intell. 2022, 5, 879603. [CrossRef] Albahri, A.S.; Duhaim, A.M.; Fadhel, M.A.; Alnoor, A.; Baqer, N.S.; Alzubaidi, L.; Deveci, M. A Systematic Review of Trustworthy and Explainable Artificial Intelligence in Healthcare: Assessment of Quality, Bias Risk, and Data Fusion. Inf. Fusion 2023, 96, 156–191. [CrossRef] Tan, L.; Huang, C.; Yao, X. A Concept-Based Local Interpretable Model-Agnostic Explanation Approach for Deep Neural Networks in Image Classification. In Proceedings of the International Conference on Intelligent Information Processing, Shenzhen, China, 3–6 May 2024; Springer Nature: Cham, Switzerland, 2024; pp. 119–133. [CrossRef] Thombre, A. Comparison of Decision Trees with Local Interpretable Model-Agnostic Explanations (LIME) Technique and MultiLinear Regression for Explaining Support Vector Regression Model in Terms of Root Mean Square Error (RMSE) Values. arXiv 2024, arXiv:2404.07046. [CrossRef] Trop. Med. Infect. Dis. 2024, 9, 216 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 22 of 23 Okay, F.Y.; Yıldırım, M.; Özdemir, S. Interpretable Machine Learning: A Case Study of Healthcare. In Proceedings of the 2021 International Symposium on Networks, Computers and Communications (ISNCC), Dubai, United Arab Emirates, 31 October–2 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [CrossRef] Attai, K.; Akwaowo, C.; Asuquo, D.; Esubok, N.E.; Nelson, U.A.; Dan, E.; Uzoka, F.M. Explainable AI Modelling of Comorbidity in Pregnant Women and Children with Tropical Febrile Conditions. In Proceedings of the International Conference on Artificial Intelligence and Its Applications, Mauritius, East Africa, 9–10 November 2023; pp. 152–159. [CrossRef] Ashraf, K.; Nawar, S.; Hosen, M.H.; Islam, M.T.; Uddin, M.N. Beyond the Black Box: Employing LIME and SHAP for Transparent Health Predictions with Machine Learning Models. In Proceedings of the 2024 International Conference on Advances in Computing, Communication, Electrical, and Smart Systems (iCACCESS) , Dhaka, Bangladesh, 8–9 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [CrossRef] Li, F.; Jin, Y.; Liu, W.; Rawat, B.P.S.; Cai, P.; Yu, H. Fine-Tuning Bidirectional Encoder Representations from Transformers (BERT)–Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study. JMIR Med. Inform. 2019, 7, e14830. [CrossRef] [PubMed] Nakamura, Y.; Hanaoka, S.; Nomura, Y.; Nakao, T.; Miki, S.; Watadani, T.; Abe, O. Automatic Detection of Actionable Radiology Reports Using Bidirectional Encoder Representations from Transformers. BMC Med. Inform. Decis. Mak. 2021, 21, 262. [CrossRef] [PubMed] Gorenstein, L.; Konen, E.; Green, M.; Klang, E. BERT in Radiology: A Systematic Review of Natural Language Processing Applications. J. Am. Coll. Radiol. 2024, 21, 914–941. [CrossRef] Yenduri, G.; Ramalingam, M.; Selvi, G.C.; Supriya, Y.; Srivastava, G.; Maddikunta, P.K.R.; Gadekallu, T.R. GPT (Generative Pre-Trained Transformer)–A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions. IEEE Access 2024, 12, 54608–54649. [CrossRef] Wang, Z.; Guo, R.; Sun, P.; Qian, L.; Hu, X. Enhancing Diagnostic Accuracy and Efficiency with GPT-4-Generated Structured Reports: A Comprehensive Study. J. Med. Biol. Eng. 2024, 44, 144–153. [CrossRef] Islam, M.M.; Alam, M.J.; Maniruzzaman, M.; Ahmed, N.F.; Ali, M.S.; Rahman, M.J.; Roy, D.C. Predicting the Risk of Hypertension Using Machine Learning Algorithms: A Cross Sectional Study in Ethiopia. PLoS ONE 2023, 18, e0289613. [CrossRef] Silva-Aravena, F.; Núñez Delafuente, H.; Gutiérrez-Bahamondes, J.H.; Morales, J. A Hybrid Algorithm of ML and XAI to Prevent Breast Cancer: A Strategy to Support Decision Making. Cancers 2023, 15, 2443. [CrossRef] Zhu, T.; Liu, X.; Wang, J.; Kou, R.; Hu, Y.; Yuan, M.; Zhang, W. Explainable Machine-Learning Algorithms to Differentiate Bipolar Disorder from Major Depressive Disorder Using Self-Reported Symptoms, Vital Signs, and Blood-Based Markers. Comput. Methods Programs Biomed. 2023, 240, 107723. [CrossRef] Fan, Y.; Lu, X.; Sun, G. IHCP: Interpretable Hepatitis C Prediction System Based on Black-Box Machine Learning Models. BMC Bioinf. 2023, 24, 333. [CrossRef] Jin, M.; Yu, Q.; Zhang, C.; Shu, D.; Zhu, S.; Du, M.; Meng, Y. Health-LLM: Personalized Retrieval-Augmented Disease Prediction Model. arXiv 2024, arXiv:2402.00746. Panagoulias, D.P.; Virvou, M.; Tsihrintzis, G.A. Evaluating LLM-Generated Multimodal Diagnosis from Medical Images and Symptom Analysis. arXiv 2024, arXiv:2402.01730. Hariri, W. Analyzing the Performance of ChatGPT in Cardiology and Vascular Pathologies. arXiv 2023, arXiv:2307.02518. Kusa, W.; Mosca, E.; Lipani, A. “Dr LLM, What Do I Have?”: The Impact of User Beliefs and Prompt Formulation on Health Diagnoses. In Proceedings of the Third Workshop on NLP for Medical Conversations, Nusa Dua, Indonesia, 1 November 2023; pp. 13–19. University of Uyo Teaching Hospital; Mount Royal University. NFRF Project Patient Dataset with Febrile Diseases [Data Set]; Zenodo: Bern, Switzerland, 2024. [CrossRef] Caruccio, L.; Cirillo, S.; Polese, G.; Solimando, G.; Sundaramurthy, S.; Tortora, G. Can ChatGPT Provide Intelligent Diagnoses? A Comparative Study Between Predictive Models and ChatGPT to Define a New Medical Diagnostic Bot. Expert Syst. Appl. 2024, 235, 121186. [CrossRef] Han, H.; Zhang, Z.; Cui, X.; Meng, Q. Ensemble Learning with Member Optimization for Fault Diagnosis of a Building Energy System. Energy Build. 2020, 226, 110351. [CrossRef] Zhu, L.; Qiu, D.; Ergu, D.; Ying, C.; Liu, K. A Study on Predicting Loan Default Based on the Random Forest Algorithm. Procedia Comput. Sci. 2019, 162, 503–513. [CrossRef] Ghosh, D.; Cabrera, J. Enriched Random Forest for High Dimensional Genomic Data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 19, 2817–2828. [CrossRef] Jackins, V.; Vimal, S.; Kaliappan, M.; Lee, M.Y. AI-Based Smart Prediction of Clinical Disease Using Random Forest Classifier and Naive Bayes. J. Supercomput. 2021, 77, 5198–5219. [CrossRef] Palimkar, P.; Shaw, R.N.; Ghosh, A. Machine Learning Technique to Prognosis Diabetes Disease: Random Forest Classifier Approach. In Advanced Computing and Intelligent Technologies: Proceedings of ICACIT 2021; Springer: Singapore, 2022; pp. 219–244. [CrossRef] Asselman, A.; Khaldi, M.; Aammou, S. Enhancing the Prediction of Student Performance Based on the Machine Learning XGBoost Algorithm. Interact. Learn. Environ. 2023, 31, 3360–3379. [CrossRef] Trop. Med. Infect. Dis. 2024, 9, 216 52. 53. 54. 55. 56. 23 of 23 Devikanniga, D.; Ramu, A.; Haldorai, A. Efficient Diagnosis of Liver Disease Using Support Vector Machine Optimized with Crows Search Algorithm. EAI Endorsed Trans. Energy Web 2020, 7, e10. [CrossRef] Asuquo, D.E.; Attai, K.F.; Johnson, E.A.; Obot, O.U.; Adeoye, O.S.; Akwaowo, C.D.; Uzoka, F.M.E. Multi-Criteria Decision Analysis Method for Differential Diagnosis of Tropical Febrile Diseases. Health Inform. J. 2024, 30, 1–41. [CrossRef] Salih, A.M.; Raisi-Estabragh, Z.; Galazzo, I.B.; Radeva, P.; Petersen, S.E.; Lekadir, K.; Menegaz, G. A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME. arXiv 2024, arXiv:2305.02012. [CrossRef] Maidabara, A.H.; Ahmadu, A.S.; Malgwi, Y.M.; Ibrahim, D. Expert System for Diagnosis of Malaria and Typhoid. Comput. Sci. IT Res. J. 2021, 2, 1–15. [CrossRef] Mariki, M.; Mkoba, E.; Mduma, N. Combining Clinical Symptoms and Patient Features for Malaria Diagnosis: Machine Learning Approach. Appl. Artif. Intell. 2022, 36, 2031826. [CrossRef] Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.