BACHELOR OF SCIENCE IN COMPUTER SCIENCE AND ENGINEERING A CNN-based Pipeline using Plane-wise Ensemble Technique for Classifying Alzheimer’s Disease from 3DMRI Images Mahtab Nur Fardin 190041112 Md. Irfanur Rahman Rafio 190041125 Md. Jubayer Islam 190041129 Department of Computer Science and Engineering Islamic University of Technology June, 2024 A CNN-based Pipeline using Plane-wise Ensemble Technique for Classifying Alzheimer’s Disease from 3DMRI Images Mahtab Nur Fardin 190041112 Md. Irfanur Rahman Rafio 190041125 Md. Jubayer Islam 190041129 Department of Computer Science and Engineering Islamic University of Technology June, 2024 Declaration of Candidate This is to certify that the work presented in this thesis is the outcome of the analysis and experiments carried out byMahtab Nur Fardin,Md. Irfanur Rahman Rafio, andMd. Jubayer Islam under the supervision of Dr. Md. Hasanul Kabir, Pro- fessor, Department of Computer Science and Engineering and co-supervision of Sab- bir Ahmed, Assistant Professor, Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh. It is also declared that nei- ther this thesis nor any part of it has been submitted anywhere else for any degree or diploma. Information derived from the published and unpublished work of others have been acknowledged in the text and a list of references is given. Dr. Md. Hasanul Kabir Professor Department of Computer Science and Engineering Islamic University of Technology (IUT) Date: June 04, 2024 Sabbir Ahmed Assistant Professor Department of Computer Science and Engineering Islamic University of Technology (IUT) Date: June 04, 2024 Mahtab Nur Fardin Student ID: 190041112 Date: June 04, 2024 Md. IrfanurRahmanRafio Student ID: 190041125 Date: June 04, 2024 Md. Jubayer Islam Student ID: 190041129 Date: June 04, 2024 ii Contents 1 Introduction 1 1.1 Motivation and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Variability in Brain Morphology . . . . . . . . . . . . . . . . . 5 1.3.2 Multi-Class Classification Complexity . . . . . . . . . . . . . . 6 1.3.3 Computational Cost . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 RelatedWorks 10 2.1 Conventional Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Binary Classification Using Machine Learning Techniques . . 10 2.1.2 Image Processing Methods . . . . . . . . . . . . . . . . . . . . 11 2.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Deep Learning Based Approaches . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Dominance of CNNModels . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Multi-class Classification . . . . . . . . . . . . . . . . . . . . . 20 2.2.4 Utilization of Single and Multi-modal Approaches . . . . . . . 21 2.2.5 Preprocessing Techniques . . . . . . . . . . . . . . . . . . . . . 23 2.3 Summary and Limitations of Existing Architectures . . . . . . . . . . 25 3 Proposed Methodology 27 3.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Model Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Projector Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 iii 3.4.1 Midplane Projector . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.2 Average Projector . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.3 Max Variance Projector . . . . . . . . . . . . . . . . . . . . . . 32 3.4.4 Variance Weighted Average Projector . . . . . . . . . . . . . . 33 3.4.5 Linear Learnable (LL) Projector . . . . . . . . . . . . . . . . . 33 4 Results and Discussion 35 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.2 Hyper-parameter Settings . . . . . . . . . . . . . . . . . . . . . 38 4.3 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.3 Plane-wise Ensemble Models . . . . . . . . . . . . . . . . . . . 47 4.3.4 Comparative Analysis and Discussion . . . . . . . . . . . . . . 52 5 Conclusion 58 References 60 iv List of Figures 2.1 A simple CNN architecture that extracts the features from the spatial information and then classify into different classes [33] . . . . . . . . 13 2.2 The architecture of AlzheimerNet, a fine-tuned InceptionV3 [34] . . . 13 2.3 Schematic representation of proposed ensemble model [21]. . . . . . . 14 2.4 Network architecture of 3MT [19]. . . . . . . . . . . . . . . . . . . . . 15 2.5 Details of the CNN transformer for encoding 3D images [19]. . . . . . 16 3.1 Diagram comparing a regular ensemble model (c) with our proposed plane-wise ensemble technique (f) . . . . . . . . . . . . . . . . . . . . 28 3.2 Midplane Projections: Axial Plane . . . . . . . . . . . . . . . . . . . . 32 3.3 Average Projections: Axial Plane . . . . . . . . . . . . . . . . . . . . . 32 3.4 Max Variance Projections: Axial Plane . . . . . . . . . . . . . . . . . . 33 3.5 Variance Weighted Average Projections: Axial Plane . . . . . . . . . . 33 4.1 Skull-stripping: The process of removing non-brain tissues from MRI images using semantic segmentation . . . . . . . . . . . . . . . . . . . 38 v List of Tables 2.1 Summary of RepresentativeWorks onAlzheimer’s Disease Classification 26 4.1 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Data Split Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Performance Analysis for AlexNet . . . . . . . . . . . . . . . . . . . . 44 4.4 Performance Analysis for VGG-16 . . . . . . . . . . . . . . . . . . . . 44 4.5 Performance Analysis for ResNet-50 . . . . . . . . . . . . . . . . . . . 45 4.6 Performance Analysis for DenseNet-169 . . . . . . . . . . . . . . . . . 45 4.7 Performance Analysis for Vision Transformer B16 . . . . . . . . . . . 46 4.8 Performance Analysis for Traditional Ensemble Model (ResNet-50, AlexNet, VGG-16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.9 Performance Analysis for Triple ResNet (using Midplane Projector) . . 47 4.10 Performance Analysis for Triple DenseNet (using Midplane Projector) 48 4.11 Performance Analysis for Triple ResNet (using Average Projector) . . 48 4.12 Performance Analysis for Triple DenseNet (using Average Projector) . 49 4.13 Performance Analysis for Triple ResNet (using Max Variance Projector) 49 4.14 Performance Analysis for Triple DenseNet (using Max Variance Pro- jector) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.15 Performance Analysis for Triple ResNet (using VarianceWeighted Av- erage Projector) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.16 Performance Analysis for Triple DenseNet (using Variance Weighted Average Projector) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.17 Performance Analysis for Triple ResNet (using LL Projector) . . . . . 51 4.18 Performance Analysis for Triple DenseNet (using LL Projector) . . . . 52 4.19 Model Performance Comparison . . . . . . . . . . . . . . . . . . . . . 54 4.20 Confidence Intervals for Model Accuracy . . . . . . . . . . . . . . . . 56 4.21 Comparison with State-of-the-art Models . . . . . . . . . . . . . . . . 57 vi Abstract Alzheimer’s disease (AD) is a chronic neurodegenerative condition that progressively damages brain cells, resulting inmemory and cognitive decline and eventually imped- ing basic functionalities. With over 55 million people worldwide affected by demen- tia, a number anticipated to rise significantly, the urgency for early diagnosis becomes paramount. While a definitive cure remains elusive, early intervention is crucial in mitigating disease progression and enhancing patient outcomes. This research inves- tigates the potential of deep learning models for classifying Alzheimer’s Disease, em- phasizing the challenges in Mild Cognitive Impairment (MCI) classification, and in- troduces a CNN-based pipeline utilizing a plane-wise ensemble technique for 3DMRI image classification. To manage the complex nature of 3D MRI data, a CNN-based pipeline is proposed that makes use of a plane-wise ensemble technique. Decompos- ing the 3D image into axial, coronal, and sagittal planes, and using an ensemble of 2D CNN models trained on the axial, coronal, and sagittal planes, the system attempts to include multi-view data and improve classification accuracy. This methodology also leverages projector functions to map the 3D volumes into a series of 2D images and tackles the computational challenges presented by 3D data, resulting in a more efficient and practical process even with constrained computation resources. vii Chapter 1 Introduction Alzheimer’s disease (AD) is a pervasive and devastating neurodegenerative disorder characterized by the gradual deterioration of brain tissues, leading to a decline in memory, cognitive functions, and ultimately, a loss of fundamental abilities. As the global population ages, the prevalence of dementia, including AD, continues to esca- late, with over 55 million individuals affected worldwide and this figure is expected to rise to 78 million in 2030 and 139 million in 2050 [28]. Traditionally, biomarkers, focusing on critical brain regions like the hippocampus, parietal lobe, and amygdala, have been fundamental in identifying atrophy indica- tive of AD [4]. Recent studies, however, have showcased the high potential of deep learning models in the accurate classification of AD, prompting a paradigm shift in diagnostic methodologies [1]. Before delving into our core research, it is essential to comprehend the domain intri- cacies. Experts categorize potential AD patients into three classes, a departure from the conventional binary classification of AD and normal cognitive function (CN). The introduction of Mild Cognitive Impairment (MCI) allows for early detection, recog- nizing that most MCI patients progress to AD within 3 to 6 years [23]. MCI poses a unique challenge, as early-stage MCI resembles CN, while late-stage MCI bears sim- ilarity to AD. Examining themodalities instrumental in diagnosingAD reveals a shifting landscape. Outdated methods like CT scans and EEG have given way to more sophisticated ap- proaches. Magnetic Resonance Imaging (MRI), offering detailed structural images, has become the primarymodality due to itswidespread availability. PositronEmission Tomography (PET), while providing functional insights, is invasive, limiting its use. Diffusion Tensor Imaging (DTI), though yielding high-quality data, faces challenges 1 in terms of accessibility. Remarkably, in Alzheimer’s classification, the integration of text data like age, gender, MMSE scores, and genetic information yields nuanced in- sights into disease dynamics, adding to the enhancement of accuracy. Demographic factors such as age and gender capture variations, while MMSE scores provide stan- dardized cognitive assessments. Genetic data contributes to understanding hereditary patterns, collectively enhancing the precision of classification models for a compre- hensive analysis of Alzheimer’s disease [39]. In recent years, research centers have aggregated substantial medical and image data, sharing it publicly for the benefit of researchers engaged in Artificial Intelligence (AI) development for Alzheimer’s Disease. These online datasets provide crucial biomarker information, including neuroimaging modalities, genetic data, and clinical and cognitive assessments. The most prominent datasets are: • Alzheimer’s DiseaseNeuroimaging Initiative (ADNI) [15]: ADNI, a longi- tudinal andmulticenter study, serves as a prominent dataset. It includes ADNI- 1, ADNI-GO, ADNI-2, and ADNI-3. ADNI aims to assess the progression of Mild Cognitive Impairment (MCI) and early AD, utilizing Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), biological markers, and clinical and neuropsychological assessments. ADNI datasets encompass vari- ous data types, such as clinical information, genetic data, MRI and PET images, and biospecimens. • Australian Imaging, Biomarker, & Lifestyle Flagship Study of Aging (AIBL) [10]: AIBL compiles imaging and medical data from individuals with AD, those with MCI, and cognitive healthy individuals. • Open Access Series of Imaging Studies (OASIS)[22]: OASIS, designed to share neuroimaging brain datasets, includes OASIS-1 with 434 MRI scans, OASIS-2 with 373 MRI scans, and OASIS-3 with 2,168 MRIs and 1,608 PET scans. • NationalAlzheimer’s CoordinatingCenter (NACC)[5]: NACC, established as a cornerstone in Alzheimer’s research, is a pivotal dataset that makes invalu- able contributions to our understanding of the disease. NACC not only provides essential data but also serves as a nexus for standardizing and harmonizing di- verse datasets. Its comprehensive collection includes clinical, genetic, and neu- roimaging data, fostering a holistic approach to Alzheimer’s research. Recent advancements in Alzheimer’s disease classification have witnessed the effec- tiveness of deep learning models. Building upon that, this report presents a Convo- 2 lutional Neural Network (CNN)-based pipeline for classifying Alzheimer’s Disease from 3D MRI images. The pipeline employs a plane-wise ensemble technique, intro- ducing a strategic approach that capitalizes on specialized models for distinct imag- ing planes. This innovative technique aims to harness anatomical information during both model training and evaluation, demonstrating remarkable improvements in the classification accuracy of 3D MRI images. 1.1 Motivation and Scope Early and accurate diagnosis of AD is crucial for effective patient management and the development of therapeutic strategies. Magnetic Resonance Imaging (MRI) has emerged as a powerful non-invasive tool for diagnosing AD due to its high-resolution imaging capabilities. However, manual analysis of 3D MRI images is time-consuming, subjective, and prone to errors, necessitating the development of automated and reliable diagnostic methods [17]. Recent advancements in deep learning, particularly Convolutional Neural Networks (CNNs), have shown remarkable success in image classification tasks [24]. CNNs have the potential to revolutionize the field of medical imaging by automating the analysis process, reducing human error, and enhancing diagnostic accuracy. Despite these advancements, the complexity of 3D MRI data presents unique challenges, in- cluding high dimensionality and computational intensity, which often limit the per- formance and applicability of standard CNN models. To address these challenges, the motivation for this thesis is to explore a novel CNN-based pipeline that leverages a plane-wise ensemble technique for classifying Alzheimer’s disease from 3D MRI images. By decomposing the 3D MRI data into 2D planes and employing an ensemble of CNN models trained on these planes, the proposed method aims to improve classification accuracy while managing computational complexity. This approach not only capitalizes on the strengths of 2D CNNs but also integrates the multi-view information inherent in 3D data, offering a promising solution for robust AD diagnosis. The significance of this research lies in its potential to enhance the early detection of Alzheimer’s disease, thereby enabling timely intervention and improved patient outcomes. Furthermore, by optimizing the computational efficiency of the diagnostic process, this approach can facilitate broader clinical adoption and integration into existing diagnostic workflows. 3 Recent research has demonstrated the efficacy of CNNs in medical imaging, particu- larly in the classification of neurological conditions fromMRI data. Different studies [20], [21], [29], [38] have shown that deep learning models can achieve high accuracy in distinguishing between AD and healthy controls. Additionally, works by Payan and Montana have explored 3D CNNs for AD diagnosis [27], highlighting the poten- tial of deep learning in this domain. However, these studies often grapple with the computational demands and complexity associated with processing 3D MRI data. Despite the promising results, existing research is limited by several factors. The high dimensionality of 3DMRI data leads to significant computational requirements, mak- ing the training and deployment of 3D CNNmodels resource-intensive. Additionally, many current approaches also overlook the potential benefits of integrating multi- view information fromdifferent planes of 3DMRI data, which can enhance diagnostic accuracy. This thesis aims to bridge these gaps by developing, implementing, and evaluating a CNN-based pipeline designed for the classification of Alzheimer’s disease using 3D MRI images. The key aspects covered include: 1. Data Preprocessing: Detailed examination of preprocessing techniques to standardize 3D MRI data, including normalization, resizing, and slice extraction, ensuring compatibility with the proposed CNN models. 2. Model Architecture: Design and implementation of CNN architectures tai- lored for 2D plane classification, followed by an ensemble approach that inte- grates predictions from multiple planes (axial, coronal, and sagittal). 3. Training and Validation: Strategies for effectively training the CNN models, including data augmentation, cross-validation, and optimization of hyperpa- rameters to enhance model performance and generalizability. 4. PerformanceEvaluation: Comprehensive evaluation of the pipeline using es- tablished metrics such as accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC-ROC). Comparisons with existing methods to highlight improvements and potential benefits. By addressing these components, the thesis aims to contribute to the field of medical imaging andAlzheimer’s disease diagnosis, providing a scalable and efficient solution for early detection and improving the overall quality of patient care. The proposed plane-wise ensemble technique not only enhances diagnostic accuracy but also sig- nificantly reduces the computational burden, making the model feasible for clinical 4 application even with limited computational resources. This innovation has the po- tential to make advanced diagnostic tools accessible to a wider range of healthcare settings, thereby improving patient outcomes on a global scale. 1.2 Problem Statement Develop amodelwhich, given a 3DMRI image, can classify it into one of the following classes: • Alzheimer’s Disease (AD) • Cognitive Normal (CN) • Mild Cognitive Impairment (MCI) Each 3D MRI image presents a complex and high-dimensional dataset that requires advanced techniques for accurate classification. Traditionalmanual analysismethods are labor-intensive and susceptible to subjective bias and errors. Therefore, there is a pressing need for automated, reliable, and efficient diagnostic tools to support the early detection and differentiation between these categories. The goal is to provide a robust and scalable tool for aiding the diagnosis and mon- itoring of Alzheimer’s Disease and related conditions. This will ultimately improve patient outcomes and support clinical decision-making. 1.3 Research Challenges The development of a CNN-based pipeline for classifying Alzheimer’s disease from 3D MRI images involves several significant challenges. These challenges span com- putational costs, variability in brain morphology, and the complexity of multi-class classification. Addressing these challenges is crucial for the successful implementa- tion and accuracy of the proposed diagnostic tool. 1.3.1 Variability in Brain Morphology Detecting brain atrophies associated with Alzheimer’s disease is complicated by the natural variability in brain structure. Factors contributing to this variability include: • Gender Differences: Male and female brains exhibit structural differences, which can affect the model’s ability to generalize across genders. 5 • Age-related Changes: The brain undergoes significant changes over a per- son’s lifespan, adding another layer of complexity to the classification task. Age- related atrophies might be mistaken for disease-specific patterns. • Demographic and Genetic Diversity: Variations in brain structure can also be attributed to demographic factors (e.g., ethnicity, lifestyle) and genetic pre- dispositions, complicating the detection of AD-related changes. Developing a model that can account for these variabilities requires a robust training dataset that adequately represents these diverse factors. It also necessitates sophisti- cated preprocessing and augmentation techniques to ensure the model is exposed to a wide range of anatomical variations. 1.3.2 Multi-Class Classification Complexity Classifying 3DMRI images into three categories (AD, CN,MCI) introduces additional complexity compared to binary classification. Specific challenges include: • Interclass Similarity: The Mild Cognitive Impairment (MCI) class often ex- hibits characteristics that overlap with both Alzheimer’s Disease (AD) and Cog- nitive Normal (CN) classes, making it difficult for the model to distinguish be- tween these states accurately. • Imbalanced Data: The prevalence of AD, CN, and MCI in available datasets may not be evenly distributed, potentially leading to biased model performance if not adequately addressed. • Diagnostic Ambiguity: The progression from CN to MCI to AD is a contin- uum, and the boundaries between these classes are not always clear-cut. This ambiguity can lead to misclassification and reduced model accuracy. 1.3.3 Computational Cost One of the primary challenges in developing a deep learningmodel for 3DMRI image classification is the high computational cost associated with training and inference. Key issues include: • HighDimensionality of Data: 3DMRI images contain a vast amount of data, leading to high memory and processing requirements. The "Curse of Dimen- sionality" exacerbates this issue, as the number of parameters in the model in- creases exponentially with the dimensionality of the data. This makes learning 6 more difficult and necessitates large amounts of training data to avoid overfit- ting. • Model Complexity: Standard 3D CNNmodels are computationally expensive due to the large number of parameters and layers required to capture spatial fea- tures in three dimensions. High-dimensional data also complicates optimiza- tion, making it more susceptible to local minima and increasing the difficulty of finding a global optimum. Additionally, outlier analysis becomes less mean- ingful in high-dimensional spaces. • Hardware Limitations: Many research institutions and healthcare facilities may not have access to high-performance computing resources, limiting the fea- sibility of training complex 3D CNN models. The significant memory and pro- cessing power required to handle 3D MRI data can strain available hardware, leading to slower training times and reduced model performance. Tomitigate these issues, the proposed plane-wise ensemble technique aims to decom- pose 3DMRI data into 2D planes, significantly reducing computational demands and making the training process more efficient. This approach reduces the number of parameters, simplifies optimization, and lowers the data requirements, making the model more feasible for training on available hardware. Additionally, by focusing on 2D planes, the method can leverage the strengths of 2D CNNs, which are less compu- tationally intensive than their 3D counterparts. To address these challenges, the research will focus on: • Balanced Sampling: Techniques to ensure that the dataset is balanced by sam- pling from each class multiple times, ensuring statistical significance and ro- bustness of the method. Strong measures are taken to prevent data leakage, maintaining the integrity of the training process. • Ensemble Techniques: Using multiple CNNs trained on different 2D planes to capturemore nuanced features and improve overall classification robustness. This approach leverages the strengths of 2D CNNs while integrating multi-view information from 3D data. • EnhancedModelTraining: Employing strategies such as cross-validation and hyperparameter tuning to optimize model performance andmitigate the effects of interclass similarity. These techniques help in fine-tuning themodel parame- ters and ensuring that the model generalizes well across different subsets of the data. 7 By tackling these computational, morphological, and classification challenges, this research aims to develop a more accurate, efficient, and generalizable CNN-based di- agnostic tool for Alzheimer’s disease and related conditions. 1.4 Contribution This thesis presents several key contributions to the field of medical imaging and Alzheimer’s disease diagnosis: • Introduced a novel plane-wise ensemble technique that addresses the computational challenges of 3D MRI data by decomposing it into 2D planes (axial, coronal, and sagittal) and employing an ensemble of CNN models trained on these planes. This method reduces computational complexity while leveraging the strengths of 2D CNNs, enhancing classification accuracy and efficiency. • Developed projector functions that map 3D MRI images into 2D inputs for the CNN models. These functions transform high-dimensional 3D data into a manageable format for 2D CNNs, maintaining essential structural information and simplifying the data processing. • Benchmarked the performance of different CNN models, systematically evaluating and comparing the models. This benchmarking provides insights into the most effective models and plane orientations for the classification task, ensuring the use of the most accurate and reliable configurations for Alzheimer’s disease diagnosis. 1.5 Organization The thesis is structured to provide a comprehensive exploration of the Plane-wise En- semble Technique for classifying Alzheimer’s Disease from 3D MRI images. The or- ganization ensures a logical flow of information, guiding the reader from the intro- duction to the conclusion while aligning with the research objectives. Chapter 2 conducts a thorough review of existing literature on Alzheimer’s Disease diagnosis and medical image classification techniques. It discusses the limitations of traditional methods and highlights the potential benefits of utilizing CNN-based ap- proaches, laying the theoretical groundwork for the Plane-wise Ensemble Technique and its relevance in the field of medical imaging. 8 Chapter 3 presents an overview of the Plane-wise Ensemble Approach, detailing the decomposition of 3D MRI data into axial, coronal, and sagittal planes. It explains the rationale behind leveraging anatomical information fromdifferent imaging planes to enhance classification accuracy and introduces the workflow and integration of specialized models within the ensemble framework. Chapter 4 details the implementation of the CNN-based pipeline for classifying Alzheimer’s Disease using the Plane-wise Ensemble Technique. It discusses the experimental setup, including dataset selection, model training, and evaluation metrics, and analyzes the results to showcase the advancements in classification accuracy achieved through the proposed approach. Chapter 5 summarizes the key findings of the research, emphasizing the contributions of the Plane-wise Ensemble Technique. It discusses the implications of the findings in relation to the research objectives and existing knowledge in the field, acknowledges study limitations, and suggests future research directions to build upon the current work, providing a comprehensive wrap-up of the thesis. 9 Chapter 2 RelatedWorks 2.1 Conventional Approaches Earlier methods primarily focused on binary classification and traditional image pro- cessing techniques, which laid the groundwork for more sophisticated models. 2.1.1 Binary Classification Using Machine Learning Techniques One of the initial strategies for Alzheimer’s disease diagnosis involved binary classifi- cation, distinguishing between Alzheimer’s disease (AD) and Cognitive normal (CN) individuals. Machine learning techniques such as Support Vector Machines (SVM), Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA) [2], [20], [41] were commonly employed in these efforts. SVMs, known for their effective- ness in high-dimensional spaces, were utilized to create hyperplanes that best sep- arated the AD and CN classes based on features extracted from neuroimaging data. LDA, on the other hand, aimed to find the linear combinations of features that best separated the two classes. PCA was often used to reduce the dimensionality of the data, retaining the most informative components for subsequent classification tasks. These machine learning approaches provided a foundation for early diagnostic mod- els, achieving reasonable accuracy and helping to highlight the potential of computa- tional methods in AD diagnosis. 10 2.1.2 Image Processing Methods Traditional image processing methods were also pivotal in the early stages of Alzheimer’s disease research. Techniques such as thresholding, edge detection, and region-based methods were employed to analyze neuroimaging data, particularly MRI and CT scans [1]. Thresholding involved setting intensity thresholds to segment brain images, highlighting regions of interest such as hippocampal atrophy, which is commonly associated with AD. Edge-based techniques focused on detecting boundaries and contours within the images, facilitating the identification of structural changes in the brain. Region-based methods aimed to segment images into meaningful regions, often using criteria such as intensity homogeneity or anatomical knowledge to delineate areas affected by the disease. These image processing methods were instrumental in extracting relevant features from neuroimaging data, which were subsequently used in classification models. 2.1.3 Limitations While these earlier approaches made significant contributions to the field, they also had limitations. Machine learning techniques like SVM and LDA required extensive feature engineering and were often sensitive to the quality and quantity of input fea- tures. Additionally, traditional image processing methods were sometimes limited by their reliance on manual parameter tuning [2] and their susceptibility to noise and artifacts in the imaging data. Despite these challenges, the foundational work of bi- nary classification and image processing paved the way for the development of more advanced models, including deep learning and multimodal approaches, which have since demonstrated superior performance in AD diagnosis. 2.2 Deep Learning Based Approaches The landscape of Alzheimer’s disease (AD) diagnosis has evolved significantly with the advent of advanced computational techniques, particularly in the realm of deep learning and multimodal data integration. Recent trends emphasize the use of so- phisticated neural networks, leveraging the power of convolutional neural networks (CNNs), transformer-based models, and hybrid approaches to improve diagnostic ac- curacy and robustness. 11 2.2.1 Dominance of CNNModels One of the most prominent trends in recent research is the application of deep learn- ing methods, especially convolutional neural networks (CNNs) [26], to AD diagno- sis. Unlike traditional machine learning techniques that require manual feature ex- traction, CNNs can automatically learn hierarchical features from raw imaging data. These models have been applied extensively to MRI and PET scans, providing state- of-the-art performance in distinguishing between AD, MCI, and CN individuals. For example, models like DenseNet [14] and ResNet [12] have been fine-tuned for the task, demonstrating substantial improvements in classification accuracy and compu- tational efficiency. 2.2.2 Architecture Overview 2D CNN Based Architectures The use of 2D Convolutional Neural Networks (CNNs) for classifying Alzheimer’s Disease (AD) has been a prominent area of research. These methods typically involve extracting 2D slices from 3D MRI scans and using them as input for pre-trained or custom-designed CNN architectures. Savaş [33] utilized pre-trained deep learningmodels, specifically VGG-16 and ResNet- 50, for classifying AD stages through transfer learning. By fine-tuning these models on 2D MRI slices, the study leveraged learned features from the ImageNet dataset to enhance classification accuracy while reducing training time. Preprocessing ensured consistency and compatibility with the CNN input requirements. • VGG-16: A deep CNN with 16 layers, pre-trained on ImageNet, served as a fea- ture extractor. The final layers were replaced for AD classification. • ResNet-50: A 50-layer residual network, also pre-trained on ImageNet, was fine-tuned similarly to VGG-16. While effective, this approach might miss important 3D spatial information inherent in MRI scans. The fine-tuned models depend heavily on the specific dataset, poten- tially limiting their generalizability. Additionally, fine-tuning large pre-trained mod- els still requires significant computational resources. A CNN model [31] from scratch was developed to automate AD detection using 2D MRI slices. The model involved multiple convolutional layers for feature extraction, followed by ReLU activation functions andmax-pooling layers to downsample feature maps and reduce computational complexity. Fully connected layers processed the 12 Fig. 2.1: A simple CNN architecture that extracts the features from the spatial infor- mation and then classify into different classes [33] extracted features for classification. This model, however, requires a large amount of labeled data, which might not always be available. It also risks missing crucial 3D context and may overfit with limited datasets. AlzheimerNet[34], a deep learningmodel for classifying AD stages fromMRI images. AlzheimerNet incorporated convolutional blockswith batch normalization andReLU activations, along with residual connections to mitigate vanishing gradients. Max- pooling layers downsampled featuremaps, and fully connected layers enabled higher- level feature processing for classification. Fig. 2.2: The architecture of AlzheimerNet, a fine-tuned InceptionV3 [34] AlzheimerNet’s complex architecture, while powerful, may be computationally inten- sive for some clinical settings. The model’s performance is dependent on the quality 13 and diversity of training data, and like other 2D CNNmodels, it may overlook impor- tant 3D spatial relationships critical for accurate AD diagnosis and staging. Another study presents a novel deep-ensemble method [32] combining multiple con- volutional neural network (CNN) architectures for robust and accurate classification of Alzheimer’s Disease (AD) using MRI and fMRI data. The datasets include MRI and fMRI images of patients with varying degrees of dementia, including healthy controls, very mild AD, mild AD, and moderate AD. The CNN architectures selected, based on their performance in previousAD research and their size/precision ratios, in- clude AlexNet, Inception-ResNet-v2, ResNet-50, ResNet-101, and GoogLeNet. Trans- fer learning was employed using CNNs pre-trained on the ImageNet dataset, which were fine-tuned on the AD datasets to adapt to the specific task of AD classification by retraining the networks’ final layers while keeping the earlier layers frozen. Features were extracted from the penultimate fully connected layer (FC7) of AlexNet and the last layer of the ResNet and Inception architectures, providing a refined representa- tion of the input images. An ensemble learning approach combined the predictions of the three best-performing networks (AlexNet, ResNet-101, and Inception-ResNet- v2) using a bagged trees model with an averaging strategy, enhancing the robustness and accuracy of classification. To prevent overfitting and improve model generaliza- tion, data augmentation techniques such as random rotation between -35° and 35°, random scaling in the x and y directions, and grey-scale preprocessing were applied to the training and validation sets. Fig. 2.3: Schematic representation of proposed ensemble model [21]. A robust classification model leveraged transfer learning with DenseNet and inte- grated within an embedded healthcare decision support system (DSS) [32]. Prepro- cessing steps included skull stripping, intensity normalization, and registration to a common reference space. DenseNet-121, pre-trained on the ImageNet dataset, was chosen for its dense connectivity patterns, which enhance feature reuse and miti- gate the vanishing gradient problem. The pre-trained DenseNet was fine-tuned on the ADNI dataset by freezing initial layers, adding custom fully connected layers, 14 Fig. 2.4: Network architecture of 3MT [19]. and incorporating dropout regularization to prevent overfitting. The training pro- cess involved data augmentation techniques such as rotation, flipping, and scaling to improve model generalization, with optimal batch size and number of epochs deter- mined experimentally. Despite high performance, limitations included dependency on the quality and availability of labeled MRI data, substantial computational re- sources required for DenseNet architecture, and the need for further validation to en- sure generalization to different populations and imaging protocols beyond the ADNI dataset. 3D CNN Based Architectures 3D Convolutional Neural Networks (CNNs) are well-suited for medical imaging tasks involving volumetric data, such as MRI scans. These models process the entire 3D volume, capturing spatial relationships that might be missed by 2D approaches. A cascaded multi-modal mixing transformer (CM3T) [19] framework was developed for AD classification using incomplete data. Their approach integrated 3DCNNswith transformers to handle the spatial complexity of MRI images. The CM3T model pro- cessed multi-modal neuroimaging data (structural MRI, fMRI, and PET scans) using 15 Fig. 2.5: Details of the CNN transformer for encoding 3D images [19]. dedicated transformer encoders for eachmodality. These encoders capturedmodality- specific representations and were fused using a multi-modal mixing module with cross-attention mechanisms (Fig. 2.4). To handle incomplete data, the CM3T model employed a cascaded fusion strategy, progressively combining available modalities. The final multi-modal representation was fed into a classification head for AD stage prediction. Helaly et al.[13] developed a 3D CNN for AD diagnosis using MRI volumes. Their model captured comprehensive anatomical features by processing the entire 3D brain volume, detecting subtle structural changes indicative of early-stage AD. The archi- tecture consisted of multiple convolutional and pooling layers, followed by fully con- nected layers for classification. One of the primary strengths of this study lies in its high classification accuracy, particularly in detecting early-stage AD. By utilizing the entire 3D brain volume, the model demonstrated an exceptional ability to capture 16 detailed spatial information, potentially identifying subtle biomarkers that might be missed in 2D slice-based approaches. This comprehensive analysis of brain structure represents a significant advancement in the field of automated AD diagnosis. How- ever, the study also faced several challenges. The 3D CNN approach requires substan- tial amounts of labeled 3D MRI data for effective training, which can be difficult and expensive to acquire in large quantities. Moreover, processing entire 3D volumes is computationally intensive, demanding significant processing power and memory re- sources. This could potentially limit themodel’s applicability in resource-constrained clinical settings. Another limitation noted was the model’s propensity for overfitting, especially when trained on smaller datasets. This highlights the critical need for large, diverse datasets in developing robust 3D CNN models for AD diagnosis. An interpretable deep learning framework [29] using 3D CNNs and MRI data was proposed for AD classification. Their model incorporated attention mechanisms to highlight critical brain regions contributing to classification decisions, enhancing in- terpretability and transparency. The 3D CNN processed full MRI volumes, captur- ing subtle patterns associated with AD. A significant strength of this study lies in its ability to achieve high classification accuracy while simultaneously providing inter- pretable results. The attention maps generated by the model offer valuable insights into the neural correlates of AD, potentially aiding clinicians in understanding the structural changes associated with the disease progression. This interpretability is crucial for building trust in AI-based diagnostic tools and could facilitate their inte- gration into clinicalworkflows. However, the study also faced several challenges. Like many deep learning approaches in medical imaging, the model requires high-quality, diverse training data to perform optimally and generalize well to unseen cases. Ac- quiring such comprehensive datasets in themedical field remains a significant hurdle. Moreover, while the attention mechanisms enhance interpretability, the overall com- plexity of the model can still pose challenges in clinical settings. The intricate nature of deep learning models, even with interpretability features, may make it difficult for non-specialists to fully understand and trust the model’s decisions. This highlights the ongoing challenge of balancing model sophistication with practical clinical appli- cability. Utilizing the 3D CNNs Ebrahimi et al. [9] leveraged the rich spatial information in MRI volumes for AD detection. Their model employedmultiple layers of 3D convolu- tions, followed by pooling and fully connected layers, to capture and process detailed anatomical features. A key strength of this study lies in its superior performance in AD detection compared to traditional 2D approaches. By leveraging the full 3D struc- ture of the brain, the model demonstrated an enhanced ability to identify complex 17 spatial patterns associated with AD. This comprehensive volumetric analysis not only improved overall detection accuracy but also showed promise in enabling earlier and more precise AD diagnosis, a crucial factor in effective treatment and management of the disease. However, the study also faced significant challenges. The use of 3D CNNs necessitates extensive 3DMRI data for effective training, which can be difficult and costly to acquire in large quantities. Moreover, processing entire 3D volumes is computationally intensive, requiring substantial computational resources. This could potentially limit the model’s applicability in resource-constrained clinical settings or research environments without access to high-performance computing facilities. An- other critical consideration is the model’s heavy dependence on the quality of the training data. The effectiveness of the 3D CNN in detecting AD is intrinsically linked to the comprehensiveness and accuracy of the MRI scans used for training. This un- derscores the importance of high-quality, diverse datasets in developing robust and generalizable models for AD detection. Transformer Based Architectures Recent advancements in deep learning have seen the emergence of transformer-based models, which have shown remarkable performance in various domains, including medical image analysis [35]. These models leverage the self-attention mechanism, which allows them to capture complex dependencies and interactionswithin the data, making them particularly well-suited for tasks that require detailed spatial and con- textual understanding, such as Alzheimer’s disease (AD) classification fromMRI im- ages. The M3T model(Multi-Plane Multi-Slice Transformer) [16] combines transformers’ self-attention capabilitieswith amulti-plane andmulti-slice representation of 3DMRI volumes. This approach facilitates capturing complex spatial dependencies within the data. The model slices the 3D MRI volume into multiple 2D planes, which are further divided into smaller 2D slices treated as sequential inputs to the transformer model. The self-attention mechanism integrates information across multiple planes and slices, effectively reconstructing the 3D spatial context from the 2D represen- tations. A key strength of this study lies in its significant improvement in classi- fication accuracy over traditional CNN-based approaches. By leveraging the trans- former’s ability to model long-range dependencies, the M3T model demonstrates su- perior performance in capturing intricate spatial patterns and relationships within the brain structure. This enhanced capability is particularly crucial in AD classi- fication, where subtle structural changes can be indicative of disease progression. 18 However, the study also faces several challenges. The complex architecture of the M3T model demands high computational resources, potentially limiting its applica- tion in resource-constrained environments. Additionally, likemany deep learning ap- proaches, the model’s performance is heavily dependent on large, high-quality train- ing datasets, which can be challenging to acquire in the medical imaging domain. Another significant consideration is the model’s complexity, which can reduce its interpretability. While the model achieves high accuracy, the intricate nature of its decision-making process may be difficult for clinicians to understand and trust. This lack of transparency could potentially limit the model’s acceptance and integration into clinical workflows, where interpretability is often crucial for decision-making and patient communication. A novel approach combining pixel-level fusion techniques with vision transformers (ViTs) [25] was developed for early AD detection from MRI scans. ViTs effectively capture global relationships and dependencies within images [8]. Their method in- volves splitting high-resolution MRI images into smaller patches, embedding these into a sequence of tokens, and processing them with the transformer network. The pixel-level fusion technique ensures detailed and precise classification by integrating information from every part of the image. One of the primary strengths of this study lies in its superior performance in early AD detection. By leveraging ViTs, the model demonstrates an exceptional ability to capture long-range dependencies and global contextual information within brainMRI scans. This capability is particularly crucial in detecting subtle, early-stage indicators of AD that might be missed by traditional convolutional neural network (CNN) approaches, which typically focus on local fea- tures. However, the study also faces several significant challenges. The complex ar- chitecture of ViTs, combined with pixel-level fusion techniques, demands high com- putational power and memory resources. This requirement could potentially limit themodel’s applicability in resource-constrained clinical settings or research environ- ments without access to high-performance computing facilities. Another notable lim- itation is the long training times associated with such complex model architectures. This not only impacts the development and iterative improvement of the model but also poses challenges for its adaptation to new datasets or different medical imaging modalities. Furthermore, a critical consideration for clinical applications is the diffi- culty in interpreting the model’s decisions. While the model achieves high accuracy, the intricate nature of transformer architectures and pixel-level fusion makes it chal- lenging to provide clear, interpretable explanations for its classifications. This lack of transparency could potentially hinder the model’s acceptance and integration into clinical workflows, where interpretability is often crucial for decision-making and pa- 19 tient communication. The Addformermodel [18], which combinesmultiple transformermodules for multi- modal fusion of different MRI sequences (e.g., T1-weighted and T2-weighted images) for AD detection. Each MRI sequence is processed by a separate transformer mod- ule to capture modality-specific features, and the outputs are fused using additional transformer layers. This approach leverages the strengths of each MRI sequence and captures a comprehensive representation of the brain’s structural characteristics. A key strength of this study lies in its enhanced robustness and accuracy in AD classi- fication. By leveraging multiple MRI sequences, the Addformer model demonstrates an improved ability to detect subtle indicators of AD that might be more prominent in one modality than another. This multi-modal approach provides a more compre- hensive view of brain structure and potential AD-related changes, potentially leading to more accurate and reliable diagnoses. Furthermore, the effective combination of multi-modal data represents a significant advancement in the field. The Addformer’s architecture allows for the integration of complementary information from different MRI sequences, potentially capturing a wider range of AD biomarkers and structural changes associated with the disease progression. However, the study also faces sev- eral challenges. One notable limitation is the requirement for careful preprocessing and alignment of images from different modalities. This prerequisite adds complex- ity to the data preparation phase and may introduce potential sources of error if not handled meticulously. Another significant consideration is the increased computa- tional requirements for both training and inference. The complex architecture of the Addformer, processing multiple MRI sequences through separate transformer mod- ules, demands substantial computational resources. This could potentially limit the model’s applicability in resource-constrained clinical settings or research environ- ments without access to high-performance computing facilities. Moreover, the com- plexity of the Addformer model poses challenges in terms of interpretability. While the model achieves high accuracy, the intricate nature of its decision-making process, involving multiple transformer modules and fusion layers, may be difficult for clini- cians to understand and trust. This lack of transparency could potentially hinder the model’s acceptance and integration into clinical workflows, where interpretability is often crucial for decision-making and patient communication. 2.2.3 Multi-class Classification While binary classification has been a common approach in Alzheimer’s disease (AD) diagnosis, recent research has increasingly focused on multi-class classification to 20 better capture the spectrum of cognitive states associated with the disease. This ap- proach distinguishes not only between Alzheimer’s disease (AD) and cognitive nor- mal (CN) individuals but also includes intermediate stages such as mild cognitive impairment (MCI). Multi-class classification is crucial for developing comprehensive diagnostic tools that can provide more nuanced assessments and support early inter- vention strategies. 2.2.4 Utilization of Single andMulti-modal Approaches The diagnosis and classification of Alzheimer’s disease (AD) have significantly advanced with the development of both single and multi-modal approaches. Each method offers unique benefits and, when combined, they provide a comprehensive and robust diagnostic framework. Single-Modal Approaches Single-modal approaches focus on analyzing data from a single imagingmodality, typ- ically MRI or PET scans. These methods have the advantage of being simpler and less resource-intensive compared to multi-modal approaches, making them more acces- sible in many clinical settings [21], [32], [33]. MRI-Based Approaches • AlzheimerNet [34], a CNN-based model specifically designed to classify AD stages from functional brain changes observed in MRI images. The architec- ture utilized multiple convolutional layers to extract hierarchical features from 2D MRI slices, achieving notable accuracy in distinguishing between different stages of AD. • A CNNmodel from scratch was designed to automate AD detection using MRI images [31]. By focusing on extensive data preprocessing, normalization, and augmentation techniques, their model demonstrated high robustness and accu- racy in classification tasks, effectively handling the variability in MRI data. PET-Based Approach: A 3D CNN framework [29] was developed to classify AD stages using PET images. Their model emphasized the interpretability of predictions, allowing clinicians to understand the decision-making process, thereby increasing the trust and usability of the model in clinical practice. 21 Multi-modal Approaches Multi-modal approaches [30], [38] integrate information from multiple imaging modalities, such as MRI and PET, to leverage the complementary strengths of each. These methods provide a more comprehensive understanding of the brain’s structure and function, enhancing diagnostic accuracy and robustness. Fusion Techniques Combining data from MRI and PET scans, multi-modal approaches use advanced deep learning architectures to integrate and analyze this diverse information. • A proposed multimodal image fusion method [38] demonstrated superior per- formance in classifying Alzheimer’s Disease (AD), Mild Cognitive Impairment (MCI), and Normal Control (NC) compared to single-modality approaches by leveraging both structural information from MRI and functional insights from Positron Emission Tomography (PET). The study utilized the Alzheimer’s Dis- ease Neuroimaging Initiative (ADNI) dataset, including MRI and PET scans. Preprocessing involved skull stripping, intensity normalization, and registra- tion for MRI, and intensity normalization and spatial alignment with MRI for PET. The core methodology included aligning PET images withMRI, extracting features from both using a modified ResNet, and combining features through a weighted averaging fusion strategy. The CNN architecture incorporated a modified ResNet and specialized 3D CNN models, including a U-shaped net- work with skip connections to extract multi-scale features. Data augmentation techniques like rotation, flipping, and scaling were applied to enhance variabil- ity. The model was optimized using the Adam optimizer with a learning rate scheduler and categorical cross-entropy loss function, reflecting a comprehen- sive approach to training and optimization. Despite high performance on the ADNI dataset, limitations include dependency on MRI and PET image quality and availability, additional computational complexity, and generalization chal- lenges to other datasets and real-world scenarios. • Addformer [18], a transformer-based model that fuses information from differ- ent MRI sequences. This model utilized multiple transformer modules to in- tegrate data across various imaging modalities, enhancing robustness and ac- curacy in AD detection. The study demonstrated the potential of transformer- based models in leveraging multi-modal data for comprehensive AD classifica- tion. 22 Hybrid and Ensemble Methods Hybrid approaches combine different deep learning models to utilize their respective strengths, often leading to superior performance in AD diagnosis. • A cascaded multi-modal mixing transformer framework [19] that combines 3D CNNs with transformers. This hybrid method effectively handles the spatial complexity ofMRI images, achieving robust classification evenwith incomplete data. The integration of CNNs and transformers highlighted the potential of hybrid models in improving diagnostic performance. • A pixel-level fusion approach using vision transformers [25] for early AD detec- tion. By processingMRI images at the pixel level, their model achieved detailed and precise classification, underscoring the advantages of transformers in han- dling high-resolution medical images. The study’s multi-modal classification framework effectively distinguished between AD, MCI, and CN classes. Transfer Learning and Domain Adaptation Transfer learning, where models pre-trained on large datasets are fine-tuned for specific tasks, has also been instrumen- tal in multi-modal approaches. It allows the models to leverage learned features from extensive, general-purpose datasets, reducing the need for large amounts of domain- specific data. While multi-modal approaches offer significant advantages, they also present chal- lenges such as increased computational complexity, the need for synchronized multi- modal datasets, and the difficulty of integrating diverse data types. However, ongoing advancements in deep learning and computational power are addressing these chal- lenges, paving the way for more efficient and effective multi-modal diagnostic tools. In summary, the utilization of single and multi-modal approaches has greatly en- riched the field of AD diagnosis. Single-modal methods, particularly those based on MRI and PET, provide valuable insights into the brain’s structure and function. Multi- modal approaches, by integrating these insights, offer a more comprehensive and ac- curate diagnostic framework, ultimately enhancing the early detection and manage- ment of Alzheimer’s disease. 2.2.5 Preprocessing Techniques Preprocessing is crucial for the accuracy and efficiency of Convolutional Neural Net- work (CNN)-based models in MRI image analysis. Recent trends and popular tech- niques include: 23 • Intensity Normalization: Standardizes image intensities to aid feature learn- ing. Common methods include Z-score normalization (mean of zero, standard deviation of one) and Min-Max scaling (fixed range, typically [0, 1] or [-1, 1]). • Bias Field Correction: Corrects intensity inhomogeneities using algorithms like N4ITK to enhance image consistency [40]. • Skull Stripping: Removes non-brain tissues to focus on brain structures, im- proving classification accuracy. Common tools are Brain Extraction Tool (BET) from FSL [37], BrainSuite [36], and FreeSurfer [11]. • Spatial Normalization: Aligns MRI images to a common reference space, such as the MNI space, using affine or non-linear transformations to account for anatomical variability [6]. • Segmentation: Divides MRI images into tissue types (e.g., gray matter, white matter, cerebrospinal fluid) to isolate relevant brain structures. Tools include SPM [3] and FSL [37]. • Data Augmentation: Increases training dataset size and model robustness through random transformations such as rotation, translation, scaling, flipping, and adding Gaussian noise. • Smoothing: Reduces noise and enhances signal-to-noise ratio using techniques like Gaussian smoothing to better delineate brain structures. • Patch Extraction: Divides 3D MRI volumes into smaller patches for input to the CNN, reducing computational load and focusing on local features. • Histogram Equalization: Enhances contrast by redistributing intensity val- ues, improving visibility of brain structures for better feature learning. • Deep Learning-based Preprocessing: Utilizes autoencoders and Generative Adversarial Networks (GANs) for denoising, normalizing, and enhancing MRI images in a data-driven manner. • Multimodal Image Fusion: Combines different MRI modalities (e.g., T1-weighted, T2-weighted) to provide richer information for better classification performance through alignment and integration. • Domain Adaptation Techniques: Addresses variations between training and testing data using techniques like adversarial domain adaptation to improve generalization by making learned features invariant to differences in scanners or protocols. 24 2.3 Summary and Limitations of Existing Architec- tures In order to provide a concise comparison and highlight the key contributions of some of the studies which are aligned with our proposed methodology, we have summa- rized their core methodologies, model architectures, datasets used, and performance metrics in Table 2.1. This table offers a clear overview of the advancements and lim- itations observed in these studies, facilitating a better understanding of the current state of Alzheimer’s disease classification using deep learning approaches. In order to provide a concise comparison and highlight the key contributions of the representative studies analyzed in detail, we have summarized their core methodolo- gies, results, and comments in Table 2.1. This table offers a clear overview of the advancements and limitations observed in these studies, facilitating a better under- standing of the current state of Alzheimer’s disease classification using deep learning approaches. In summary, the reviewed literature underscores the effectiveness of advanced deep learning models, including 2D CNNs, 3D CNNs, and transformer-based classifiers, in Alzheimer’s disease detection. However, each of these architectures has inherent limitations: • 2D CNNs: These models analyze individual 2D slices of MRI images, often fail- ing to capture the interdependencies among slices. This slice-by-slice approach can lead to a loss of crucial 3D spatial information, which is vital for accurate AD diagnosis. Additionally, 2D CNNs can be computationally expensive when processing multiple slices separately. • 3D CNNs: While 3D CNNs are capable of analyzing volumetric data, they suf- fer from the "Curse of Dimensionality." The high number of parameters in these models makes them prone to overfitting, especially with limited training data. The non-convex nature of neural networks further complicates the learning pro- cess, reducing the chances of finding optimal parameters. • Transformer-based Classifiers: These models leverage self-attention mech- anisms to capture long-range dependencies in data. However, transformers re- quire large amounts of data and computational resources for training, making them less suitable for datasets with limited samples. Additionally, they can be sensitive to the quality of input data and preprocessing techniques. 25 Table 2.1: Summary of Representative Works on Alzheimer’s Disease Classification Reference Core Methodology Results Comments Loddo et al. [21] • Ensemble of multi- ple CNNs • Pre-trained on Ima- geNet • Simple average function for en- semble model (by averaging the top 3) • Binary class accu- racy: 99.29% • Dataset: ADNI • Introduction of en- semble learning • Significant improve- ment in accuracy Song et al. [38] • Multimodal image fusion (MRI + PET) to create "GM-PET" images. • 3D CNN with U-Net-like architec- ture • Binary class accu- racy: 94.11% • Multi-class accu- racy: 74.54% • Dataset: ADNI • Highlights benefit of multimodal data • Need for additional data for optimiza- tion Qiu et al. [30] • Hybrid approach (CNN + CatBoost) • Integration of imaging and non- imaging data • Trained CNNmodel on MRI data to compute cognitive scores. • Multi-class test accuracy: (87.9 ± 1.3)% • Multiple indepen- dent datasets • Omitted CNN archi- tecture details for MRI model • Extensive validation Saleh et al. [32] • Transfer learning with Densenet • Data augmentation • Multi-class training accuracy: 96.05% • Multi-class testing accuracy: 90.01% • Dataset: Kaggle • Dataset seems to contain less chal- lenging data than ADNI • Indications of over- fitting 26 Chapter 3 Proposed Methodology 3.1 Architecture Overview In this study, we employ an ensemble learning approach to enhance the classification performance of 3D MRI images for Alzheimer’s Disease. The ensemble is composed of three distinct models, each specifically trained to process one of the three primary anatomical planes: coronal, sagittal, and axial. This approach, which we refer to as ‘plane-wise ensemble’, is designed to leverage the unique structural information in- herent in each imaging plane. By integrating the outputs from models trained on these different planes, the approach aims to provide a more comprehensive and ac- curate classification than any single model could achieve alone. Fig. 3.1 provides a visual representation of this ensemble learning framework, highlighting theworkflow and integration of the different models. The rationale behind this plane-wise ensemble approach lies in the fact that different anatomical planes can reveal complementary aspects of brain structure and pathol- ogy. The coronal plane captures the frontal and posterior regions, the sagittal plane provides a lateral view, and the axial plane offers a top-down perspective. Each plane emphasizes different anatomical features, which, when combined, enhance the over- all diagnostic accuracy. This method is particularly advantageous in the context of 3D MRI image analysis, where the complexity and variability of brain structures necessitate a robust andmul- tifaceted approach. By utilizing an ensemble ofmodels, each attuned to specific struc- tural information, our approach mitigates the limitations of individual models and capitalizes on the strengths of each perspective. 27 (a) Training (Forward Pass) (b) Evaluation (c) Regular Ensemble Model (d) Training (Forward Pass) (e) Evaluation (f) Plane-wise Ensemble Technique Fig. 3.1: Diagram comparing a regular ensemble model (c) with our proposed plane- wise ensemble technique (f) 28 3.2 Model Architecture The three specializedmodels are trained on distinctMRI slice orientations to enhance Alzheimer’s Disease classification. The Coronal Model focuses on coronal slices, cap- turing frontal to posterior brain structures such as the lateral ventricles and frontal lobes, which aids in identifying features unique to this plane. The Sagittal Model uses sagittal slices to highlight midline structures like the corpus callosum and brainstem, essential for detecting lateralized features and asymmetries. The AxialModel, trained on axial slices, captures horizontal structures including the cerebral cortex and basal ganglia, improving the detection of cortical thickness and basal ganglia configuration changes. Each model’s specialization in its respective plane enhances its ability to identify Alzheimer’s Disease-related patterns. This segmentation of training data by plane is intended to enhance themodels’ ability to understand and interpret the slice-specific characteristics. By training models on specific imaging planes, each model can capture intricate details and features unique to its respective plane. This targeted approachminimizes internal confusion andmax- imizes the ability to identify subtle abnormalities, leading to the improvement of over- all ensemble performance. Each model’s outputs are then integrated to form an ensemble, capitalizing on the strengths of each plane-specific model. The coronal, sagittal, and axial models col- lectively contribute to a more comprehensive analysis, with each model providing in- sights from different anatomical perspectives. This ensemble approach ensures that the final classification leverages the diverse and complementary information obtained from all three planes, resulting in a more robust and accurate diagnosis. 3.3 Model Integration In the traditional ensemblemodel, when a single slice is provided as input, themodels must first determine the anatomical plane to which the slice belongs before proceed- ing with classification. This approach effectively transforms the original 3-class clas- sification problem into a more complex 9-class classification problem, as the model must now account for three planes for each class. Unlike these traditional ensembles that evaluate individual 2D slices, our plane-wise ensemble processes entire 3D volumes, thereby providing amore comprehensive con- text for accurate classification. During the evaluation phase of the ensemble, an entire 3DMRI volume is used as input. This approach leverages the full spatial context of the 29 MRI data, capturing inter-slice relationships and volumetric features that are critical for accurate disease classification. The 3DMRI volume is automatically sliced into the three primary anatomical planes: coronal, sagittal, and axial. Each set of slices corresponding to these planes is then fed into their respective specialized models within the ensemble. The coronal slices are processed by the coronal model, sagittal slices by the sagittal model, and axial slices by the axial model. This ensures that each model can apply its specialized knowledge to the appropriate set of slices, enhancing the detection of plane-specific features. Following the classification of the slices by their respectivemodels, the ensemble inte- grates the predictions by calculating a weighted average of the probabilities obtained from each model. These weights are not arbitrarily assigned; rather, they are learned through a separate training phase designed to optimize the combination ofmodel out- puts. This weighted averaging allows the ensemble to balance the contributions of each model according to their performance and relevance to the final classification. By combining the strengths of the coronal, sagittal, and axial models, the plane-wise ensemble provides an accurate classification of the 3D MRI volume. This integration ensures that the diverse and complementary information from each anatomical plane is utilized effectively, leading to a more robust and reliable diagnosis. 3.4 Projector Functions Projector functions play a critical role in the proposed plane-wise ensemble technique by transforming high-dimensional 3DMRI images into 2D inputs suitable for Convo- lutional Neural Network (CNN) models. A projector function is amathematical function thatmaps three-dimensional (3D) vol- umetric data into a series of two-dimensional (2D) images. Formally, let 𝑉 ⊆ ℝ3 rep- resent the domain of 3D volumetric data. A projector function 𝑃 can be defined as follows: 𝑃 ∶ 𝑉 → (ℝ2)𝑛 (3.1) where (ℝ2)𝑛 denotes an 𝑛-tuple of elements in ℝ2, representing an ordered sequence of 2D images. For any point 𝐯 ∈ 𝑉, the projector function 𝑃 maps 𝐯 to a sequence of 2D images (𝐢1, 𝐢2,… , 𝐢𝑛) such that 𝐢𝑗 ∈ ℝ2 for 𝑗 = 1, 2,… , 𝑛. In the context of MRI image classification, these functions are used to decompose a 30 3D MRI scan into three sets of 2D planes: axial, coronal, and sagittal. This decompo- sition allows for the application of 2D CNN models, which are less computationally intensive than their 3D counterparts. Projector functions are impactful for several reasons: • Reducing Computational Complexity: Handling 3D MRI data directly with 3D CNNs can be computationally prohibitive due to the high dimensionality of the data. By projecting the 3D images into 2D planes, the projector functions significantly reduce the computational complexity, making it feasible to train and deploy CNN models on standard hardware. • Leveraging the Strengths of 2D CNNs: 2D CNNs are well-established and widely used in image classification tasks, benefiting from extensive research and optimization. Projector functions enable the use of these robust 2D CNN architectures by converting the 3DMRI data into a format that thesemodels can process effectively. • Maintaining Essential Structural Information: Despite reducing the data dimensionality, projector functions try to ensure that the essential structural information of the brain is retained. By extracting slices in three orthogonal planes, the functions capture comprehensive views of the brain’s anatomy, which is crucial for accurate disease diagnosis. • Enhancing Classification Accuracy and Efficiency: The use of projector functions, in conjunction with an ensemble of CNN models trained on differ- ent planes, enhances the overall classification accuracy and efficiency. This ap- proach allows the models to focus on specific aspects of the brain’s structure, leveraging the strengths of each plane orientation to improve diagnostic perfor- mance. Projector functions are a pivotal component of the proposed plane-wise ensemble technique for MRI image classification. By transforming 3D MRI data into 2D slices, they facilitate the use of efficient and accurate 2D CNN models, thereby addressing the computational challenges associatedwith 3DMRI data. This innovative approach enhances the performance and reliability of Alzheimer’s disease diagnosis, contribut- ing significantly to the field of medical imaging. 3.4.1 Midplane Projector This function selects the middle slice of the 3D volume. 31 A 3D MRI scan can be considered as a stack of 2D images (slices) layered on top of each other. The midplane projector picks the slice that is exactly in the middle of this stack. This slice is considered representative of the entire volume and often captures a central view of the brain’s structure. Fig. 3.2: Midplane Projections: Axial Plane 3.4.2 Average Projector This function calculates the average of all slices in the 3D volume. The average projector combines information fromall slices by taking the average value for each pixel across the stack of slices. The resulting 2D image represents an averaged view, where each pixel’s value is the mean of the corresponding pixels in the original 3D volume. This method aims to capture the overall intensity patterns in the brain. Fig. 3.3: Average Projections: Axial Plane 3.4.3 Max Variance Projector This function selects the slice with the highest variance. Variance measures how much the pixel values differ from the mean value within a slice. The max variance projector identifies the slice where the pixel values vary the most, indicating a high level of structural detail and differences. 32 Fig. 3.4: Max Variance Projections: Axial Plane 3.4.4 VarianceWeighted Average Projector This function creates a weighted average of all slices, givingmore importance to slices with higher variance. Instead of treating all slices equally, the variance weighted average projector assigns greater weight to slices with higher variance, meaning those with more detail and differences. It then computes a weighted average, where each slice contributes to the final 2D image based on its variance. This approach aims to enhance the overall detail and information content in the projected image by emphasizing the most informative slices. Importantly, sliceswith very low variance, such as those that are fully black and contain no information, are assigned a weight of zero. This means they do not con- tribute to the final average, ensuring that only the slices withmeaningful information are used in the projection. Fig. 3.5: Variance Weighted Average Projections: Axial Plane 3.4.5 Linear Learnable (LL) Projector This function learns the optimal weights to assign to each slice in the 3D volume through a training process and then creates aweighted average of all slices using those weights. Instead of manually defining the weights based on variance or other criteria, the LL 33 projector uses a machine learning model to determine how much weight each slice should contribute to the final 2D image. The working of this model can be seen as similar to an encoder-decoder model. The LL projector acts as the encoder, learning to compress and represent the 3D volume into a 2D plane through learned weights. The classifier that uses the 2D projection is analogous to the decoder, interpreting the 2D representation to make predictions. During training, both the encoder and decoder update their weights based on the training samples and backpropagation. While evaluating novel images, the LL pro- jector uses the learnt weights to project the 3D volume into 2D images. And then they are input to the planewise models to make the prediction. 34 Chapter 4 Results and Discussion In this chapter, we have discussed the dataset, experimental setup and analysis pro- cess, emphasizing the advancements achieved in classification accuracy. It begins with a detailed explanation of dataset selection, preprocessing steps, and the method- ology for splitting the data into training, validation, and test sets. Following this, we evaluate the performance of several widely-used convolutional neural network archi- tectures, including AlexNet, VGG-16, and ResNet-50, using a comprehensive set of classification metrics. The chapter then introduces our novel plane-wise ensemble approach, explaining its implementation and demonstrating its advantages over tra- ditional ensemble models. Through detailed performance metrics and comparative analysis, we demonstrate the effectiveness of our proposedmethod in improving clas- sification accuracy across the three primary classes. The chapter concludes with a discussion of key evaluation metrics, providing a comprehensive view of model per- formance. 4.1 Dataset The dataset utilized for our experiments was sourced from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, which is widely recognized as a benchmark dataset for Alzheimer’s Disease classification. The ADNI dataset includes a variety of imaging data, clinical information, and biomarkers collected from participants over multiple visits. This rich dataset is crucial for developing and validatingmodels aimed at detecting and classifying Alzheimer’s Disease. As detailed in Table 4.1, the dataset comprises three primary classes: Alzheimer’s Disease (AD), Cognitive Normal (CN), and Mild Cognitive Impairment (MCI). Each 35 Table 4.1: Dataset Distribution Class Number of Samples AD (Alzheimer’s Disease) 602 CN (Cognitive Normal) 998 MCI (Mild Cognitive Impairment) 1832 class includes a significant number of samples, with the MCI class having the largest representation. One important aspect of the ADNI dataset is that it includes multiple imaging data points for individual patients collected across different visits. To ensure the integrity of the experimental results and prevent data leakage, we grouped all images from the same patient together. During the process of splitting the data into training, valida- tion, and test sets, all images from a single patient are kept within the same split. This approach ensures that themodels are evaluated on entirely unseen patients, thus pro- viding a more realistic assessment of their generalizability and performance. This careful handling of the dataset helps inmaintaining the robustness of the training and evaluation process, ensuring that themodels are not inadvertently trained on data that could appear in the test set. This practice is crucial for developing reliablemodels for Alzheimer’s Disease classification, as it closelymirrors real-world scenarios where models must generalize well to new patients. 4.2 Experimental Setup 4.2.1 Data Preparation Data Preprocessing As illustrated in Fig. 4.1, the acquired MRI images underwent skull-stripping using the Freesurfer software. Freesurfer plays a vital role in preparing structural MRI data for Alzheimer’s disease (AD) classification, offering a comprehensive pipeline to ex- tract relevant features from brain images. The following steps outline the preprocess- ing process: • IntensityNormalization: The intensity values of the inputMRI scans are nor- malized to ensure consistency across different acquisitions. This step corrects variations in signal intensity caused by scanner differences and acquisition pro- tocols. • Denoising Mechanisms: Freesurfer incorporates denoising mechanisms to 36 reduce noise and enhance image quality. This includes the use of non-local means denoising, which preserves important anatomical details while effec- tively removing noise. This step is crucial for improving the accuracy of sub- sequent image processing tasks. • Skull Stripping and Semantic Segmentation: Freesurfer employs a unified approach for skull stripping and semantic segmentation. By leveraging probabilistic atlases and machine learning algorithms, it accurately delineates brain structures while removing non-brain tissues such as skull, scalp, and dura mater. This step is crucial for eliminating extraneous structures and focusing on relevant brain regions for AD classification. • Tissue Segmentation: Freesurfer performs tissue segmentation to classify voxels in the brain into different tissue types, including gray matter, white matter, and cerebrospinal fluid (CSF). This segmentation provides valuable information for subsequent analyses and feature extraction. • Cortical SurfaceReconstruction: Freesurfer reconstructs the cortical surface of the brain from MRI data, identifying the pial surface (outer boundary of the cortex) and the white matter surface (inner boundary of the cortex). Accurate cortical surface reconstruction is essential for capturing cortical morphological changes associated with AD. • Parcellation andLabeling: Automated parcellation of the cerebral cortex into distinct anatomical regions is performed, enabling detailed analysis of cortical morphology and regional differences. Additionally, subcortical segmentation is carried out to delineate structures such as the hippocampus and basal ganglia, which are implicated in AD pathology. By employing this preprocessing pipeline, Freesurfer prepares structuralMRI data for AD classification studies, extracting relevant features and facilitating accurate analy- sis of brain morphometry and pathology. Train-Test Split The dataset underwent a partitioning process into train, validation (val), and test sets following an 8:1:1 ratio. To create the 5 folds for validation, the process involved in- dependently generating each fold by randomly selecting a subset of samples from the dataset, ensuring that the classes are balanced within each subset. Each fold was produced using a ’sampling with replacement’ approach, ensuring that the folds are independent of each other. 37 Fig. 4.1: Skull-stripping: The process of removing non-brain tissues fromMRI images using semantic segmentation The process of generating each fold involved the following steps: • Randomly shuffle the dataset to ensure a fair distribution of samples. • From the shuffled dataset, randomly select a subset such that the class distribu- tion is balanced. • Assign 80% of this subset to the training set, 10% to the validation set, and 10% to the test set. • Repeat this process five times, independently, to create five separate folds. To prevent data leakage, rigorous precautions were taken to ensure that multiple im- ages from the same patient did not inadvertently end up across different splits during the random split generation. This meticulous approach was crucial for maintaining the integrity of the evaluation process. The distribution of data across these subsets is delineated in Table 4.2, providing transparency regarding the allocation of samples for training and evaluation. 4.2.2 Hyper-parameter Settings The plane-wise ensemble model was trained using the following hyperparameters: • Batch Size: 16 The batch size determines the number of samples processed before the model’s 38 Table 4.2: Data Split Distribution Fold Set AD Samples MCI Samples CN Samples 1 Train 482 484 484 Val 54 52 51 Test 57 53 52 2 Train 480 484 482 Val 51 47 49 Test 61 58 56 3 Train 474 474 477 Val 66 64 62 Test 53 53 49 4 Train 474 477 473 Val 63 61 62 Test 57 54 57 5 Train 495 496 495 Val 49 47 47 Test 56 55 55 internal parameters are updated. A batch size of 16 strikes a balance between computational efficiency and the stability of the gradient descent process, enabling effective learning without overwhelming the memory capacity of the training hardware. • MaximumNumber of Epochs: 100 An epoch refers to one complete pass through the entire training dataset. Setting the maximum number of epochs to 100 allows the model sufficient iterations to learn from the data while preventing excessive training time and overfitting. • Early Stopping: Implemented to prevent overfitting, with a patience threshold set at 10 epochs Early stopping is a regularization technique used to terminate training when the model’s performance on a validation set stops improving. The patience pa- rameter of 10 epochs means that training will halt if there is no improvement in the validation loss for 10 consecutive epochs, thereby avoiding overfitting and reducing unnecessary computation. • Initial Learning Rate: 0.001 The learning rate controls the step size at each iteration while moving toward a minimum of the loss function. An initial learning rate of 0.001 is chosen as it is small enough to ensure stable convergence and large enough to expedite the learning process. 39 • Scheduler Step Size: 7 epochs The learning rate scheduler reduces the learning rate by a factor (gamma) ev- ery 7 epochs. This step size ensures periodic adjustments to the learning rate, facilitating finer learning adjustments as training progresses. • Gamma (Scheduler Factor): 0.1 The gamma parameter is the factor by which the learning rate is reduced. A gamma of 0.1 means the learning rate is multiplied by 0.1 every 7 epochs, allow- ing the model to fine-tune its weights with smaller learning rates in later stages of training for better accuracy. • Loss Function: Cross-entropy loss Cross-entropy loss is employed due to its effectiveness in multiclass classifica- tion tasks. It measures the performance of the classification model whose out- put is a probability value between 0 and 1, helping to quantify the difference between predicted probabilities and the actual class labels. • Optimizer: Adam Adam optimizer is used as it is the most popular default choice of optimizer in deep learning. It dynamically adjusts learning rates for individual parameters based on gradient magnitudes, smoothing convergence and reducing oscilla- tions. Incorporating momentum, it aids in navigating loss landscapes, helping to escape shallow local optima. These features mitigate issues like jittering and local optima, enhancing training stability and speed. • Ensemble Weights Training: The weights of the ensemble were fine-tuned separately for 30 epochs using the same hyperparameter settings. After training the individual models, the en- semble weights are optimized to combine the outputs of the models effectively. This separate training for 30 epochs with consistent hyperparameters ensures that the ensemble can leverage the strengths of eachmodel and improve overall prediction accuracy. 4.3 Quantitative Evaluation 4.3.1 Evaluation Metrics In the context of evaluating machine learning models, especially in classification tasks, several metrics are employed to gauge the performance of the model. These metrics offer insights into different aspects of the model’s predictive capabilities. 40 This section delves into the specifics of accuracy, precision, recall, F1 score, and AUC-ROC, with a particular emphasis on their application in multi-class classification scenarios. Accuracy Accuracy is the simplest and most intuitive metric, representing the proportion of correctly classified instances out of the total instances. It is calculated as follows: Accuracy = Number of Correct Predictions Total Number of Predictions (4.1) In a multi-class classification setting, accuracy alone may not provide a complete pic- ture, especially if the class distribution is imbalanced. For example, if one class dom- inates the dataset, a model that always predicts the majority class could still achieve high accuracy but would perform poorly on the minority classes. Precision Precision measures the proportion of true positive predictions among all positive pre- dictions made by the model. It is an important metric when the cost of false positives is high. For multi-class classification, precision is calculated for each class individu- ally: Precision𝑖 = 𝑇𝑃𝑖 𝑇𝑃𝑖 + 𝐹𝑃𝑖 (4.2) where 𝑇𝑃𝑖 and 𝐹𝑃𝑖 are the true positives and false positives for class 𝑖, respectively. The overall precision for the model can be obtained by averaging the precision values for each class (macro-averaging) or by weighting them by the number of instances in each class (weighted averaging). Recall Recall, also known as sensitivity or true positive rate, measures the proportion of true positive predictions among all actual positive instances. It is crucial when the cost of false negatives is high. For multi-class classification, recall is calculated for each class as follows: Recall𝑖 = 𝑇𝑃𝑖 𝑇𝑃𝑖 + 𝐹𝑁𝑖 (4.3) 41 where 𝑇𝑃𝑖 and 𝐹𝑁𝑖 are the true positives and false negatives for class 𝑖, respectively. Similar to precision, overall recall can be obtained through macro-averaging or weighted averaging. F1 Score The F1 score is a metric that combines precision and recall into a single number by calculating their harmonic mean. It is particularly useful when dealing with imbal- anced datasets, where one class may be significantly more frequent than others. In such cases, relying solely on accuracy can be misleading, as a model might perform well overall by favoring the majority class but poorly on the minority class. The F1 score is also known as the Dice Score or Dice Coefficient in the context of certain applications like image segmentation. It is calculated for each class in amulti- class classification problem using the following formula: F1 Score𝑖 = 2 ⋅ Precision𝑖 ⋅ Recall𝑖 Precision𝑖 + Recall𝑖 (4.4) The harmonic mean of two numbers, 𝑎 and 𝑏, is given by: 𝐻(𝑎, 𝑏) = 2 1 𝑎 + 1 𝑏 = 2𝑎𝑏 𝑎 + 𝑏 (4.5) One of the key properties of the harmonic mean is that it tends to be closer to the smaller of the two numbers. This is beneficial in the context of the F1 score because it penalizes models that have a significant imbalance between precision and recall. In other words, if a model has high precision but low recall, or vice versa, the F1 score will be closer to the lower value, highlighting the model’s deficiency. AUC-ROC The Receiver Operating Characteristic (ROC) curve is a graphical representation of a model’s diagnostic ability. It plots the true positive rate (recall) against the false positive rate (1 - specificity) at various threshold settings. The Area Under the ROC Curve (AUC-ROC) summarizes the performance of the model across all thresholds: AUC-ROC = ∫ 1 0 ROC curve(𝑥)𝑑𝑥 (4.6) 42 For multi-class classification, a common approach is to compute the ROC curve and AUC for each class against all other classes (one-vs-rest) and then average the results. • True Positive Rate (TPR) or Recall: This is the 𝑦-axis of the ROC curve and represents the proportion of actual positives correctly identified by the model. • False Positive Rate (FPR): This is the 𝑥-axis of the ROC curve and repre- sents the proportion of actual negatives incorrectly identified as positives by the model. A model with a high AUC-ROC value (closer to 1) indicates better performance, as it suggests that the model has a good measure of separability between the classes. While the confusion matrix provides information about the actual counts of true pos- itives, false positives, true negatives, and false negatives, it is dependent on a spe- cific threshold. In contrast, the AUC-ROC offers a threshold-independent evaluation, summarizing the model’s performance across all possible thresholds. This makes the AUC-ROCamore robust and comprehensivemetric for assessingmodel performance, especially when comparing models. In summary, each metric provides unique insights into different aspects of model performance. Accuracy offers a general overview, while precision and recall provide deeper insights into the model’s behavior with respect to false positives and false neg- atives. The F1 score balances precision and recall, and the AUC-ROC provides a com- prehensive evaluation across all thresholds. Together, these metrics form a holistic view of the model’s performance in multi-class classification scenarios. 4.3.2 Baseline Models We employed common 2D CNN models pretrained on the ImageNet [7] dataset as well as a traditional ensemble model, where each model was trained on the entirety of MRI slices, and predictions were combined through probability averaging during evaluation. AlexNet Table 4.3 shows the metrics for accuracy, classwise precision, recall, F1 score, and AUC-ROC for each of the 5 folds for AlexNet. 43 Table 4.3: Performance Analysis for AlexNet Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%) 1 60.494 AD 74.000 64.912 69.159 80.033 MCI 45.946 65.385 53.968 61.626 CN 71.053 50.943 59.341 71.733 2 60.000 AD 70.833 55.738 62.385 74.590 MCI 46.667 62.500 53.435 66.161 CN 69.231 62.069 65.455 78.912 3 59.355 AD 86.111 58.491 69.663 77.839 MCI 43.210 71.429 53.846 66.712 CN 68.421 49.057 57.143 71.162 4 55.357 AD 61.702 50.877 55.769 64.738 MCI 43.750 61.404 51.095 62.162 CN 70.732 53.704 61.053 70.110 5 60.843 AD 70.370 67.857 69.091 75.292 MCI 45.161 50.909 47.863 58.624 CN 70.000 63.636 66.667 71.122 VGG-16 Table 4.4 shows the metrics for accuracy, classwise precision, recall, F1 score, and AUC-ROC for each of the 5 folds for VGG-16. Table 4.4: Performance Analysis for VGG-16 Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%) 1 61.111 AD 79.070 59.649 68.000 79.866 MCI 47.297 67.308 55.556 69.073 CN 66.667 56.604 61.224 68.929 2 62.857 AD 76.471 63.934 69.643 74.863 MCI 48.718 67.857 56.716 66.221 CN 71.739 56.897 63.462 73.637 3 61.935 AD 88.889 60.377 71.910 72.660 MCI 46.575 69.388 55.738 67.077 CN 65.217 56.604 60.606 73.825 4 66.667 AD 80.000 56.140 65.979 79.501 MCI 53.846 73.684 62.222 73.921 CN 76.000 70.370 73.077 80.815 5 61.446 AD 70.833 60.714 65.385 75.633 MCI 46.875 54.545 50.420 60.164 CN 70.370 69.091 69.725 80.655 ResNet-50 Table 4.5 shows the metrics for accuracy, classwise precision, recall, F1 score, and AUC-ROC for each of the 5 folds for ResNet-50. 44 Table 4.5: Performance Analysis for ResNet-50 Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%) 1 72.840 AD 75.000 78.947 76.923 84.712 MCI 63.158 69.231 66.055 76.914 CN 82.222 69.811 75.510 83.469 2 73.714 AD 80.769 68.852 74.336 79.062 MCI 62.121 73.214 67.213 71.924 CN 80.702 79.310 80.000 85.927 3 72.903 AD 69.091 71.698 70.370 74.769 MCI 72.549 75.510 74.000 77.801 CN 77.551 71.698 74.510 78.376 4 69.643 AD 68.254 75.439 71.667 78.125 MCI 66.038 61.404 63.636 71.946 CN 75.000 72.222 73.585 80.539 5 69.277 AD 68.966 71.429 70.175 79.140 MCI 66.667 72.727 69.565 77.821 CN 72.917 63.636 67.961 74.382 DenseNet-169 Table 4.6 shows the metrics for accuracy, classwise precision, recall, F1 score, and AUC-ROC for each of the 5 folds for DenseNet-169. Table 4.6: Performance Analysis for DenseNet-169 Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%) 1 77.778 AD 83.636 80.702 82.143 87.168 MCI 69.231 69.231 69.231 79.528 CN 80.000 83.019 81.481 88.783 2 74.857 AD 78.182 70.492 74.138 81.622 MCI 67.143 83.929 74.603 83.283 CN 82.000 70.690 75.926 81.359 3 74.839 AD 80.769 79.245 80.000 85.775 MCI 65.000 79.592 71.560 80.400 CN 81.395 66.038 72.917 84.462 4 75.000 AD 79.245 73.684 76.364 82.646 MCI 68.852 73.684 71.186 78.694 CN 77.778 77.778 77.778 86.160 5 73.494 AD 80.357 80.357 80.357 86.055 MCI 62.687 76.364 68.852 77.674 CN 81.395 63.636 71.429 78.870 Vision Transformer B16 Table 4.7 shows the metrics for accuracy, classwise precision, recall, F1 score, and AUC-ROC for each of the 5 folds for Vision Transformer B16. 45 Table 4.7: Performance Analysis for Vision Transformer B16 Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%) 1 64.815 AD 72.222 68.421 70.270 82.256 MCI 56.667 65.385 60.714 70.472 CN 66.667 60.377 63.366 70.573 2 60.571 AD 75.000 59.016 66.055 77.380 MCI 51.351 67.857 58.462 70.273 CN 60.377 55.172 57.658 66.858 3 70.968 AD 74.468 66.038 70.000 78.117 MCI 64.912 75.510 69.811 78.283 CN 74.510 71.698 73.077 79.967 4 61.310 AD 66.667 70.175 68.376 74.696 MCI 53.125 59.649 56.198 67.520 CN 65.909 53.704 59.184 70.744 5 65.663 AD 66.038 62.500 64.220 73.198 MCI 60.000 65.455 62.609 73.726 CN 71.698 69.091 70.370 79.230 Traditional Ensemble Model (ResNet-50, AlexNet, VGG-16) The traditional ensemblemodel combines the strengths of threewell-establishedCon- volutional Neural Networks (CNNs): ResNet-50, AlexNet, and VGG-16. In this en- semble approach, each model is independently trained on the same dataset, and their predictions are aggregated to make the final decision. Table 4.8 shows the metrics for accuracy, classwise precision, recall, F1 score, and AUC-ROC for each of the 5 folds for Traditional Ensemble Model (ResNet-50, AlexNet, VGG-16). Table 4.8: Performance Analysis for Traditional Ensemble Model (ResNet-50, AlexNet, VGG-16) Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%) 1 74.691 AD 84.211 84.211 84.211 87.347 MCI 62.500 67.308 64.815 70.359 CN 77.551 71.698 74.510 80.898 2 73.714 AD 80.392 67.213 73.214 77.858 MCI 65.079 73.214 68.908 75.040 CN 77.049 81.034 78.992 85.427 3 81.935 AD 90.909 75.472 82.474 86.632 MCI 74.000 75.510 74.747 80.485 CN 81.967 94.340 87.719 90.972 4 74.405 AD 81.481 77.193 79.279 83.750 MCI 71.429 61.404 66.038 71.189 CN 70.769 85.185 77.311 81.453 5 73.494 AD 75.806 83.929 79.661 82.976 MCI 64.516 72.727 68.376 73.095 CN 83.333 63.636 72.165 76.717 46 4.3.3 Plane-wise Ensemble Models The proposed plane-wise ensemble, withmodels trained specifically on coronal, sagit- tal, and axial slices, demonstrated superior performance compared to traditional en- semblemodels and standalone 2DCNNs. The holistic approach of combining special- ized models using a weighted average of their predictions yielded improved accuracy in the classification of 3D MRI images. Triple ResNet (using Midplane Projector) Triple ResNet model utilizes a plane-wise ensemble technique, involving three sepa- rate ResNet-50 models trained on coronal, sagittal, and axial planes of MRI images. Each ResNet-50model is trained independently on 2D slices from one of these planes, capturing unique anatomical features specific to that orientation. During evaluation, the entire 3D MRI volume is processed by a midplane projector function, which ex- tracts the central slice from each of the three planes. These slices are then fed into their respective ResNet-50 models. The predictions from the three models are sub- sequently combined to make the final classification decision. Table 4.9 shows the metrics for accuracy, classwise precision, recall, F1 score, and AUC-ROC for each of the 5 folds for Triple ResNet (Midplane Projector). Table 4.9: Performance Analysis for Triple ResNet (using Midplane Projector) Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%) 1 80.247 AD 85.455 82.456 83.929