BACHELOR OF SCIENCE IN COMPUTER SCIENCE AND ENGINEERING

A CNN-based Pipeline using Plane-wise Ensemble Technique for Classifying
Alzheimer’s Disease from 3DMRI Images

Mahtab Nur Fardin
190041112

Md. Irfanur Rahman Rafio
190041125

Md. Jubayer Islam
190041129

Department of Computer Science and Engineering
Islamic University of Technology

June, 2024


A CNN-based Pipeline using Plane-wise Ensemble Technique for Classifying
Alzheimer’s Disease from 3DMRI Images

Mahtab Nur Fardin
190041112

Md. Irfanur Rahman Rafio
190041125

Md. Jubayer Islam
190041129

Department of Computer Science and Engineering
Islamic University of Technology

June, 2024


Declaration of Candidate
This is to certify that the work presented in this thesis is the outcome of the analysis
and experiments carried out byMahtab Nur Fardin,Md. Irfanur Rahman Rafio,
andMd. Jubayer Islam under the supervision of Dr. Md. Hasanul Kabir, Pro-
fessor, Department of Computer Science and Engineering and co-supervision of Sab-
bir Ahmed, Assistant Professor, Department of Computer Science and Engineering,
Islamic University of Technology, Dhaka, Bangladesh. It is also declared that nei-
ther this thesis nor any part of it has been submitted anywhere else for any degree or
diploma. Information derived from the published and unpublished work of others
have been acknowledged in the text and a list of references is given.

Dr. Md. Hasanul Kabir
Professor
Department of Computer Science and Engineering
Islamic University of Technology (IUT)
Date: June 04, 2024

Sabbir Ahmed
Assistant Professor
Department of Computer Science and Engineering
Islamic University of Technology (IUT)
Date: June 04, 2024

Mahtab Nur Fardin
Student ID: 190041112
Date: June 04, 2024

Md. IrfanurRahmanRafio
Student ID: 190041125
Date: June 04, 2024

Md. Jubayer Islam
Student ID: 190041129
Date: June 04, 2024

ii


Contents

1 Introduction 1
1.1 Motivation and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Variability in Brain Morphology . . . . . . . . . . . . . . . . . 5
1.3.2 Multi-Class Classification Complexity . . . . . . . . . . . . . . 6
1.3.3 Computational Cost . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 RelatedWorks 10
2.1 Conventional Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Binary Classification Using Machine Learning Techniques . . 10
2.1.2 Image Processing Methods . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Deep Learning Based Approaches . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Dominance of CNNModels . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Multi-class Classification . . . . . . . . . . . . . . . . . . . . . 20
2.2.4 Utilization of Single and Multi-modal Approaches . . . . . . . 21
2.2.5 Preprocessing Techniques . . . . . . . . . . . . . . . . . . . . . 23

2.3 Summary and Limitations of Existing Architectures . . . . . . . . . . 25

3 Proposed Methodology 27
3.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Model Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Projector Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

iii


3.4.1 Midplane Projector . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Average Projector . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.3 Max Variance Projector . . . . . . . . . . . . . . . . . . . . . . 32
3.4.4 Variance Weighted Average Projector . . . . . . . . . . . . . . 33
3.4.5 Linear Learnable (LL) Projector . . . . . . . . . . . . . . . . . 33

4 Results and Discussion 35
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Hyper-parameter Settings . . . . . . . . . . . . . . . . . . . . . 38

4.3 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.3 Plane-wise Ensemble Models . . . . . . . . . . . . . . . . . . . 47
4.3.4 Comparative Analysis and Discussion . . . . . . . . . . . . . . 52

5 Conclusion 58

References 60

iv


List of Figures

2.1 A simple CNN architecture that extracts the features from the spatial
information and then classify into different classes [33] . . . . . . . . 13

2.2 The architecture of AlzheimerNet, a fine-tuned InceptionV3 [34] . . . 13
2.3 Schematic representation of proposed ensemble model [21]. . . . . . . 14
2.4 Network architecture of 3MT [19]. . . . . . . . . . . . . . . . . . . . . 15
2.5 Details of the CNN transformer for encoding 3D images [19]. . . . . . 16

3.1 Diagram comparing a regular ensemble model (c) with our proposed
plane-wise ensemble technique (f) . . . . . . . . . . . . . . . . . . . . 28

3.2 Midplane Projections: Axial Plane . . . . . . . . . . . . . . . . . . . . 32
3.3 Average Projections: Axial Plane . . . . . . . . . . . . . . . . . . . . . 32
3.4 Max Variance Projections: Axial Plane . . . . . . . . . . . . . . . . . . 33
3.5 Variance Weighted Average Projections: Axial Plane . . . . . . . . . . 33

4.1 Skull-stripping: The process of removing non-brain tissues from MRI
images using semantic segmentation . . . . . . . . . . . . . . . . . . . 38

v


List of Tables

2.1 Summary of RepresentativeWorks onAlzheimer’s Disease Classification 26

4.1 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Data Split Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Performance Analysis for AlexNet . . . . . . . . . . . . . . . . . . . . 44
4.4 Performance Analysis for VGG-16 . . . . . . . . . . . . . . . . . . . . 44
4.5 Performance Analysis for ResNet-50 . . . . . . . . . . . . . . . . . . . 45
4.6 Performance Analysis for DenseNet-169 . . . . . . . . . . . . . . . . . 45
4.7 Performance Analysis for Vision Transformer B16 . . . . . . . . . . . 46
4.8 Performance Analysis for Traditional Ensemble Model (ResNet-50,

AlexNet, VGG-16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.9 Performance Analysis for Triple ResNet (using Midplane Projector) . . 47
4.10 Performance Analysis for Triple DenseNet (using Midplane Projector) 48
4.11 Performance Analysis for Triple ResNet (using Average Projector) . . 48
4.12 Performance Analysis for Triple DenseNet (using Average Projector) . 49
4.13 Performance Analysis for Triple ResNet (using Max Variance Projector) 49
4.14 Performance Analysis for Triple DenseNet (using Max Variance Pro-

jector) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.15 Performance Analysis for Triple ResNet (using VarianceWeighted Av-

erage Projector) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.16 Performance Analysis for Triple DenseNet (using Variance Weighted

Average Projector) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.17 Performance Analysis for Triple ResNet (using LL Projector) . . . . . 51
4.18 Performance Analysis for Triple DenseNet (using LL Projector) . . . . 52
4.19 Model Performance Comparison . . . . . . . . . . . . . . . . . . . . . 54
4.20 Confidence Intervals for Model Accuracy . . . . . . . . . . . . . . . . 56
4.21 Comparison with State-of-the-art Models . . . . . . . . . . . . . . . . 57

vi


Abstract

Alzheimer’s disease (AD) is a chronic neurodegenerative condition that progressively
damages brain cells, resulting inmemory and cognitive decline and eventually imped-
ing basic functionalities. With over 55 million people worldwide affected by demen-
tia, a number anticipated to rise significantly, the urgency for early diagnosis becomes
paramount. While a definitive cure remains elusive, early intervention is crucial in
mitigating disease progression and enhancing patient outcomes. This research inves-
tigates the potential of deep learning models for classifying Alzheimer’s Disease, em-
phasizing the challenges in Mild Cognitive Impairment (MCI) classification, and in-
troduces a CNN-based pipeline utilizing a plane-wise ensemble technique for 3DMRI
image classification. To manage the complex nature of 3D MRI data, a CNN-based
pipeline is proposed that makes use of a plane-wise ensemble technique. Decompos-
ing the 3D image into axial, coronal, and sagittal planes, and using an ensemble of 2D
CNN models trained on the axial, coronal, and sagittal planes, the system attempts
to include multi-view data and improve classification accuracy. This methodology
also leverages projector functions to map the 3D volumes into a series of 2D images
and tackles the computational challenges presented by 3D data, resulting in a more
efficient and practical process even with constrained computation resources.

vii


Chapter 1

Introduction

Alzheimer’s disease (AD) is a pervasive and devastating neurodegenerative disorder
characterized by the gradual deterioration of brain tissues, leading to a decline in
memory, cognitive functions, and ultimately, a loss of fundamental abilities. As the
global population ages, the prevalence of dementia, including AD, continues to esca-
late, with over 55 million individuals affected worldwide and this figure is expected
to rise to 78 million in 2030 and 139 million in 2050 [28].

Traditionally, biomarkers, focusing on critical brain regions like the hippocampus,
parietal lobe, and amygdala, have been fundamental in identifying atrophy indica-
tive of AD [4]. Recent studies, however, have showcased the high potential of deep
learning models in the accurate classification of AD, prompting a paradigm shift in
diagnostic methodologies [1].

Before delving into our core research, it is essential to comprehend the domain intri-
cacies. Experts categorize potential AD patients into three classes, a departure from
the conventional binary classification of AD and normal cognitive function (CN). The
introduction of Mild Cognitive Impairment (MCI) allows for early detection, recog-
nizing that most MCI patients progress to AD within 3 to 6 years [23]. MCI poses a
unique challenge, as early-stage MCI resembles CN, while late-stage MCI bears sim-
ilarity to AD.

Examining themodalities instrumental in diagnosingAD reveals a shifting landscape.
Outdated methods like CT scans and EEG have given way to more sophisticated ap-
proaches. Magnetic Resonance Imaging (MRI), offering detailed structural images,
has become the primarymodality due to itswidespread availability. PositronEmission
Tomography (PET), while providing functional insights, is invasive, limiting its use.
Diffusion Tensor Imaging (DTI), though yielding high-quality data, faces challenges

1


in terms of accessibility. Remarkably, in Alzheimer’s classification, the integration of
text data like age, gender, MMSE scores, and genetic information yields nuanced in-
sights into disease dynamics, adding to the enhancement of accuracy. Demographic
factors such as age and gender capture variations, while MMSE scores provide stan-
dardized cognitive assessments. Genetic data contributes to understanding hereditary
patterns, collectively enhancing the precision of classification models for a compre-
hensive analysis of Alzheimer’s disease [39].

In recent years, research centers have aggregated substantial medical and image
data, sharing it publicly for the benefit of researchers engaged in Artificial
Intelligence (AI) development for Alzheimer’s Disease. These online datasets
provide crucial biomarker information, including neuroimaging modalities, genetic
data, and clinical and cognitive assessments. The most prominent datasets are:

• Alzheimer’s DiseaseNeuroimaging Initiative (ADNI) [15]: ADNI, a longi-
tudinal andmulticenter study, serves as a prominent dataset. It includes ADNI-
1, ADNI-GO, ADNI-2, and ADNI-3. ADNI aims to assess the progression of
Mild Cognitive Impairment (MCI) and early AD, utilizing Magnetic Resonance
Imaging (MRI), Positron Emission Tomography (PET), biological markers, and
clinical and neuropsychological assessments. ADNI datasets encompass vari-
ous data types, such as clinical information, genetic data, MRI and PET images,
and biospecimens.

• Australian Imaging, Biomarker, & Lifestyle Flagship Study of Aging
(AIBL) [10]: AIBL compiles imaging and medical data from individuals with
AD, those with MCI, and cognitive healthy individuals.

• Open Access Series of Imaging Studies (OASIS)[22]: OASIS, designed to
share neuroimaging brain datasets, includes OASIS-1 with 434 MRI scans,
OASIS-2 with 373 MRI scans, and OASIS-3 with 2,168 MRIs and 1,608 PET
scans.

• NationalAlzheimer’s CoordinatingCenter (NACC)[5]: NACC, established
as a cornerstone in Alzheimer’s research, is a pivotal dataset that makes invalu-
able contributions to our understanding of the disease. NACC not only provides
essential data but also serves as a nexus for standardizing and harmonizing di-
verse datasets. Its comprehensive collection includes clinical, genetic, and neu-
roimaging data, fostering a holistic approach to Alzheimer’s research.

Recent advancements in Alzheimer’s disease classification have witnessed the effec-
tiveness of deep learning models. Building upon that, this report presents a Convo-

2


lutional Neural Network (CNN)-based pipeline for classifying Alzheimer’s Disease
from 3D MRI images. The pipeline employs a plane-wise ensemble technique, intro-
ducing a strategic approach that capitalizes on specialized models for distinct imag-
ing planes. This innovative technique aims to harness anatomical information during
both model training and evaluation, demonstrating remarkable improvements in the
classification accuracy of 3D MRI images.

1.1 Motivation and Scope

Early and accurate diagnosis of AD is crucial for effective patient management and
the development of therapeutic strategies. Magnetic Resonance Imaging (MRI) has
emerged as a powerful non-invasive tool for diagnosing AD due to its
high-resolution imaging capabilities. However, manual analysis of 3D MRI images is
time-consuming, subjective, and prone to errors, necessitating the development of
automated and reliable diagnostic methods [17].

Recent advancements in deep learning, particularly Convolutional Neural Networks
(CNNs), have shown remarkable success in image classification tasks [24]. CNNs
have the potential to revolutionize the field of medical imaging by automating the
analysis process, reducing human error, and enhancing diagnostic accuracy. Despite
these advancements, the complexity of 3D MRI data presents unique challenges, in-
cluding high dimensionality and computational intensity, which often limit the per-
formance and applicability of standard CNN models.

To address these challenges, the motivation for this thesis is to explore a novel
CNN-based pipeline that leverages a plane-wise ensemble technique for classifying
Alzheimer’s disease from 3D MRI images. By decomposing the 3D MRI data into 2D
planes and employing an ensemble of CNN models trained on these planes, the
proposed method aims to improve classification accuracy while managing
computational complexity. This approach not only capitalizes on the strengths of 2D
CNNs but also integrates the multi-view information inherent in 3D data, offering a
promising solution for robust AD diagnosis.

The significance of this research lies in its potential to enhance the early detection
of Alzheimer’s disease, thereby enabling timely intervention and improved patient
outcomes. Furthermore, by optimizing the computational efficiency of the diagnostic
process, this approach can facilitate broader clinical adoption and integration into
existing diagnostic workflows.

3


Recent research has demonstrated the efficacy of CNNs in medical imaging, particu-
larly in the classification of neurological conditions fromMRI data. Different studies
[20], [21], [29], [38] have shown that deep learning models can achieve high accuracy
in distinguishing between AD and healthy controls. Additionally, works by Payan
and Montana have explored 3D CNNs for AD diagnosis [27], highlighting the poten-
tial of deep learning in this domain. However, these studies often grapple with the
computational demands and complexity associated with processing 3D MRI data.

Despite the promising results, existing research is limited by several factors. The high
dimensionality of 3DMRI data leads to significant computational requirements, mak-
ing the training and deployment of 3D CNNmodels resource-intensive. Additionally,
many current approaches also overlook the potential benefits of integrating multi-
view information fromdifferent planes of 3DMRI data, which can enhance diagnostic
accuracy.

This thesis aims to bridge these gaps by developing, implementing, and evaluating a
CNN-based pipeline designed for the classification of Alzheimer’s disease using 3D
MRI images. The key aspects covered include:

1. Data Preprocessing: Detailed examination of preprocessing techniques to
standardize 3D MRI data, including normalization, resizing, and slice
extraction, ensuring compatibility with the proposed CNN models.

2. Model Architecture: Design and implementation of CNN architectures tai-
lored for 2D plane classification, followed by an ensemble approach that inte-
grates predictions from multiple planes (axial, coronal, and sagittal).

3. Training and Validation: Strategies for effectively training the CNN models,
including data augmentation, cross-validation, and optimization of hyperpa-
rameters to enhance model performance and generalizability.

4. PerformanceEvaluation: Comprehensive evaluation of the pipeline using es-
tablished metrics such as accuracy, sensitivity, specificity, and area under the
receiver operating characteristic curve (AUC-ROC). Comparisons with existing
methods to highlight improvements and potential benefits.

By addressing these components, the thesis aims to contribute to the field of medical
imaging andAlzheimer’s disease diagnosis, providing a scalable and efficient solution
for early detection and improving the overall quality of patient care. The proposed
plane-wise ensemble technique not only enhances diagnostic accuracy but also sig-
nificantly reduces the computational burden, making the model feasible for clinical

4


application even with limited computational resources. This innovation has the po-
tential to make advanced diagnostic tools accessible to a wider range of healthcare
settings, thereby improving patient outcomes on a global scale.

1.2 Problem Statement

Develop amodelwhich, given a 3DMRI image, can classify it into one of the following
classes:

• Alzheimer’s Disease (AD)

• Cognitive Normal (CN)

• Mild Cognitive Impairment (MCI)

Each 3D MRI image presents a complex and high-dimensional dataset that requires
advanced techniques for accurate classification. Traditionalmanual analysismethods
are labor-intensive and susceptible to subjective bias and errors. Therefore, there is
a pressing need for automated, reliable, and efficient diagnostic tools to support the
early detection and differentiation between these categories.

The goal is to provide a robust and scalable tool for aiding the diagnosis and mon-
itoring of Alzheimer’s Disease and related conditions. This will ultimately improve
patient outcomes and support clinical decision-making.

1.3 Research Challenges

The development of a CNN-based pipeline for classifying Alzheimer’s disease from
3D MRI images involves several significant challenges. These challenges span com-
putational costs, variability in brain morphology, and the complexity of multi-class
classification. Addressing these challenges is crucial for the successful implementa-
tion and accuracy of the proposed diagnostic tool.

1.3.1 Variability in Brain Morphology

Detecting brain atrophies associated with Alzheimer’s disease is complicated by the
natural variability in brain structure. Factors contributing to this variability include:

• Gender Differences: Male and female brains exhibit structural differences,
which can affect the model’s ability to generalize across genders.

5


• Age-related Changes: The brain undergoes significant changes over a per-
son’s lifespan, adding another layer of complexity to the classification task. Age-
related atrophies might be mistaken for disease-specific patterns.

• Demographic and Genetic Diversity: Variations in brain structure can also
be attributed to demographic factors (e.g., ethnicity, lifestyle) and genetic pre-
dispositions, complicating the detection of AD-related changes.

Developing a model that can account for these variabilities requires a robust training
dataset that adequately represents these diverse factors. It also necessitates sophisti-
cated preprocessing and augmentation techniques to ensure the model is exposed to
a wide range of anatomical variations.

1.3.2 Multi-Class Classification Complexity

Classifying 3DMRI images into three categories (AD, CN,MCI) introduces additional
complexity compared to binary classification. Specific challenges include:

• Interclass Similarity: The Mild Cognitive Impairment (MCI) class often ex-
hibits characteristics that overlap with both Alzheimer’s Disease (AD) and Cog-
nitive Normal (CN) classes, making it difficult for the model to distinguish be-
tween these states accurately.

• Imbalanced Data: The prevalence of AD, CN, and MCI in available datasets
may not be evenly distributed, potentially leading to biased model performance
if not adequately addressed.

• Diagnostic Ambiguity: The progression from CN to MCI to AD is a contin-
uum, and the boundaries between these classes are not always clear-cut. This
ambiguity can lead to misclassification and reduced model accuracy.

1.3.3 Computational Cost

One of the primary challenges in developing a deep learningmodel for 3DMRI image
classification is the high computational cost associated with training and inference.
Key issues include:

• HighDimensionality of Data: 3DMRI images contain a vast amount of data,
leading to high memory and processing requirements. The "Curse of Dimen-
sionality" exacerbates this issue, as the number of parameters in the model in-
creases exponentially with the dimensionality of the data. This makes learning

6


more difficult and necessitates large amounts of training data to avoid overfit-
ting.

• Model Complexity: Standard 3D CNNmodels are computationally expensive
due to the large number of parameters and layers required to capture spatial fea-
tures in three dimensions. High-dimensional data also complicates optimiza-
tion, making it more susceptible to local minima and increasing the difficulty
of finding a global optimum. Additionally, outlier analysis becomes less mean-
ingful in high-dimensional spaces.

• Hardware Limitations: Many research institutions and healthcare facilities
may not have access to high-performance computing resources, limiting the fea-
sibility of training complex 3D CNN models. The significant memory and pro-
cessing power required to handle 3D MRI data can strain available hardware,
leading to slower training times and reduced model performance.

Tomitigate these issues, the proposed plane-wise ensemble technique aims to decom-
pose 3DMRI data into 2D planes, significantly reducing computational demands and
making the training process more efficient. This approach reduces the number of
parameters, simplifies optimization, and lowers the data requirements, making the
model more feasible for training on available hardware. Additionally, by focusing on
2D planes, the method can leverage the strengths of 2D CNNs, which are less compu-
tationally intensive than their 3D counterparts.

To address these challenges, the research will focus on:

• Balanced Sampling: Techniques to ensure that the dataset is balanced by sam-
pling from each class multiple times, ensuring statistical significance and ro-
bustness of the method. Strong measures are taken to prevent data leakage,
maintaining the integrity of the training process.

• Ensemble Techniques: Using multiple CNNs trained on different 2D planes
to capturemore nuanced features and improve overall classification robustness.
This approach leverages the strengths of 2D CNNs while integrating multi-view
information from 3D data.

• EnhancedModelTraining: Employing strategies such as cross-validation and
hyperparameter tuning to optimize model performance andmitigate the effects
of interclass similarity. These techniques help in fine-tuning themodel parame-
ters and ensuring that the model generalizes well across different subsets of the
data.

7


By tackling these computational, morphological, and classification challenges, this
research aims to develop a more accurate, efficient, and generalizable CNN-based di-
agnostic tool for Alzheimer’s disease and related conditions.

1.4 Contribution

This thesis presents several key contributions to the field of medical imaging and
Alzheimer’s disease diagnosis:

• Introduced a novel plane-wise ensemble technique that addresses the
computational challenges of 3D MRI data by decomposing it into 2D planes
(axial, coronal, and sagittal) and employing an ensemble of CNN models
trained on these planes. This method reduces computational complexity while
leveraging the strengths of 2D CNNs, enhancing classification accuracy and
efficiency.

• Developed projector functions that map 3D MRI images into 2D inputs for
the CNN models. These functions transform high-dimensional 3D data into a
manageable format for 2D CNNs, maintaining essential structural information
and simplifying the data processing.

• Benchmarked the performance of different CNN models, systematically
evaluating and comparing the models. This benchmarking provides insights
into the most effective models and plane orientations for the classification
task, ensuring the use of the most accurate and reliable configurations for
Alzheimer’s disease diagnosis.

1.5 Organization

The thesis is structured to provide a comprehensive exploration of the Plane-wise En-
semble Technique for classifying Alzheimer’s Disease from 3D MRI images. The or-
ganization ensures a logical flow of information, guiding the reader from the intro-
duction to the conclusion while aligning with the research objectives.

Chapter 2 conducts a thorough review of existing literature on Alzheimer’s Disease
diagnosis and medical image classification techniques. It discusses the limitations of
traditional methods and highlights the potential benefits of utilizing CNN-based ap-
proaches, laying the theoretical groundwork for the Plane-wise Ensemble Technique
and its relevance in the field of medical imaging.

8


Chapter 3 presents an overview of the Plane-wise Ensemble Approach, detailing the
decomposition of 3D MRI data into axial, coronal, and sagittal planes. It explains
the rationale behind leveraging anatomical information fromdifferent imaging planes
to enhance classification accuracy and introduces the workflow and integration of
specialized models within the ensemble framework.

Chapter 4 details the implementation of the CNN-based pipeline for classifying
Alzheimer’s Disease using the Plane-wise Ensemble Technique. It discusses the
experimental setup, including dataset selection, model training, and evaluation
metrics, and analyzes the results to showcase the advancements in classification
accuracy achieved through the proposed approach.

Chapter 5 summarizes the key findings of the research, emphasizing the contributions
of the Plane-wise Ensemble Technique. It discusses the implications of the findings in
relation to the research objectives and existing knowledge in the field, acknowledges
study limitations, and suggests future research directions to build upon the current
work, providing a comprehensive wrap-up of the thesis.

9


Chapter 2

RelatedWorks

2.1 Conventional Approaches

Earlier methods primarily focused on binary classification and traditional image pro-
cessing techniques, which laid the groundwork for more sophisticated models.

2.1.1 Binary Classification Using Machine Learning
Techniques

One of the initial strategies for Alzheimer’s disease diagnosis involved binary classifi-
cation, distinguishing between Alzheimer’s disease (AD) and Cognitive normal (CN)
individuals. Machine learning techniques such as Support Vector Machines (SVM),
Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA) [2],
[20], [41] were commonly employed in these efforts. SVMs, known for their effective-
ness in high-dimensional spaces, were utilized to create hyperplanes that best sep-
arated the AD and CN classes based on features extracted from neuroimaging data.
LDA, on the other hand, aimed to find the linear combinations of features that best
separated the two classes. PCA was often used to reduce the dimensionality of the
data, retaining the most informative components for subsequent classification tasks.
These machine learning approaches provided a foundation for early diagnostic mod-
els, achieving reasonable accuracy and helping to highlight the potential of computa-
tional methods in AD diagnosis.

10


2.1.2 Image Processing Methods

Traditional image processing methods were also pivotal in the early stages of
Alzheimer’s disease research. Techniques such as thresholding, edge detection, and
region-based methods were employed to analyze neuroimaging data, particularly
MRI and CT scans [1]. Thresholding involved setting intensity thresholds to
segment brain images, highlighting regions of interest such as hippocampal atrophy,
which is commonly associated with AD. Edge-based techniques focused on
detecting boundaries and contours within the images, facilitating the identification
of structural changes in the brain. Region-based methods aimed to segment images
into meaningful regions, often using criteria such as intensity homogeneity or
anatomical knowledge to delineate areas affected by the disease. These image
processing methods were instrumental in extracting relevant features from
neuroimaging data, which were subsequently used in classification models.

2.1.3 Limitations

While these earlier approaches made significant contributions to the field, they also
had limitations. Machine learning techniques like SVM and LDA required extensive
feature engineering and were often sensitive to the quality and quantity of input fea-
tures. Additionally, traditional image processing methods were sometimes limited by
their reliance on manual parameter tuning [2] and their susceptibility to noise and
artifacts in the imaging data. Despite these challenges, the foundational work of bi-
nary classification and image processing paved the way for the development of more
advanced models, including deep learning and multimodal approaches, which have
since demonstrated superior performance in AD diagnosis.

2.2 Deep Learning Based Approaches

The landscape of Alzheimer’s disease (AD) diagnosis has evolved significantly with
the advent of advanced computational techniques, particularly in the realm of deep
learning and multimodal data integration. Recent trends emphasize the use of so-
phisticated neural networks, leveraging the power of convolutional neural networks
(CNNs), transformer-based models, and hybrid approaches to improve diagnostic ac-
curacy and robustness.

11


2.2.1 Dominance of CNNModels

One of the most prominent trends in recent research is the application of deep learn-
ing methods, especially convolutional neural networks (CNNs) [26], to AD diagno-
sis. Unlike traditional machine learning techniques that require manual feature ex-
traction, CNNs can automatically learn hierarchical features from raw imaging data.
These models have been applied extensively to MRI and PET scans, providing state-
of-the-art performance in distinguishing between AD, MCI, and CN individuals. For
example, models like DenseNet [14] and ResNet [12] have been fine-tuned for the
task, demonstrating substantial improvements in classification accuracy and compu-
tational efficiency.

2.2.2 Architecture Overview

2D CNN Based Architectures

The use of 2D Convolutional Neural Networks (CNNs) for classifying Alzheimer’s
Disease (AD) has been a prominent area of research. These methods typically involve
extracting 2D slices from 3D MRI scans and using them as input for pre-trained or
custom-designed CNN architectures.

Savaş [33] utilized pre-trained deep learningmodels, specifically VGG-16 and ResNet-
50, for classifying AD stages through transfer learning. By fine-tuning these models
on 2D MRI slices, the study leveraged learned features from the ImageNet dataset to
enhance classification accuracy while reducing training time. Preprocessing ensured
consistency and compatibility with the CNN input requirements.

• VGG-16: A deep CNN with 16 layers, pre-trained on ImageNet, served as a fea-
ture extractor. The final layers were replaced for AD classification.

• ResNet-50: A 50-layer residual network, also pre-trained on ImageNet, was
fine-tuned similarly to VGG-16.

While effective, this approach might miss important 3D spatial information inherent
in MRI scans. The fine-tuned models depend heavily on the specific dataset, poten-
tially limiting their generalizability. Additionally, fine-tuning large pre-trained mod-
els still requires significant computational resources.

A CNN model [31] from scratch was developed to automate AD detection using 2D
MRI slices. The model involved multiple convolutional layers for feature extraction,
followed by ReLU activation functions andmax-pooling layers to downsample feature
maps and reduce computational complexity. Fully connected layers processed the

12


Fig. 2.1: A simple CNN architecture that extracts the features from the spatial infor-
mation and then classify into different classes [33]

extracted features for classification. This model, however, requires a large amount of
labeled data, which might not always be available. It also risks missing crucial 3D
context and may overfit with limited datasets.

AlzheimerNet[34], a deep learningmodel for classifying AD stages fromMRI images.
AlzheimerNet incorporated convolutional blockswith batch normalization andReLU
activations, along with residual connections to mitigate vanishing gradients. Max-
pooling layers downsampled featuremaps, and fully connected layers enabled higher-
level feature processing for classification.

Fig. 2.2: The architecture of AlzheimerNet, a fine-tuned InceptionV3 [34]

AlzheimerNet’s complex architecture, while powerful, may be computationally inten-
sive for some clinical settings. The model’s performance is dependent on the quality

13


and diversity of training data, and like other 2D CNNmodels, it may overlook impor-
tant 3D spatial relationships critical for accurate AD diagnosis and staging.

Another study presents a novel deep-ensemble method [32] combining multiple con-
volutional neural network (CNN) architectures for robust and accurate classification
of Alzheimer’s Disease (AD) using MRI and fMRI data. The datasets include MRI
and fMRI images of patients with varying degrees of dementia, including healthy
controls, very mild AD, mild AD, and moderate AD. The CNN architectures selected,
based on their performance in previousAD research and their size/precision ratios, in-
clude AlexNet, Inception-ResNet-v2, ResNet-50, ResNet-101, and GoogLeNet. Trans-
fer learning was employed using CNNs pre-trained on the ImageNet dataset, which
were fine-tuned on the AD datasets to adapt to the specific task of AD classification by
retraining the networks’ final layers while keeping the earlier layers frozen. Features
were extracted from the penultimate fully connected layer (FC7) of AlexNet and the
last layer of the ResNet and Inception architectures, providing a refined representa-
tion of the input images. An ensemble learning approach combined the predictions
of the three best-performing networks (AlexNet, ResNet-101, and Inception-ResNet-
v2) using a bagged trees model with an averaging strategy, enhancing the robustness
and accuracy of classification. To prevent overfitting and improve model generaliza-
tion, data augmentation techniques such as random rotation between -35° and 35°,
random scaling in the x and y directions, and grey-scale preprocessing were applied
to the training and validation sets.

Fig. 2.3: Schematic representation of proposed ensemble model [21].

A robust classification model leveraged transfer learning with DenseNet and inte-
grated within an embedded healthcare decision support system (DSS) [32]. Prepro-
cessing steps included skull stripping, intensity normalization, and registration to a
common reference space. DenseNet-121, pre-trained on the ImageNet dataset, was
chosen for its dense connectivity patterns, which enhance feature reuse and miti-
gate the vanishing gradient problem. The pre-trained DenseNet was fine-tuned on
the ADNI dataset by freezing initial layers, adding custom fully connected layers,

14


Fig. 2.4: Network architecture of 3MT [19].

and incorporating dropout regularization to prevent overfitting. The training pro-
cess involved data augmentation techniques such as rotation, flipping, and scaling to
improve model generalization, with optimal batch size and number of epochs deter-
mined experimentally. Despite high performance, limitations included dependency
on the quality and availability of labeled MRI data, substantial computational re-
sources required for DenseNet architecture, and the need for further validation to en-
sure generalization to different populations and imaging protocols beyond the ADNI
dataset.

3D CNN Based Architectures

3D Convolutional Neural Networks (CNNs) are well-suited for medical imaging tasks
involving volumetric data, such as MRI scans. These models process the entire 3D
volume, capturing spatial relationships that might be missed by 2D approaches.

A cascaded multi-modal mixing transformer (CM3T) [19] framework was developed
for AD classification using incomplete data. Their approach integrated 3DCNNswith
transformers to handle the spatial complexity of MRI images. The CM3T model pro-
cessed multi-modal neuroimaging data (structural MRI, fMRI, and PET scans) using

15


Fig. 2.5: Details of the CNN transformer for encoding 3D images [19].

dedicated transformer encoders for eachmodality. These encoders capturedmodality-
specific representations and were fused using a multi-modal mixing module with
cross-attention mechanisms (Fig. 2.4).

To handle incomplete data, the CM3T model employed a cascaded fusion strategy,
progressively combining available modalities. The final multi-modal representation
was fed into a classification head for AD stage prediction.

Helaly et al.[13] developed a 3D CNN for AD diagnosis using MRI volumes. Their
model captured comprehensive anatomical features by processing the entire 3D brain
volume, detecting subtle structural changes indicative of early-stage AD. The archi-
tecture consisted of multiple convolutional and pooling layers, followed by fully con-
nected layers for classification. One of the primary strengths of this study lies in its
high classification accuracy, particularly in detecting early-stage AD. By utilizing the
entire 3D brain volume, the model demonstrated an exceptional ability to capture

16


detailed spatial information, potentially identifying subtle biomarkers that might be
missed in 2D slice-based approaches. This comprehensive analysis of brain structure
represents a significant advancement in the field of automated AD diagnosis. How-
ever, the study also faced several challenges. The 3D CNN approach requires substan-
tial amounts of labeled 3D MRI data for effective training, which can be difficult and
expensive to acquire in large quantities. Moreover, processing entire 3D volumes is
computationally intensive, demanding significant processing power and memory re-
sources. This could potentially limit themodel’s applicability in resource-constrained
clinical settings. Another limitation noted was the model’s propensity for overfitting,
especially when trained on smaller datasets. This highlights the critical need for large,
diverse datasets in developing robust 3D CNN models for AD diagnosis.

An interpretable deep learning framework [29] using 3D CNNs and MRI data was
proposed for AD classification. Their model incorporated attention mechanisms to
highlight critical brain regions contributing to classification decisions, enhancing in-
terpretability and transparency. The 3D CNN processed full MRI volumes, captur-
ing subtle patterns associated with AD. A significant strength of this study lies in its
ability to achieve high classification accuracy while simultaneously providing inter-
pretable results. The attention maps generated by the model offer valuable insights
into the neural correlates of AD, potentially aiding clinicians in understanding the
structural changes associated with the disease progression. This interpretability is
crucial for building trust in AI-based diagnostic tools and could facilitate their inte-
gration into clinicalworkflows. However, the study also faced several challenges. Like
many deep learning approaches in medical imaging, the model requires high-quality,
diverse training data to perform optimally and generalize well to unseen cases. Ac-
quiring such comprehensive datasets in themedical field remains a significant hurdle.
Moreover, while the attention mechanisms enhance interpretability, the overall com-
plexity of the model can still pose challenges in clinical settings. The intricate nature
of deep learning models, even with interpretability features, may make it difficult for
non-specialists to fully understand and trust the model’s decisions. This highlights
the ongoing challenge of balancing model sophistication with practical clinical appli-
cability.

Utilizing the 3D CNNs Ebrahimi et al. [9] leveraged the rich spatial information in
MRI volumes for AD detection. Their model employedmultiple layers of 3D convolu-
tions, followed by pooling and fully connected layers, to capture and process detailed
anatomical features. A key strength of this study lies in its superior performance in
AD detection compared to traditional 2D approaches. By leveraging the full 3D struc-
ture of the brain, the model demonstrated an enhanced ability to identify complex

17


spatial patterns associated with AD. This comprehensive volumetric analysis not only
improved overall detection accuracy but also showed promise in enabling earlier and
more precise AD diagnosis, a crucial factor in effective treatment and management
of the disease. However, the study also faced significant challenges. The use of 3D
CNNs necessitates extensive 3DMRI data for effective training, which can be difficult
and costly to acquire in large quantities. Moreover, processing entire 3D volumes is
computationally intensive, requiring substantial computational resources. This could
potentially limit the model’s applicability in resource-constrained clinical settings or
research environments without access to high-performance computing facilities. An-
other critical consideration is the model’s heavy dependence on the quality of the
training data. The effectiveness of the 3D CNN in detecting AD is intrinsically linked
to the comprehensiveness and accuracy of the MRI scans used for training. This un-
derscores the importance of high-quality, diverse datasets in developing robust and
generalizable models for AD detection.

Transformer Based Architectures

Recent advancements in deep learning have seen the emergence of transformer-based
models, which have shown remarkable performance in various domains, including
medical image analysis [35]. These models leverage the self-attention mechanism,
which allows them to capture complex dependencies and interactionswithin the data,
making them particularly well-suited for tasks that require detailed spatial and con-
textual understanding, such as Alzheimer’s disease (AD) classification fromMRI im-
ages.

The M3T model(Multi-Plane Multi-Slice Transformer) [16] combines transformers’
self-attention capabilitieswith amulti-plane andmulti-slice representation of 3DMRI
volumes. This approach facilitates capturing complex spatial dependencies within
the data. The model slices the 3D MRI volume into multiple 2D planes, which are
further divided into smaller 2D slices treated as sequential inputs to the transformer
model. The self-attention mechanism integrates information across multiple planes
and slices, effectively reconstructing the 3D spatial context from the 2D represen-
tations. A key strength of this study lies in its significant improvement in classi-
fication accuracy over traditional CNN-based approaches. By leveraging the trans-
former’s ability to model long-range dependencies, the M3T model demonstrates su-
perior performance in capturing intricate spatial patterns and relationships within
the brain structure. This enhanced capability is particularly crucial in AD classi-
fication, where subtle structural changes can be indicative of disease progression.

18


However, the study also faces several challenges. The complex architecture of the
M3T model demands high computational resources, potentially limiting its applica-
tion in resource-constrained environments. Additionally, likemany deep learning ap-
proaches, the model’s performance is heavily dependent on large, high-quality train-
ing datasets, which can be challenging to acquire in the medical imaging domain.
Another significant consideration is the model’s complexity, which can reduce its
interpretability. While the model achieves high accuracy, the intricate nature of its
decision-making process may be difficult for clinicians to understand and trust. This
lack of transparency could potentially limit the model’s acceptance and integration
into clinical workflows, where interpretability is often crucial for decision-making
and patient communication.

A novel approach combining pixel-level fusion techniques with vision transformers
(ViTs) [25] was developed for early AD detection from MRI scans. ViTs effectively
capture global relationships and dependencies within images [8]. Their method in-
volves splitting high-resolution MRI images into smaller patches, embedding these
into a sequence of tokens, and processing them with the transformer network. The
pixel-level fusion technique ensures detailed and precise classification by integrating
information from every part of the image. One of the primary strengths of this study
lies in its superior performance in early AD detection. By leveraging ViTs, the model
demonstrates an exceptional ability to capture long-range dependencies and global
contextual information within brainMRI scans. This capability is particularly crucial
in detecting subtle, early-stage indicators of AD that might be missed by traditional
convolutional neural network (CNN) approaches, which typically focus on local fea-
tures. However, the study also faces several significant challenges. The complex ar-
chitecture of ViTs, combined with pixel-level fusion techniques, demands high com-
putational power and memory resources. This requirement could potentially limit
themodel’s applicability in resource-constrained clinical settings or research environ-
ments without access to high-performance computing facilities. Another notable lim-
itation is the long training times associated with such complex model architectures.
This not only impacts the development and iterative improvement of the model but
also poses challenges for its adaptation to new datasets or different medical imaging
modalities. Furthermore, a critical consideration for clinical applications is the diffi-
culty in interpreting the model’s decisions. While the model achieves high accuracy,
the intricate nature of transformer architectures and pixel-level fusion makes it chal-
lenging to provide clear, interpretable explanations for its classifications. This lack
of transparency could potentially hinder the model’s acceptance and integration into
clinical workflows, where interpretability is often crucial for decision-making and pa-

19


tient communication.

The Addformermodel [18], which combinesmultiple transformermodules for multi-
modal fusion of different MRI sequences (e.g., T1-weighted and T2-weighted images)
for AD detection. Each MRI sequence is processed by a separate transformer mod-
ule to capture modality-specific features, and the outputs are fused using additional
transformer layers. This approach leverages the strengths of each MRI sequence and
captures a comprehensive representation of the brain’s structural characteristics. A
key strength of this study lies in its enhanced robustness and accuracy in AD classi-
fication. By leveraging multiple MRI sequences, the Addformer model demonstrates
an improved ability to detect subtle indicators of AD that might be more prominent
in one modality than another. This multi-modal approach provides a more compre-
hensive view of brain structure and potential AD-related changes, potentially leading
to more accurate and reliable diagnoses. Furthermore, the effective combination of
multi-modal data represents a significant advancement in the field. The Addformer’s
architecture allows for the integration of complementary information from different
MRI sequences, potentially capturing a wider range of AD biomarkers and structural
changes associated with the disease progression. However, the study also faces sev-
eral challenges. One notable limitation is the requirement for careful preprocessing
and alignment of images from different modalities. This prerequisite adds complex-
ity to the data preparation phase and may introduce potential sources of error if not
handled meticulously. Another significant consideration is the increased computa-
tional requirements for both training and inference. The complex architecture of the
Addformer, processing multiple MRI sequences through separate transformer mod-
ules, demands substantial computational resources. This could potentially limit the
model’s applicability in resource-constrained clinical settings or research environ-
ments without access to high-performance computing facilities. Moreover, the com-
plexity of the Addformer model poses challenges in terms of interpretability. While
the model achieves high accuracy, the intricate nature of its decision-making process,
involving multiple transformer modules and fusion layers, may be difficult for clini-
cians to understand and trust. This lack of transparency could potentially hinder the
model’s acceptance and integration into clinical workflows, where interpretability is
often crucial for decision-making and patient communication.

2.2.3 Multi-class Classification

While binary classification has been a common approach in Alzheimer’s disease (AD)
diagnosis, recent research has increasingly focused on multi-class classification to

20


better capture the spectrum of cognitive states associated with the disease. This ap-
proach distinguishes not only between Alzheimer’s disease (AD) and cognitive nor-
mal (CN) individuals but also includes intermediate stages such as mild cognitive
impairment (MCI). Multi-class classification is crucial for developing comprehensive
diagnostic tools that can provide more nuanced assessments and support early inter-
vention strategies.

2.2.4 Utilization of Single andMulti-modal Approaches

The diagnosis and classification of Alzheimer’s disease (AD) have significantly
advanced with the development of both single and multi-modal approaches. Each
method offers unique benefits and, when combined, they provide a comprehensive
and robust diagnostic framework.

Single-Modal Approaches

Single-modal approaches focus on analyzing data from a single imagingmodality, typ-
ically MRI or PET scans. These methods have the advantage of being simpler and less
resource-intensive compared to multi-modal approaches, making them more acces-
sible in many clinical settings [21], [32], [33].

MRI-Based Approaches

• AlzheimerNet [34], a CNN-based model specifically designed to classify AD
stages from functional brain changes observed in MRI images. The architec-
ture utilized multiple convolutional layers to extract hierarchical features from
2D MRI slices, achieving notable accuracy in distinguishing between different
stages of AD.

• A CNNmodel from scratch was designed to automate AD detection using MRI
images [31]. By focusing on extensive data preprocessing, normalization, and
augmentation techniques, their model demonstrated high robustness and accu-
racy in classification tasks, effectively handling the variability in MRI data.

PET-Based Approach: A 3D CNN framework [29] was developed to classify AD
stages using PET images. Their model emphasized the interpretability of predictions,
allowing clinicians to understand the decision-making process, thereby increasing the
trust and usability of the model in clinical practice.

21


Multi-modal Approaches

Multi-modal approaches [30], [38] integrate information from multiple imaging
modalities, such as MRI and PET, to leverage the complementary strengths of each.
These methods provide a more comprehensive understanding of the brain’s
structure and function, enhancing diagnostic accuracy and robustness.

Fusion Techniques

Combining data from MRI and PET scans, multi-modal approaches use advanced
deep learning architectures to integrate and analyze this diverse information.

• A proposed multimodal image fusion method [38] demonstrated superior per-
formance in classifying Alzheimer’s Disease (AD), Mild Cognitive Impairment
(MCI), and Normal Control (NC) compared to single-modality approaches by
leveraging both structural information from MRI and functional insights from
Positron Emission Tomography (PET). The study utilized the Alzheimer’s Dis-
ease Neuroimaging Initiative (ADNI) dataset, including MRI and PET scans.
Preprocessing involved skull stripping, intensity normalization, and registra-
tion for MRI, and intensity normalization and spatial alignment with MRI for
PET. The core methodology included aligning PET images withMRI, extracting
features from both using a modified ResNet, and combining features through
a weighted averaging fusion strategy. The CNN architecture incorporated a
modified ResNet and specialized 3D CNN models, including a U-shaped net-
work with skip connections to extract multi-scale features. Data augmentation
techniques like rotation, flipping, and scaling were applied to enhance variabil-
ity. The model was optimized using the Adam optimizer with a learning rate
scheduler and categorical cross-entropy loss function, reflecting a comprehen-
sive approach to training and optimization. Despite high performance on the
ADNI dataset, limitations include dependency on MRI and PET image quality
and availability, additional computational complexity, and generalization chal-
lenges to other datasets and real-world scenarios.

• Addformer [18], a transformer-based model that fuses information from differ-
ent MRI sequences. This model utilized multiple transformer modules to in-
tegrate data across various imaging modalities, enhancing robustness and ac-
curacy in AD detection. The study demonstrated the potential of transformer-
based models in leveraging multi-modal data for comprehensive AD classifica-
tion.

22


Hybrid and Ensemble Methods

Hybrid approaches combine different deep learning models to utilize their respective
strengths, often leading to superior performance in AD diagnosis.

• A cascaded multi-modal mixing transformer framework [19] that combines 3D
CNNs with transformers. This hybrid method effectively handles the spatial
complexity ofMRI images, achieving robust classification evenwith incomplete
data. The integration of CNNs and transformers highlighted the potential of
hybrid models in improving diagnostic performance.

• A pixel-level fusion approach using vision transformers [25] for early AD detec-
tion. By processingMRI images at the pixel level, their model achieved detailed
and precise classification, underscoring the advantages of transformers in han-
dling high-resolution medical images. The study’s multi-modal classification
framework effectively distinguished between AD, MCI, and CN classes.

Transfer Learning and Domain Adaptation Transfer learning, where models
pre-trained on large datasets are fine-tuned for specific tasks, has also been instrumen-
tal in multi-modal approaches. It allows the models to leverage learned features from
extensive, general-purpose datasets, reducing the need for large amounts of domain-
specific data.

While multi-modal approaches offer significant advantages, they also present chal-
lenges such as increased computational complexity, the need for synchronized multi-
modal datasets, and the difficulty of integrating diverse data types. However, ongoing
advancements in deep learning and computational power are addressing these chal-
lenges, paving the way for more efficient and effective multi-modal diagnostic tools.

In summary, the utilization of single and multi-modal approaches has greatly en-
riched the field of AD diagnosis. Single-modal methods, particularly those based on
MRI and PET, provide valuable insights into the brain’s structure and function. Multi-
modal approaches, by integrating these insights, offer a more comprehensive and ac-
curate diagnostic framework, ultimately enhancing the early detection and manage-
ment of Alzheimer’s disease.

2.2.5 Preprocessing Techniques

Preprocessing is crucial for the accuracy and efficiency of Convolutional Neural Net-
work (CNN)-based models in MRI image analysis. Recent trends and popular tech-
niques include:

23


• Intensity Normalization: Standardizes image intensities to aid feature learn-
ing. Common methods include Z-score normalization (mean of zero, standard
deviation of one) and Min-Max scaling (fixed range, typically [0, 1] or [-1, 1]).

• Bias Field Correction: Corrects intensity inhomogeneities using algorithms
like N4ITK to enhance image consistency [40].

• Skull Stripping: Removes non-brain tissues to focus on brain structures, im-
proving classification accuracy. Common tools are Brain Extraction Tool (BET)
from FSL [37], BrainSuite [36], and FreeSurfer [11].

• Spatial Normalization: Aligns MRI images to a common reference space,
such as the MNI space, using affine or non-linear transformations to account
for anatomical variability [6].

• Segmentation: Divides MRI images into tissue types (e.g., gray matter, white
matter, cerebrospinal fluid) to isolate relevant brain structures. Tools include
SPM [3] and FSL [37].

• Data Augmentation: Increases training dataset size and model robustness
through random transformations such as rotation, translation, scaling, flipping,
and adding Gaussian noise.

• Smoothing: Reduces noise and enhances signal-to-noise ratio using
techniques like Gaussian smoothing to better delineate brain structures.

• Patch Extraction: Divides 3D MRI volumes into smaller patches for input to
the CNN, reducing computational load and focusing on local features.

• Histogram Equalization: Enhances contrast by redistributing intensity val-
ues, improving visibility of brain structures for better feature learning.

• Deep Learning-based Preprocessing: Utilizes autoencoders and Generative
Adversarial Networks (GANs) for denoising, normalizing, and enhancing MRI
images in a data-driven manner.

• Multimodal Image Fusion: Combines different MRI modalities (e.g.,
T1-weighted, T2-weighted) to provide richer information for better
classification performance through alignment and integration.

• Domain Adaptation Techniques: Addresses variations between training and
testing data using techniques like adversarial domain adaptation to improve
generalization by making learned features invariant to differences in scanners
or protocols.

24


2.3 Summary and Limitations of Existing Architec-
tures

In order to provide a concise comparison and highlight the key contributions of some
of the studies which are aligned with our proposed methodology, we have summa-
rized their core methodologies, model architectures, datasets used, and performance
metrics in Table 2.1. This table offers a clear overview of the advancements and lim-
itations observed in these studies, facilitating a better understanding of the current
state of Alzheimer’s disease classification using deep learning approaches.

In order to provide a concise comparison and highlight the key contributions of the
representative studies analyzed in detail, we have summarized their core methodolo-
gies, results, and comments in Table 2.1. This table offers a clear overview of the
advancements and limitations observed in these studies, facilitating a better under-
standing of the current state of Alzheimer’s disease classification using deep learning
approaches.

In summary, the reviewed literature underscores the effectiveness of advanced deep
learning models, including 2D CNNs, 3D CNNs, and transformer-based classifiers,
in Alzheimer’s disease detection. However, each of these architectures has inherent
limitations:

• 2D CNNs: These models analyze individual 2D slices of MRI images, often fail-
ing to capture the interdependencies among slices. This slice-by-slice approach
can lead to a loss of crucial 3D spatial information, which is vital for accurate
AD diagnosis. Additionally, 2D CNNs can be computationally expensive when
processing multiple slices separately.

• 3D CNNs: While 3D CNNs are capable of analyzing volumetric data, they suf-
fer from the "Curse of Dimensionality." The high number of parameters in these
models makes them prone to overfitting, especially with limited training data.
The non-convex nature of neural networks further complicates the learning pro-
cess, reducing the chances of finding optimal parameters.

• Transformer-based Classifiers: These models leverage self-attention mech-
anisms to capture long-range dependencies in data. However, transformers re-
quire large amounts of data and computational resources for training, making
them less suitable for datasets with limited samples. Additionally, they can be
sensitive to the quality of input data and preprocessing techniques.

25


Table 2.1: Summary of Representative Works on Alzheimer’s Disease Classification

Reference Core Methodology Results Comments
Loddo et
al. [21]

• Ensemble of multi-
ple CNNs

• Pre-trained on Ima-
geNet

• Simple average
function for en-
semble model (by
averaging the top 3)

• Binary class accu-
racy: 99.29%

• Dataset: ADNI

• Introduction of en-
semble learning

• Significant improve-
ment in accuracy

Song et al.
[38]

• Multimodal image
fusion (MRI + PET)
to create "GM-PET"
images.

• 3D CNN with
U-Net-like architec-
ture

• Binary class accu-
racy: 94.11%

• Multi-class accu-
racy: 74.54%

• Dataset: ADNI

• Highlights benefit of
multimodal data

• Need for additional
data for optimiza-
tion

Qiu et al.
[30]

• Hybrid approach
(CNN + CatBoost)

• Integration of
imaging and non-
imaging data

• Trained CNNmodel
on MRI data to
compute cognitive
scores.

• Multi-class test
accuracy: (87.9 ±
1.3)%

• Multiple indepen-
dent datasets

• Omitted CNN archi-
tecture details for
MRI model

• Extensive validation

Saleh et al.
[32]

• Transfer learning
with Densenet

• Data augmentation

• Multi-class training
accuracy: 96.05%

• Multi-class testing
accuracy: 90.01%

• Dataset: Kaggle

• Dataset seems to
contain less chal-
lenging data than
ADNI

• Indications of over-
fitting

26


Chapter 3

Proposed Methodology

3.1 Architecture Overview

In this study, we employ an ensemble learning approach to enhance the classification
performance of 3D MRI images for Alzheimer’s Disease. The ensemble is composed
of three distinct models, each specifically trained to process one of the three primary
anatomical planes: coronal, sagittal, and axial. This approach, which we refer to as
‘plane-wise ensemble’, is designed to leverage the unique structural information in-
herent in each imaging plane. By integrating the outputs from models trained on
these different planes, the approach aims to provide a more comprehensive and ac-
curate classification than any single model could achieve alone. Fig. 3.1 provides a
visual representation of this ensemble learning framework, highlighting theworkflow
and integration of the different models.

The rationale behind this plane-wise ensemble approach lies in the fact that different
anatomical planes can reveal complementary aspects of brain structure and pathol-
ogy. The coronal plane captures the frontal and posterior regions, the sagittal plane
provides a lateral view, and the axial plane offers a top-down perspective. Each plane
emphasizes different anatomical features, which, when combined, enhance the over-
all diagnostic accuracy.

This method is particularly advantageous in the context of 3D MRI image analysis,
where the complexity and variability of brain structures necessitate a robust andmul-
tifaceted approach. By utilizing an ensemble ofmodels, each attuned to specific struc-
tural information, our approach mitigates the limitations of individual models and
capitalizes on the strengths of each perspective.

27


(a) Training (Forward Pass)

(b) Evaluation

(c) Regular Ensemble Model

(d) Training (Forward Pass)

(e) Evaluation

(f) Plane-wise Ensemble Technique

Fig. 3.1: Diagram comparing a regular ensemble model (c) with our proposed plane-
wise ensemble technique (f)

28


3.2 Model Architecture

The three specializedmodels are trained on distinctMRI slice orientations to enhance
Alzheimer’s Disease classification. The Coronal Model focuses on coronal slices, cap-
turing frontal to posterior brain structures such as the lateral ventricles and frontal
lobes, which aids in identifying features unique to this plane. The Sagittal Model uses
sagittal slices to highlight midline structures like the corpus callosum and brainstem,
essential for detecting lateralized features and asymmetries. The AxialModel, trained
on axial slices, captures horizontal structures including the cerebral cortex and basal
ganglia, improving the detection of cortical thickness and basal ganglia configuration
changes. Each model’s specialization in its respective plane enhances its ability to
identify Alzheimer’s Disease-related patterns.

This segmentation of training data by plane is intended to enhance themodels’ ability
to understand and interpret the slice-specific characteristics. By training models on
specific imaging planes, each model can capture intricate details and features unique
to its respective plane. This targeted approachminimizes internal confusion andmax-
imizes the ability to identify subtle abnormalities, leading to the improvement of over-
all ensemble performance.

Each model’s outputs are then integrated to form an ensemble, capitalizing on the
strengths of each plane-specific model. The coronal, sagittal, and axial models col-
lectively contribute to a more comprehensive analysis, with each model providing in-
sights from different anatomical perspectives. This ensemble approach ensures that
the final classification leverages the diverse and complementary information obtained
from all three planes, resulting in a more robust and accurate diagnosis.

3.3 Model Integration

In the traditional ensemblemodel, when a single slice is provided as input, themodels
must first determine the anatomical plane to which the slice belongs before proceed-
ing with classification. This approach effectively transforms the original 3-class clas-
sification problem into a more complex 9-class classification problem, as the model
must now account for three planes for each class.

Unlike these traditional ensembles that evaluate individual 2D slices, our plane-wise
ensemble processes entire 3D volumes, thereby providing amore comprehensive con-
text for accurate classification. During the evaluation phase of the ensemble, an entire
3DMRI volume is used as input. This approach leverages the full spatial context of the

29


MRI data, capturing inter-slice relationships and volumetric features that are critical
for accurate disease classification.

The 3DMRI volume is automatically sliced into the three primary anatomical planes:
coronal, sagittal, and axial. Each set of slices corresponding to these planes is then fed
into their respective specialized models within the ensemble. The coronal slices are
processed by the coronal model, sagittal slices by the sagittal model, and axial slices
by the axial model. This ensures that each model can apply its specialized knowledge
to the appropriate set of slices, enhancing the detection of plane-specific features.

Following the classification of the slices by their respectivemodels, the ensemble inte-
grates the predictions by calculating a weighted average of the probabilities obtained
from each model. These weights are not arbitrarily assigned; rather, they are learned
through a separate training phase designed to optimize the combination ofmodel out-
puts. This weighted averaging allows the ensemble to balance the contributions of
each model according to their performance and relevance to the final classification.

By combining the strengths of the coronal, sagittal, and axial models, the plane-wise
ensemble provides an accurate classification of the 3D MRI volume. This integration
ensures that the diverse and complementary information from each anatomical plane
is utilized effectively, leading to a more robust and reliable diagnosis.

3.4 Projector Functions

Projector functions play a critical role in the proposed plane-wise ensemble technique
by transforming high-dimensional 3DMRI images into 2D inputs suitable for Convo-
lutional Neural Network (CNN) models.

A projector function is amathematical function thatmaps three-dimensional (3D) vol-
umetric data into a series of two-dimensional (2D) images. Formally, let 𝑉 ⊆ ℝ3 rep-
resent the domain of 3D volumetric data. A projector function 𝑃 can be defined as
follows:

𝑃 ∶ 𝑉 → (ℝ2)𝑛 (3.1)

where (ℝ2)𝑛 denotes an 𝑛-tuple of elements in ℝ2, representing an ordered sequence
of 2D images. For any point 𝐯 ∈ 𝑉, the projector function 𝑃 maps 𝐯 to a sequence of
2D images (𝐢1, 𝐢2,… , 𝐢𝑛) such that 𝐢𝑗 ∈ ℝ2 for 𝑗 = 1, 2,… , 𝑛.

In the context of MRI image classification, these functions are used to decompose a

30


3D MRI scan into three sets of 2D planes: axial, coronal, and sagittal. This decompo-
sition allows for the application of 2D CNN models, which are less computationally
intensive than their 3D counterparts.

Projector functions are impactful for several reasons:

• Reducing Computational Complexity: Handling 3D MRI data directly with
3D CNNs can be computationally prohibitive due to the high dimensionality of
the data. By projecting the 3D images into 2D planes, the projector functions
significantly reduce the computational complexity, making it feasible to train
and deploy CNN models on standard hardware.

• Leveraging the Strengths of 2D CNNs: 2D CNNs are well-established and
widely used in image classification tasks, benefiting from extensive research
and optimization. Projector functions enable the use of these robust 2D CNN
architectures by converting the 3DMRI data into a format that thesemodels can
process effectively.

• Maintaining Essential Structural Information: Despite reducing the data
dimensionality, projector functions try to ensure that the essential structural
information of the brain is retained. By extracting slices in three orthogonal
planes, the functions capture comprehensive views of the brain’s anatomy,
which is crucial for accurate disease diagnosis.

• Enhancing Classification Accuracy and Efficiency: The use of projector
functions, in conjunction with an ensemble of CNN models trained on differ-
ent planes, enhances the overall classification accuracy and efficiency. This ap-
proach allows the models to focus on specific aspects of the brain’s structure,
leveraging the strengths of each plane orientation to improve diagnostic perfor-
mance.

Projector functions are a pivotal component of the proposed plane-wise ensemble
technique for MRI image classification. By transforming 3D MRI data into 2D slices,
they facilitate the use of efficient and accurate 2D CNN models, thereby addressing
the computational challenges associatedwith 3DMRI data. This innovative approach
enhances the performance and reliability of Alzheimer’s disease diagnosis, contribut-
ing significantly to the field of medical imaging.

3.4.1 Midplane Projector

This function selects the middle slice of the 3D volume.

31


A 3D MRI scan can be considered as a stack of 2D images (slices) layered on top of
each other. The midplane projector picks the slice that is exactly in the middle of this
stack. This slice is considered representative of the entire volume and often captures
a central view of the brain’s structure.

Fig. 3.2: Midplane Projections: Axial Plane

3.4.2 Average Projector

This function calculates the average of all slices in the 3D volume.

The average projector combines information fromall slices by taking the average value
for each pixel across the stack of slices. The resulting 2D image represents an averaged
view, where each pixel’s value is the mean of the corresponding pixels in the original
3D volume. This method aims to capture the overall intensity patterns in the brain.

Fig. 3.3: Average Projections: Axial Plane

3.4.3 Max Variance Projector

This function selects the slice with the highest variance.

Variance measures how much the pixel values differ from the mean value within a
slice. The max variance projector identifies the slice where the pixel values vary the
most, indicating a high level of structural detail and differences.

32


Fig. 3.4: Max Variance Projections: Axial Plane

3.4.4 VarianceWeighted Average Projector

This function creates a weighted average of all slices, givingmore importance to slices
with higher variance.

Instead of treating all slices equally, the variance weighted average projector assigns
greater weight to slices with higher variance, meaning those with more detail and
differences. It then computes a weighted average, where each slice contributes to the
final 2D image based on its variance. This approach aims to enhance the overall detail
and information content in the projected image by emphasizing the most informative
slices. Importantly, sliceswith very low variance, such as those that are fully black and
contain no information, are assigned a weight of zero. This means they do not con-
tribute to the final average, ensuring that only the slices withmeaningful information
are used in the projection.

Fig. 3.5: Variance Weighted Average Projections: Axial Plane

3.4.5 Linear Learnable (LL) Projector

This function learns the optimal weights to assign to each slice in the 3D volume
through a training process and then creates aweighted average of all slices using those
weights.

Instead of manually defining the weights based on variance or other criteria, the LL

33


projector uses a machine learning model to determine how much weight each slice
should contribute to the final 2D image. The working of this model can be seen as
similar to an encoder-decoder model. The LL projector acts as the encoder, learning
to compress and represent the 3D volume into a 2D plane through learned weights.
The classifier that uses the 2D projection is analogous to the decoder, interpreting the
2D representation to make predictions.

During training, both the encoder and decoder update their weights based on the
training samples and backpropagation. While evaluating novel images, the LL pro-
jector uses the learnt weights to project the 3D volume into 2D images. And then they
are input to the planewise models to make the prediction.

34


Chapter 4

Results and Discussion

In this chapter, we have discussed the dataset, experimental setup and analysis pro-
cess, emphasizing the advancements achieved in classification accuracy. It begins
with a detailed explanation of dataset selection, preprocessing steps, and the method-
ology for splitting the data into training, validation, and test sets. Following this, we
evaluate the performance of several widely-used convolutional neural network archi-
tectures, including AlexNet, VGG-16, and ResNet-50, using a comprehensive set of
classification metrics. The chapter then introduces our novel plane-wise ensemble
approach, explaining its implementation and demonstrating its advantages over tra-
ditional ensemble models. Through detailed performance metrics and comparative
analysis, we demonstrate the effectiveness of our proposedmethod in improving clas-
sification accuracy across the three primary classes. The chapter concludes with a
discussion of key evaluation metrics, providing a comprehensive view of model per-
formance.

4.1 Dataset

The dataset utilized for our experiments was sourced from the Alzheimer’s Disease
Neuroimaging Initiative (ADNI) dataset, which is widely recognized as a benchmark
dataset for Alzheimer’s Disease classification. The ADNI dataset includes a variety of
imaging data, clinical information, and biomarkers collected from participants over
multiple visits. This rich dataset is crucial for developing and validatingmodels aimed
at detecting and classifying Alzheimer’s Disease.

As detailed in Table 4.1, the dataset comprises three primary classes: Alzheimer’s
Disease (AD), Cognitive Normal (CN), and Mild Cognitive Impairment (MCI). Each

35


Table 4.1: Dataset Distribution

Class Number of Samples
AD (Alzheimer’s Disease) 602
CN (Cognitive Normal) 998
MCI (Mild Cognitive Impairment) 1832

class includes a significant number of samples, with the MCI class having the largest
representation.

One important aspect of the ADNI dataset is that it includes multiple imaging data
points for individual patients collected across different visits. To ensure the integrity
of the experimental results and prevent data leakage, we grouped all images from the
same patient together. During the process of splitting the data into training, valida-
tion, and test sets, all images from a single patient are kept within the same split. This
approach ensures that themodels are evaluated on entirely unseen patients, thus pro-
viding a more realistic assessment of their generalizability and performance.

This careful handling of the dataset helps inmaintaining the robustness of the training
and evaluation process, ensuring that themodels are not inadvertently trained on data
that could appear in the test set. This practice is crucial for developing reliablemodels
for Alzheimer’s Disease classification, as it closelymirrors real-world scenarios where
models must generalize well to new patients.

4.2 Experimental Setup

4.2.1 Data Preparation

Data Preprocessing

As illustrated in Fig. 4.1, the acquired MRI images underwent skull-stripping using
the Freesurfer software. Freesurfer plays a vital role in preparing structural MRI data
for Alzheimer’s disease (AD) classification, offering a comprehensive pipeline to ex-
tract relevant features from brain images. The following steps outline the preprocess-
ing process:

• IntensityNormalization: The intensity values of the inputMRI scans are nor-
malized to ensure consistency across different acquisitions. This step corrects
variations in signal intensity caused by scanner differences and acquisition pro-
tocols.

• Denoising Mechanisms: Freesurfer incorporates denoising mechanisms to

36


reduce noise and enhance image quality. This includes the use of non-local
means denoising, which preserves important anatomical details while effec-
tively removing noise. This step is crucial for improving the accuracy of sub-
sequent image processing tasks.

• Skull Stripping and Semantic Segmentation: Freesurfer employs a unified
approach for skull stripping and semantic segmentation. By leveraging
probabilistic atlases and machine learning algorithms, it accurately delineates
brain structures while removing non-brain tissues such as skull, scalp, and
dura mater. This step is crucial for eliminating extraneous structures and
focusing on relevant brain regions for AD classification.

• Tissue Segmentation: Freesurfer performs tissue segmentation to classify
voxels in the brain into different tissue types, including gray matter, white
matter, and cerebrospinal fluid (CSF). This segmentation provides valuable
information for subsequent analyses and feature extraction.

• Cortical SurfaceReconstruction: Freesurfer reconstructs the cortical surface
of the brain from MRI data, identifying the pial surface (outer boundary of the
cortex) and the white matter surface (inner boundary of the cortex). Accurate
cortical surface reconstruction is essential for capturing cortical morphological
changes associated with AD.

• Parcellation andLabeling: Automated parcellation of the cerebral cortex into
distinct anatomical regions is performed, enabling detailed analysis of cortical
morphology and regional differences. Additionally, subcortical segmentation is
carried out to delineate structures such as the hippocampus and basal ganglia,
which are implicated in AD pathology.

By employing this preprocessing pipeline, Freesurfer prepares structuralMRI data for
AD classification studies, extracting relevant features and facilitating accurate analy-
sis of brain morphometry and pathology.

Train-Test Split

The dataset underwent a partitioning process into train, validation (val), and test sets
following an 8:1:1 ratio. To create the 5 folds for validation, the process involved in-
dependently generating each fold by randomly selecting a subset of samples from the
dataset, ensuring that the classes are balanced within each subset. Each fold was
produced using a ’sampling with replacement’ approach, ensuring that the folds are
independent of each other.

37


Fig. 4.1: Skull-stripping: The process of removing non-brain tissues fromMRI images
using semantic segmentation

The process of generating each fold involved the following steps:

• Randomly shuffle the dataset to ensure a fair distribution of samples.

• From the shuffled dataset, randomly select a subset such that the class distribu-
tion is balanced.

• Assign 80% of this subset to the training set, 10% to the validation set, and 10%
to the test set.

• Repeat this process five times, independently, to create five separate folds.

To prevent data leakage, rigorous precautions were taken to ensure that multiple im-
ages from the same patient did not inadvertently end up across different splits during
the random split generation. This meticulous approach was crucial for maintaining
the integrity of the evaluation process. The distribution of data across these subsets
is delineated in Table 4.2, providing transparency regarding the allocation of samples
for training and evaluation.

4.2.2 Hyper-parameter Settings

The plane-wise ensemble model was trained using the following hyperparameters:

• Batch Size: 16
The batch size determines the number of samples processed before the model’s

38


Table 4.2: Data Split Distribution

Fold Set AD Samples MCI Samples CN Samples

1
Train 482 484 484
Val 54 52 51
Test 57 53 52

2
Train 480 484 482
Val 51 47 49
Test 61 58 56

3
Train 474 474 477
Val 66 64 62
Test 53 53 49

4
Train 474 477 473
Val 63 61 62
Test 57 54 57

5
Train 495 496 495
Val 49 47 47
Test 56 55 55

internal parameters are updated. A batch size of 16 strikes a balance between
computational efficiency and the stability of the gradient descent process,
enabling effective learning without overwhelming the memory capacity of the
training hardware.

• MaximumNumber of Epochs: 100
An epoch refers to one complete pass through the entire training dataset. Setting
the maximum number of epochs to 100 allows the model sufficient iterations to
learn from the data while preventing excessive training time and overfitting.

• Early Stopping: Implemented to prevent overfitting, with a patience threshold
set at 10 epochs
Early stopping is a regularization technique used to terminate training when
the model’s performance on a validation set stops improving. The patience pa-
rameter of 10 epochs means that training will halt if there is no improvement in
the validation loss for 10 consecutive epochs, thereby avoiding overfitting and
reducing unnecessary computation.

• Initial Learning Rate: 0.001
The learning rate controls the step size at each iteration while moving toward a
minimum of the loss function. An initial learning rate of 0.001 is chosen as it
is small enough to ensure stable convergence and large enough to expedite the
learning process.

39


• Scheduler Step Size: 7 epochs
The learning rate scheduler reduces the learning rate by a factor (gamma) ev-
ery 7 epochs. This step size ensures periodic adjustments to the learning rate,
facilitating finer learning adjustments as training progresses.

• Gamma (Scheduler Factor): 0.1
The gamma parameter is the factor by which the learning rate is reduced. A
gamma of 0.1 means the learning rate is multiplied by 0.1 every 7 epochs, allow-
ing the model to fine-tune its weights with smaller learning rates in later stages
of training for better accuracy.

• Loss Function: Cross-entropy loss
Cross-entropy loss is employed due to its effectiveness in multiclass classifica-
tion tasks. It measures the performance of the classification model whose out-
put is a probability value between 0 and 1, helping to quantify the difference
between predicted probabilities and the actual class labels.

• Optimizer: Adam
Adam optimizer is used as it is the most popular default choice of optimizer in
deep learning. It dynamically adjusts learning rates for individual parameters
based on gradient magnitudes, smoothing convergence and reducing oscilla-
tions. Incorporating momentum, it aids in navigating loss landscapes, helping
to escape shallow local optima. These features mitigate issues like jittering and
local optima, enhancing training stability and speed.

• Ensemble Weights Training:
The weights of the ensemble were fine-tuned separately for 30 epochs using the
same hyperparameter settings. After training the individual models, the en-
semble weights are optimized to combine the outputs of the models effectively.
This separate training for 30 epochs with consistent hyperparameters ensures
that the ensemble can leverage the strengths of eachmodel and improve overall
prediction accuracy.

4.3 Quantitative Evaluation

4.3.1 Evaluation Metrics

In the context of evaluating machine learning models, especially in classification
tasks, several metrics are employed to gauge the performance of the model. These
metrics offer insights into different aspects of the model’s predictive capabilities.

40


This section delves into the specifics of accuracy, precision, recall, F1 score, and
AUC-ROC, with a particular emphasis on their application in multi-class
classification scenarios.

Accuracy

Accuracy is the simplest and most intuitive metric, representing the proportion of
correctly classified instances out of the total instances. It is calculated as follows:

Accuracy = Number of Correct Predictions
Total Number of Predictions

(4.1)

In a multi-class classification setting, accuracy alone may not provide a complete pic-
ture, especially if the class distribution is imbalanced. For example, if one class dom-
inates the dataset, a model that always predicts the majority class could still achieve
high accuracy but would perform poorly on the minority classes.

Precision

Precision measures the proportion of true positive predictions among all positive pre-
dictions made by the model. It is an important metric when the cost of false positives
is high. For multi-class classification, precision is calculated for each class individu-
ally:

Precision𝑖 =
𝑇𝑃𝑖

𝑇𝑃𝑖 + 𝐹𝑃𝑖
(4.2)

where 𝑇𝑃𝑖 and 𝐹𝑃𝑖 are the true positives and false positives for class 𝑖, respectively.
The overall precision for the model can be obtained by averaging the precision values
for each class (macro-averaging) or by weighting them by the number of instances in
each class (weighted averaging).

Recall

Recall, also known as sensitivity or true positive rate, measures the proportion of true
positive predictions among all actual positive instances. It is crucial when the cost of
false negatives is high. For multi-class classification, recall is calculated for each class
as follows:

Recall𝑖 =
𝑇𝑃𝑖

𝑇𝑃𝑖 + 𝐹𝑁𝑖
(4.3)

41


where 𝑇𝑃𝑖 and 𝐹𝑁𝑖 are the true positives and false negatives for class 𝑖, respectively.
Similar to precision, overall recall can be obtained through macro-averaging or
weighted averaging.

F1 Score

The F1 score is a metric that combines precision and recall into a single number by
calculating their harmonic mean. It is particularly useful when dealing with imbal-
anced datasets, where one class may be significantly more frequent than others. In
such cases, relying solely on accuracy can be misleading, as a model might perform
well overall by favoring the majority class but poorly on the minority class.

The F1 score is also known as the Dice Score or Dice Coefficient in the context of
certain applications like image segmentation. It is calculated for each class in amulti-
class classification problem using the following formula:

F1 Score𝑖 =
2 ⋅ Precision𝑖 ⋅ Recall𝑖
Precision𝑖 + Recall𝑖

(4.4)

The harmonic mean of two numbers, 𝑎 and 𝑏, is given by:

𝐻(𝑎, 𝑏) = 2
1
𝑎
+ 1

𝑏

= 2𝑎𝑏
𝑎 + 𝑏 (4.5)

One of the key properties of the harmonic mean is that it tends to be closer to the
smaller of the two numbers. This is beneficial in the context of the F1 score because
it penalizes models that have a significant imbalance between precision and recall. In
other words, if a model has high precision but low recall, or vice versa, the F1 score
will be closer to the lower value, highlighting the model’s deficiency.

AUC-ROC

The Receiver Operating Characteristic (ROC) curve is a graphical representation of
a model’s diagnostic ability. It plots the true positive rate (recall) against the false
positive rate (1 - specificity) at various threshold settings. The Area Under the ROC
Curve (AUC-ROC) summarizes the performance of the model across all thresholds:

AUC-ROC = ∫
1

0
ROC curve(𝑥)𝑑𝑥 (4.6)

42


For multi-class classification, a common approach is to compute the ROC curve and
AUC for each class against all other classes (one-vs-rest) and then average the results.

• True Positive Rate (TPR) or Recall: This is the 𝑦-axis of the ROC curve and
represents the proportion of actual positives correctly identified by the model.

• False Positive Rate (FPR): This is the 𝑥-axis of the ROC curve and repre-
sents the proportion of actual negatives incorrectly identified as positives by the
model.

A model with a high AUC-ROC value (closer to 1) indicates better performance, as it
suggests that the model has a good measure of separability between the classes.

While the confusion matrix provides information about the actual counts of true pos-
itives, false positives, true negatives, and false negatives, it is dependent on a spe-
cific threshold. In contrast, the AUC-ROC offers a threshold-independent evaluation,
summarizing the model’s performance across all possible thresholds. This makes the
AUC-ROCamore robust and comprehensivemetric for assessingmodel performance,
especially when comparing models.

In summary, each metric provides unique insights into different aspects of model
performance. Accuracy offers a general overview, while precision and recall provide
deeper insights into the model’s behavior with respect to false positives and false neg-
atives. The F1 score balances precision and recall, and the AUC-ROC provides a com-
prehensive evaluation across all thresholds. Together, these metrics form a holistic
view of the model’s performance in multi-class classification scenarios.

4.3.2 Baseline Models

We employed common 2D CNN models pretrained on the ImageNet [7] dataset as
well as a traditional ensemble model, where each model was trained on the entirety
of MRI slices, and predictions were combined through probability averaging during
evaluation.

AlexNet

Table 4.3 shows the metrics for accuracy, classwise precision, recall, F1 score, and
AUC-ROC for each of the 5 folds for AlexNet.

43


Table 4.3: Performance Analysis for AlexNet

Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%)

1 60.494
AD 74.000 64.912 69.159 80.033
MCI 45.946 65.385 53.968 61.626
CN 71.053 50.943 59.341 71.733

2 60.000
AD 70.833 55.738 62.385 74.590
MCI 46.667 62.500 53.435 66.161
CN 69.231 62.069 65.455 78.912

3 59.355
AD 86.111 58.491 69.663 77.839
MCI 43.210 71.429 53.846 66.712
CN 68.421 49.057 57.143 71.162

4 55.357
AD 61.702 50.877 55.769 64.738
MCI 43.750 61.404 51.095 62.162
CN 70.732 53.704 61.053 70.110

5 60.843
AD 70.370 67.857 69.091 75.292
MCI 45.161 50.909 47.863 58.624
CN 70.000 63.636 66.667 71.122

VGG-16

Table 4.4 shows the metrics for accuracy, classwise precision, recall, F1 score, and
AUC-ROC for each of the 5 folds for VGG-16.

Table 4.4: Performance Analysis for VGG-16

Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%)

1 61.111
AD 79.070 59.649 68.000 79.866
MCI 47.297 67.308 55.556 69.073
CN 66.667 56.604 61.224 68.929

2 62.857
AD 76.471 63.934 69.643 74.863
MCI 48.718 67.857 56.716 66.221
CN 71.739 56.897 63.462 73.637

3 61.935
AD 88.889 60.377 71.910 72.660
MCI 46.575 69.388 55.738 67.077
CN 65.217 56.604 60.606 73.825

4 66.667
AD 80.000 56.140 65.979 79.501
MCI 53.846 73.684 62.222 73.921
CN 76.000 70.370 73.077 80.815

5 61.446
AD 70.833 60.714 65.385 75.633
MCI 46.875 54.545 50.420 60.164
CN 70.370 69.091 69.725 80.655

ResNet-50

Table 4.5 shows the metrics for accuracy, classwise precision, recall, F1 score, and
AUC-ROC for each of the 5 folds for ResNet-50.

44


Table 4.5: Performance Analysis for ResNet-50

Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%)

1 72.840
AD 75.000 78.947 76.923 84.712
MCI 63.158 69.231 66.055 76.914
CN 82.222 69.811 75.510 83.469

2 73.714
AD 80.769 68.852 74.336 79.062
MCI 62.121 73.214 67.213 71.924
CN 80.702 79.310 80.000 85.927

3 72.903
AD 69.091 71.698 70.370 74.769
MCI 72.549 75.510 74.000 77.801
CN 77.551 71.698 74.510 78.376

4 69.643
AD 68.254 75.439 71.667 78.125
MCI 66.038 61.404 63.636 71.946
CN 75.000 72.222 73.585 80.539

5 69.277
AD 68.966 71.429 70.175 79.140
MCI 66.667 72.727 69.565 77.821
CN 72.917 63.636 67.961 74.382

DenseNet-169

Table 4.6 shows the metrics for accuracy, classwise precision, recall, F1 score, and
AUC-ROC for each of the 5 folds for DenseNet-169.

Table 4.6: Performance Analysis for DenseNet-169

Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%)

1 77.778
AD 83.636 80.702 82.143 87.168
MCI 69.231 69.231 69.231 79.528
CN 80.000 83.019 81.481 88.783

2 74.857
AD 78.182 70.492 74.138 81.622
MCI 67.143 83.929 74.603 83.283
CN 82.000 70.690 75.926 81.359

3 74.839
AD 80.769 79.245 80.000 85.775
MCI 65.000 79.592 71.560 80.400
CN 81.395 66.038 72.917 84.462

4 75.000
AD 79.245 73.684 76.364 82.646
MCI 68.852 73.684 71.186 78.694
CN 77.778 77.778 77.778 86.160

5 73.494
AD 80.357 80.357 80.357 86.055
MCI 62.687 76.364 68.852 77.674
CN 81.395 63.636 71.429 78.870

Vision Transformer B16

Table 4.7 shows the metrics for accuracy, classwise precision, recall, F1 score, and
AUC-ROC for each of the 5 folds for Vision Transformer B16.

45


Table 4.7: Performance Analysis for Vision Transformer B16

Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%)

1 64.815
AD 72.222 68.421 70.270 82.256
MCI 56.667 65.385 60.714 70.472
CN 66.667 60.377 63.366 70.573

2 60.571
AD 75.000 59.016 66.055 77.380
MCI 51.351 67.857 58.462 70.273
CN 60.377 55.172 57.658 66.858

3 70.968
AD 74.468 66.038 70.000 78.117
MCI 64.912 75.510 69.811 78.283
CN 74.510 71.698 73.077 79.967

4 61.310
AD 66.667 70.175 68.376 74.696
MCI 53.125 59.649 56.198 67.520
CN 65.909 53.704 59.184 70.744

5 65.663
AD 66.038 62.500 64.220 73.198
MCI 60.000 65.455 62.609 73.726
CN 71.698 69.091 70.370 79.230

Traditional Ensemble Model (ResNet-50, AlexNet, VGG-16)

The traditional ensemblemodel combines the strengths of threewell-establishedCon-
volutional Neural Networks (CNNs): ResNet-50, AlexNet, and VGG-16. In this en-
semble approach, each model is independently trained on the same dataset, and their
predictions are aggregated to make the final decision. Table 4.8 shows the metrics for
accuracy, classwise precision, recall, F1 score, and AUC-ROC for each of the 5 folds
for Traditional Ensemble Model (ResNet-50, AlexNet, VGG-16).

Table 4.8: Performance Analysis for Traditional Ensemble Model (ResNet-50,
AlexNet, VGG-16)

Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%)

1 74.691
AD 84.211 84.211 84.211 87.347
MCI 62.500 67.308 64.815 70.359
CN 77.551 71.698 74.510 80.898

2 73.714
AD 80.392 67.213 73.214 77.858
MCI 65.079 73.214 68.908 75.040
CN 77.049 81.034 78.992 85.427

3 81.935
AD 90.909 75.472 82.474 86.632
MCI 74.000 75.510 74.747 80.485
CN 81.967 94.340 87.719 90.972

4 74.405
AD 81.481 77.193 79.279 83.750
MCI 71.429 61.404 66.038 71.189
CN 70.769 85.185 77.311 81.453

5 73.494
AD 75.806 83.929 79.661 82.976
MCI 64.516 72.727 68.376 73.095
CN 83.333 63.636 72.165 76.717

46


4.3.3 Plane-wise Ensemble Models

The proposed plane-wise ensemble, withmodels trained specifically on coronal, sagit-
tal, and axial slices, demonstrated superior performance compared to traditional en-
semblemodels and standalone 2DCNNs. The holistic approach of combining special-
ized models using a weighted average of their predictions yielded improved accuracy
in the classification of 3D MRI images.

Triple ResNet (using Midplane Projector)

Triple ResNet model utilizes a plane-wise ensemble technique, involving three sepa-
rate ResNet-50 models trained on coronal, sagittal, and axial planes of MRI images.
Each ResNet-50model is trained independently on 2D slices from one of these planes,
capturing unique anatomical features specific to that orientation. During evaluation,
the entire 3D MRI volume is processed by a midplane projector function, which ex-
tracts the central slice from each of the three planes. These slices are then fed into
their respective ResNet-50 models. The predictions from the three models are sub-
sequently combined to make the final classification decision. Table 4.9 shows the
metrics for accuracy, classwise precision, recall, F1 score, and AUC-ROC for each of
the 5 folds for Triple ResNet (Midplane Projector).

Table 4.9: Performance Analysis for Triple ResNet (using Midplane Projector)

Fold Accuracy (%) Class Precision (%) Recall (%) F1 Score (%) AUC-ROC (%)

1 80.247
AD 85.455 82.456 83.929