Optimizing CAM Refinement with Swin-based Feature Affinity for Weakly Supervised Semantic Segmentation

Farzana, Anika; Ahsan, K. M.Abesh; Amin, Sayemah

Optimizing CAM Refinement with Swin-based Feature Affinity for Weakly Supervised Semantic Segmentation

dc.contributor.author	Farzana, Anika
dc.contributor.author	Ahsan, K. M.Abesh
dc.contributor.author	Amin, Sayemah
dc.date.accessioned	2026-06-23T09:47:57Z
dc.date.issued	2025-10-25
dc.description	Supervised by Dr. Md. HasanulKabir, Professor, Department of Computer Science and Engineering (CSE) Islamic University of Technology (IUT) Board Bazar, Gazipur, Bangladesh This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Computer Science and Engineering, 2025
dc.description.abstract	Semantic segmentation is a fundamental computer vision task requiring pixel-level understanding of images. While fully supervised methods achieve high accuracy, they rely on costly pixel-level annotations. Weakly Supervised Semantic Segmentation (WSSS) mitigates this by using weaker supervision, such as image-level labels, to train effective models. Recent WSSS progress leverages Class Activation Maps (CAMs), though their sparsity and poor boundary localization remain challenges. This study enhances CAM quality through multi-modal backbones like UniCL and hierarchical transformers such as Swin Transformer for stronger feature extraction, coupled with an affinity-based framework that fuses encoder and decoder affinities for semantically coherent pseudo-labels. A Pixel-Adaptive Refinement (PAR) module further improves object boundaries using local similarity cues. Experiments on the PASCAL VOC 2012 dataset yield mean IoUs of 50.3% (validation) and 50.8% (test), with strong performance on large, distinctive classes but weaker results for small or human-centric ones due to CAM bias and dataset imbalance. Overall, our findings demonstrate that UniCL and Swin Transformer significantly improve CAM quality and segmentation under weak supervision while highlighting the need for strategies that handle object size variation and reduce model bias.
dc.identifier.citation	[1] S. Abnar andW.Zuidema, Quantifying attention flow in transformers, 2020. arXiv: 2005.00928 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2005.00928. [2] R.AdamsandL.Bischof, “Seeded region growing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 6, pp. 641–647, 1994. doi: 10.1109/34.295913. [3] J. Ahn, S. Cho, and S. Kwak, Weakly supervised learning of instance segmentation with inter-pixel relations, 2019. arXiv: 1904.05044 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1904.05044. [4] J. AhnandS.Kwak,Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation, 2018. arXiv: 1803.10464 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1803.10464. [5] N.Araslanov and S. Roth, Single-stage semantic segmentation from image labels, 2020. arXiv: 2005.08104 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2005.08104. [6] V.Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017. [7] A.Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, Mar. 2018. doi: 10.1109/wacv.2018.00097. [Online]. Available: http://dx.doi.org/10.1109/WACV.2018.00097. [8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017. 78 [9] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014. [10] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017. [11] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818. [12] L. Chen, C. Lei, R. Li, S. Li, Z. Zhang, and L. Zhang, “Fpr: False positive rectification for weakly supervised semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1108–1118. [13] M.Cordts et al., “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. doi: 10.1109/CVPR.2016.350. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_ 2016/html/Cordts_The_Cityscapes_Dataset_CVPR_2016_paper.html. [14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet: A large-scale hierarchical image database,” Jun. 2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848. [15] A.Dosovitskiy et al., An image is worth 16x16 words: Transformers for image recognition at scale, 2021. arXiv: 2010.11929 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2010.11929. [16] Y. Du, Z. Fu, Q. Liu, and Y. Wang, Weakly supervised semantic segmentation by pixel-to-prototype contrast, 2022. arXiv: 2110.07110 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2110.07110. [17] M.Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, vol. 88, pp. 303–338, Jun. 2010. doi: 10.1007/s11263-009-0275-4. [18] W.Gaoetal., Ts-cam: Token semantic coupled attention map for weakly supervised object localization, 2021. arXiv: 2103.14862 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2103.14862. [19] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Semantic contours from inverse detectors,” in ICCV, 2011. 79 [20] Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang, “Weakly-supervised semantic segmentation network with deep seeded region growing,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7014–7023. doi: 10.1109/CVPR.2018.00733. [21] P.-T. Jiang, Y. Yang, Q. Hou, and Y. Wei, “L2g: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022. [22] P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y. Wei, “Layercam: Exploring hierarchical class activation maps for localization,” IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021. doi: 10.1109/TIP.2021.3089943. [23] A.Kolesnikov and C. H. Lampert, Seed, expand and constrain: Three principles for weakly-supervised image segmentation, 2016. arXiv: 1603.06098 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1603.06098. [24] A.Kolesnikov and C. H. Lampert, Seed, expand and constrain: Three principles for weakly-supervised image segmentation, 2016. arXiv: 1603.06098 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1603.06098. [25] P. Krähenbühl and V. Koltun, Efficient inference in fully connected crfs with gaussian edge potentials, 2012. arXiv: 1210.5644 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1210.5644. [26] J. R. Lee, S. Kim, I. Park, T. Eo, and D. Hwang, “Relevance-cam: Your model already knows where to look,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14944–14953. [27] J. Lee, J. Choi, J. Mok, and S. Yoon, “Reducing information bottleneck for weakly supervised semantic segmentation,” Advances in neural information processing systems, vol. 34, pp. 27408–27421, 2021. [28] J. Lee, S. J. Oh, S. Yun, J. Choe, E. Kim, and S. Yoon, “Weakly supervised semantic segmentation using out-of-distribution data,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16897–16906. [29] J. Li, Z. Jie, X. Wang, X. Wei, and L. Ma, “Expansion and shrinkage of localization for weakly-supervised semantic segmentation,” Advances in neural information processing systems, vol. 35, pp. 16037–16051, 2022. 80 [30] T.-Y. Lin et al., Microsoft coco: Common objects in context, 2015. arXiv: 1405.0312 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1405.0312. [31] Y. Lin et al., Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation, 2023. arXiv: 2212.09506 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2212.09506. [32] Z. Liu et al., Swin transformer: Hierarchical vision transformer using shifted windows, 2021. arXiv: 2103.14030 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2103.14030. [33] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440. [34] J. Pan et al., “Learning self-supervised low-rank network for single-stage weakly and semi-supervised semantic segmentation,” International Journal of Computer Vision, vol. 130, no. 5, pp. 1181–1195, 2022. [35] G.Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille, Weakly- and semi-supervised learning of a dcnn for semantic image segmentation, 2015. arXiv: 1502.02734 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1502.02734. [36] Z. Peng, G. Wang, L. Xie, D. Jiang, W. Shen, and Q. Tian, “Usage: A unified seed area generation paradigm for weakly supervised semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 624–634. [37] P. O. Pinheiro and R. Collobert, From image-level to pixel-level labeling with convolutional networks, 2015. arXiv: 1411.6228 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1411.6228. [38] A.Radford et al., Learning transferable visual models from natural language supervision, 2021. arXiv: 2103.00020 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2103.00020. [39] S.Rong, B. Tu, Z. Wang, and J. Li, “Boundary-enhanced co-training for weakly supervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19574–19584. [40] O.Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, 81 Munich, Germany, October 5-9, 2015, proceedings, part III 18, Springer, 2015, pp. 234–241. [41] L. Ru, B. Du, Y. Zhan, and C. Wu, “Weakly-supervised semantic segmentation with visual words learning and hybrid pooling,” arXiv preprint arXiv:2202.04812, 2022. [42] L. Ru, Y. Zhan, B. Yu, and B. Du, Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers, 2022. arXiv: 2203.02664 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2203.02664. [43] L. Ru, H. Zheng, Y. Zhan, and B. Du, Tokencontrast for weakly-supervised semantic segmentation, 2023. arXiv: 2303.01267 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2303.01267. [44] R.R.Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” International Journal of Computer Vision, vol. 128, no. 2, pp. 336–359, Oct. 2019, issn: 1573-1405. doi: 10.1007/s11263-019-01228-7. [Online]. Available: http://dx.doi.org/10.1007/s11263-019-01228-7. [45] R.Sinkhorn, “A relationship between arbitrary positive matrices and stochastic matrices,” Canadian Journal of Mathematics, vol. 18, pp. 303–306, 1966. [Online]. Available: https://api.semanticscholar.org/CorpusID:123663969. [46] R.Strudel, R. Garcia, I. Laptev, and C. Schmid, Segmenter: Transformer for semantic segmentation, 2021. arXiv: 2105.05633 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2105.05633. [47] G.Sun, W.Wang, J. Dai, and L. Van Gool, “Weakly supervised semantic segmentation with generative attention spread and region refinement,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 11329–11339. [48] W.Sun,J. Zhang, Z. Liu, Y. Zhong, and N. Barnes, Getam: Gradient-weighted element-wise transformer attention map for weakly-supervised semantic segmentation, 2022. arXiv: 2112.02841 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2112.02841. [49] P. Tokmakov, K. Alahari, and C. Schmid, Weakly-supervised semantic segmentation using motion cues, 2017. arXiv: 1603.07188 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1603.07188. 82 [50] C.Wang,R. Xu, S. Xu, W. Meng, and X. Zhang, “Treating pseudo-labels generation as image matting for weakly supervised semantic segmentation,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 755–765. doi: 10.1109/ICCV51070.2023.00076. [51] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan, Object region mining with adversarial erasing: A simple classification to semantic segmentation approach, 2018. arXiv: 1703.08448 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1703.08448. [52] T. Wu,G.Gao, J. Huang, X. Wei, X. Wei, and C. H. Liu, “Adaptive spatial-bce loss for weakly supervised semantic segmentation,” in European Conference on Computer Vision, Springer, 2022, pp. 199–216. [53] Z. Wu,C.Shen, and A. van den Hengel, Widerordeeper: Revisiting the resnet model for visual recognition, 2016. arXiv: 1611.10080 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1611.10080. [54] E.Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Advances in neural information processing systems, vol. 34, pp. 12077–12090, 2021. [55] J. Xie, X. Hou, K. Ye, and L. Shen, “Clims: Cross language image matching for weakly supervised semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4483–4492. [56] L. Xu, W.Ouyang, M. Bennamoun, F. Boussaid, and D. Xu, “Multi-class token transformer for weakly supervised semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4310–4319. [57] R.Xu, C. Wang, J. Sun, S. Xu, W. Meng, and X. Zhang, “Self correspondence distillation for end-to-end weakly-supervised semantic segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 3045–3053. [58] J. Yang et al., Unified contrastive learning in image-text-label space, 2022. arXiv: 2204.03610 [cs.CV]. [59] S.-H. Yoon, H. Kweon, J. Cho, S. Kim, and K.-J. Yoon, “Adversarial erasing framework via triplet with gated pyramid pooling layer for weakly supervised semantic segmentation,” in European conference on computer vision, Springer, 2022, pp. 326–344. 83 [60] B. Zhang, J. Xiao, Y. Wei, M. Sun, and K. Huang, Reliability does matter: An end-to-end weakly supervised semantic segmentation approach, 2019. arXiv: 1911.08039 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1911.08039. [61] B. Zhang, S. Yu, Y. Wei, Y. Zhao, and J. Xiao, Frozen clip: A strong backbone for weakly supervised semantic segmentation, 2024. arXiv: 2406.11189 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2406.11189. [62] F. Zhang, C. Gu, C. Zhang, and Y. Dai, “Complementary patch for weakly supervised semantic segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 7242–7251. [63] X.Zhanget al., Adaptive affinity loss and erroneous pseudo-label refinement for weakly supervised semantic segmentation, 2021. arXiv: 2108.01344 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2108.01344. [64] H.Zhao,J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017. [65] S. Zheng et al., Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, 2021. arXiv: 2012.15840 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2012.15840. [66] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, Learning deep features for discriminative localization, 2015. arXiv: 1512.04150 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1512.04150. [67] B. Zhouet al., Semantic understanding of scenes through the ade20k dataset, 2018. arXiv: 1608.05442 [cs.CV]. [Online]. Available: https://arxiv.org/abs/1608.05442. [68] T. Zhou, M. Zhang, F. Zhao, and J. Li, “Regional semantic contrast and aggregation for weakly supervised semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4299–4309
dc.identifier.uri	https://repository.iutoic-dhaka.edu/handle/123456789/2616
dc.language.iso	en
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh
dc.title	Optimizing CAM Refinement with Swin-based Feature Affinity for Weakly Supervised Semantic Segmentation
dc.type	Thesis

Files

Original bundle

Now showing 1 - 2 of 2

Name:: 26 Fulltext_ CSE_Optimizing CAM Refinement with Swin-based_200041204_200041225_200041234_.pdf
Size:: 18.6 MB
Format:: Adobe Portable Document Format

Download

Name:: 26 Turnitin Report_ CSE_200041204_200041225_200041234_.pdf
Size:: 491.6 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

2025