Loss Design and Architecture Selection for Long-Tailed Multi-Label Chest X-Ray Classification

Abstract

Long-tailed class distributions pose a significant challenge for multi-label chest X-ray (CXR) classification, where rare but clinically important findings are severely underrepresented. We present a systematic empirical evaluation of loss functions, CNN backbone architectures and post-training strategies on the CXR-LT 2026 benchmark, comprising approximately 143K images with 30 disease labels from PadChest.

Our experiments demonstrate that LDAM with deferred re-weighting (LDAM-DRW) consistently outperforms standard BCE and asymmetric losses for rare class recognition. Amongst the architectures evaluated, ConvNeXt-Large achieves the best single-model performance with 0.5220 mAP and 0.3765 F1 on our development set, whilst classifier re-training and test-time augmentation further improve ranking metrics. On the official test leaderboard, our submission achieved 0.3950 mAP, ranking 5th amongst 68 participating teams with a total of 1528 submissions.

Method Overview

Given a chest X-ray image, the goal is to predict a binary label vector for 30 disease classes. This multi-label setting allows multiple findings to co-occur in a single image. We investigated several approaches in the perspective of loss functions, backbone architectures and post-training techniques to address the extreme class imbalance and multi-label nature of the task. We experimented with Label-Distribution-Aware Margin Loss combined with Deferred Re-Weighting (LDAM-DRW), Asymmetric Loss, and standard Binary Cross-Entropy (BCE) as baselines. For backbone architectures, we evaluated ResNet-50/101, DenseNet-121/169, EfficientFormerV2-S and ConvNeXt-Base/Large. Post-training strategies included classifier re-training (cRT), test-time augmentation (TTA), probability calibration (Prob Calib.) and ensembling.

We experiment with ResNet-50 model trained with different loss functions and post-training strategies to demonstrate the impact of each component on performance. We see that change of loss function from BCE to LDAM+DRW helps in 30.5% increase in MAP. And adding the cRT post-training strategy further improves the performance by 1.4% in MAP. While the cRT and TTA combined imrpoves by 1.9% in MAP. We then finalize the LDAM+DRW loss function for further model experiments, where ResNet-101 models gives 0.4584 MAP, DenseNet-121 gives 0.3967 MAP, DenseNet-169 gives 0.3981 MAP, EfficientFormerV2-S gives 0.4869 MAP, ConvNeXt-Base gives 0.4855 MAP and ConvNext-Large gives the highest MAP of 0.5220. The cRT helps the ConvNext-Base model to increase it performance by 3.8%, and adding up the TTA furtherly increase it by 7.4%. While the cRT and probability calibration degrades the performance and even the ensemble of ConvNext-Large and EfficientFormer-V2 S model with cRT also gives lower performance, than the standard ConvNext-Large with LDAM+DRW.

Results

Development Set Performance

Comparison across architectures, loss functions, and post-training strategies:

Model	Loss	Post-training	AP	AUC	F1	ECE
ResNet-50	BCE	—	0.3248	0.8410	0.3222	0.8884
ResNet-50	Asymmetric	—	0.0667	0.5603	0.0843	0.9526
ResNet-50	LDAM+DRW	—	0.4241	0.8435	0.2676	0.5575
ResNet-50	LDAM+DRW	cRT	0.4303	0.8828	0.3233	0.8300
ResNet-50	LDAM+DRW	cRT + TTA	0.4325	0.8864	0.3102	0.8247
ResNet-101	LDAM+DRW	—	0.4584	0.8679	0.2564	0.5332
DenseNet-121	LDAM+DRW	—	0.3967	0.8334	0.2119	0.5422
DenseNet-169	LDAM+DRW	—	0.3981	0.8520	0.1819	0.5316
EfficientFormerV2-S	LDAM+DRW	—	0.4869	0.8818	0.3161	0.5215
EfficientFormerV2-S	LDAM+DRW	—	0.4869	0.8818	0.3161	0.8250
ConvNeXt-Base	LDAM+DRW	—	0.4855	0.8931	0.3081	0.5319
ConvNeXt-Base	LDAM+DRW	cRT	0.5039	0.8902	0.2548	0.8932
ConvNeXt-Base	LDAM+DRW	cRT + TTA	0.5217	0.8961	0.2659	0.8936
ConvNeXt-Base	LDAM+DRW	cRT + Prob Calib.	0.4539	0.8948	0.2974	0.8250
ConvNeXt-Large	LDAM+DRW	—	0.5220	0.8832	0.3765	0.5506
ConvNeXt-Large	LDAM+DRW	cRT + Prob Calib.	0.5116	0.8939	0.3669	0.5488
ConvNeXt-Large + EfficientFormerV2-S	LDAM+DRW	cRT + Ensemble	0.4990	0.8951	0.2556	0.7037

Official CXR-LT 2026 Test Leaderboard

Task 1: In-distribution Multi-label Classification (primary metric: macro-averaged mAP).

Rank	Team	Affiliation	mAP ↑	AUC ↑	F1 ↑
1	CVMAIL x MIHL	Vietnam National University, Vietnam	0.5854	0.9259	0.3518
2	Cool Peace	KAIST Graduate School of AI, South Korea	0.4827	0.9186	0.3162
3	VIU	Vietnam National University, Vietnam	0.4599	0.8827	0.4504
4	Bibimbap-Bueno	Case Western Reserve University, USA	0.4297	0.8753	0.2482
5	Nikhil Rao Sulake	RGUKT Nuzvid, India	0.3950	0.8591	0.0945
6	UGIVIA team	Universitat de les Illes Balears, Spain	0.2362	0.7756	0.2353

68 participating teams, 1528 total submissions. (Official Results Link)

Qualitative Results

Class-activation maps overlaid on test images. The model localizes findings correctly (kyphosis, hernia, azygos lobe) but probability calibration causes instance-level misses.

Interactive GradCAM Explorer

Compare class activation maps across different models and disease classes. Select multiple models and classes to see side-by-side comparisons. The number mentioned on each image is the probability score for that class. Click on any image to view a larger version with detailed information.

Models

Classes

Select models and classes to compare

🔍

Select models and classes above to compare GradCAM visualizations

Conclusion

Our findings establish that LDAM-DRW loss combined with modern CNN architectures, particularly ConvNeXt, forms a strong baseline for long-tailed multi-label CXR classification, achieving 0.5220 mAP on the development set. The consistent advantage of LDAM-DRW across all architectures suggests that margin-based losses with deferred re-weighting should be the default choice for clinical long-tailed tasks.

However, good ranking performance alone is not sufficient — the gap between development and test mAP (0.52 vs. 0.395) and very low test F1 (0.0945) highlight the pressing need for better generalization and calibration strategies. Per-class threshold optimization, temperature scaling, and techniques like Sharpness Aware Minimization appear to be the most promising directions for improving instance-level predictions.

BibTeX

@article{sulake2026lossdesignarchitectureselection,
    title={Loss Design and Architecture Selection for Long-Tailed Multi-Label Chest X-Ray Classification}, 
    author={Nikhileswara Rao Sulake},
    year={2026},
    eprint={2603.02294},
    archivePrefix={arXiv},
    primaryClass={eess.IV},
    url={https://arxiv.org/abs/2603.02294}, 
  }