Comparison across architectures, loss functions, and post-training strategies:
Long-tailed class distributions pose a significant challenge for multi-label chest X-ray (CXR) classification, where rare but clinically important findings are severely underrepresented. We present a systematic empirical evaluation of loss functions, CNN backbone architectures and post-training strategies on the CXR-LT 2026 benchmark, comprising approximately 143K images with 30 disease labels from PadChest.
Our experiments demonstrate that LDAM with deferred re-weighting (LDAM-DRW) consistently outperforms standard BCE and asymmetric losses for rare class recognition. Amongst the architectures evaluated, ConvNeXt-Large achieves the best single-model performance with 0.5220 mAP and 0.3765 F1 on our development set, whilst classifier re-training and test-time augmentation further improve ranking metrics. On the official test leaderboard, our submission achieved 0.3950 mAP, ranking 5th amongst 68 participating teams with a total of 1528 submissions.
Given a chest X-ray image, the goal is to predict a binary label vector for 30 disease classes. This multi-label setting allows multiple findings to co-occur in a single image. We investigated several approaches in the perspective of loss functions, backbone architectures and post-training techniques to address the extreme class imbalance and multi-label nature of the task. We experimented with Label-Distribution-Aware Margin Loss combined with Deferred Re-Weighting (LDAM-DRW), Asymmetric Loss, and standard Binary Cross-Entropy (BCE) as baselines. For backbone architectures, we evaluated ResNet-50/101, DenseNet-121/169, EfficientFormerV2-S and ConvNeXt-Base/Large. Post-training strategies included classifier re-training (cRT), test-time augmentation (TTA), probability calibration (Prob Calib.) and ensembling.
We experiment with ResNet-50 model trained with different loss functions and post-training strategies to demonstrate the impact of each component on performance. We see that change of loss function from BCE to LDAM+DRW helps in 30.5% increase in MAP. And adding the cRT post-training strategy further improves the performance by 1.4% in MAP. While the cRT and TTA combined imrpoves by 1.9% in MAP. We then finalize the LDAM+DRW loss function for further model experiments, where ResNet-101 models gives 0.4584 MAP, DenseNet-121 gives 0.3967 MAP, DenseNet-169 gives 0.3981 MAP, EfficientFormerV2-S gives 0.4869 MAP, ConvNeXt-Base gives 0.4855 MAP and ConvNext-Large gives the highest MAP of 0.5220. The cRT helps the ConvNext-Base model to increase it performance by 3.8%, and adding up the TTA furtherly increase it by 7.4%. While the cRT and probability calibration degrades the performance and even the ensemble of ConvNext-Large and EfficientFormer-V2 S model with cRT also gives lower performance, than the standard ConvNext-Large with LDAM+DRW.
Comparison across architectures, loss functions, and post-training strategies:
| Model | Loss | Post-training | AP | AUC | F1 | ECE |
|---|---|---|---|---|---|---|
| ResNet-50 | BCE | — | 0.3248 | 0.8410 | 0.3222 | 0.8884 |
| ResNet-50 | Asymmetric | — | 0.0667 | 0.5603 | 0.0843 | 0.9526 |
| ResNet-50 | LDAM+DRW | — | 0.4241 | 0.8435 | 0.2676 | 0.5575 |
| ResNet-50 | LDAM+DRW | cRT | 0.4303 | 0.8828 | 0.3233 | 0.8300 |
| ResNet-50 | LDAM+DRW | cRT + TTA | 0.4325 | 0.8864 | 0.3102 | 0.8247 |
| ResNet-101 | LDAM+DRW | — | 0.4584 | 0.8679 | 0.2564 | 0.5332 |
| DenseNet-121 | LDAM+DRW | — | 0.3967 | 0.8334 | 0.2119 | 0.5422 |
| DenseNet-169 | LDAM+DRW | — | 0.3981 | 0.8520 | 0.1819 | 0.5316 |
| EfficientFormerV2-S | LDAM+DRW | — | 0.4869 | 0.8818 | 0.3161 | 0.5215 |
| EfficientFormerV2-S | LDAM+DRW | — | 0.4869 | 0.8818 | 0.3161 | 0.8250 |
| ConvNeXt-Base | LDAM+DRW | — | 0.4855 | 0.8931 | 0.3081 | 0.5319 |
| ConvNeXt-Base | LDAM+DRW | cRT | 0.5039 | 0.8902 | 0.2548 | 0.8932 |
| ConvNeXt-Base | LDAM+DRW | cRT + TTA | 0.5217 | 0.8961 | 0.2659 | 0.8936 |
| ConvNeXt-Base | LDAM+DRW | cRT + Prob Calib. | 0.4539 | 0.8948 | 0.2974 | 0.8250 |
| ConvNeXt-Large | LDAM+DRW | — | 0.5220 | 0.8832 | 0.3765 | 0.5506 |
| ConvNeXt-Large | LDAM+DRW | cRT + Prob Calib. | 0.5116 | 0.8939 | 0.3669 | 0.5488 |
| ConvNeXt-Large + EfficientFormerV2-S | LDAM+DRW | cRT + Ensemble | 0.4990 | 0.8951 | 0.2556 | 0.7037 |
Task 1: In-distribution Multi-label Classification (primary metric: macro-averaged mAP).
| Rank | Team | Affiliation | mAP ↑ | AUC ↑ | F1 ↑ |
|---|---|---|---|---|---|
| 1 | CVMAIL x MIHL | Vietnam National University, Vietnam | 0.5854 | 0.9259 | 0.3518 |
| 2 | Cool Peace | KAIST Graduate School of AI, South Korea | 0.4827 | 0.9186 | 0.3162 |
| 3 | VIU | Vietnam National University, Vietnam | 0.4599 | 0.8827 | 0.4504 |
| 4 | Bibimbap-Bueno | Case Western Reserve University, USA | 0.4297 | 0.8753 | 0.2482 |
| 5 | Nikhil Rao Sulake | RGUKT Nuzvid, India | 0.3950 | 0.8591 | 0.0945 |
| 6 | UGIVIA team | Universitat de les Illes Balears, Spain | 0.2362 | 0.7756 | 0.2353 |
68 participating teams, 1528 total submissions. (Official Results Link)
Class-activation maps overlaid on test images. The model localizes findings correctly (kyphosis, hernia, azygos lobe) but probability calibration causes instance-level misses.
Compare class activation maps across different models and disease classes. Select multiple models and classes to see side-by-side comparisons. The number mentioned on each image is the probability score for that class. Click on any image to view a larger version with detailed information.
Select models and classes above to compare GradCAM visualizations
Our findings establish that LDAM-DRW loss combined with modern CNN architectures, particularly ConvNeXt, forms a strong baseline for long-tailed multi-label CXR classification, achieving 0.5220 mAP on the development set. The consistent advantage of LDAM-DRW across all architectures suggests that margin-based losses with deferred re-weighting should be the default choice for clinical long-tailed tasks.
However, good ranking performance alone is not sufficient — the gap between development and test mAP (0.52 vs. 0.395) and very low test F1 (0.0945) highlight the pressing need for better generalization and calibration strategies. Per-class threshold optimization, temperature scaling, and techniques like Sharpness Aware Minimization appear to be the most promising directions for improving instance-level predictions.
@article{sulake2026lossdesignarchitectureselection,
title={Loss Design and Architecture Selection for Long-Tailed Multi-Label Chest X-Ray Classification},
author={Nikhileswara Rao Sulake},
year={2026},
eprint={2603.02294},
archivePrefix={arXiv},
primaryClass={eess.IV},
url={https://arxiv.org/abs/2603.02294},
}