Loss Design and Architecture Selection for Long-Tailed Multi-Label Chest X-Ray Classification

RGUKT Nuzvid, India
CXR-LT 2026 Challenge (ISBI 2026) : 5th Place

Abstract

Long-tailed class distributions pose a significant challenge for multi-label chest X-ray (CXR) classification, where rare but clinically important findings are severely underrepresented. We present a systematic empirical evaluation of loss functions, CNN backbone architectures and post-training strategies on the CXR-LT 2026 benchmark, comprising approximately 143K images with 30 disease labels from PadChest.

Our experiments demonstrate that LDAM with deferred re-weighting (LDAM-DRW) consistently outperforms standard BCE and asymmetric losses for rare class recognition. Amongst the architectures evaluated, ConvNeXt-Large achieves the best single-model performance with 0.5220 mAP and 0.3765 F1 on our development set, whilst classifier re-training and test-time augmentation further improve ranking metrics. On the official test leaderboard, our submission achieved 0.3950 mAP, ranking 5th amongst 68 participating teams with a total of 1528 submissions.

Method Overview

Given a chest X-ray image, the goal is to predict a binary label vector for 30 disease classes. This multi-label setting allows multiple findings to co-occur in a single image. We investigated several approaches in the perspective of loss functions, backbone architectures and post-training techniques to address the extreme class imbalance and multi-label nature of the task. We experimented with Label-Distribution-Aware Margin Loss combined with Deferred Re-Weighting (LDAM-DRW), Asymmetric Loss, and standard Binary Cross-Entropy (BCE) as baselines. For backbone architectures, we evaluated ResNet-50/101, DenseNet-121/169, EfficientFormerV2-S and ConvNeXt-Base/Large. Post-training strategies included classifier re-training (cRT), test-time augmentation (TTA), probability calibration (Prob Calib.) and ensembling.

We experiment with ResNet-50 model trained with different loss functions and post-training strategies to demonstrate the impact of each component on performance. We see that change of loss function from BCE to LDAM+DRW helps in 30.5% increase in MAP. And adding the cRT post-training strategy further improves the performance by 1.4% in MAP. While the cRT and TTA combined imrpoves by 1.9% in MAP. We then finalize the LDAM+DRW loss function for further model experiments, where ResNet-101 models gives 0.4584 MAP, DenseNet-121 gives 0.3967 MAP, DenseNet-169 gives 0.3981 MAP, EfficientFormerV2-S gives 0.4869 MAP, ConvNeXt-Base gives 0.4855 MAP and ConvNext-Large gives the highest MAP of 0.5220. The cRT helps the ConvNext-Base model to increase it performance by 3.8%, and adding up the TTA furtherly increase it by 7.4%. While the cRT and probability calibration degrades the performance and even the ensemble of ConvNext-Large and EfficientFormer-V2 S model with cRT also gives lower performance, than the standard ConvNext-Large with LDAM+DRW.

Results

Development Set Performance

Comparison across architectures, loss functions, and post-training strategies:

Model Loss Post-training AP AUC F1 ECE
ResNet-50 BCE 0.3248 0.8410 0.3222 0.8884
ResNet-50 Asymmetric 0.0667 0.5603 0.0843 0.9526
ResNet-50 LDAM+DRW 0.4241 0.8435 0.2676 0.5575
ResNet-50 LDAM+DRW cRT 0.4303 0.8828 0.3233 0.8300
ResNet-50 LDAM+DRW cRT + TTA 0.4325 0.8864 0.3102 0.8247
ResNet-101 LDAM+DRW 0.4584 0.8679 0.2564 0.5332
DenseNet-121 LDAM+DRW 0.3967 0.8334 0.2119 0.5422
DenseNet-169 LDAM+DRW 0.3981 0.8520 0.1819 0.5316
EfficientFormerV2-S LDAM+DRW 0.4869 0.8818 0.3161 0.5215
EfficientFormerV2-S LDAM+DRW 0.4869 0.8818 0.3161 0.8250
ConvNeXt-Base LDAM+DRW 0.4855 0.8931 0.3081 0.5319
ConvNeXt-Base LDAM+DRW cRT 0.5039 0.8902 0.2548 0.8932
ConvNeXt-Base LDAM+DRW cRT + TTA 0.5217 0.8961 0.2659 0.8936
ConvNeXt-Base LDAM+DRW cRT + Prob Calib. 0.4539 0.8948 0.2974 0.8250
ConvNeXt-Large LDAM+DRW 0.5220 0.8832 0.3765 0.5506
ConvNeXt-Large LDAM+DRW cRT + Prob Calib. 0.5116 0.8939 0.3669 0.5488
ConvNeXt-Large + EfficientFormerV2-S LDAM+DRW cRT + Ensemble 0.4990 0.8951 0.2556 0.7037

Official CXR-LT 2026 Test Leaderboard

Task 1: In-distribution Multi-label Classification (primary metric: macro-averaged mAP).

Rank Team Affiliation mAP ↑ AUC ↑ F1 ↑
1 CVMAIL x MIHL Vietnam National University, Vietnam 0.5854 0.9259 0.3518
2 Cool Peace KAIST Graduate School of AI, South Korea 0.4827 0.9186 0.3162
3 VIU Vietnam National University, Vietnam 0.4599 0.8827 0.4504
4 Bibimbap-Bueno Case Western Reserve University, USA 0.4297 0.8753 0.2482
5 Nikhil Rao Sulake RGUKT Nuzvid, India 0.3950 0.8591 0.0945
6 UGIVIA team Universitat de les Illes Balears, Spain 0.2362 0.7756 0.2353

68 participating teams, 1528 total submissions. (Official Results Link)

Qualitative Results

Class activation maps on CXR images

Class-activation maps overlaid on test images. The model localizes findings correctly (kyphosis, hernia, azygos lobe) but probability calibration causes instance-level misses.

Interactive GradCAM Explorer

Compare class activation maps across different models and disease classes. Select multiple models and classes to see side-by-side comparisons. The number mentioned on each image is the probability score for that class. Click on any image to view a larger version with detailed information.

Models
Classes
Select models and classes to compare
🔍

Select models and classes above to compare GradCAM visualizations

Conclusion

Our findings establish that LDAM-DRW loss combined with modern CNN architectures, particularly ConvNeXt, forms a strong baseline for long-tailed multi-label CXR classification, achieving 0.5220 mAP on the development set. The consistent advantage of LDAM-DRW across all architectures suggests that margin-based losses with deferred re-weighting should be the default choice for clinical long-tailed tasks.

However, good ranking performance alone is not sufficient — the gap between development and test mAP (0.52 vs. 0.395) and very low test F1 (0.0945) highlight the pressing need for better generalization and calibration strategies. Per-class threshold optimization, temperature scaling, and techniques like Sharpness Aware Minimization appear to be the most promising directions for improving instance-level predictions.

BibTeX

@article{sulake2026lossdesignarchitectureselection,
    title={Loss Design and Architecture Selection for Long-Tailed Multi-Label Chest X-Ray Classification}, 
    author={Nikhileswara Rao Sulake},
    year={2026},
    eprint={2603.02294},
    archivePrefix={arXiv},
    primaryClass={eess.IV},
    url={https://arxiv.org/abs/2603.02294}, 
  }