3D Left Ventricular Analysis

Abstract

Automated analysis of the left ventricle (LV) from echocardiography, encompassing both segmentation and ejection fraction (EF) estimation, is a clinically important but technically challenging problem. This work presents a two-phase research program exploring the potential and limits of synthetic 3D data generation for multitask volumetric cardiac learning. In Phase One, a framework is proposed for joint 3D LV segmentation and EF estimation by leveraging synthetic volumetric data generated from an ensemble of fourteen diverse 2D segmentation models trained exclusively on end-diastolic (ED) and end-systolic (ES) frame annotations from the EchoNet-Dynamic dataset. A lightweight 3D multitask network with a shared encoder achieved a maximum Dice score of 0.9701 and a minimum EF MSE of 163.9, with only 1,183,682 parameters and a model size of 4.73 MB, orders of magnitude more compact than DeepLabResNet101 (233 MB) or ViT-L-16 (1183 MB). In Phase Two, these results are critically re-examined through controlled experiments anchored to real ground truth. The 3D model evaluated on real 2D ground-truth masks achieves a Dice of only 0.6782, substantially below 2D baselines (up to 0.8932), exposing a fundamental performance ceiling imposed by synthetic label training. Further experiments on the real-world 3D MITEA dataset confirm that complex models overfit on limited 3D clinical data, with the standard 3D UNet achieving the best test MSE of 85.77 at test Dice of 0.8464.

• • •

Introduction

The left ventricular ejection fraction (LVEF) is one of the most clinically consequential parameters in cardiology, used to diagnose heart failure, guide therapy, and stratify long-term patient risk. LVEF is defined as the ratio of stroke volume to end-diastolic volume, a quantity that fundamentally requires accurate LV segmentation across the cardiac cycle. Despite decades of research, automated and reliable computation of LVEF from echocardiographic video remains an open challenge due to the low signal-to-noise ratio of ultrasound images, significant inter-patient anatomical variability, and the inherent difficulty of constructing volumetric representations from 2D imaging data.

Modern deep learning has substantially advanced both segmentation and regression in isolation. Fully convolutional networks, attention-based U-Nets, and Vision Transformers have brought 2D segmentation Dice scores above 0.93 on well-annotated datasets such as EchoNet-Dynamic. Concurrently, end-to-end video regression models have reduced EF mean absolute error (MAE) to below 4%. However, the integration of precise volumetric segmentation with functional regression in a unified 3D framework, and rigorous evaluation against ground-truth baselines, remains comparatively underexplored.

A core practical barrier is the scarcity of 3D echocardiographic annotations. While datasets such as EchoNet-Dynamic provide 2D apical four-chamber video with sparse ED/ES masks, fully annotated 3D volumetric data is far rarer and more costly to acquire. One natural strategy is to use trained 2D models to generate dense synthetic 3D supervision by predicting masks for every frame and stacking them temporally. However, the faithfulness of such synthetic labels relative to true clinical annotations, and the degree to which a 3D model trained on them can generalize, are empirical questions that prior literature has not directly addressed.

This paper presents a rigorous two-phase investigation of this strategy. Phase One develops and evaluates a lightweight 3D multitask framework trained entirely on synthetic volumes derived from an ensemble of fourteen 2D segmentation architectures, demonstrating high performance on within-distribution (synthetic) evaluation. Phase Two then systematically interrogates the validity and limits of those results through three controlled experiments: (1) full test-set evaluation of all 2D generators against real ground-truth labels; (2) direct evaluation of the 3D model on real 2D ground-truth masks to enable an unbiased comparison; and (3) experiments on the genuine 3D MITEA dataset to assess how well various architectures generalize when trained on authentic volumetric annotations.

        Main Contributions
        A comprehensive benchmark of fourteen 2D segmentation models on EchoNet-Dynamic, covering U-Net variants, FCN backbones, DeepLabV3+ configurations, and Vision Transformers.
A lightweight 3D multitask architecture (1.18M parameters, 4.73 MB) substantially more compact and faster than all 2D counterparts.
Methodologically rigorous Phase Two establishing: the 3D model's Phase One accuracy is bounded by synthetic label quality; on real ground truth the 3D model (Dice 0.6782) is outperformed by leading 2D models (Dice up to 0.8932); on the real 3D MITEA dataset, model complexity must be carefully controlled to avoid overfitting.
Umbra UNet, a novel hybrid 3D encoder-decoder integrating InceptionNeXt, FasterNet, and ConvNeXtV2 blocks, evaluated for the first time on 3D echocardiographic LV segmentation.
Clear articulation of the synthetic-to-real domain gap and the conflation problem of jointly learning segmentation alongside a derived metric (EF).

      

• • •

Related Work

Automated LV segmentation and EF estimation from echocardiography has evolved rapidly. Early probabilistic approaches such as dual-view Bayesian fusion achieved EF MAE of 3.9% and established the clinical value of uncertainty-aware predictions. Video-based deep learning with Vision Transformers and lightweight 3D CNNs achieved MAE around 5–6% for EF regression but produced no segmentation output, limiting interpretability.

On the multitask front, dual-encoder U-Nets fusing orthogonal apical views improved spatial context, while hierarchical transformer approaches achieved MAE around 4.3%. EFNet, a dual-branch CNN, achieved simultaneous LV segmentation (Dice ≈ 0.91) and EF estimation (MAE 4.5%) on 2D frames, a strong 2D multitask baseline directly motivating our unified design.

Semi-supervised and weakly supervised methods have pushed toward minimal annotation regimes. UniLVSeg and SimLVSeg combine weak ED/ES labels with self-supervised temporal masking to propagate segmentation across the full cardiac cycle (Dice ≈ 0.91), directly inspiring our strategy of using 2D model predictions to densify sparse ED/ES annotations into full temporal volumes.

Recent specialized architectures include dynamic-gating spatiotemporal attention reaching Dice 0.92, and fully volumetric 3D CNNs jointly learning motion fields and LV segmentation from genuine 3D echo data (Dice 0.90). On the clinical validation side, CardiacField combines 2D U-Net segmentation with a neural field regressor (Dice > 0.88, EF error ~5%), while robustness studies on point-of-care ultrasound identify segmentation quality as the key limiting factor.

Against this backdrop, our work makes two specific advances: rather than using a single 2D model to generate pseudo-labels, we systematically exploit an ensemble of fourteen heterogeneous architectures for denser synthetic supervision; and critically, we do not stop at evaluating on synthetic data, we rigorously re-evaluate on real ground truth to quantify exactly what the synthetic training ceiling costs in terms of true generalization.

• • •

Phase One: Synthetic Volume Generation and 3D Multitask Learning

Dataset: EchoNet-Dynamic

All Phase One experiments use EchoNet-Dynamic, a large-scale public echocardiography benchmark released by Stanford University comprising 10,030 apical four-chamber video clips. Each video is annotated with a single LVEF value and binary LV segmentation masks at the ED and ES frames. The official train/validation/test split (7465 / 1288 / 1277 videos) is followed. All frames are resized to 224×224 pixels and normalized to [0, 1].

Stage 1: Training the 2D Segmentation Ensemble

Fourteen 2D segmentation architectures are trained exclusively on ED/ES frame pairs from EchoNet-Dynamic. The ensemble spans four architectural families: the U-Net family (UNet, UNet++, R2U-Net, R2AttU-Net, Double U-Net, U²-Net, and AttU-Net); Fully Convolutional Networks (FCN-ResNet50 and FCN-ResNet101); DeepLabV3+ with three backbones (MobileNetV3-Large, ResNet50, and ResNet101); and Vision Transformer segmenters (ViT-B-16 and ViT-L-16).

Each model minimizes a combined binary cross-entropy and Dice loss on predicted versus ground-truth binary masks. All models are optimized with Adam (learning rate 1×10⁻⁴, weight decay 1×10⁻⁵) for 100 epochs, selecting the best checkpoint by validation Dice.

Stage 2: Synthetic 3D Volume Generation

Each trained 2D model is applied frame-by-frame to every training video to produce a sequence of binary masks stacked temporally into a synthetic volume. To enforce spatiotemporal continuity: morphological closing (disk kernel, radius 3) is applied slice-by-slice to fill intra-mask holes, and connected components with fewer than 50 pixels are removed as isolated artifacts. This generates fourteen distinct synthetic volume sets, one per 2D generator, each approximating the complete volumetric anatomy of the LV across the cardiac cycle from only two annotated frames.

Stage 3: 3D Multitask Network Architecture

The 3D multitask network takes a full video volume as input and simultaneously predicts a binary segmentation mask volume and a scalar EF value. The architecture has three components. A shared 3D encoder, a lightweight 3D convolutional backbone inspired by ResNet50, modified for single-channel volumetric inputs, produces a hierarchy of spatiotemporal feature maps across four resolution levels via strided 3D convolutions with batch normalization and ReLU. A segmentation decoder reconstructs the full-resolution 3D mask via transposed convolutions and skip connections. An EF regression head applies global average pooling at the bottleneck followed by two fully connected layers (512→128→1) with dropout (p=0.3).

Training minimizes a joint loss with equal weighting: the sum of binary cross-entropy and Dice loss for segmentation plus MSE for EF regression, where the segmentation target is the synthetic volume from the respective 2D generator and the EF target is the ground-truth EF from EchoNet-Dynamic.

Full methodology overview: 2D ensemble synthetic volume generation pipeline to 3D multitask network

Fig. 1: Complete methodology overview — from 2D segmentation ensemble and synthetic 3D volume generation (Phase One) through real-world validation on MITEA (Phase Two).

Important Note on Phase One Evaluation

The 3D models are trained on synthetic volumes (pseudo-labels produced by the 2D generators) and are also evaluated by comparing predictions against those same synthetic volumes on the held-out test set. The Dice scores reported in Phase One therefore measure how well the 3D model reproduces the style of its specific 2D generator, they do not directly measure agreement with human-annotated ground truth. The rigorous ground-truth comparison is presented in Phase Two.

• • •

Phase One: Results and Analysis

Model Efficiency Comparison

Table 1 contextualizes the computational footprint of all models. The proposed 3D multitask model is dramatically more compact and faster than every 2D model in the ensemble, over 26× fewer parameters than the lightest 2D model (DeepLabV3+ MobileNetV3-L, 11M) and over 260× fewer than ViT-L-16. Its inference time of 0.6156 s is 3.4× faster than FCN-ResNet50 (1.87 s) and 11× faster than ViT-L-16 (6.79 s).

Model	Parameters	Size (MB)	Inference (s)
DeepLabV3+ ResNet50	41,998,934	160.43	2.0907
DeepLabV3+ ResNet101	60,991,062	233.08	3.9142
DeepLabV3+ MobileNetV3-L	11,024,188	42.16	2.1450
FCN-ResNet50	35,311,958	134.91	1.8698
FCN-ResNet101	54,304,086	207.56	3.4331
UNet	34,527,041	131.76	2.0870
UNet++	36,629,633	139.79	4.4062
U²-Net	44,009,869	168.00	5.4076
Double UNet	29,288,886	111.76	2.8303
R2U-Net	39,091,393	149.17	5.5711
AttU-Net	34,878,573	133.11	2.3156
R2AttU-Net	39,442,925	150.52	5.7607
ViT-B-16	91,560,145	349.29	2.9406
ViT-L-16	310,204,625	1183.35	6.7873
Proposed 3D Multitask Model	1,183,682	4.73	0.6156

Table 1: Model parameter counts, storage sizes, and single-frame inference times.

2D Segmentation Model Performance

Table 2 summarizes train and validation Dice scores for all fourteen 2D segmentation models on EchoNet-Dynamic ED/ES frames. These results establish the quality of the synthetic labels generated in Stage 2. The UNet family consistently leads in validation performance, with UNet achieving the best validation Dice (0.9272). Attention-based recurrent variants exhibit high training scores (0.98+) but show evidence of overfitting, R2AttU-Net's validation Dice drops to 0.7543. Vision Transformer models are competitive (validation ≈ 0.9069) but consistently trail the best CNNs.

Model	Train Dice	Validation Dice
DeepLabV3+ MobileNetV3-L	0.8978	0.8941
DeepLabV3+ ResNet50	0.9155	0.9132
DeepLabV3+ ResNet101	0.9193	0.9174
FCN-ResNet50	0.9171	0.9173
FCN-ResNet101	0.9193	0.9174
UNet	0.9514	0.9272
UNet++	0.9585	0.9254
U²-Net	0.9329	0.9002
AttU-Net	0.9841	0.9268
Double UNet	0.9481	0.9216
R2AttU-Net	0.9822	0.7543
R2U-Net	0.9822	0.9152
ViT-B-16	0.9186	0.9069
ViT-L-16	0.9124	0.9055

Table 2: Train and validation Dice similarity coefficients for all 2D models on EchoNet-Dynamic (ED/ES frames). Evaluated against synthetic labels.

3D Multitask Model Performance on Synthetic Data

Table 3 reports test Dice scores and EF MSE for the fourteen 3D multitask models, each trained and evaluated on synthetic volumes from its respective 2D generator. FCN-ResNet50 as generator yields the highest test Dice (0.9701), while Double UNet achieves the lowest EF MSE (163.93) with a near-identical Dice (0.9698), representing the Pareto-optimal frontier. Transformer-based generators consistently produce lower Dice (≈0.946) and higher EF MSE (≈170–176), indicating ViT models trained solely on sparse ED/ES annotations do not generalize well to mid-cycle frames. Deeper CNN backbones outperform lighter counterparts in Dice, but EF error differences remain narrow (165–168 range), suggesting EF regression is relatively robust to generator choice as long as the generator is CNN-based.

Synthetic Generator	Test Dice	Test EF MSE
FCN-ResNet50	0.9701	165.78
Double UNet	0.9698	163.93
FCN-ResNet101	0.9692	167.83
UNet	0.9672	166.36
DeepLabV3+ ResNet101	0.9666	166.74
AttU-Net	0.9633	168.63
DeepLabV3+ ResNet50	0.9626	167.21
UNet++	0.9618	167.07
R2U-Net	0.9610	175.07
R2AttU-Net	0.9608	169.43
ViT-L-16	0.9465	176.41
DeepLabV3+ MobileNetV3-L	0.9462	169.24
ViT-B-16	0.9455	170.12
U²-Net	0.9297	164.75

Table 3: Test Dice and EF MSE for 3D multitask models. Note: metrics are measured against synthetic labels, not human-annotated ground truth.

• • •

Phase Two: Empirical Validation and Methodological Refinements

Experiment 1: 2D Model Test Performance on Real Ground Truth

Phase One reported only train and validation Dice scores. To establish a fair reference point, all fourteen 2D segmentation models were evaluated on the official EchoNet-Dynamic test set against human-annotated ED/ES masks. Table 4 presents the complete results.

Several important patterns emerge. A systematic drop from validation to test Dice is observed across all models. R2AttU-Net, despite its poor validation Dice (0.7543), achieves the best test Dice (0.8932), a reversal explained by the attention-gated recurrent structure's expressive recovery beyond the specific validation split. R2U-Net closely follows at 0.8890. The U²-Net test Dice (0.7300) is notably lower than its validation score (0.9002), indicating overfitting. Vision Transformer models suffer a catastrophic drop: ViT-B-16 drops from 0.9069 to 0.5867, and ViT-L-16 from 0.9055 to 0.6422, suggesting pure transformer architectures are ill-suited for LV segmentation under the limited-label (ED/ES only) regime.

Model	Train Dice	Val Dice	Test Dice
DeepLabV3+ MobileNetV3-L	0.8978	0.8941	0.8045
DeepLabV3+ ResNet50	0.9155	0.9132	0.8253
DeepLabV3+ ResNet101	0.9193	0.9174	0.8624
FCN-ResNet50	0.9171	0.9173	0.8514
FCN-ResNet101	0.9193	0.9174	0.8624
UNet	0.9514	0.9272	0.8511
UNet++	0.9585	0.9254	0.8654
U²-Net	0.9329	0.9002	0.7300
AttU-Net	0.9841	0.9268	0.8867
Double UNet	0.9481	0.9216	0.8398
R2AttU-Net	0.9822	0.7543	0.8932
R2U-Net	0.9822	0.9152	0.8890
ViT-B-16	0.9186	0.9069	0.5867
ViT-L-16	0.9124	0.9055	0.6422

Table 4: Complete train/validation/test Dice scores for all 2D segmentation models evaluated against human-annotated ground-truth masks on EchoNet-Dynamic.

Experiment 2: Fair Comparison of the 3D Model Against Real Ground Truth

The Phase One 3D model was assessed against synthetic labels it was trained to reproduce, not a fair comparison to the 2D models evaluated against human annotations. To construct a fair comparison, the best-performing 3D multitask model was evaluated on the original 2D ground-truth masks from EchoNet-Dynamic. Since the 3D network requires volumetric input, each 2D ground-truth mask was expanded with a singleton temporal dimension into a 1×224×224 volume; the model's predicted 3D mask was then projected back to 2D and compared using the Dice coefficient.

Result: Test Dice of 0.6782 on Real Ground Truth

The 3D model, which appeared highly competitive in Phase One (Dice up to 0.9701), is in fact substantially weaker than its own generators when judged against human annotations. It falls below every CNN-based 2D model and is comparable only to the worst-performing ViT models (ViT-B-16: 0.5867, ViT-L-16: 0.6422).

This result demonstrates a fundamental performance ceiling effect. The 3D model's training labels are themselves the predictions of a 2D model, predictions that contain that model's specific errors and biases. The 3D model learns to reproduce those imperfect predictions, and its accuracy is therefore bounded above by the fidelity of its synthetic labels. In practice, the 3D model performs worse than this bound suggests, because it additionally suffers from the domain mismatch between smooth, model-biased synthetic masks and irregular human-annotated contours.

This finding also reveals a conceptual weakness in the joint learning objective. EF is a derived clinical metric computed from the volumetric ratio of end-diastolic to end-systolic LV volumes. Asking the network to simultaneously regress EF while learning segmentation from pseudo-labels creates a circular learning signal, the EF target is computed from ground-truth anatomy, but the segmentation target is synthetic and may not accurately encode that anatomy. This conflation hampers both tasks and suggests that EF should be treated as a post-processing output derived deterministically from the predicted segmentation volume, rather than as an independent regression target during training.

Experiment 3: Transition to Real 3D Data with the MITEA Dataset

To entirely bypass the synthetic data bottleneck, experiments were extended to MITEA, a real-world 3D transesophageal echocardiography (3D TEE) dataset providing genuine volumetric annotations of the LV with voxel-level ground-truth segmentation and associated clinical measurements including EF.

Umbra UNet: A Novel Hybrid 3D Architecture

Fig. 2: Architecture of the proposed Umbra UNet. The encoder integrates modular vision blocks (InceptionNeXt, FasterNet, ConvNeXtV2) in parallel streams, fusing local-detail and global-context features before passing them through the skip-connection decoder for volumetric mask reconstruction.

Motivated by the limitations of the simple Phase One architecture, Umbra UNet is a hybrid 3D encoder-decoder integrating multiple modern lightweight vision blocks within a U-Net-style skeleton. The multi-branch parallel encoder employs three modern vision modules: InceptionNeXt blocks using decomposed depth-wise convolutions with multi-kernel branches to capture multi-scale local texture efficiently; FasterNet blocks using a partial convolution strategy that applies convolutions only to a subset of feature channels, dramatically reducing computational cost; and ConvNeXtV2 blocks with depthwise convolutions, global response normalization, and inverted bottleneck design. Features from all branches are fused via channel concatenation and a 1×1×1 projection convolution at each encoder level. A symmetric transposed-convolution decoder with skip connections reconstructs the full-resolution 3D segmentation mask, and the EF regression head remains at the bottleneck.

Comparative Evaluation on MITEA

Model	Train Dice	Train MSE	Val Dice	Val MSE	Test Dice	Test MSE
Proposed 3D Model (random init.)	0.8846	374.18	0.8747	81.02	0.8540	141.58
Proposed 3D Model (EchoNet pretrained)	0.9085	373.41	0.9115	80.95	0.7815	120.14
Umbra UNet 3D (full)	0.9387	59.33	0.9383	13.13	0.7474	283.71
Umbra UNet 3D (FasterNet only)	0.9327	58.48	0.9316	13.16	0.8456	320.58
Umbra UNet 3D (InceptionNeXt only)	0.9192	56.36	0.9219	12.87	0.8269	542.60
Umbra UNet 3D (ConvNeXtV2 only)	0.9441	71.44	0.9420	15.90	0.7353	628.34
3D UNet	0.9327	446.12	0.9281	96.37	0.8464	85.77

Table 5: Performance on the MITEA real-world 3D echocardiography dataset. Best test Dice and test MSE values highlighted.

Synthetic pretraining hurts generalization. The 3D model trained from random initialization achieves test Dice 0.8540, whereas the same model pretrained on EchoNet-Dynamic synthetic volumes drops to 0.7815 after fine-tuning on MITEA. The model pretrained on synthetic labels has learned features specific to the 2D-generator output style, smooth, morphologically post-processed mask boundaries, incompatible with the real 3D annotation style in MITEA. Pretraining thus serves as domain corruption rather than beneficial initialization. Interestingly, EF MSE improves slightly with pretraining (120.14 vs. 141.58), suggesting the functional regression branch transfers more readily than the segmentation branch.

Model complexity must be matched to data volume. The full Umbra UNet achieves excellent train and validation performance (Dice >0.93, Val MSE ≈13) but generalizes poorly to the test set (Dice 0.7474, MSE 283.71). By contrast, simpler architectures, the standard 3D UNet and the single-block FasterNet variant, achieve considerably better test Dice (≈0.85), confirming that on limited real 3D medical data, inductive biases constraining model capacity are more valuable than raw representational power. The ConvNeXtV2-only variant, which has the highest training Dice (0.9441), generalizes the least (test Dice 0.7353).

Segmentation and EF regression are difficult to jointly optimize. The 3D UNet achieves strong test Dice (0.8464) and the best test MSE (85.77), suggesting its conservative capacity provides implicit regularization beneficial to both tasks. However, models that achieve the best validation EF MSE (Umbra UNet InceptionNeXt: 12.87) perform poorly on test MSE (542.60), a factor of 42, indicating severe overfitting of the regression head on the limited MITEA dataset.

• • •

Discussion

The gap between a test Dice of 0.9701 (synthetic) and 0.6782 (real ground truth) for the same model is the most significant quantitative finding of this work. This gap arises from at least three sources: label noise from systematic prediction errors in mid-cycle frames far from ED/ES where generator confidence is lowest; style mismatch between smooth model-predicted boundaries and more conservative human annotations following the endocardial border; and temporal smoothing artifacts from morphological post-processing that may further deviate from annotator behavior.

A core design issue is the joint optimization of LV segmentation and EF prediction. EF is a clinically defined quantity computed deterministically from segmentation volumes: EF = (V_ED − V_ES) / V_ED × 100%. Treating EF as an independent regression target forces the model to learn a quantity that can be exactly derived from its segmentation output, introducing gradient conflicts and diluting the segmentation-specific learning signal. Future work should either train exclusively for segmentation and compute EF analytically, or pair segmentation with a more complementary secondary task such as motion field estimation or myocardial strain quantification.

The Umbra UNet demonstrates that hybrid multi-block encoders can achieve excellent validation performance (Dice >0.93, Val MSE <14), substantially better than the simple Phase One model on real 3D data. However, the full architecture overtly overfits on the MITEA dataset scale. Future deployment should incorporate stronger regularization (stochastic depth, mixup for 3D volumes, test-time augmentation) and evaluation on larger 3D clinical datasets. The FasterNet-only variant's balance of test Dice (0.8456) and structural accuracy suggests partial convolution-based feature extraction is a promising direction for data-efficient 3D echocardiographic segmentation.

This work illustrates a pitfall not unique to echocardiography: when synthetic data generation pipelines are evaluated only within their own distribution, performance appears far better than in clinical reality. The field must be vigilant about establishing rigorous ground-truth benchmarks and reporting test performance on independent held-out splits with human annotations. The Phase Two single-frame 3D expansion protocol for fair 2D-versus-3D comparison provides a reusable methodology for future researchers.

• • •

Conclusion and Future Directions

This paper presented a rigorous two-phase investigation of 3D multitask learning for left ventricular segmentation and ejection fraction estimation in echocardiography. Phase One developed a compact and efficient 3D multitask network (1.18M parameters, 4.73 MB, sub-second inference) achieving Dice up to 0.9701 on synthetic evaluation data. Phase Two systematically exposed the limitations: when evaluated against real human-annotated ground truth, the model's Dice of 0.6782 falls significantly below the best 2D models (up to 0.8932), demonstrating a substantial synthetic-to-real domain gap. Further experiments on the authentic 3D MITEA dataset revealed that overly complex architectures overfit on limited clinical data, that synthetic pretraining from EchoNet actively harms MITEA generalization, and that jointly optimizing segmentation and EF regression remains challenging, with the 3D UNet emerging as the most robust baseline (test Dice 0.8464, test EF MSE 85.77).

These findings motivate several concrete future directions: domain generalization strategies including adversarial domain adaptation or confidence-weighted label propagation; decoupled task design where EF is computed analytically from the predicted segmentation volume; 3D ultrasound-specific augmentations including acoustic shadowing simulation and synthetic cardiac motion perturbations; architecture search targeting generalization under limited 3D clinical data; and replacing EF regression with motion analysis or myocardial strain estimation as a secondary task, providing information orthogonal to the segmentation mask rather than derived from it.

The negative results reported in this work are, in themselves, a positive scientific contribution: they clearly delineate where synthetic-data-based approaches succeed and where they fall short, providing future researchers with principled guidance for designing more clinically valid automated cardiac analysis systems.

References

Behnami, D., et al. "Dual-view joint estimation of left ventricular ejection fraction with uncertainty modelling in echocardiograms." MICCAI, pp. 256–264. Springer, 2019.
Reynaud, H., et al. "Ultrasound video transformers for cardiac ejection fraction estimation." MICCAI, pp. 516–525. Springer, 2021.
Tabuco, F. C. A., et al. "Two-View Left Ventricular Segmentation and Ejection Fraction Estimation in 2D Echocardiograms." BMVC, 2022.
Kang, X., et al. "A light-weight deep video network: towards robust assessment of ejection fraction on mobile devices." Medical Imaging 2022: Image-Guided Procedures, Vol. 12034. SPIE, 2022.
Fazry, L., et al. "Hierarchical vision transformers for cardiac ejection fraction estimation." IWBIS 2022. IEEE, 2022.
Blaivas, M., and L. Blaivas. "Machine learning algorithm using publicly available echo database for simplified visual estimation of LVEF." World Journal of Experimental Medicine 12.2 (2022): 16–28.
Dai, W., et al. "Cyclical self-supervision for semi-supervised ejection fraction prediction from echocardiogram videos." IEEE Transactions on Medical Imaging 42.5 (2022): 1446–1461.
Maani, F. A., et al. "UniLVSeg: Unified Left Ventricular Segmentation with Sparsely Annotated Echocardiogram Videos." arXiv:2304.01723, 2023.
Varalakshmi, P., et al. "Left Ventricular Ejection Fraction Estimation for Pediatric Patients using CNN." ICSCAN 2023. IEEE, 2023.
Rahman, S., et al. "Deep learning-based left ventricular ejection fraction estimation from echocardiographic videos." EASCT 2023. IEEE, 2023.
Muldoon, M., and N. Khan. "Lightweight and interpretable left ventricular ejection fraction estimation using mobile U-Net." IEEE ISBI 2023. IEEE, 2023.
Maani, F., et al. "SimLVSeg: Simplifying left ventricular segmentation in 2-D+time echocardiograms." Ultrasound in Medicine & Biology 50.12 (2024): 1945–1954.
Carrera-Pinzón, A. F., et al. "Characterizing the Left Ventricular Ultrasound Dynamics in the Frequency Domain to Estimate the Cardiac Function." MICCAI. Springer, 2024.
Batool, S., I. A. Taj, and M. Ghafoor. "EFNet: A multitask deep learning network for simultaneous quantification of left ventricle structure and function." Physica Medica 125 (2024): 104505.
Chen, X., et al. "Research on automatic segmentation of the left ventricular echocardiogram and calculation of ejection fraction." AEMCSE 2024, Vol. 13229. SPIE, 2024.
Lin, J., et al. "Dynamic-guided spatiotemporal attention for echocardiography video segmentation." IEEE Transactions on Medical Imaging (2024).
Luong, C. L., et al. "Validation of machine learning models for estimation of LVEF on point-of-care ultrasound." Echo Research & Practice 11.1 (2024): 9.
Ta, K., et al. "Multi-task learning for motion analysis and segmentation in 3D echocardiography." IEEE Transactions on Medical Imaging 43.5 (2024): 2010–2020.
Shen, C., et al. "CardiacField: computational echocardiography for automated heart function estimation." European Heart Journal–Digital Health 6.1 (2025): 137–146.
Pieszko, K., et al. "Artificial intelligence to measure left atrial ejection fraction in transthoracic echocardiography videos." European Heart Journal–Cardiovascular Imaging 26.Supplement_1 (2025): jeae333-049.
Ouyang, D., et al. "Video-based AI for beat-to-beat assessment of cardiac function." Nature 580 (2020): 252–256.
Yu, W., et al. "InceptionNeXt: When Inception Meets ConvNeXt." CVPR 2024.
Chen, J., et al. "Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks." CVPR 2023.
Woo, S., et al. "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders." CVPR 2023.

Contact

Nikhileswara Rao Sulake, nikhil01446@gmail.com · LinkedIn · GitHub

A Two-Phase Framework for 3D Left Ventricular Analysis: From Synthetic Data Generation to Real-World Clinical Validation

Abstract

Introduction

Main Contributions

Related Work

Phase One: Synthetic Volume Generation and 3D Multitask Learning

Dataset: EchoNet-Dynamic

Stage 1: Training the 2D Segmentation Ensemble

Stage 2: Synthetic 3D Volume Generation

Stage 3: 3D Multitask Network Architecture

Important Note on Phase One Evaluation

Phase One: Results and Analysis

Model Efficiency Comparison

2D Segmentation Model Performance

3D Multitask Model Performance on Synthetic Data

Phase Two: Empirical Validation and Methodological Refinements

Experiment 1: 2D Model Test Performance on Real Ground Truth

Experiment 2: Fair Comparison of the 3D Model Against Real Ground Truth

Result: Test Dice of 0.6782 on Real Ground Truth

Experiment 3: Transition to Real 3D Data with the MITEA Dataset

Umbra UNet: A Novel Hybrid 3D Architecture

Comparative Evaluation on MITEA

Discussion

Conclusion and Future Directions

References

Contact