NVIDIA's Star Elastic: A Single Checkpoint with Multiple Model Sizes via Nested Weight-Sharing

What Is Star Elastic?

Training a family of large language models (LLMs) has traditionally been a costly endeavor. Each variant—whether 8B, 30B, or 70B—typically requires its own full training run, separate storage, and a dedicated deployment stack. This multiplication of efforts can significantly inflate compute costs for teams running inference at scale. Now, NVIDIA researchers propose a novel solution called Star Elastic, a post-training method that embeds multiple nested submodels of varying parameter budgets within a single parent reasoning model. All variants are trained together in one run and stored in a single checkpoint, eliminating the need for separate training and infrastructure.

NVIDIA's Star Elastic: A Single Checkpoint with Multiple Model Sizes via Nested Weight-Sharing — Source: www.marktechpost.com

How Nesting Works: A Single Model That Houses Smaller Ones

If you are unfamiliar with elastic or nested architectures, the concept is elegant: instead of training three separate models (e.g., 30B, 23B, and 12B parameters), you train one model that contains the smaller ones as subsets. The smaller submodels reuse the most important weights from the parent, identified through a process called importance estimation.

Applied to Nemotron Nano v3—a hybrid Mamba–Transformer–MoE model with 30B total parameters and 3.6B active parameters—Star Elastic produces two nested variants: 23B (2.8B active) and 12B (2.0B active). These variants are trained with approximately 160 billion tokens, and all three coexist in a single checkpoint. They can be extracted without any additional fine-tuning.

Importance Estimation: Finding What Matters

Star Elastic scores each model component for its contribution to overall accuracy. The components evaluated include embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN channels. After scoring, components are ranked and sorted. For any target budget, the corresponding submodel automatically uses the highest-ranked contiguous subset. This property is known as nested weight-sharing.

The method supports nesting along multiple axes: the SSM dimension, embedding channels, attention heads, Mamba heads and their channels, MoE expert count, and FFN intermediate dimension. For MoE layers specifically, Star Elastic introduces Router-Weighted Expert Activation Pruning (REAP). REAP ranks experts by combining both routing gate values and expert output magnitudes—a more principled signal than naive frequency-based pruning, which often overlooks how much each expert truly contributes to the layer output.

A Learnable Router, Not a Fixed Compression Recipe

What sets Star Elastic apart from earlier compression methods like Minitron is its use of an end-to-end trainable router to determine nested submodel architectures. The router accepts a target budget as a one-hot input (e.g., “give me a 2.8B active parameter model”) and outputs differentiable masks that select which components are active at that budget. These masks are trained jointly with the model using Gumbel-Softmax, allowing gradients to flow through discrete architectural decisions.

The loss function combines knowledge distillation (KD), where the non-elastified parent model acts as the teacher, with additional regularization to encourage clean submodel extraction. This training ensures that each nested submodel performs competitively without requiring its own separate training regimen.

Advantages and Practical Implications

The primary advantage of Star Elastic is reduced computational overhead. Instead of training multiple models from scratch, developers train one model that serves multiple size requirements. This saves on GPU hours, storage space, and deployment complexity. Moreover, because all variants share a single checkpoint, switching between model sizes for different tasks or latency budgets becomes instantaneous—no need to load separate weights.

Another benefit is zero-shot slicing: once the parent model is trained, you can extract any nested submodel without any additional fine-tuning. This flexibility is particularly valuable for real-world applications where inference costs must be dynamically adjusted based on request volume or throughput constraints.

Star Elastic also opens the door to more efficient model compression and deployment strategies. By incorporating importance estimation and a learnable router, the method adapts to the model’s internal structure rather than applying a fixed compression recipe. This leads to better preserved accuracy in smaller submodels compared to traditional pruning or distillation approaches.

Future Directions and Conclusion

NVIDIA’s Star Elastic represents a step forward in making large language model families more economical to train and deploy. Future work may explore extending the method to even larger models, incorporating more nesting depths, or integrating it with other efficiency techniques like quantization and pruning. As LLMs continue to grow, approaches like Star Elastic will be crucial for managing the trade-offs between model size, accuracy, and cost.

In summary, Star Elastic offers a pragmatic solution to the “one model per size” problem. By embedding multiple nested submodels into a single parent checkpoint, it reduces training and storage overhead while maintaining flexibility and performance. For teams working with large-scale LLMs, this method could significantly simplify their infrastructure and lower operational costs.

Learn more about the importance estimation process and the trainable router mechanism that make Star Elastic possible.

Fbhchile