This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Austin Roofing: Canada’s Largest Roofing Contractor Leads the Industry in Steel Roof Coatings, Metal Roofing Systems

Austin Roofing: Canada’s Largest Roofing Contractor Leads the Industry in Steel Roof Coatings, Metal Roofing Systems

Metal roof coatings can out-perform a conventional membrane overlay which consists of installing thousands of screws

March 17, 2026

Cindy Schuler, PHR, SHRM-CP, CPC, ELI-MP, CPRW, Recognized By Influential Women, Leads HR Strategy At IntegriStar

Cindy Schuler, PHR, SHRM-CP, CPC, ELI-MP, CPRW, Recognized By Influential Women, Leads HR Strategy At IntegriStar

GAMBRILLS, MD, UNITED STATES, March 17, 2026 /EINPresswire.com/ — Senior HR Leader, Certified Professional Coach, and

March 17, 2026

XOP Networks Integrates its Emergency Crash-Phone Platform with Cloud-Hosted IP-PBX

XOP Networks Integrates its Emergency Crash-Phone Platform with Cloud-Hosted IP-PBX

The integration enables enterprises to leverage XOP Networks’ proven Ringdown Firebar Conference Server (RFCS) with

March 17, 2026

Global Innovative Platforms (GIPL) Announces OTCID™ Quotation

Global Innovative Platforms (GIPL) Announces OTCID™ Quotation

Global Innovative Platforms confirms OTCID™ quotation as it advances commercialization of its VetBreath™ animal health

March 17, 2026

BroadAcre Apartments Opens Modern Residential Community in McCordsville, Indiana

BroadAcre Apartments Opens Modern Residential Community in McCordsville, Indiana

MCCORDSVILLE, IN, UNITED STATES, March 17, 2026 /EINPresswire.com/ — BroadAcre Apartments today announced the opening

March 17, 2026

Jeekeshen Chinnappen Releases New Book on the Entrepreneurial Mindset

Jeekeshen Chinnappen Releases New Book on the Entrepreneurial Mindset

Entrepreneur explores the neuroscience behind how founders think. NEW YORK, NY, UNITED STATES, March 17, 2026

March 17, 2026

The Maples Fort Worth Announces In-Network Agreements with UnitedHealthcare and Blue Cross Blue Shield, Expanding Access

The Maples Fort Worth Announces In-Network Agreements with UnitedHealthcare and Blue Cross Blue Shield, Expanding Access

The Maples Fort Worth Announces In-Network Agreements with UnitedHealthcare & Blue Cross Blue Shield, Expanding

March 17, 2026

Regenerative Orthopedics and Sports Medicine (ROSM) to Lead and Honor at AAOM OTX26

Regenerative Orthopedics and Sports Medicine (ROSM) to Lead and Honor at AAOM OTX26

ROSM co-founders Dr. John Ferrell and Dr. Sean Mulvaney to lead instruction at OTX26, where Dr. Mulvaney will receive

March 17, 2026

Done Right Hood and Fire Safety Calls Attention to Rising Restaurant Fire-Safety Concerns in New York and South Florida

Done Right Hood and Fire Safety Calls Attention to Rising Restaurant Fire-Safety Concerns in New York and South Florida

Done Right HFS says policy lag, uneven enforcement, aging code frameworks make Commercial Hood cleaning and Fire

March 17, 2026

Mobisoft Infotech Is Digitizing Global Transport Operations with Intelligent Fleet Technology

Mobisoft Infotech Is Digitizing Global Transport Operations with Intelligent Fleet Technology

The smart Transport Management System helps enterprises modernize logistics with real-time visibility, predictive

March 17, 2026

Katy Nichole Hits No. 1 At Radio With ‘Have Your Way’ From Sophomore Album, Honest Conversations

Katy Nichole Hits No. 1 At Radio With ‘Have Your Way’ From Sophomore Album, Honest Conversations

RIAA Platinum®-selling Artist Sings Her 4th #1 Song For Thousands Nightly As a Winter Jam 2026 Tour Headliner; Honest

March 17, 2026

ICONIX INTERNATIONAL AND REVLON ENTER INTO GLOBAL FRAGRANCE DEAL FOR FIRST-EVER SALT LIFE BRAND FRAGRANCE

ICONIX INTERNATIONAL AND REVLON ENTER INTO GLOBAL FRAGRANCE DEAL FOR FIRST-EVER SALT LIFE BRAND FRAGRANCE

NEW YORK, NY, UNITED STATES, March 17, 2026 /EINPresswire.com/ — Iconix International Inc. has signed a global

March 17, 2026

REMAX DIRECTOR OF TRAINING JOINS GLOVER U COACHING BENCH

REMAX DIRECTOR OF TRAINING JOINS GLOVER U COACHING BENCH

REMAX Results Director of Training joins the fastest growing real estate coaching organization in North America. Shawna

March 17, 2026

Short, High-Frequency Travel Emerges as a New Trend Among Younger Generations

Short, High-Frequency Travel Emerges as a New Trend Among Younger Generations

With borders reopening and international flight networks fully recovering, the tourism market is undergoing a

March 17, 2026

Migration Consultations in Japan Hit Record High; Gunma Emerges as a Popular Destination

Migration Consultations in Japan Hit Record High; Gunma Emerges as a Popular Destination

Rising costs are driving more Japanese to consider moving to regional areas, with Gunma, Tochigi, and Nagano topping

March 17, 2026

2026 World Grand Prix Taipei Open – Who Will Rise to the Challenge and Battle in Taipei!

2026 World Grand Prix Taipei Open – Who Will Rise to the Challenge and Battle in Taipei!

TAIPEI, TAIWAN (MERXWIRE) – The Department of Sports, Taipei City Government and Taiwan Dancesport Development

March 17, 2026

Financial Times Ranks Spider Labs Among Fastest-Growing APAC Companies

Financial Times Ranks Spider Labs Among Fastest-Growing APAC Companies

Company reports 171% growth as demand rises for protection against ad fraud and fake leads TOKYO, JP / ACCESS Newswire

March 17, 2026

Red Wing Brings 120 Years Of Industry-Defining Craft To Work Apparel For The First Time

Red Wing Brings 120 Years Of Industry-Defining Craft To Work Apparel For The First Time

Engineered with the same comfort, durability, and style that define Red Wing boots RED WING, MN / ACCESS Newswire /

March 17, 2026

Cubic Secure Communications to Showcase Cubic(R) Vector(TM) Multi-Orbit Hybrid SATCOM Antenna at Satellite 2026

Cubic Secure Communications to Showcase Cubic(R) Vector(TM) Multi-Orbit Hybrid SATCOM Antenna at Satellite 2026

Enabling resilient, assured connectivity across SATCOM networks in contested electromagnetic spectrum environments SAN

March 17, 2026

Introducing Proda, the Lifestyle Protein-Infused Soda Launching Exclusively at Sprouts Farmers Market

Introducing Proda, the Lifestyle Protein-Infused Soda Launching Exclusively at Sprouts Farmers Market

Wellness Entrepreneur Matthew Postlethwaite Partners With Suja Co-Founder to Bring to Market the First Protein-Infused

March 17, 2026

A First-of-Its-Kind Video Game Based on Muslim Scientific Artifacts, Launching on March 20, 2026

A First-of-Its-Kind Video Game Based on Muslim Scientific Artifacts, Launching on March 20, 2026

Unity Productions Foundation Announces VANISHED: Puzzle Quest WASHINGTON, D.C. / ACCESS Newswire / March 17, 2026 /

March 17, 2026

Sauce Labs CEO, Prince Kohli, Says $1 Trillion Software Quality Industry Has Been “Building Wrong” for 20 Years

Sauce Labs CEO, Prince Kohli, Says $1 Trillion Software Quality Industry Has Been “Building Wrong” for 20 Years

The new Sauce AI for Test Authoring launch targets the most labor-intensive slice of the 22% of IT budgets spent on

March 17, 2026

Voxelmaps Launches Real-Time City Digital Twin for San José, Powered by NVIDIA AI

Voxelmaps Launches Real-Time City Digital Twin for San José, Powered by NVIDIA AI

Voxelmaps and NVIDIA partner to give city teams real-time 3D visibility into streets, infrastructure, and urban change

March 17, 2026

A New Peer-Reviewed Report Highlights the Reality Facing Some Couples and the Information Your Infertility Doctor May Not Be Telling You

A New Peer-Reviewed Report Highlights the Reality Facing Some Couples and the Information Your Infertility Doctor May Not Be Telling You

According to data from a new peer-reviewed, real-world study, the "standard" Semen Analysis test can come back "normal"

March 17, 2026

Path Fertility Releases its SpermQT Facts Sheet

Path Fertility Releases its SpermQT Facts Sheet

SALT LAKE CITY, UT / ACCESS Newswire / March 17, 2026 / Path FertilityTM, an epigenetics-driven fertility technology

March 17, 2026

El consorcio MANTA selecciona a MDC Data Centers como socio neutral para el aterrizaje de su cable submarino en México

El consorcio MANTA selecciona a MDC Data Centers como socio neutral para el aterrizaje de su cable submarino en México

Liberty Networks, Gold Data y Sparkle aterrizarán el cable submarino MANTA en Cancún y Veracruz a través de centros de

March 17, 2026

Sasha’s Pet Resort Endorses March 23rd as National Puppy Day/Month in US and Canada

Sasha’s Pet Resort Endorses March 23rd as National Puppy Day/Month in US and Canada

Adopt puppies from shelters rather than buying from puppy mills where dogs endure cruel conditions for profit. We encourage puppy owners to celebrate by sharing…

March 17, 2026

Signal Alchemy Expands Global Organic Ambient Roster with New Artists from Poland and Sweden

Signal Alchemy Expands Global Organic Ambient Roster with New Artists from Poland and Sweden

New signings Miaquirele and Broken Peak join an international roster of ambient artists spanning North America, South

March 17, 2026

ClubWorx named a 2025 Winner in the 7th Annual MXM Locations of Excellence Awards

ClubWorx named a 2025 Winner in the 7th Annual MXM Locations of Excellence Awards

ClubWorx, a fitness center in Fuquay-Varina, North Carolina, has been honored with an award for fitness facilities that

March 17, 2026

Azilen Technologies Becomes Merge Service Partner to Accelerate HRTech and HRIS Integrations

Azilen Technologies Becomes Merge Service Partner to Accelerate HRTech and HRIS Integrations

Azilen partners with Merge to accelerate HRTech and HRIS integrations for SaaS platforms with unified API-driven

March 17, 2026

Influential Women Features Kristi Ojala: Underground Mining Safety Leader and Industry Mentor

Influential Women Features Kristi Ojala: Underground Mining Safety Leader and Industry Mentor

ELKO, NV, UNITED STATES, March 17, 2026 /EINPresswire.com/ — Safety Supervisor at Redpath USA Corporation Advancing

March 17, 2026

NJ Filmmaker Janice Molinari Premieres Inspiring Documentary Stronger Than You Think at Garden State Film Festival

NJ Filmmaker Janice Molinari Premieres Inspiring Documentary Stronger Than You Think at Garden State Film Festival

NJ filmmaker Janice Molinari premieres Stronger Than You Think at GSFF, telling Paralympian Ali Truwit’s powerful story

March 17, 2026

Rainbow Hill the Band Announces New Album Crash Bloom, Third Chapter of the ‘Brick by Brick’ Saga, Arriving March 20

Rainbow Hill the Band Announces New Album Crash Bloom, Third Chapter of the ‘Brick by Brick’ Saga, Arriving March 20

LOS ANGELES, CA, UNITED STATES, March 17, 2026 /EINPresswire.com/ — Rainbow Hill the Band today announced the upcoming

March 17, 2026

FutureMoney and Halfmore Launch Generations Plan, Turning Household Work Into Roth IRA Savings for Kids

FutureMoney and Halfmore Launch Generations Plan, Turning Household Work Into Roth IRA Savings for Kids

New platform automates payroll, compliance, and Roth IRA investing — giving kids a head start on building generational

March 17, 2026

Camfil Featured in Manufacturing Marvels™ Showcasing Cleaner, Safer Operations

Camfil Featured in Manufacturing Marvels™ Showcasing Cleaner, Safer Operations

Video spotlights Jonesboro, Arkansas facility, ISO-certified processes and advanced air filtration innovations. Being

March 17, 2026

Elizabethtown Dental Assistant School to Open This Spring, Creating New Pathway into Dental Careers in Hardin County

Elizabethtown Dental Assistant School to Open This Spring, Creating New Pathway into Dental Careers in Hardin County

Elizabethtown Dental Assistant School will open this spring, offering a 12-week, hands-on dental assistant program in

March 17, 2026

Credera Releases New E-Book on Accelerating Insights in the ‘Age of Impatience’

Credera Releases New E-Book on Accelerating Insights in the ‘Age of Impatience’

DENVER, CO, UNITED STATES, March 17, 2026 /EINPresswire.com/ — Credera, a global consulting firm specializing in

March 17, 2026

AllegroGraph 8.5 Strengthens the Semantic Foundation for Agentic AI

AllegroGraph 8.5 Strengthens the Semantic Foundation for Agentic AI

Franz Inc. expands graph, vector, and Neuro-Symbolic capabilities for enterprise-scale AI systems LAFAYETTE, CA, UNITED

March 17, 2026

Apellix Selects Drone Clean UK as Sole Distributor of Apellix Cleaning Drones in the UK

Apellix Selects Drone Clean UK as Sole Distributor of Apellix Cleaning Drones in the UK

Apellix selects Drone Clean UK as exclusive UK distributor for AI‑powered cleaning drones, boosting safety and

March 17, 2026

Techifox CEO Atul Sharma Named Among ’20 Most Inspiring CEOs to Watch in 2026′ for Driving Record Growth for Law Firms

Techifox CEO Atul Sharma Named Among ’20 Most Inspiring CEOs to Watch in 2026′ for Driving Record Growth for Law Firms

Recognition Highlights His Work Helping Law Firms Generate Over 100,000 Legal Leads and $350 Million in Case Value. The

March 17, 2026