This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Webmaster Pub Expands E-Commerce Solutions Across Winterthur and Zurich Regions

Webmaster Pub Expands E-Commerce Solutions Across Winterthur and Zurich Regions

WINTERTHUR, CH – March 18, 2026 – PRESSADVANTAGE – Webmaster Pub, a professional web design company based in

March 18, 2026

LSA Launches SmartCheck: The First Real-Time Insurance Validation Tool for Interpretation Services

LSA Launches SmartCheck: The First Real-Time Insurance Validation Tool for Interpretation Services

LSA SmartCheck is the first real-time insurance validation tool, providing health plans instant coverage checks,

March 18, 2026

Telly and Amlogic Partner to Power the First AI-Ready, Dual-Screen Television Platform Built for the Connected Home

Telly and Amlogic Partner to Power the First AI-Ready, Dual-Screen Television Platform Built for the Connected Home

LOS ANGELES, CA, UNITED STATES, March 17, 2026 /EINPresswire.com/ — Telly, the smartest TV ever built and offered at

March 18, 2026

Jon ‘Money Mase’ Mason Empowers New Salespeople to Achieve Early Success

Jon ‘Money Mase’ Mason Empowers New Salespeople to Achieve Early Success

ORLANDO, FL, UNITED STATES, March 17, 2026 /EINPresswire.com/ — Jon “Money Mase” Mason, nationally recognized sales

March 18, 2026

Braga Outdoor Lighting Emphasizes Critical Need for Electrical Inspections as Denver Properties Modernize

Braga Outdoor Lighting Emphasizes Critical Need for Electrical Inspections as Denver Properties Modernize

March 17, 2026 – PRESSADVANTAGE – Denver-based lighting specialist Braga Outdoor Lighting highlights the growing

March 18, 2026

Introducing Psychological Evaluations for Immigrants From Pro Psychological Analysis

Introducing Psychological Evaluations for Immigrants From Pro Psychological Analysis

BOYNTON BEACH, FL, UNITED STATES, March 17, 2026 /EINPresswire.com/ — Pro Psychological Analysis, a professional

March 18, 2026

Tax Expert Launches Free AI Tax Assistant Backed by Actual IRS Code for Small Business Owners

Tax Expert Launches Free AI Tax Assistant Backed by Actual IRS Code for Small Business Owners

TaxForge delivers plain-English tax answers citing real IRC sections and IRS Publications built by a Fortune-level

March 18, 2026

Qalitex Laboratories Offers GMP Consulting for 21 CFR Part 111 Quality System Development

Qalitex Laboratories Offers GMP Consulting for 21 CFR Part 111 Quality System Development

GMP Consulting Supports FDA Audits, SOPs, Testing, and GMP Readiness for Supplement Manufacturers IRVINE, CA, UNITED

March 18, 2026

Challenger DFS Pit Optimisation Drilling Begins

Challenger DFS Pit Optimisation Drilling Begins

Targeting Initial ‘Stage 1' DFS & Ore Reserves conversion by H2 CY 2026HIGHLIGHTSDFS underway following dual

March 18, 2026

Presentation to Swiss Mining Institute Conference

Presentation to Swiss Mining Institute Conference

Targeting near-term production, medium-term scale & long-term growth ADELAIDE, AU / ACCESS Newswire / March 17,

March 18, 2026

Lone Wolf Exteriors Expands Window and Siding Replacement Programs with Zero Percent Financing Options

Lone Wolf Exteriors Expands Window and Siding Replacement Programs with Zero Percent Financing Options

LEWISVILLE, TX – March 17, 2026 – PRESSADVANTAGE – Lone Wolf Exteriors, a Dallas-Fort Worth based exterior renovation

March 17, 2026

Omen Kaine’s ‘The Heart Tells Tales’ Breakout Theatrical Hit Expands into Film and International Jazz Musical Adaptation

Omen Kaine’s ‘The Heart Tells Tales’ Breakout Theatrical Hit Expands into Film and International Jazz Musical Adaptation

Omen Kaine's The Heart Tells Tales A Runaway Hit! The highest order of beauty, is the divine of chaos.”— Omen Kaine

March 17, 2026

To Steal A Moment’s Time Reveals A Mother’s Wartime Diary Of Courage, Survival, And Hope

To Steal A Moment’s Time Reveals A Mother’s Wartime Diary Of Courage, Survival, And Hope

G. J. Berger presents the remarkable diary of Katharina Berger, capturing a mother’s experience raising a child amid

March 17, 2026

Qalitex Laboratories Achieves ISO 17025 Accreditation for Analytical Testing

Qalitex Laboratories Achieves ISO 17025 Accreditation for Analytical Testing

A2LA-accredited Irvine lab explains why accredited vs. self-declared compliance determines COA acceptance by Amazon,

March 17, 2026

Qalitex Laboratories Launches Regulatory Consulting for Supplement and Pharma Brands

Qalitex Laboratories Launches Regulatory Consulting for Supplement and Pharma Brands

ISO 17025 Lab Provides FDA, GMP, and Amazon Compliance Support for Supplement & Pharma Brands IRVINE, CA, UNITED

March 17, 2026

Qalitex Laboratories Expands Pharmaceutical Testing Services for Drug Developers and CROs

Qalitex Laboratories Expands Pharmaceutical Testing Services for Drug Developers and CROs

ISO 17025-accredited California lab offers HPLC, LC-MS/MS, ICP-MS, and ICH stability studies with 48-hour turnaround

March 17, 2026

Public Hearing to Spotlight Gondola Transit Solutions for Downtown Denver

Public Hearing to Spotlight Gondola Transit Solutions for Downtown Denver

The April hearing invites residents to explore a high-tech transit vision aimed at boosting safety, tourism and

March 17, 2026

Simply Onno: AI Service That Translates and Explains Medical Documents into Plain Language Now Available in English

Simply Onno: AI Service That Translates and Explains Medical Documents into Plain Language Now Available in English

Built in Germany with the highest medical quality standards, Onno now brings this service to English speakers.

March 17, 2026

H2Ocean Participation at Motor City Tattoo Expo 2026 with Education, Artist Engagement, and Award Winning Presence

H2Ocean Participation at Motor City Tattoo Expo 2026 with Education, Artist Engagement, and Award Winning Presence

Bringing science driven aftercare, global artist collaborations, and industry recognition to one of the world’s most

March 17, 2026

The Fairy Queen And The Heart Of The Lake Brings A Magical Story Of Healing, Courage, And Hope To Young Readers

The Fairy Queen And The Heart Of The Lake Brings A Magical Story Of Healing, Courage, And Hope To Young Readers

Kristen Lindeman presents a beautifully illustrated children’s book about resilience, compassion, and the journey to

March 17, 2026

B&M Crane Rental Addresses Site-Specific Challenges in Crane Rental Operations Across Michigan Environments

B&M Crane Rental Addresses Site-Specific Challenges in Crane Rental Operations Across Michigan Environments

FENTON, MI – March 17, 2026 – PRESSADVANTAGE – B&M Crane Rental continues to navigate the diverse and demanding

March 17, 2026

McCormick Highlights Comprehensive Range of CNC Machining Services

McCormick Highlights Comprehensive Range of CNC Machining Services

APPLETON, WI – March 17, 2026 – PRESSADVANTAGE – McCormick Industries, a precision machining company serving diverse

March 17, 2026

Muslim Firsts Launches Free Reference on Muslim Barrier-Breakers

Muslim Firsts Launches Free Reference on Muslim Barrier-Breakers

March 17, 2026 – PRESSADVANTAGE – Muslim Firsts, a free online educational reference, launched this week with 30

March 17, 2026

ROUTLEDGE RELEASES ‘PROFOUND’ AND ‘ILLUMINATING’ BOOK IN ITS PSYCHOANALYSIS, TECHNOLOGY & THE FUTURE SERIES

ROUTLEDGE RELEASES ‘PROFOUND’ AND ‘ILLUMINATING’ BOOK IN ITS PSYCHOANALYSIS, TECHNOLOGY & THE FUTURE SERIES

Renowned Psychoanalyst and Bestselling Author Explores the Unconscious Roots of AI Bias Courageous―and necessary.”—

March 17, 2026

New Memoir Growing Up Happy in a Lonely World Explores the Hidden Loneliness Behind a Confident Life

New Memoir Growing Up Happy in a Lonely World Explores the Hidden Loneliness Behind a Confident Life

Author Nicole Morrison Releases Growing Up Happy in a Lonely World, a Mixtape Memoir on Identity, Resilience, and

March 17, 2026

4Sight Labs Introduces OptiGuard™ to Help Detect Signs of Life in Jail Cells Using Existing Camera Systems

4Sight Labs Introduces OptiGuard™ to Help Detect Signs of Life in Jail Cells Using Existing Camera Systems

AI-powered monitoring capability designed to support faster wellness checks in detention facilities OptiGuard™ is

March 17, 2026

THE GREEN AND KAWIKA KAHIAPO JOIN EASTERSEALS HAWAII’S ‘MUSIC FOR ALL’ BENEFIT CONCERT AT HISTORIC HAWAII THEATRE

THE GREEN AND KAWIKA KAHIAPO JOIN EASTERSEALS HAWAII’S ‘MUSIC FOR ALL’ BENEFIT CONCERT AT HISTORIC HAWAII THEATRE

Celebrate World Autism Acceptance Day with a Night of Music, Inclusion and Community World Autism Acceptance Day

March 17, 2026

Water On Demand Announces Terms Changing for Accredited Investor Offering

Water On Demand Announces Terms Changing for Accredited Investor Offering

Formation-Stage Terms Ending — Now Funding Active Execution We wanted to provide formation-stage accredited investors

March 17, 2026

Global Automotive Fastener Market to Exceed USD 31.08 Bn by 2031, Led by Toyota, Volkswagen, GM, and Hyundai | Arizton

Global Automotive Fastener Market to Exceed USD 31.08 Bn by 2031, Led by Toyota, Volkswagen, GM, and Hyundai | Arizton

OEMs Are Scaling High-Volume Production Across China, Mexico, Eastern Europe, and Southeast Asia Volkswagen has

March 17, 2026

California Arts Council Opens Statewide Grants as Agency Marks 50 Years of Creative Impact

California Arts Council Opens Statewide Grants as Agency Marks 50 Years of Creative Impact

Arts organizations across California can apply for funding through May 12 SACRAMENTO , CA, UNITED STATES, March 17,

March 17, 2026

CENTURY 21 Redwood Realty Announces Herb Lisjak as Executive Vice President of Agent Advocacy

CENTURY 21 Redwood Realty Announces Herb Lisjak as Executive Vice President of Agent Advocacy

CENTURY 21 Redwood strengthens its commitment to agent growth and experience with a new leadership role focused on

March 17, 2026

Eric Roberts, Sofia Milos Attended Charmaine Blake Oscar Viewing Gala

Eric Roberts, Sofia Milos Attended Charmaine Blake Oscar Viewing Gala

BEVERLY HILLS, CA, UNITED STATES, March 17, 2026 /EINPresswire.com/ — Charmaine Blake Hosts Star-Studded Oscar Viewing

March 17, 2026

Ali Talai Educates Realtors on Real Estate Asset Protection Strategies

Ali Talai Educates Realtors on Real Estate Asset Protection Strategies

Estate Planning Lawyer Ali Talai Spoke at the West San Gabriel Valley Realtors Association on Asset Protection

March 17, 2026

Ali Talai Conducts Asset Protection Seminar for Real Estate Investors

Ali Talai Conducts Asset Protection Seminar for Real Estate Investors

Ali Talai Led an In-Person Asset Protection Seminar at the Waldorf Astoria in Beverly Hills, Providing Strategies

March 17, 2026

Global Walk Assist Robot Market to Reach USD 568.81 Million by 2031 Despite Cost Pressures | Arizton

Global Walk Assist Robot Market to Reach USD 568.81 Million by 2031 Despite Cost Pressures | Arizton

Industry Analysis Report, Regional Outlook, Growth Potential, Price Trends, Competitive Market Share & Forecast

March 17, 2026

Wilbanks Partners Celebrates Women’s History Month & International Women’s Day 2026

Wilbanks Partners Celebrates Women’s History Month & International Women’s Day 2026

George Wilbanks comments on the importance of women in executive leadership and in the boardroom. Clear evidence shows

March 17, 2026

US News & World Reports: UOI’s East Bay Surgery Center Among Nation’s Best ASCs for Orthopedics & Spine

US News & World Reports: UOI’s East Bay Surgery Center Among Nation’s Best ASCs for Orthopedics & Spine

Only Rhode Island ASC to earn repeat recognition; One of 233 orthopedic centers nationwide to be ranked high-performing

March 17, 2026

National Week of Conversation: Uniting Americans to #ListenFirst and Choose Curiosity over Contempt

National Week of Conversation: Uniting Americans to #ListenFirst and Choose Curiosity over Contempt

Coalition of 500+ organizations invites a divided and distrustful nation to find strength and hope during the 9th

March 17, 2026

Now Over 1,500 5 Star Reviews For AZ Home Services Group AC Repair & Plumbing Services

Now Over 1,500 5 Star Reviews For AZ Home Services Group AC Repair & Plumbing Services

Tempe HVAC and plumbing company AZ Home Services Group surpasses 1,500 five-star Google reviews from satisfied Phoenix

March 17, 2026

K&D Landscaping Partners With Halstead Media to Build Integrated Growth Marketing and Sales System

K&D Landscaping Partners With Halstead Media to Build Integrated Growth Marketing and Sales System

Strategic partnership aligns marketing, sales, and reporting to support disciplined, scalable growth. Our systems,

March 17, 2026