This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Williams Data Management Expands Secure Mobile Shredding Capacity in Los Angeles

Williams Data Management Expands Secure Mobile Shredding Capacity in Los Angeles

Secure Mobile Shredding Expansion in Los Angeles Supports Safe Document Disposal and Compliance for Local Businesses

March 12, 2026

Kleinschmidt’s Dana Postlewait Receives Pamela E. Klatt Award from Northwest Hydroelectric Association

Kleinschmidt’s Dana Postlewait Receives Pamela E. Klatt Award from Northwest Hydroelectric Association

His intelligence, humility, and optimism consistently shine in complex environments—bringing people together,

March 12, 2026

Catastrophe AI™ Launches Smart AI-Guided Platform to Help Insurers Prepare for the Next Earthquake

Catastrophe AI™ Launches Smart AI-Guided Platform to Help Insurers Prepare for the Next Earthquake

Smart inspection workflows and real-time documentation help carriers respond faster, scale large-loss claims, and

March 12, 2026

Affinity Counseling of Colorado Launches New Website to Expand Access to Trauma-Informed Somatic Therapy Across Colorado

Affinity Counseling of Colorado Launches New Website to Expand Access to Trauma-Informed Somatic Therapy Across Colorado

Denver-based virtual therapy practice unveils a redesigned website highlighting trauma-informed, relational care for

March 12, 2026

Bootstrap Island Leaves Early Access and Launches in Full Release on Steam Today

Bootstrap Island Leaves Early Access and Launches in Full Release on Steam Today

The Robinson Crusoe-inspired VR survival adventure launches out of Early Access with new content and the complete story

March 12, 2026

Octave Holdings & Investments and Vantico Investments Purchase Two Acquisitions for $52M to Launch The OVA Fund

Octave Holdings & Investments and Vantico Investments Purchase Two Acquisitions for $52M to Launch The OVA Fund

ALPHARETTA, GA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Octave Holdings & Investments (Octave) and

March 12, 2026

Multi-Award-Winning Horror Romance ‘Straight On Till Morning’ on streaming and VOD platforms today

Multi-Award-Winning Horror Romance ‘Straight On Till Morning’ on streaming and VOD platforms today

After a 17-year journey from page to screen and a celebrated international festival run, director Craig Ouellette’s

March 12, 2026

WISE MARKETER GROUP ANNOUNCES PARTNERSHIP WITH LEAL HUB TO BRING CLMP™ CERTIFICATION TO SPANISH-SPEAKING MARKETS

WISE MARKETER GROUP ANNOUNCES PARTNERSHIP WITH LEAL HUB TO BRING CLMP™ CERTIFICATION TO SPANISH-SPEAKING MARKETS

Loyalty Academy™ expands into Mexico, Colombia, and Spain through strategic alliance with leading Latin American

March 12, 2026

Public Statement Regarding Dismissal of Lawsuit Between BLMGNF and Tides

Public Statement Regarding Dismissal of Lawsuit Between BLMGNF and Tides

ATLANTA, GA, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Public Statement In 2024, Black Lives Matter Global

March 12, 2026

Firepoint Energy Appoints Ardour Capital as Financial Advisor to Support Capital Formation and Strategic Growth

Firepoint Energy Appoints Ardour Capital as Financial Advisor to Support Capital Formation and Strategic Growth

“Engaging Ardour Capital positions us to pursue our funding objectives and strengthen our financial foundation.””—

March 12, 2026

Walk By Faith With God As Your Compass Chronicles a Life of Purpose, Perseverance, and Unshakable Faith

Walk By Faith With God As Your Compass Chronicles a Life of Purpose, Perseverance, and Unshakable Faith

MN, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Books to Life Marketing proudly presents Walk By Faith With God

March 12, 2026

MorningCoach® Founder JB Glossinger Reaches Episode 6,000 with New Book and Professional Operating System

MorningCoach® Founder JB Glossinger Reaches Episode 6,000 with New Book and Professional Operating System

fter 21 years and 6,000 daily episodes, JB Glossinger releases "Get It Done NOW!" and launches the MorningCoach®

March 12, 2026

Saudi Arabia Data Center Market to Reach USD 6.17 Billion by 2031, Establishing Itself as a Top-Tier Middle East Hub

Saudi Arabia Data Center Market to Reach USD 6.17 Billion by 2031, Establishing Itself as a Top-Tier Middle East Hub

Riyadh, Jeddah, NEOM, Makkah, Madinah, Al Qassim, and Al Ahsa are gaining traction for new data center Microsoft, and

March 12, 2026

A Biblical Roadmap to Spiritual Clarity and Salvation

A Biblical Roadmap to Spiritual Clarity and Salvation

MILWAUKEE, WI, UNITED STATES, March 12, 2026 /EINPresswire.com/ — Books to Life Marketing proudly announces the rising

March 12, 2026

Golden, Colorado Invites Visitors to ‘Bring Your Best Friend’ After Record Goldens in Golden Turnout

Golden, Colorado Invites Visitors to ‘Bring Your Best Friend’ After Record Goldens in Golden Turnout

Beloved event draws 16,000 people and 5,500 dogs from around the world Goldens in Golden is a celebration of the

March 12, 2026

UAE Ministry of Interior Champions Women’s Role in Global Policing

UAE Ministry of Interior Champions Women’s Role in Global Policing

NEW YORK, NY, UNITED STATES, March 12, 2026 /EINPresswire.com/ — The Ministry of Interior of the United Arab Emirates

March 12, 2026

Visage Laser & Skin Care to Host Bloom & Glow Social, Spring Client Appreciation Event in Anaheim Hills

Visage Laser & Skin Care to Host Bloom & Glow Social, Spring Client Appreciation Event in Anaheim Hills

Red carpet, DJ, refreshments, VIP swag, raffle prizes, and event-only specials at Visage Laser & Skin Care on March 21. RSVP via Eventbrite. ANAHEIM, CA,…

March 12, 2026

Examining the Benefits of White Cotton Pajamas with Domi

Examining the Benefits of White Cotton Pajamas with Domi

March 12, 2026 – PRESSADVANTAGE – Los Angeles, CA – As a common sleepwear choice, white cotton pajamas are valued for

March 12, 2026

Lone Wolf Exteriors Introduces Personalized Consultation Program for Replacement Vinyl Windows and Siding Projects

Lone Wolf Exteriors Introduces Personalized Consultation Program for Replacement Vinyl Windows and Siding Projects

LEWISVILLE, TX – March 12, 2026 – PRESSADVANTAGE – Lone Wolf Exteriors, a Dallas-Fort Worth-based exterior renovation

March 12, 2026

Wilmslow All-On-4 Dental Implants Private Dentist Dr Natasja Kashyap Recommends Consultations at The Croft Dental & Implant Practice

Wilmslow All-On-4 Dental Implants Private Dentist Dr Natasja Kashyap Recommends Consultations at The Croft Dental & Implant Practice

March 12, 2026 – PRESSADVANTAGE – People in Wilmslow who are experiencing significant tooth loss may now explore

March 12, 2026

G-Stacker Announces Automated Digital Infrastructure Platform for Brand Voice SEO and Multi-Property Data Stacking

G-Stacker Announces Automated Digital Infrastructure Platform for Brand Voice SEO and Multi-Property Data Stacking

WILMINGTON, DE – March 12, 2026 – PRESSADVANTAGE – G-Stacker has announced the availability of its digital

March 12, 2026

The Wedding Planner Hong Kong Outlines Professional Event Planning Framework and Coordination Practices

The Wedding Planner Hong Kong Outlines Professional Event Planning Framework and Coordination Practices

HONG KONG, HK – March 12, 2026 – PRESSADVANTAGE – The Wedding Planner Hong Kong has released an announcement outlining

March 12, 2026

Grace Point Treatment Center Publishes New Resource on Website Discussing Alcohol Abuse and Recovery Pathways

Grace Point Treatment Center Publishes New Resource on Website Discussing Alcohol Abuse and Recovery Pathways

FORT LAUDERDALE, FL – March 12, 2026 – PRESSADVANTAGE – Grace Point Treatment Center has released a new educational

March 12, 2026

Big Easy Lighting Adds Color-Changing Lighting and Deck and Patio Lighting to Residential Service Lineup

Big Easy Lighting Adds Color-Changing Lighting and Deck and Patio Lighting to Residential Service Lineup

March 12, 2026 – PRESSADVANTAGE – Big Easy Lighting, a residential and commercial lighting contractor serving

March 12, 2026

Orange County Restoration Services Expands Specialized Abatement Offerings

Orange County Restoration Services Expands Specialized Abatement Offerings

March 12, 2026 – PRESSADVANTAGE – Orange County Restoration Services has announced the expansion of its specialized

March 12, 2026

Couto Group Announces Global Strategic Hubs, Including Silicon Valley, to Strengthen Founder–Investor Connections

Couto Group Announces Global Strategic Hubs, Including Silicon Valley, to Strengthen Founder–Investor Connections

The expansion strengthens the company’s presence across major innovation ecosystems, including Dubai, Hong Kong, and

March 12, 2026

ioHealth’s Intelligent Overlay: Home Health EMR Innovation

ioHealth’s Intelligent Overlay: Home Health EMR Innovation

ioHealth launches the first Intelligent Overlay for home health, enabling real-time clinical support and compliance

March 12, 2026

Northeast Health Services Opens New Clinic in Auburn, MA to Serve Community Needs

Northeast Health Services Opens New Clinic in Auburn, MA to Serve Community Needs

Northeast Health Services has opened a new clinic in Auburn, MA, increasing access to quality mental healthcare for the

March 12, 2026

Etta May Brings Laughter Back to the Midwest at the Lincoln Square Theater – Saturday, March 14, 2026

Etta May Brings Laughter Back to the Midwest at the Lincoln Square Theater – Saturday, March 14, 2026

Local Businesses Unite for a “Girls Day with Etta May” Celebration in Decatur DECATUR, IL, UNITED STATES, March 12,

March 12, 2026

NameBadge.com: America’s Top-Rated Custom Name Badge Manufacturer Surpasses 1,400 Five-Star Google Reviews

NameBadge.com: America’s Top-Rated Custom Name Badge Manufacturer Surpasses 1,400 Five-Star Google Reviews

Family-owned name badge manufacturer operates 100+ machines across 29,000 sq ft in Florida and South Carolina with 55+

March 12, 2026

Cumberland Academy of Georgia to Honor Legacy of Founding Board Member Valery Voyles at Annual Party FORE a Purpose Gala

Cumberland Academy of Georgia to Honor Legacy of Founding Board Member Valery Voyles at Annual Party FORE a Purpose Gala

Cumberland Academy of Georgia to Honor Legacy of Founding Board Member Valery Voyles at Annual "Party FORE a Purpose"

March 12, 2026

THE REEDS AT SHELTER HAVEN AT SHELTER HAVEN WELCOMES WARMER DAYS AHEAD WITH READY, SET, SUMMER

THE REEDS AT SHELTER HAVEN AT SHELTER HAVEN WELCOMES WARMER DAYS AHEAD WITH READY, SET, SUMMER

EARLY BOOKING OFFER ENCOURAGES GUESTS TO PLAN SUMMER FUN GETAWAYS STONE HARBOR, NJ, UNITED STATES, March 12, 2026

March 12, 2026

Steam Education (STEM Career): Building a 500+ Corporate Referral Network

Steam Education (STEM Career): Building a 500+ Corporate Referral Network

JIANGSU, CHINA, March 12, 2026 /EINPresswire.com/ — One morning in Midtown Manhattan, Stan and his team from Steam

March 12, 2026

Conquer Padel Introduces lululemon Apparel at Its Clubs

Conquer Padel Introduces lululemon Apparel at Its Clubs

Conquer Padel now offers lululemon premium apparel at its clubs, marking lululemon’s most significant expansion into

March 12, 2026

Desk365 Launches a Premium Plan with Advanced AI, HIPAA Compliance, and ITSM Capabilities

Desk365 Launches a Premium Plan with Advanced AI, HIPAA Compliance, and ITSM Capabilities

Desk365 announces a Premium plan with upgraded AI, enterprise security, & asset management capabilities to help

March 12, 2026

‘Market Makers’ Podcast Launches: Unfiltered Conversations at the Intersection of PropTech, Marketing, and Innovation

‘Market Makers’ Podcast Launches: Unfiltered Conversations at the Intersection of PropTech, Marketing, and Innovation

Market Makers will explore how technology, leadership, and creativity converge to redefine competitive advantage in the

March 12, 2026

Malama Health Secures $9.2M to Scale Doula-Led Maternal Care for Medicaid-Insured Women Nationwide

Malama Health Secures $9.2M to Scale Doula-Led Maternal Care for Medicaid-Insured Women Nationwide

Seed round accelerates Malama Health's mission to build the national infrastructure for continuous, doula-led care

March 12, 2026

Book Resumes Hosted by the VERSO ILS SaaS Library Catalog Management Solution

Book Resumes Hosted by the VERSO ILS SaaS Library Catalog Management Solution

This partnership enables Soutron Global’s LMS VERSO to support libraries in a concrete, meaningful way by providing

March 12, 2026

ADLINK Powers Physical AI and AI-Medical Imaging Solutions at NVIDIA GTC 2026

ADLINK Powers Physical AI and AI-Medical Imaging Solutions at NVIDIA GTC 2026

Enabling Humanoid Robotics and Edge AI at Scale ADLINK Technology, Inc. (TWSE:6166)TAOYUAN, TAIWAN, March 12, 2026

March 12, 2026

aiXplain introduces aiXplain Studio

aiXplain introduces aiXplain Studio

aiXplain Studio is the no-code platform for building production-grade AI agents, designed for speed, built for teams,

March 12, 2026