Machine studying (ML) fashions are nonetheless creating in difficult methods, each by way of dimension and approach. Giant language fashions (LLMs) function situations of the previous, whereas Deep Studying Recommender Fashions (DLRMs) and the huge computations of Transformers and BERT function examples of the latter. Our ML supercomputer has expanded from 256 TPU v2 nodes to 4096 TPU v4 nodes as a result of to the large magnitude of current LLMs . Reaching such a dimension leads to reliability points, that are additional exacerbated by the truth that deep neural community (DNN) coaching is carried out in an HPC-style, checkpoint/restore, everything-must-work method. That may be very totally different from the software program dependability attribute of distributed mainline methods like Google.