Spark in me - Internet, data science, math, deep learning, philosophy(@snakers4). 😱 How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0

😱 How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0 - www.semianalysis.com/p/nvidi…npytorch TLDR - Nvidia's dominant position in this field, mainly due to its software moat, is being disrupted; - PyTorch won the hearts of researchers and small / large firms; - Nvidia’s FLOPS have increased multiple orders of magnitude by leveraging Moore’s Law, but primarily architectural changes such as the tensor core and lower precision floating point formats. In contrast, memory has not followed the same path; - The next step down in the memory hierarchy is tightly coupled off-chip memory, DRAM. DRAM followed the path of Moore’s Law for many decades. Since ~2012 though, the cost of DRAM has barely improved; - Comparing Nvidia’s 2016 P100 GPU to their 2022 H100 GPU that is just starting to ship, there is a 5x increase in memory capacity (16GB -> 80GB) but a 46x increase in FP16 performance (21.2 TFLOPS -> 989.5 TFLOPS). - From the current generation A100 to the next generation H100, the FLOPS grow by more than 6X, but memory bandwidth only grows by 1.65x; - One of the principal optimization methods for a model executed in Eager mode is called operator fusion, this optimization often involves writing custom CUDA kernels; - The growth in operators and position as the default has helped Nvidia as each operator was quickly optimized for their architecture but not for any other hardware. If an AI hardware startup wanted to fully implement PyTorch, that meant supporting the growing list of 2,000 operators natively with high performance; - PyTorch 2.0 brings many changes, but the primary difference is that it adds a compiled solution that supports a graph execution model; - OpenAI’s Triton is very disruptive angle to Nvidia’s closed-source software moat for machine learning. Triton takes in Python directly or feeds through the PyTorch Inductor stack. The latter will be the most common use case. Triton then converts the input to an LLVM intermediate representation and then generates code. In the case of Nvidia GPUs, it directly generates PTX code, skipping Nvidia’s closed-source CUDA libraries, such as cuBLAS, in favor of open-source libraries, such as cutlass. The Triton kernels themselves are quite legible to the typical ML researcher which is huge for usability;