CatOps(@catops). OpenAI shares their story of running large Kubernetes clusters. Their setup is quite unique since t

CatOps

4518 @catops

Открыть

DevOps и другие неприятности.

CatOps

@catops 3 года назад

OpenAI shares their story of running large Kubernetes clusters. Their setup is quite unique since they mostly running research jobs. Still, there are couple of takeaways for running large-size clusters. For example, reducing the number of DaemonSets and the number of the node count fluctuations. Also, as usual the most interesting part is the “Unsolved problems” paragraph. #kubernetes

Scaling Kubernetes to 7,500 nodes

We’ve scaled Kubernetes clusters to 7,500 nodes, producing a scalable infrastructure for large models like GPT-3, CLIP, and DALL·E, but also for rapid small-scale iterative research such as Scaling Laws for Neural Language Models.

Openai