Обложка канала

CatOps

4518 @catops

DevOps и другие неприятности.

CatOps

3 года назад
Открыть в
OpenAI shares their story of running large Kubernetes clusters. Their setup is quite unique since they mostly running research jobs. Still, there are couple of takeaways for running large-size clusters. For example, reducing the number of DaemonSets and the number of the node count fluctuations. Also, as usual the most interesting part is the “Unsolved problems” paragraph. #kubernetes
Scaling Kubernetes to 7,500 nodes

We’ve scaled Kubernetes clusters to 7,500 nodes, producing a scalable infrastructure for large models like GPT-3, CLIP, and DALL·E, but also for rapid small-scale iterative research such as Scaling Laws for Neural Language Models.

Openai