Обложка канала

Spark in me - Internet, data science, math, deep learning, philosophy. Страница 27

2440 @snakers4

Канал про интересные мне темы - интернет - статистика - наука о данных Без рекламы и буллшита.

  • Spark in me - Internet, data science, math, deep learning, philosophy

    Getting The Most Out of AMP

    We were digging deep into understanding how to utilize AMP properly. Surprise-surprise:

    - It works better with large networks, wide networks
    - It works poorly with separable convolutions
    -You need a bit more involved design considerations than just "have your channels divisible by 8":

    For matrix multiplication:
    On FP16 inputs, all three dimensions (M, N, K) must be multiples of 8.

    For convolution:
    On FP16 inputs, input and output channels must be multiples of 8.

    Also:

    Prefer dense math operations.
    For example, vanilla convolutions have much higher arithmetic intensity than depth-wise separable convolutions.

    Also:

    Choose mini-batch to be a multiple of 8
    Choose linear layer dimensions to be a multiple of 8
    Choose convolution layer channel counts to be a multiple of 8
    For classification problems, pad vocabulary to be a multiple of 8
    For sequence problems, pad the sequence length to be a multiple of 8

    Please see
    - https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html


    #deep_learning
  • Spark in me - Internet, data science, math, deep learning, philosophy

    Trying Out New Ampere GPUs and MIG (RU)

    Играемся с Новыми GPU на базе Ampere от Nvidia и пробуем MIG

    https://habr.com/ru/post/530986/

    Please like / share / repost!

    #hardware
    #deep_learning
  • Spark in me - Internet, data science, math, deep learning, philosophy

    First Experience With A100 GPUs

    (0)
    Under 100% load they are indeed 15-20 degrees cooler, i.e. 60 - 70C (similar to 3090).

    (1)
    ./gpu_burn 120

    - 1080 Ti 8000 - 8,500
    - Titan X (Maxwell) ~4,300
    - 3090 (Ampere) ~16,500
    - A100 (wo MIG) ~16,700 Gflop/s

    ./gpu-burn -tc 120

    - 3090 (Ampere) ~38,500
    - A100 (wo MIG) ~81,500 Gflop/s

    (2)
    Using MIG is kind of straight-forward, but obviously it does not work properly with gpu-burn out of the box.

    Obviously, the most interesting thing is to test MIG 2,3,7 setups against 2x 3090 / 1080 Ti / Titan X.

    #deep_learning
  • Реклама

  • Spark in me - Internet, data science, math, deep learning, philosophy

    2020 DS / ML Digest 13

    Highlights
    :

    - Silero models now has an experimental Ukrainian model
    - CV inference 101
    - High-Resolution 3D Human Digitization
    - Background Features in Google Meet
    - How to Build an Open-Domain Question Answering System?
    - A case for … Keeping encryption elitist
    - Objectron dataset
    - See the above posts about 3090 ... and hopefully new posts comparing Titan X / 1080 Ti / 3090 / A100 =)

    Please like / share / repost!

    https://spark-in.me/post/2020_ds_ml_digest_13

    #digest
  • Spark in me - Internet, data science, math, deep learning, philosophy

    Some More Observations About 3090

    - torch.cuda.empty_cache() does not seem to do anything for networks with variable depth / sequence length / girth

    - DDP + AMP ... seems 3x slower instead of 2x faster (lol) for some networks, we are looking for the cause

    - For some networks, 2x speed bump using AMP out of the box

    - Now DDP prevents me from using 2 processes on 1 GPU with

    RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729096996/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

    - Looks like they are much more efficient in parallelizing and keeping high utilization (80-100%), same networks train ~2x-3x faster compared to Titan X (Maxwell) and 1080 Ti without any tweaks to the code

    - Same networks use more RAM with 3090 compared to 1080 Ti (?)

    - I kind of was afraid that these cards would be under-utilized (50%), but they are just faster. Magic


    #deep_learning
  • Spark in me - Internet, data science, math, deep learning, philosophy

    First Experience With 3090 Gpus (0) Under 100% load they are indeed 15-20 degrees cooler. (1) Lol, gpu-burn shows strange results using default settings - 2x less Gflops compared to 1080 Ti ./gpu_burn 600: - 1080 Ti 8000 - 8500 - Titan X (Maxwell) ~4300…
  • Spark in me - Internet, data science, math, deep learning, philosophy

    First Experience With 3090 Gpus

    (0)
    Under 100% load they are indeed 15-20 degrees cooler.

    (1)
    Lol, gpu-burn shows strange results using default settings - 2x less Gflops compared to 1080 Ti

    ./gpu_burn 600:

    - 1080 Ti 8000 - 8500
    - Titan X (Maxwell) ~4300
    - 3090 (Ampere) ~3000

    ./gpu-burn -tc 600
    - 3090 (Ampere) ~3000

    Idk, maybe it's me, maybe it's gpu-test, need to test on real tasks!

    PS
    I had an old image, maybe bumping CUDA / CUDNN will help.

    #deep_learning
  • Spark in me - Internet, data science, math, deep learning, philosophy

  • Spark in me - Internet, data science, math, deep learning, philosophy

    Some Additional Thoughts on DDP

    DDP docs say that you cannot use multiple DDP processes on one GPU (otherwise you would have to use their RPC framework, which is a bit too much hassle and complication, at least for now for me personally!).

    Turns out you can. But the speed up was negligible in my case:

    - GPU utilization 70-80% 1 process per GPU => GPU utilization 90%-100%;
    - Total epoch time decreased by 3-5%;
    - Interestingly, I tried 2 DDP workers on 2 GPUs vs 4 DDP workers on 2 GPUs ans 3 DDP workers on 2 GPUs (1 on master, 2 on other GPU), and 3 workers were much slower, so probably it is the compute bottleneck, not the communication bottleneck (we will see with Ampere GPUs!);
    - Following advice from Nvidia, I also tried MPS (which is supposed help several processes run smoothly on one GPU), but I just could not make it work with DDP, it failed with cryptic errors at first after cuda.empty.cache() and then just randomly. Sad times;

    #deep_learning
  • Spark in me - Internet, data science, math, deep learning, philosophy

  • Spark in me - Internet, data science, math, deep learning, philosophy

    2020-11-03 [Experimental] Ukrainian Model V1 Released

    - An experimental model
    - Trained from a small community contributed corpus
    - New Full model size reduced to 85 MB
    - New Quantized model is only 25 MB
    - No TF or ONNX models
    - Will be re-released a fine-tuned model from a larger - Russian corpus upon V3 release

    https://github.com/snakers4/silero-models
  • Spark in me - Internet, data science, math, deep learning, philosophy

  • Spark in me - Internet, data science, math, deep learning, philosophy

  • Spark in me - Internet, data science, math, deep learning, philosophy

    Trying PyTorch DDP Again

    Just a quick note. DDP expects to have a gradient / backward pass on each worker (or not to have it on all workers). Otherwise it hangs.

    So do not forget to use grad scaler with native PyTorch AMP.

    In my particular case, DDP worked well with AMP, but when I added grad scaler it stopped exploding / de-syncing and started converging even faster. If only I had GPUs with FP16 support =)

    I guess nice work, Nvidia?

    #deep_learning
  • Spark in me - Internet, data science, math, deep learning, philosophy

    Torch Dataloader With Workers Leaking RAM

    Everyone has faced this issue for HUGE datasets. Is is just because of python itself. If you faced it - you know what I am talking about.

    I do not claim this to be a definitive solution, but it worked for me.

    import time
    import torch
    import random
    import string
    from multiprocessing import Manager
    from torch.utils.data import Dataset, DataLoader


    def id_gen(size=6,
    chars=string.ascii_uppercase):
    return ''.join(random.choice(chars)
    for _ in range(size))


    class DataIter(Dataset):
    def __init__(self):
    m = Manager()
    self.data = m.dict({i: {'key': random.random(),
    'path': id_gen(size=10)}
    for i in range(1000000)})

    def __len__(self):
    return len(self.data)

    def __getitem__(self, idx):
    data = self.data[idx]
    return torch.tensor(data['cer']), data['path']


    train_data = DataIter()

    train_loader = DataLoader(train_data,
    batch_size=60,
    shuffle=False,
    drop_last=False,
    pin_memory=False,
    num_workers=10)

    tic = time.time()

    for i, item in enumerate(train_loader):
    if (i + 1) % 1000 == 0:
    toc = time.time()
    print(f"Time for 1000 batches in {toc - tic} s")
    tic = time.time()

    Be careful with manager dict though. Though it behaves like a dict, if you just try to iterate over its keys, it will be slow, because it has some overhead for inter-process communication.

    If you just need the whole dict, it has some methods to access the whole dict in one big object, which is fast.

    #pytorch
    #deep_learning
  • Реклама

  • Spark in me - Internet, data science, math, deep learning, philosophy

    2020 DS / ML Digest 2 Highlights - New STT benchmarks from FAIR - Analysis of GPT-2 by thegradient - Google’s Meena, a 2.6 billion parameter end-to-end trained neural conversational model (not AGI ofc) - OpenAI now uses PyTorch - LaserTag - cool idea on…
  • Spark in me - Internet, data science, math, deep learning, philosophy

  • Spark in me - Internet, data science, math, deep learning, philosophy

    2020 DS / ML Digest 12

    Highlights
    :

    - Neural network visualization tool
    - Russian large GPT by Sber
    - Some tests of 3090
    - Large radiology dataset
    - New wave of space-tech
    - Containerization landscape

    Please like / share / repost!

    https://spark-in.me/post/2020_ds_ml_digest_12

    #digest