Spark in me - Internet, data science, math, deep learning, philosophy(@snakers4). Torch Dataloader With Workers Leaking RAMEveryone has faced this issue for HUGE datasets. Is is just

Torch Dataloader With Workers Leaking RAM

Everyone has faced this issue for HUGE datasets. Is is just because of python itself. If you faced it - you know what I am talking about.

I do not claim this to be a definitive solution, but it worked for me.

import time
import torch
import random
import string
from multiprocessing import Manager
from torch.utils.data import Dataset, DataLoader


def id_gen(size=6,
           chars=string.ascii_uppercase):
    return ''.join(random.choice(chars)
                   for _ in range(size))


class DataIter(Dataset):
    def __init__(self):
        m = Manager()
        self.data = m.dict({i: {'key': random.random(),
                                'path': id_gen(size=10)}
                            for i in range(1000000)})

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        data = self.data[idx]
        return torch.tensor(data['cer']), data['path']


train_data = DataIter()

train_loader = DataLoader(train_data,
                          batch_size=60,
                          shuffle=False,
                          drop_last=False,
                          pin_memory=False,
                          num_workers=10)

tic = time.time()

for i, item in enumerate(train_loader):
    if (i + 1) % 1000 == 0:
        toc = time.time()
        print(f"Time for 1000 batches in {toc - tic} s")
        tic = time.time()

Be careful with manager dict though. Though it behaves like a dict, if you just try to iterate over its keys, it will be slow, because it has some overhead for inter-process communication.

If you just need the whole dict, it has some methods to access the whole dict in one big object, which is fast.

#pytorch
#deep_learning