Обложка канала

Spark in me - Internet, data science, math, deep learning, philosophy

2440 @snakers4

Канал про интересные мне темы - интернет - статистика - наука о данных Без рекламы и буллшита.

Spark in me - Internet, data science, math, deep learning, philosophy

6 лет назад
Открыть в
Torch Dataloader With Workers Leaking RAM

Everyone has faced this issue for HUGE datasets. Is is just because of python itself. If you faced it - you know what I am talking about.

I do not claim this to be a definitive solution, but it worked for me.

import time
import torch
import random
import string
from multiprocessing import Manager
from torch.utils.data import Dataset, DataLoader


def id_gen(size=6,
chars=string.ascii_uppercase):
return ''.join(random.choice(chars)
for _ in range(size))


class DataIter(Dataset):
def __init__(self):
m = Manager()
self.data = m.dict({i: {'key': random.random(),
'path': id_gen(size=10)}
for i in range(1000000)})

def __len__(self):
return len(self.data)

def __getitem__(self, idx):
data = self.data[idx]
return torch.tensor(data['cer']), data['path']


train_data = DataIter()

train_loader = DataLoader(train_data,
batch_size=60,
shuffle=False,
drop_last=False,
pin_memory=False,
num_workers=10)

tic = time.time()

for i, item in enumerate(train_loader):
if (i + 1) % 1000 == 0:
toc = time.time()
print(f"Time for 1000 batches in {toc - tic} s")
tic = time.time()

Be careful with manager dict though. Though it behaves like a dict, if you just try to iterate over its keys, it will be slow, because it has some overhead for inter-process communication.

If you just need the whole dict, it has some methods to access the whole dict in one big object, which is fast.

#pytorch
#deep_learning