Everyone has faced this issue for HUGE datasets. Is is just because of python itself. If you faced it - you know what I am talking about.
I do not claim this to be a definitive solution, but it worked for me.
import timeBe careful with manager dict though. Though it behaves like a dict, if you just try to iterate over its keys, it will be slow, because it has some overhead for inter-process communication.
import torch
import random
import string
from multiprocessing import Manager
from torch.utils.data import Dataset, DataLoader
def id_gen(size=6,
chars=string.ascii_uppercase):
return ''.join(random.choice(chars)
for _ in range(size))
class DataIter(Dataset):
def __init__(self):
m = Manager()
self.data = m.dict({i: {'key': random.random(),
'path': id_gen(size=10)}
for i in range(1000000)})
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
data = self.data[idx]
return torch.tensor(data['cer']), data['path']
train_data = DataIter()
train_loader = DataLoader(train_data,
batch_size=60,
shuffle=False,
drop_last=False,
pin_memory=False,
num_workers=10)
tic = time.time()
for i, item in enumerate(train_loader):
if (i + 1) % 1000 == 0:
toc = time.time()
print(f"Time for 1000 batches in {toc - tic} s")
tic = time.time()
If you just need the whole dict, it has some methods to access the whole dict in one big object, which is fast.
#pytorch
#deep_learning