TensorStore for High-Performance, Scalable Array Storage
In ML training engineering it gets complicated, when you deal with 100M+ datasets. Of course you can get away with basic tools like Redis / python's manager / PyTorch even has its version of Redis.
Surprisingly, if you just implement a naïve disk database (i.e. hashed subfolders with a separately stored index), with sufficiently large dataset and small files you can run out of inodes.
Of course, you can easily implement some custom simple chunking strategy (i.e. text data into a dataframe etc). I wonder if this tool can help with this part.
- ai.googleblog.com/2022/09…nce.html
If anyone has experience, please share.