Basically you do not prune networks (which does not readily transfer into inference) or distill your teacher network into a student, but train a low-rank factorized version of the network with some optimizations from scratch.
This article even has code, but basically ... this is an older fork of fairseq imported via 1 commit. So good luck doing what authors did not bother to do (providing a stand-alone implementation).
So the question is - has anyone seen a minimalist stand-alone implementation for something similar?