CoCa: Contrastive Captioners are Image-Text Foundation Models
Looks like Google is dead set on developing a production grade dual Image-Text encoder / captioning model:
we unify single-encoder, dual-encoder and encoder-decoder paradigms, and train one image-text foundation model that subsumes the capabilities of all three approaches
The idea of using all of the available noisy data and approaches and creatively sharing the compute is a good pattern, unless you read this line:
Pretraining CoCa takes about 5 days on 2,048 CloudTPUv4 chips
Research and compute siloing, of course, but the pattern itself is nice.
#deep_learing