Spark in me - Internet, data science, math, deep learning, philosophy(@snakers4). CoCa: Contrastive Captioners are Image-Text Foundation Models Looks like Google is dead set on deve

CoCa: Contrastive Captioners are Image-Text Foundation Models Looks like Google is dead set on developing a production grade dual Image-Text encoder / captioning model:

we unify single-encoder, dual-encoder and encoder-decoder paradigms, and train one image-text foundation model that subsumes the capabilities of all three approaches

The idea of using all of the available noisy data and approaches and creatively sharing the compute is a good pattern, unless you read this line:

Pretraining CoCa takes about 5 days on 2,048 CloudTPUv4 chips

Research and compute siloing, of course, but the pattern itself is nice. #deep_learing