[ad_1]
Self supervision and pure language supervision have emerged as two thrilling methods to coach common objective picture encoders which excel at quite a lot of downstream duties. Latest works reminiscent of M3AE [31] and SLIP [64] have urged that these approaches may be successfully mixed, however most notably their outcomes use small (<20M examples) pre-training datasets and don’t successfully mirror the large-scale regime (>100M samples) that’s generally used for these approaches. Right here we examine whether or not an analogous strategy may be efficient when educated with a a lot bigger quantity of information. We discover {that a} mixture of two state-of-the-art approaches: masked auto-encoders, MAE [38] and contrastive language picture pre-training, CLIP [68] supplies a profit over CLIP when educated on a corpus of 11.3M image-text pairs, however little to no profit (as evaluated on a set of widespread imaginative and prescient duties) over CLIP when educated on a big corpus of 1.4B photographs. Our work supplies some a lot wanted readability into the effectiveness (or lack thereof) of self supervision for large-scale image-text coaching.
Support authors and subscribe to content
This is premium stuff. Subscribe to read the entire article.