{"data":{"id":"10.48550/arxiv.2107.07651","type":"dois","attributes":{"doi":"10.48550/arxiv.2107.07651","prefix":"10.48550","suffix":"arxiv.2107.07651","identifiers":[{"identifier":"2107.07651","identifierType":"arXiv"}],"alternateIdentifiers":[{"alternateIdentifierType":"arXiv","alternateIdentifier":"2107.07651"}],"creators":[{"name":"Li, Junnan","nameType":"Personal","givenName":"Junnan","familyName":"Li","affiliation":[],"nameIdentifiers":[]},{"name":"Selvaraju, Ramprasaath R.","nameType":"Personal","givenName":"Ramprasaath R.","familyName":"Selvaraju","affiliation":[],"nameIdentifiers":[]},{"name":"Gotmare, Akhilesh Deepak","nameType":"Personal","givenName":"Akhilesh Deepak","familyName":"Gotmare","affiliation":[],"nameIdentifiers":[]},{"name":"Joty, Shafiq","nameType":"Personal","givenName":"Shafiq","familyName":"Joty","affiliation":[],"nameIdentifiers":[]},{"name":"Xiong, Caiming","nameType":"Personal","givenName":"Caiming","familyName":"Xiong","affiliation":[],"nameIdentifiers":[]},{"name":"Hoi, Steven","nameType":"Personal","givenName":"Steven","familyName":"Hoi","affiliation":[],"nameIdentifiers":[]}],"titles":[{"title":"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"}],"publisher":"arXiv","container":{},"publicationYear":2021,"subjects":[{"lang":"en","subject":"Computer Vision and Pattern Recognition (cs.CV)","subjectScheme":"arXiv"},{"lang":"en","subject":"Artificial Intelligence (cs.AI)","subjectScheme":"arXiv"},{"subject":"FOS: Computer and information sciences","subjectScheme":"Fields of Science and Technology (FOS)"},{"subject":"FOS: Computer and information sciences","schemeUri":"http://www.oecd.org/science/inno/38235147.pdf","subjectScheme":"Fields of Science and Technology (FOS)"}],"contributors":[],"dates":[{"date":"2021-07-16T00:19:22Z","dateType":"Submitted","dateInformation":"v1"},{"date":"2021-07-19T00:05:45Z","dateType":"Updated","dateInformation":"v1"},{"date":"2021-10-07T04:06:04Z","dateType":"Submitted","dateInformation":"v2"},{"date":"2021-10-08T00:08:08Z","dateType":"Updated","dateInformation":"v2"},{"date":"2021-07","dateType":"Available","dateInformation":"v1"},{"date":"2021","dateType":"Issued"}],"language":null,"types":{"ris":"GEN","bibtex":"misc","citeproc":"article","schemaOrg":"CreativeWork","resourceType":"Article","resourceTypeGeneral":"Preprint"},"relatedIdentifiers":[],"relatedItems":[],"sizes":[],"formats":[],"version":"2","rightsList":[{"rights":"Creative Commons Attribution 4.0 International","rightsUri":"https://creativecommons.org/licenses/by/4.0/legalcode","schemeUri":"https://spdx.org/licenses/","rightsIdentifier":"cc-by-4.0","rightsIdentifierScheme":"SPDX"}],"descriptions":[{"description":"Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. In order to improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR$^2$, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. Code and pre-trained models are available at https://github.com/salesforce/ALBEF/.","descriptionType":"Abstract"}],"geoLocations":[],"fundingReferences":[],"xml":"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz4KPHJlc291cmNlIHhtbG5zPSJodHRwOi8vZGF0YWNpdGUub3JnL3NjaGVtYS9rZXJuZWwtNCIgeG1sbnM6eHNpPSJodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYS1pbnN0YW5jZSIgeHNpOnNjaGVtYUxvY2F0aW9uPSJodHRwOi8vZGF0YWNpdGUub3JnL3NjaGVtYS9rZXJuZWwtNCBodHRwOi8vc2NoZW1hLmRhdGFjaXRlLm9yZy9tZXRhL2tlcm5lbC00LjMvbWV0YWRhdGEueHNkIj4KICA8aWRlbnRpZmllciBpZGVudGlmaWVyVHlwZT0iRE9JIj4xMC40ODU1MC9BUlhJVi4yMTA3LjA3NjUxPC9pZGVudGlmaWVyPgogIDxhbHRlcm5hdGVJZGVudGlmaWVycz4KICAgIDxhbHRlcm5hdGVJZGVudGlmaWVyIGFsdGVybmF0ZUlkZW50aWZpZXJUeXBlPSJhclhpdiI+MjEwNy4wNzY1MTwvYWx0ZXJuYXRlSWRlbnRpZmllcj4KICA8L2FsdGVybmF0ZUlkZW50aWZpZXJzPgogIDxjcmVhdG9ycz4KICAgIDxjcmVhdG9yPgogICAgICA8Y3JlYXRvck5hbWUgbmFtZVR5cGU9IlBlcnNvbmFsIj5MaSwgSnVubmFuPC9jcmVhdG9yTmFtZT4KICAgICAgPGdpdmVuTmFtZT5KdW5uYW48L2dpdmVuTmFtZT4KICAgICAgPGZhbWlseU5hbWU+TGk8L2ZhbWlseU5hbWU+CiAgICA8L2NyZWF0b3I+CiAgICA8Y3JlYXRvcj4KICAgICAgPGNyZWF0b3JOYW1lIG5hbWVUeXBlPSJQZXJzb25hbCI+U2VsdmFyYWp1LCBSYW1wcmFzYWF0aCBSLjwvY3JlYXRvck5hbWU+CiAgICAgIDxnaXZlbk5hbWU+UmFtcHJhc2FhdGggUi48L2dpdmVuTmFtZT4KICAgICAgPGZhbWlseU5hbWU+U2VsdmFyYWp1PC9mYW1pbHlOYW1lPgogICAgPC9jcmVhdG9yPgogICAgPGNyZWF0b3I+CiAgICAgIDxjcmVhdG9yTmFtZSBuYW1lVHlwZT0iUGVyc29uYWwiPkdvdG1hcmUsIEFraGlsZXNoIERlZXBhazwvY3JlYXRvck5hbWU+CiAgICAgIDxnaXZlbk5hbWU+QWtoaWxlc2ggRGVlcGFrPC9naXZlbk5hbWU+CiAgICAgIDxmYW1pbHlOYW1lPkdvdG1hcmU8L2ZhbWlseU5hbWU+CiAgICA8L2NyZWF0b3I+CiAgICA8Y3JlYXRvcj4KICAgICAgPGNyZWF0b3JOYW1lIG5hbWVUeXBlPSJQZXJzb25hbCI+Sm90eSwgU2hhZmlxPC9jcmVhdG9yTmFtZT4KICAgICAgPGdpdmVuTmFtZT5TaGFmaXE8L2dpdmVuTmFtZT4KICAgICAgPGZhbWlseU5hbWU+Sm90eTwvZmFtaWx5TmFtZT4KICAgIDwvY3JlYXRvcj4KICAgIDxjcmVhdG9yPgogICAgICA8Y3JlYXRvck5hbWUgbmFtZVR5cGU9IlBlcnNvbmFsIj5YaW9uZywgQ2FpbWluZzwvY3JlYXRvck5hbWU+CiAgICAgIDxnaXZlbk5hbWU+Q2FpbWluZzwvZ2l2ZW5OYW1lPgogICAgICA8ZmFtaWx5TmFtZT5YaW9uZzwvZmFtaWx5TmFtZT4KICAgIDwvY3JlYXRvcj4KICAgIDxjcmVhdG9yPgogICAgICA8Y3JlYXRvck5hbWUgbmFtZVR5cGU9IlBlcnNvbmFsIj5Ib2ksIFN0ZXZlbjwvY3JlYXRvck5hbWU+CiAgICAgIDxnaXZlbk5hbWU+U3RldmVuPC9naXZlbk5hbWU+CiAgICAgIDxmYW1pbHlOYW1lPkhvaTwvZmFtaWx5TmFtZT4KICAgIDwvY3JlYXRvcj4KICA8L2NyZWF0b3JzPgogIDx0aXRsZXM+CiAgICA8dGl0bGU+QWxpZ24gYmVmb3JlIEZ1c2U6IFZpc2lvbiBhbmQgTGFuZ3VhZ2UgUmVwcmVzZW50YXRpb24gTGVhcm5pbmcgd2l0aCBNb21lbnR1bSBEaXN0aWxsYXRpb248L3RpdGxlPgogIDwvdGl0bGVzPgogIDxwdWJsaXNoZXI+YXJYaXY8L3B1Ymxpc2hlcj4KICA8cHVibGljYXRpb25ZZWFyPjIwMjE8L3B1YmxpY2F0aW9uWWVhcj4KICA8c3ViamVjdHM+CiAgICA8c3ViamVjdCB4bWw6bGFuZz0iZW4iIHN1YmplY3RTY2hlbWU9ImFyWGl2Ij5Db21wdXRlciBWaXNpb24gYW5kIFBhdHRlcm4gUmVjb2duaXRpb24gKGNzLkNWKTwvc3ViamVjdD4KICAgIDxzdWJqZWN0IHhtbDpsYW5nPSJlbiIgc3ViamVjdFNjaGVtZT0iYXJYaXYiPkFydGlmaWNpYWwgSW50ZWxsaWdlbmNlIChjcy5BSSk8L3N1YmplY3Q+CiAgICA8c3ViamVjdCBzdWJqZWN0U2NoZW1lPSJGaWVsZHMgb2YgU2NpZW5jZSBhbmQgVGVjaG5vbG9neSAoRk9TKSI+Rk9TOiBDb21wdXRlciBhbmQgaW5mb3JtYXRpb24gc2NpZW5jZXM8L3N1YmplY3Q+CiAgPC9zdWJqZWN0cz4KICA8ZGF0ZXM+CiAgICA8ZGF0ZSBkYXRlVHlwZT0iU3VibWl0dGVkIiBkYXRlSW5mb3JtYXRpb249InYxIj4yMDIxLTA3LTE2VDAwOjE5OjIyWjwvZGF0ZT4KICAgIDxkYXRlIGRhdGVUeXBlPSJVcGRhdGVkIiBkYXRlSW5mb3JtYXRpb249InYxIj4yMDIxLTA3LTE5VDAwOjA1OjQ1WjwvZGF0ZT4KICAgIDxkYXRlIGRhdGVUeXBlPSJTdWJtaXR0ZWQiIGRhdGVJbmZvcm1hdGlvbj0idjIiPjIwMjEtMTAtMDdUMDQ6MDY6MDRaPC9kYXRlPgogICAgPGRhdGUgZGF0ZVR5cGU9IlVwZGF0ZWQiIGRhdGVJbmZvcm1hdGlvbj0idjIiPjIwMjEtMTAtMDhUMDA6MDg6MDhaPC9kYXRlPgogICAgPGRhdGUgZGF0ZVR5cGU9IkF2YWlsYWJsZSIgZGF0ZUluZm9ybWF0aW9uPSJ2MSI+MjAyMS0wNzwvZGF0ZT4KICA8L2RhdGVzPgogIDxyZXNvdXJjZVR5cGUgcmVzb3VyY2VUeXBlR2VuZXJhbD0iUHJlcHJpbnQiPkFydGljbGU8L3Jlc291cmNlVHlwZT4KICA8dmVyc2lvbj4yPC92ZXJzaW9uPgogIDxyaWdodHNMaXN0PgogICAgPHJpZ2h0cyByaWdodHNVUkk9Imh0dHA6Ly9jcmVhdGl2ZWNvbW1vbnMub3JnL2xpY2Vuc2VzL2J5LzQuMC8iIHJpZ2h0c0lkZW50aWZpZXJTY2hlbWU9IlNQRFgiIHJpZ2h0c0lkZW50aWZpZXI9IkNDLUJZLTQuMCI+Q3JlYXRpdmUgQ29tbW9ucyBBdHRyaWJ1dGlvbiA0LjAgSW50ZXJuYXRpb25hbDwvcmlnaHRzPgogIDwvcmlnaHRzTGlzdD4KICA8ZGVzY3JpcHRpb25zPgogICAgPGRlc2NyaXB0aW9uIGRlc2NyaXB0aW9uVHlwZT0iQWJzdHJhY3QiPkxhcmdlLXNjYWxlIHZpc2lvbiBhbmQgbGFuZ3VhZ2UgcmVwcmVzZW50YXRpb24gbGVhcm5pbmcgaGFzIHNob3duIHByb21pc2luZyBpbXByb3ZlbWVudHMgb24gdmFyaW91cyB2aXNpb24tbGFuZ3VhZ2UgdGFza3MuIE1vc3QgZXhpc3RpbmcgbWV0aG9kcyBlbXBsb3kgYSB0cmFuc2Zvcm1lci1iYXNlZCBtdWx0aW1vZGFsIGVuY29kZXIgdG8gam9pbnRseSBtb2RlbCB2aXN1YWwgdG9rZW5zIChyZWdpb24tYmFzZWQgaW1hZ2UgZmVhdHVyZXMpIGFuZCB3b3JkIHRva2Vucy4gQmVjYXVzZSB0aGUgdmlzdWFsIHRva2VucyBhbmQgd29yZCB0b2tlbnMgYXJlIHVuYWxpZ25lZCwgaXQgaXMgY2hhbGxlbmdpbmcgZm9yIHRoZSBtdWx0aW1vZGFsIGVuY29kZXIgdG8gbGVhcm4gaW1hZ2UtdGV4dCBpbnRlcmFjdGlvbnMuIEluIHRoaXMgcGFwZXIsIHdlIGludHJvZHVjZSBhIGNvbnRyYXN0aXZlIGxvc3MgdG8gQUxpZ24gdGhlIGltYWdlIGFuZCB0ZXh0IHJlcHJlc2VudGF0aW9ucyBCRWZvcmUgRnVzaW5nIChBTEJFRikgdGhlbSB0aHJvdWdoIGNyb3NzLW1vZGFsIGF0dGVudGlvbiwgd2hpY2ggZW5hYmxlcyBtb3JlIGdyb3VuZGVkIHZpc2lvbiBhbmQgbGFuZ3VhZ2UgcmVwcmVzZW50YXRpb24gbGVhcm5pbmcuIFVubGlrZSBtb3N0IGV4aXN0aW5nIG1ldGhvZHMsIG91ciBtZXRob2QgZG9lcyBub3QgcmVxdWlyZSBib3VuZGluZyBib3ggYW5ub3RhdGlvbnMgbm9yIGhpZ2gtcmVzb2x1dGlvbiBpbWFnZXMuIEluIG9yZGVyIHRvIGltcHJvdmUgbGVhcm5pbmcgZnJvbSBub2lzeSB3ZWIgZGF0YSwgd2UgcHJvcG9zZSBtb21lbnR1bSBkaXN0aWxsYXRpb24sIGEgc2VsZi10cmFpbmluZyBtZXRob2Qgd2hpY2ggbGVhcm5zIGZyb20gcHNldWRvLXRhcmdldHMgcHJvZHVjZWQgYnkgYSBtb21lbnR1bSBtb2RlbC4gV2UgcHJvdmlkZSBhIHRoZW9yZXRpY2FsIGFuYWx5c2lzIG9mIEFMQkVGIGZyb20gYSBtdXR1YWwgaW5mb3JtYXRpb24gbWF4aW1pemF0aW9uIHBlcnNwZWN0aXZlLCBzaG93aW5nIHRoYXQgZGlmZmVyZW50IHRyYWluaW5nIHRhc2tzIGNhbiBiZSBpbnRlcnByZXRlZCBhcyBkaWZmZXJlbnQgd2F5cyB0byBnZW5lcmF0ZSB2aWV3cyBmb3IgYW4gaW1hZ2UtdGV4dCBwYWlyLiBBTEJFRiBhY2hpZXZlcyBzdGF0ZS1vZi10aGUtYXJ0IHBlcmZvcm1hbmNlIG9uIG11bHRpcGxlIGRvd25zdHJlYW0gdmlzaW9uLWxhbmd1YWdlIHRhc2tzLiBPbiBpbWFnZS10ZXh0IHJldHJpZXZhbCwgQUxCRUYgb3V0cGVyZm9ybXMgbWV0aG9kcyB0aGF0IGFyZSBwcmUtdHJhaW5lZCBvbiBvcmRlcnMgb2YgbWFnbml0dWRlIGxhcmdlciBkYXRhc2V0cy4gT24gVlFBIGFuZCBOTFZSJF4yJCwgQUxCRUYgYWNoaWV2ZXMgYWJzb2x1dGUgaW1wcm92ZW1lbnRzIG9mIDIuMzclIGFuZCAzLjg0JSBjb21wYXJlZCB0byB0aGUgc3RhdGUtb2YtdGhlLWFydCwgd2hpbGUgZW5qb3lpbmcgZmFzdGVyIGluZmVyZW5jZSBzcGVlZC4gQ29kZSBhbmQgcHJlLXRyYWluZWQgbW9kZWxzIGFyZSBhdmFpbGFibGUgYXQgaHR0cHM6Ly9naXRodWIuY29tL3NhbGVzZm9yY2UvQUxCRUYvLjwvZGVzY3JpcHRpb24+CiAgPC9kZXNjcmlwdGlvbnM+CjwvcmVzb3VyY2U+","url":"https://arxiv.org/abs/2107.07651","contentUrl":null,"metadataVersion":0,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"mds","isActive":true,"state":"findable","reason":null,"viewCount":0,"viewsOverTime":[],"downloadCount":0,"downloadsOverTime":[],"referenceCount":0,"citationCount":0,"citationsOverTime":[],"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2022-02-21T19:50:33.000Z","registered":"2022-02-21T19:50:34.000Z","published":"2021","updated":"2022-02-21T19:50:34.000Z"},"relationships":{"client":{"data":{"id":"arxiv.content","type":"clients"}},"provider":{"data":{"id":"arxiv","type":"providers"}},"media":{"data":{"id":"10.48550/arxiv.2107.07651","type":"media"}},"references":{"data":[]},"citations":{"data":[]},"parts":{"data":[]},"partOf":{"data":[]},"versions":{"data":[]},"versionOf":{"data":[]}}}}