{"data":{"id":"10.7801/361","type":"dois","attributes":{"doi":"10.7801/361","prefix":"10.7801","suffix":"361","identifiers":[],"alternateIdentifiers":[],"creators":[{"name":"Litschko, Robert","nameType":"Personal","givenName":"Robert","familyName":"Litschko","affiliation":[],"nameIdentifiers":[]}],"titles":[{"lang":"de","title":"Data for paper: \"Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval\""}],"publisher":"Mannheim University Library","container":{},"publicationYear":2021,"subjects":[],"contributors":[],"dates":[{"date":"2021","dateType":"Issued"}],"language":null,"types":{"ris":"DATA","bibtex":"misc","citeproc":"dataset","schemaOrg":"Dataset","resourceTypeGeneral":"Dataset"},"relatedIdentifiers":[],"relatedItems":[],"sizes":["5634460492","5706262811","9400221240","9464752741","9463906740","7283757299"],"formats":["application/gzip","application/gzip","application/gzip","application/gzip","application/gzip","application/gzip"],"version":"1","rightsList":[],"descriptions":[{"lang":"de","description":"Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual IR (CLIR) tasks. Therefore, in this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained encoders fail to significantly outperform models based on CLWEs. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders `off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.","descriptionType":"Abstract"}],"geoLocations":[],"fundingReferences":[],"xml":"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4KPHJlc291cmNlIHhtbG5zPSJodHRwOi8vZGF0YWNpdGUub3JnL3NjaGVtYS9rZXJuZWwtNCIgeG1sbnM6eHNpPSJodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYS1pbnN0YW5jZSIgeHNpOnNjaGVtYUxvY2F0aW9uPSJodHRwOi8vZGF0YWNpdGUub3JnL3NjaGVtYS9rZXJuZWwtNCBodHRwOi8vc2NoZW1hLmRhdGFjaXRlLm9yZy9tZXRhL2tlcm5lbC00L21ldGFkYXRhLnhzZCI+CiAgPGlkZW50aWZpZXIgaWRlbnRpZmllclR5cGU9IkRPSSI+MTAuNzgwMS8zNjE8L2lkZW50aWZpZXI+CiAgPGNyZWF0b3JzPgogICAgPGNyZWF0b3I+CiAgICAgIDxjcmVhdG9yTmFtZSBuYW1lVHlwZT0iUGVyc29uYWwiPkxpdHNjaGtvLCBSb2JlcnQ8L2NyZWF0b3JOYW1lPgogICAgPC9jcmVhdG9yPgogIDwvY3JlYXRvcnM+CiAgPHRpdGxlcz4KICAgIDx0aXRsZSB4bWw6bGFuZz0iZGUiPkRhdGEgZm9yIHBhcGVyOiAiRXZhbHVhdGluZyBNdWx0aWxpbmd1YWwgVGV4dCBFbmNvZGVycyBmb3IgVW5zdXBlcnZpc2VkIENyb3NzLUxpbmd1YWwgUmV0cmlldmFsIjwvdGl0bGU+CiAgPC90aXRsZXM+CiAgPHB1Ymxpc2hlcj5NYW5uaGVpbSBVbml2ZXJzaXR5IExpYnJhcnk8L3B1Ymxpc2hlcj4KICA8cHVibGljYXRpb25ZZWFyPjIwMjE8L3B1YmxpY2F0aW9uWWVhcj4KICA8cmVzb3VyY2VUeXBlIHJlc291cmNlVHlwZUdlbmVyYWw9IkRhdGFzZXQiLz4KICA8c2l6ZXM+CiAgICA8c2l6ZT41NjM0NDYwNDkyPC9zaXplPgogICAgPHNpemU+NTcwNjI2MjgxMTwvc2l6ZT4KICAgIDxzaXplPjk0MDAyMjEyNDA8L3NpemU+CiAgICA8c2l6ZT45NDY0NzUyNzQxPC9zaXplPgogICAgPHNpemU+OTQ2MzkwNjc0MDwvc2l6ZT4KICAgIDxzaXplPjcyODM3NTcyOTk8L3NpemU+CiAgPC9zaXplcz4KICA8Zm9ybWF0cz4KICAgIDxmb3JtYXQ+YXBwbGljYXRpb24vZ3ppcDwvZm9ybWF0PgogICAgPGZvcm1hdD5hcHBsaWNhdGlvbi9nemlwPC9mb3JtYXQ+CiAgICA8Zm9ybWF0PmFwcGxpY2F0aW9uL2d6aXA8L2Zvcm1hdD4KICAgIDxmb3JtYXQ+YXBwbGljYXRpb24vZ3ppcDwvZm9ybWF0PgogICAgPGZvcm1hdD5hcHBsaWNhdGlvbi9nemlwPC9mb3JtYXQ+CiAgICA8Zm9ybWF0PmFwcGxpY2F0aW9uL2d6aXA8L2Zvcm1hdD4KICA8L2Zvcm1hdHM+CiAgPHZlcnNpb24+MTwvdmVyc2lvbj4KICA8ZGVzY3JpcHRpb25zPgogICAgPGRlc2NyaXB0aW9uIHhtbDpsYW5nPSJkZSIgZGVzY3JpcHRpb25UeXBlPSJBYnN0cmFjdCI+UHJldHJhaW5lZCBtdWx0aWxpbmd1YWwgdGV4dCBlbmNvZGVycyBiYXNlZCBvbiBuZXVyYWwgVHJhbnNmb3JtZXIgYXJjaGl0ZWN0dXJlcywgc3VjaCBhcyBtdWx0aWxpbmd1YWwgQkVSVCAobUJFUlQpIGFuZCBYTE0sIGhhdmUgYWNoaWV2ZWQgc3Ryb25nIHBlcmZvcm1hbmNlIG9uIGEgbXlyaWFkIG9mIGxhbmd1YWdlIHVuZGVyc3RhbmRpbmcgdGFza3MuIENvbnNlcXVlbnRseSwgdGhleSBoYXZlIGJlZW4gYWRvcHRlZCBhcyBhIGdvLXRvIHBhcmFkaWdtIGZvciBtdWx0aWxpbmd1YWwgYW5kIGNyb3NzLWxpbmd1YWwgcmVwcmVzZW50YXRpb24gbGVhcm5pbmcgYW5kIHRyYW5zZmVyLCByZW5kZXJpbmcgY3Jvc3MtbGluZ3VhbCB3b3JkIGVtYmVkZGluZ3MgKENMV0VzKSBlZmZlY3RpdmVseSBvYnNvbGV0ZS4gSG93ZXZlciwgcXVlc3Rpb25zIHJlbWFpbiB0byB3aGljaCBleHRlbnQgdGhpcyBmaW5kaW5nIGdlbmVyYWxpemVzIDEpIHRvIHVuc3VwZXJ2aXNlZCBzZXR0aW5ncyBhbmQgMikgZm9yIGFkLWhvYyBjcm9zcy1saW5ndWFsIElSIChDTElSKSB0YXNrcy4gVGhlcmVmb3JlLCBpbiB0aGlzIHdvcmsgd2UgcHJlc2VudCBhIHN5c3RlbWF0aWMgZW1waXJpY2FsIHN0dWR5IGZvY3VzZWQgb24gdGhlIHN1aXRhYmlsaXR5IG9mIHRoZSBzdGF0ZS1vZi10aGUtYXJ0IG11bHRpbGluZ3VhbCBlbmNvZGVycyBmb3IgY3Jvc3MtbGluZ3VhbCBkb2N1bWVudCBhbmQgc2VudGVuY2UgcmV0cmlldmFsIHRhc2tzIGFjcm9zcyBhIGxhcmdlIG51bWJlciBvZiBsYW5ndWFnZSBwYWlycy4gSW4gY29udHJhc3QgdG8gc3VwZXJ2aXNlZCBsYW5ndWFnZSB1bmRlcnN0YW5kaW5nLCBvdXIgcmVzdWx0cyBpbmRpY2F0ZSB0aGF0IGZvciB1bnN1cGVydmlzZWQgZG9jdW1lbnQtbGV2ZWwgQ0xJUiAtLSBhIHNldHVwIHdpdGggbm8gcmVsZXZhbmNlIGp1ZGdtZW50cyBmb3IgSVItc3BlY2lmaWMgZmluZS10dW5pbmcgLS0gcHJldHJhaW5lZCBlbmNvZGVycyBmYWlsIHRvIHNpZ25pZmljYW50bHkgb3V0cGVyZm9ybSBtb2RlbHMgYmFzZWQgb24gQ0xXRXMuIEZvciBzZW50ZW5jZS1sZXZlbCBDTElSLCB3ZSBkZW1vbnN0cmF0ZSB0aGF0IHN0YXRlLW9mLXRoZS1hcnQgcGVyZm9ybWFuY2UgY2FuIGJlIGFjaGlldmVkLiBIb3dldmVyLCB0aGUgcGVhayBwZXJmb3JtYW5jZSBpcyBub3QgbWV0IHVzaW5nIHRoZSBnZW5lcmFsLXB1cnBvc2UgbXVsdGlsaW5ndWFsIHRleHQgZW5jb2RlcnMgYG9mZi10aGUtc2hlbGYnLCBidXQgcmF0aGVyIHJlbHlpbmcgb24gdGhlaXIgdmFyaWFudHMgdGhhdCBoYXZlIGJlZW4gZnVydGhlciBzcGVjaWFsaXplZCBmb3Igc2VudGVuY2UgdW5kZXJzdGFuZGluZyB0YXNrcy48L2Rlc2NyaXB0aW9uPgogIDwvZGVzY3JpcHRpb25zPgo8L3Jlc291cmNlPg==","url":"https://madata.bib.uni-mannheim.de/361","contentUrl":null,"metadataVersion":3,"schemaVersion":"http://datacite.org/schema/kernel-4","source":"api","isActive":true,"state":"findable","reason":null,"viewCount":0,"viewsOverTime":[],"downloadCount":0,"downloadsOverTime":[],"referenceCount":0,"citationCount":0,"citationsOverTime":[],"partCount":0,"partOfCount":0,"versionCount":0,"versionOfCount":0,"created":"2021-01-25T16:56:44.000Z","registered":"2021-01-25T16:56:44.000Z","published":"2021","updated":"2024-06-21T20:13:42.000Z"},"relationships":{"client":{"data":{"id":"gesis.ubma","type":"clients"}},"provider":{"data":{"id":"jjuz","type":"providers"}},"media":{"data":{"id":"10.7801/361","type":"media"}},"references":{"data":[]},"citations":{"data":[]},"parts":{"data":[]},"partOf":{"data":[]},"versions":{"data":[]},"versionOf":{"data":[]}}}}