10.1 NLP

CCMatrix: A billion-scale bitext data set for training translation models

Facebook learned a multilingual sentence embedding to help it represent sentences from different languages in the same featurespace, and then using the distance between sentences in featurespace to help the system figure out if they’re parallel sentences from two different languages.

https://ai.facebook.com/blog/ccmatrix-a-billion-scale-bitext-data-set-for-training-translation-models/