MegaWika dataset:

https://arxiv.org/abs/2307.07049

  1. 13 million Wiki article (50 languages)
  2. 71 million referenced source materials
  3. provide cross-lingual question answering and citation retrieval

Wikipedia passages and reference web source materials are extracted → translate to English → semantically analyzed.

Each of the 71 million passage-source pairs, questions are extracted, yielding more than 120 million auto-generated question-answer pairs.

Dataset: full collection size 1.1TB on Hugging Face.

Translation: 1. M2M-100(MT, focus on balancing data for language pairs beyond English) 2. Google Translation (for the lowest frequency 10 langauges).

QA generation based on English version passage.

Worth to notice: crosslinguality:

  1. English WIKI cites non-English sources quite frequently. 11% online source citations(~2 million)
  2. Non-English WIKI, majority online source citations are language other than WIKI’s native language. (48% English, 19% other language besides native, 33% same language as the WIKI passage)

Xhosa(科萨语,南非), has no Xhosa sources at all. Arabic WIKI only cites Arabic website 9% of the time.

Laser Encodings:

https://github.com/facebookresearch/LASER

Laser 2 and Laser 3