- Professor: Philipp Koehn @ JHU
- Brief: The issue we want to solve and improve based on current techniques is about retrieve resources from different languages from the given query in a QA system. We proposed a method using Retrieval Augmented Generation systems combining Meta's Laser encoding with Llama3 Large Langauge Models to train on MegaWika Dataset where they have QA generated based on multilingual Wikipedia and their reference articles organized in English.
MegaWika dataset:
https://arxiv.org/abs/2307.07049
- 13 million Wiki article (50 languages)
- 71 million referenced source materials
- provide cross-lingual question answering and citation retrieval
Wikipedia passages and reference web source materials are extracted → translate to English → semantically analyzed.
Each of the 71 million passage-source pairs, questions are extracted, yielding more than 120 million auto-generated question-answer pairs.
Dataset: full collection size 1.1TB on Hugging Face.
Translation: 1. M2M-100(MT, focus on balancing data for language pairs beyond English) 2. Google Translation (for the lowest frequency 10 langauges).
QA generation based on English version passage.
Worth to notice: crosslinguality:
- English WIKI cites non-English sources quite frequently. 11% online source citations(~2 million)
- Non-English WIKI, majority online source citations are language other than WIKI’s native language. (48% English, 19% other language besides native, 33% same language as the WIKI passage)
Xhosa(科萨语,南非), has no Xhosa sources at all. Arabic WIKI only cites Arabic website 9% of the time.
Laser Encodings:
https://github.com/facebookresearch/LASER
Laser 2 and Laser 3