Multilingual Evidence Against Hallucinations

Professor: Philipp Koehn @ JHU
Brief: The issue we want to solve and improve based on current techniques is about retrieve resources from different languages from the given query in a QA system. We proposed a method using Retrieval Augmented Generation systems combining Meta's Laser encoding with Llama3 Large Langauge Models to train on MegaWika Dataset where they have QA generated based on multilingual Wikipedia and their reference articles organized in English.

Literature Reading Notes:

Nvidia ChatQA

ChatRAG bench 10 datasets.

Introduction: 提到了GPT4的优点, 1. easy interaction for follow-up Questions. 2. provided context is longer than the context window. 3. zero-shot, same accuracy as fine-tuned models.

Proposed a two-stage instruction tuning, design a dataset.

Utilize sythetic data generation for training customized retreiver.

ChatRag bench: 10 conversational QA datasets, 5 with long documentations, 3 with tabular data and arithmetic calculations.

“unanswerable scenario”: answer “cannot” instead hallucinating.

Related Works:

conversational QA and RAG: 对话方式提高用户体验，有需要会澄清
retrieval for multi-turn QA: dense retrievers, trained to retrieve the top-k relevant chunks given a single question.
conversational query rewriting: reinforcement learning methods. Few-shot generative models like GPT-2. instruction tuned GPT-3.5-Turbo.
finetuning retriever for multi-turn QA: zero-shot evaluation. Finetune a single-turn query retriever on a high-quality multi-turn dataset. Then evaluate zero-shot capability of the fine-tuned retriever on 5 benchmark datasets.

Note: a “turn” in a conversation is marked by one back and forth interaction.

Instruction Tuning: high quality instruction tuning dataset: FLAN, self-instruct, unnatural instructions, dolly, openAssistant.

Adding unanswerable samples reduce hallucinations.

ChatQA 2-stage instruction tuning method