Our work FELM is accepted by NeurIPS D&B track 2023!

*🎉 🎉 🎉Our paper is accepted by NeurIPS Datasets and Benchmarks track 2023 and will be on Arxiv soon!!**

FELM is a benchmark for factuality evaluation of large language models.(FELM on Hugging Face dataset can be found Here)

Authors: Shiqi Chen, Yiran Zhao, Jinghan Zhan, I-Chun Chern, Siyang Gao, Pengfei Liu and Junxian He.

The benchmark comprises 847 questions that span five distinct domains: world knowledge, science/technology, writing/recommendation, reasoning, and math. We gather prompts corresponding to each domain by various sources including standard datasets like truthfulQA, online platforms like Github repositories, ChatGPT generation or drafted by authors.

We then obtain responses from ChatGPT for these prompts. For each response, we employ fine-grained annotation at the segment level, which includes reference links, identified error types, and the reasons behind these errors as provided by our annotators.


subscribe via RSS