FactualBench

Overview

FactualBench is a large-scale Chinese factual QA dataset introduced in EMNLP2025 Findings paper Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization. The dataset contains $180,504$ samples spanning $21$ domains, designed to evaluate and mitigate factual hallucinations in large language models through precise knowledge utilization.

Compared to earlier versions, we further refined the test split by:

Deduplicating questions that overlap with the training set
Removing low-quality or ambiguous samples
Applying time constraints to time-sensitive questions.

Repository Structure

.
├── FactualBench_train_v2.jsonl   # Training split
├── FactualBench_test_v2.jsonl    # Test split
├── evaluation_prompt.txt         # Prompt for model-based evaluation
└── evaluate.py                   # Script to compute accuracy

We adopt a model-based judgment strategy for evaluation, and gpt-4-0125 is used as the automatic evaluator in our paper.

Dataset Composition

Domain	中文名	Test	Training	Total
film & entertainment	影视娱乐	191	54,433	54,624
education & training	教育培训	147	3,702	3,849
physics, chemistry & mathematics & biology	数理化生	178	9,171	9,349
history & traditional culture	历史国学	186	18,086	18,272
biography	人物百科	190	11,829	12,019
politics & law	政治法律	155	6,354	6,509
economics & management	经济管理	141	4,537	4,678
computer science	计算机科学	146	6,247	6,393
medical	医学	128	7,057	7,185
sociology & humanity	社会人文	187	8,494	8,681
agriculture, forestry & fisheries & allied industries	农林牧渔	138	3,725	3,863
astronomy & geography	天文地理	151	3,887	4,038
sports & tourism	运动旅游	143	4,867	5,010
digital & automotive	数码汽车	159	3,881	4,040
industrial engineering	工业工程	149	3,279	3,428
military & war	军武战争	142	2,568	2,710
slang & memes	网词网梗	104	529	633
work & life	工作生活	131	5,849	5,980
high technology	高新科技	112	310	422
religion & culture	信仰文化	122	508	630
others	其他	-	18,191	18,191
Total	-	3,000	177,504	180,504

Data Format

Each sample in FactualBench consists of a question $Q_i$ (question),

a standard answer $X_i^0$ (standard answer),

3 wrong answers ${X_i^j}$ (wrong answers),

and a domain $D_i$ it belongs to (domain).

An Example

Field	Content
Question $Q_i$	第一台微波量子放大器是在哪一年制成的？ In which year was the first microwave quantum amplifier made?
Standard Answer $X_i^0$	第一台微波量子放大器是在1954年制成的。 The first microwave quantum amplifier was made in 1954.
Wrong Answer $X_i^1$	第一台微波量子放大器是在1958年制成的。 The first microwave quantum amplifier was made in 1958.
Wrong Answer $X_i^2$	第一台微波量子放大器是在1960年制成的。 The first microwave quantum amplifier was made in 1960.
Wrong Answer $X_i^3$	第一台微波量子放大器是在1962年制成的。 The first microwave quantum amplifier was made in 1962.
Domain $D_i$	高新科技 high technology

Notification

The dataset is constructed from a publicly available Internet encyclopedia (Baidu Baike).
It may contain references to individuals, locations, or medical and physiological concepts that are publicly known.
The data is collected strictly for research purposes and without any intent to violate privacy or safety policies.
⚠️ Despite quality control efforts, the dataset may still contain inaccuracies or outdated facts (knowledge cutoff: 2025). FactualBench should not be treated as an authoritative knowledge base!!!

Citation

If you find this dataset useful, please cite:

@inproceedings{zhang-etal-2025-exploring-generalizability,
    title = "Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization",
    author = "Zhang, Siyuan  and
      Zhang, Yichi  and
      Dong, Yinpeng  and
      Su, Hang",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.211/",
    doi = "10.18653/v1/2025.findings-emnlp.211",
    pages = "3936--3968",
    ISBN = "979-8-89176-335-7"
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
FactualBench_test_v2.jsonl		FactualBench_test_v2.jsonl
FactualBench_train_v2.jsonl		FactualBench_train_v2.jsonl
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
evaluation_prompt.txt		evaluation_prompt.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FactualBench

Overview

Repository Structure

Dataset Composition

Data Format

An Example

Notification

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FactualBench

Overview

Repository Structure

Dataset Composition

Data Format

An Example

Notification

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages