Abstract: We introduce EventNarrative, a knowledge graph-to-text dataset from publicly available open-world knowledge graphs. Given the recent advances in event-driven Information Extraction (IE), and that prior research on graph-to-text only focused on entity-driven KGs, this paper focuses on event-centric data. However, our data generation system can still be adapted to other other types of KG data. Existing large-scale datasets in the graph-to-text area are non-parallel, meaning there is a large disconnect between the KGs and text. The datasets that have a paired KG and text, are small scale and manually generated or generated without a rich ontology, making the corresponding graphs sparse. Furthermore, these datasets contain many unlinked entities between their KG and text pairs. EventNarrative consists of approximately 230,000 graphs and their corresponding natural language text, 6 times larger than the current largest parallel dataset. It makes use of a rich ontology, all of the KGs entities are linked to the text, and our manual annotations confirm a high data quality. Our aim is two-fold: help break new ground in event-centric research where data is lacking, and to give researchers a well-defined, large-scale dataset in order to better evaluate existing and future knowledge graph-to-text models. We also evaluate two types of baseline on EventNarrative: a graph-to-text specific model and two state-of-the-art language models, which previous work has shown to be adaptable to the knowledge graph-to-text domain.
Anthony Colas, Ali Sadeghian, Yue Wang, Daisy Wang
University of Florida
https://www.kaggle.com/datasets/acolas1/eventnarration
Accepted at the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). 2021.
Paper can be found here.
Please cite:
@inproceedings{colas2021eventnarrative,
title={EventNarrative: A Large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation},
author={Colas, Anthony and Sadeghian, Ali and Wang, Yue and Wang, Daisy Zhe},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
year={2021}
}
We describe here how to pre-process the data. We start by first having a copy of events in EventKG that have a Wikidata and Wikipedia source and Wikidata triples for those events. For further information on EventKG please see the links below.
To train the baseline models please see "Preprocessing for Models".
EventKG,
WikiData,
Wikipedia Local
We use wikipedia_en_all_nopic_2021-01.zim and the WikiMapper
eventkg_wikidata_augmented_events_with_types.json, triples.json can be found here.
- move files to data folder
mv <file> preprocess/data/- cd into preprocess/
- merge the eventkg and wikidata events
python eventKG_preprocess.py- get the full text of wikipedia
python modify_full_TEXT.py -input <input_path> -output_dir <output_dir>Note: We use the wikipedia_en_all_nopic_2021-01 version of Wikipedia and host it on localhost:8080. Please change accordingly.
- get entities which match KGs and text (normalization)
python normalize_triples_text.py -input <input_dir> -output_dir <output_file>Note: change line 30 and 32 accordingly as this was the path to our WikiMapper DBs. For more information on setting up WikiMappper please see the WikiMapper link above.
- postprocess KGs/text: replace entities in text with entities in KG and match KG to text
python postprocess.pyNote: We place the data from the last step into data/full_entities_in_text/. Change the path to the data if need on line 176.
The workload for pre-processing fetching the data was distributed through multiple machines.
We provide the training/val/teting data split here.
- insert data into data/split_data/
- preprocess graphwriter data
mkdir data/split_data/graphwriter/python preprocess_baselines/graphwriter.py- preprocess BART and T5 data
mkdir data/split_data/huggingface_bart/
mkdir data/split_data/huggingface_t5/python preprocess_baselines/invest_huggingface.py
cp data/split_data/huggingface_t5/dev.source- GraphWriter
We use the version of GraphWriter found here. Please see their requirements.txt file. - BART/T5
We train BART/T5 following the repo here. Please see their requirements.txt file.
train a model on GraphWriter:
cd models/CycleGT/Make sure to have dev.json, train.json, and test.json
python main.pyMove to folder for BART/T5:
cd models/plms-graph2text/- move all BART data to models/plms-graph2text/event/data/huggingface_bart/
./finetune_EVENT_bart.sh facebook/bart-base <gpu_id>Train a model on T5:
- move all T5 data to plms-graph2text/event/data/huggingface_t5/
./finetune_EVENT_t5.sh t5-base <gpu_id>- All outputs can be found in the plms-graph2text/outputs/ folder, for their respective model.
- Test results are found in val_outputs, with the last epoch contained in the file name (10 in our case). Please use the '.tok' file.
- Although the models above evaluate the test set, we decided to evaluate each test set ourselves using the same uniform libraries for every model.
- Use tokenized files for evaluating metrics. We used pycocoevalcap for ROUGE, CIDER, and METEOR. We used Hugginface datasets for BLEUscore. We used the chrF++ package for chrf++.
- WikiMapper library for efficient Wikidata QID to Wikipedia ID matching.
- Our GraphWriter code is borrowed from CycleGT.
- Our BART/T5 code on KG to text is borrowed from the UKPLab.
- Metrics were caluclated with pycocoevalcap, BLEUscore, and chrF++.
We thank all the authors for their useful code.