Skip to content

acolas1/EventNarrative

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

EventNarrative: A large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation

Abstract: We introduce EventNarrative, a knowledge graph-to-text dataset from publicly available open-world knowledge graphs. Given the recent advances in event-driven Information Extraction (IE), and that prior research on graph-to-text only focused on entity-driven KGs, this paper focuses on event-centric data. However, our data generation system can still be adapted to other other types of KG data. Existing large-scale datasets in the graph-to-text area are non-parallel, meaning there is a large disconnect between the KGs and text. The datasets that have a paired KG and text, are small scale and manually generated or generated without a rich ontology, making the corresponding graphs sparse. Furthermore, these datasets contain many unlinked entities between their KG and text pairs. EventNarrative consists of approximately 230,000 graphs and their corresponding natural language text, 6 times larger than the current largest parallel dataset. It makes use of a rich ontology, all of the KGs entities are linked to the text, and our manual annotations confirm a high data quality. Our aim is two-fold: help break new ground in event-centric research where data is lacking, and to give researchers a well-defined, large-scale dataset in order to better evaluate existing and future knowledge graph-to-text models. We also evaluate two types of baseline on EventNarrative: a graph-to-text specific model and two state-of-the-art language models, which previous work has shown to be adaptable to the knowledge graph-to-text domain.

Anthony Colas, Ali Sadeghian, Yue Wang, Daisy Wang
University of Florida

Dataset Link

https://www.kaggle.com/datasets/acolas1/eventnarration

Paper

Accepted at the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). 2021.
Paper can be found here.

Please cite:

@inproceedings{colas2021eventnarrative,
  title={EventNarrative: A Large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation},
  author={Colas, Anthony and Sadeghian, Ali and Wang, Yue and Wang, Daisy Zhe},
  booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
  year={2021}
}

Getting Started

We describe here how to pre-process the data. We start by first having a copy of events in EventKG that have a Wikidata and Wikipedia source and Wikidata triples for those events. For further information on EventKG please see the links below.

To train the baseline models please see "Preprocessing for Models".

Data Sources

EventKG, WikiData, Wikipedia Local
We use wikipedia_en_all_nopic_2021-01.zim and the WikiMapper

Initial data files

eventkg_wikidata_augmented_events_with_types.json, triples.json can be found here.

  • move files to data folder
mv <file> preprocess/data/

Data Preprocessing

  • cd into preprocess/
  • merge the eventkg and wikidata events
python eventKG_preprocess.py
  • get the full text of wikipedia
python modify_full_TEXT.py -input <input_path> -output_dir <output_dir>

Note: We use the wikipedia_en_all_nopic_2021-01 version of Wikipedia and host it on localhost:8080. Please change accordingly.

  • get entities which match KGs and text (normalization)
python normalize_triples_text.py -input <input_dir> -output_dir <output_file>

Note: change line 30 and 32 accordingly as this was the path to our WikiMapper DBs. For more information on setting up WikiMappper please see the WikiMapper link above.

  • postprocess KGs/text: replace entities in text with entities in KG and match KG to text
python postprocess.py

Note: We place the data from the last step into data/full_entities_in_text/. Change the path to the data if need on line 176.

The workload for pre-processing fetching the data was distributed through multiple machines.

Preprocessing for Models

We provide the training/val/teting data split here.

  • insert data into data/split_data/

GraphWriter

  • preprocess graphwriter data
mkdir data/split_data/graphwriter/
python preprocess_baselines/graphwriter.py

BART/T5

  • preprocess BART and T5 data
mkdir data/split_data/huggingface_bart/
mkdir data/split_data/huggingface_t5/
python preprocess_baselines/invest_huggingface.py
cp data/split_data/huggingface_t5/dev.source

Training and Evaluation

  • GraphWriter
    We use the version of GraphWriter found here. Please see their requirements.txt file.
  • BART/T5
    We train BART/T5 following the repo here. Please see their requirements.txt file.

train a model on GraphWriter:

cd models/CycleGT/

Make sure to have dev.json, train.json, and test.json

python main.py

Move to folder for BART/T5:

cd models/plms-graph2text/
  • move all BART data to models/plms-graph2text/event/data/huggingface_bart/
./finetune_EVENT_bart.sh facebook/bart-base <gpu_id>

Train a model on T5:

  • move all T5 data to plms-graph2text/event/data/huggingface_t5/
./finetune_EVENT_t5.sh t5-base <gpu_id>
  • All outputs can be found in the plms-graph2text/outputs/ folder, for their respective model.
  • Test results are found in val_outputs, with the last epoch contained in the file name (10 in our case). Please use the '.tok' file.

Metrics

  • Although the models above evaluate the test set, we decided to evaluate each test set ourselves using the same uniform libraries for every model.
  • Use tokenized files for evaluating metrics. We used pycocoevalcap for ROUGE, CIDER, and METEOR. We used Hugginface datasets for BLEUscore. We used the chrF++ package for chrf++.

Acknowledgments

  • WikiMapper library for efficient Wikidata QID to Wikipedia ID matching.
  • Our GraphWriter code is borrowed from CycleGT.
  • Our BART/T5 code on KG to text is borrowed from the UKPLab.
  • Metrics were caluclated with pycocoevalcap, BLEUscore, and chrF++.

We thank all the authors for their useful code.

About

Repositoary for the instructions, code, and resources regarding the NeurIPS 2021 Datasets and Benchmarks paper "EventNarrative: A large-scale Event-centric Dataset for Knowledge Graph-to-Text Generation".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors