Skip to content

applicaai/CCpdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data and scripts accompanying CCpdf paper

This repository contains data and simple scripts accompanying the "CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data" paper.

The data represented here is a subset of data made public by the Common Crawl organization, see https://commoncrawl.org/2022/06/may-2022-crawl-archive-now-available/

Files

  • ccpdf.tsv — metadata of CCpdf files
  • run.sh — main script for downloading CCpdf files from publicly available sources
  • download-from-crawl.sh — script for the actual downloading

About

Index of URLs to pdf files all over the internet and scripts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages