GitHub - Arthurlpgc/InfoRetrievalProject

Websites included

The crawler will be retrieving information from the following online judges:

Running Crawler

In order to run your crawler, follow these steps:

First, make sure you have Python 3.6 and pip installed in your system. Then:

Go to src folder: cd src
Install project requirements: pip install -r requirements.txt
Run the crawler: scrapy runspider crawler/questions.py

This will start a breadth first search based on some heurístic spider module responsible for downloading all pages in the specified domain. You can see them on the fly in src/retrieved/documents and src/retrieved/objects folder.

Creating an Index

After running the crawler and retrieving documents, you have to manually set up an index to work with. In order to do this:

Go to src folder: cd src
Run the indexer: python3 indexer/indexer.py

It will search for documents stored at src/retrieved/objects and create various indexes accordingly. The indexes will be avaiable for latter querys at the src/indexes folder.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
pages		pages
src		src
.gitignore		.gitignore
Apresentação Projeto 2.pdf		Apresentação Projeto 2.pdf
Apresentação.pdf		Apresentação.pdf
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Websites included

Running Crawler

Creating an Index

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Websites included

Running Crawler

Creating an Index

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages