📝 C++17 Job Aggregator (Research Project)

High-performance, multi-threaded C++ application demonstrating modern networking, HTML parsing, and asynchronous system architecture.

Note for Recruiters/Reviewers: This project was developed as an educational exploration of C++17 standards, multithreading (std::thread, std::future), and HTTP protocol implementation (cpr/libcurl). It showcases ability to handle complex system challenges like rate limiting, concurrency control, and data sanitization.

✨ Features

🛡️ Technical Highlights

Consolidated Architecture: Aggregates data from multiple sources (LinkedIn, Indeed, StepStone) into a unified data structure.
Intelligent Pipeline: Implements tokenization and strict location filtering (locations.txt) to ensure high data relevance.
Data Integrity:
- Deduplication algorithm: Uses hash-based sets to eliminate redundant entries.
- Transactional Saving: Implements incremental IO flushing to ensure data persistence during long-running tasks.

⚡ Advanced Networking

Deep Extraction: Asynchronous traversal of detail pages to parse metadata (e.g. contact info) via Regex/XPath.
Webhook Integration (n8n): Automatically serializes and pushes results via POST requests to external automation workflows.
Resilient Request System:
- Configurable proxy rotation middleware.
- Ethical rate-limiting implementation (Jitter/Backoff strategies).
Concurrency: Scalable thread-pool implementation for parallel processing.

📋 Prerequisites

CMake 3.14+
C++17 Compiler (MSVC, GCC, Clang)
Internet Connection

🛠️ Build Instructions

Configure:
```
cmake -B build -S .
```
Build:
```
cmake --build build --config Release
```

💻 Usage

Run the scraper from the command line:

./build/Release/scraper.exe --sites all --location "Stuttgart" --keywords "Fachinformatiker,C++" --output jobs.json --pages 5

⚙️ Arguments

--sites: Comma-separated sites (indeed, stepstone, linkedin) or 'all'.
--location: Target city or region.
--keywords: Search terms.
--output: Path to save JSON results (default: jobs.json).
--threads: Number of threads.
--pages: Number of pages to scrape per site.
--deep: (New) Enable deep scraping to visit job pages and extract emails.
--webhook: (New) URL to send the JSON result payload to (POST request).
--proxies: Path to proxy list (default: proxies.txt).
--headers: Path to header config (default: headers.json).

💡 Examples

Standard Run:

./build/Release/scraper.exe --sites all --location "Stuttgart" --pages 1

Deep Scan with n8n Integration:

./build/Release/scraper.exe --sites all --location "Stuttgart" --pages 1 --deep --webhook "http://localhost:5678/webhook"

🔧 Configuration

proxies.txt: Add proxies in scheme://ip:port or ip:port:user:pass format (handled by cpr/curl).
headers.json: Modify user agents and referers to match your target demographic.

⚖️ Legal Disclaimer & Terms of Use

⚠️ EDUCATIONAL PROJECT ONLY - PLEASE READ CAREFULLY

This software is a student project created strictly for educational and learning purposes (researching C++, HTTP networking, and HTML parsing). It is NOT intended for commercial use, data mining, or mass scraping.

No Liability: The author assumes no responsibility for any consequences arising from the use of this software. You use it entirely at your own risk.
Respect robots.txt: This tool technically ignores robots.txt to function as a browser emulator. By using this tool, you acknowledge that you are bypassing standard automated access controls.
Terms of Service: Scraping data may violate the Terms of Service (ToS) of the target platforms (LinkedIn, Indeed, StepStone). Use of this tool may lead to your IP address or account being blocked.
GDPR/DSGVO Compliance: If you extract personal data (like names or email addresses via --deep), you are responsible for handling this data in compliance with local data protection laws (e.g., GDPR in Europe). Do not publish or sell scraped data.
Do Not Misuse: Do not use this tool to spam, harass, or overload the servers of the target platforms.

By downloading or running this software, you agree to use it only for personal learning and testing.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
http_client.hpp		http_client.hpp
main.cpp		main.cpp
parser.hpp		parser.hpp
scraper.hpp		scraper.hpp
utils.hpp		utils.hpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📝 C++17 Job Aggregator (Research Project)

✨ Features

🛡️ Technical Highlights

⚡ Advanced Networking

📋 Prerequisites

🛠️ Build Instructions

💻 Usage

⚙️ Arguments

💡 Examples

🔧 Configuration

⚖️ Legal Disclaimer & Terms of Use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📝 C++17 Job Aggregator (Research Project)

✨ Features

🛡️ Technical Highlights

⚡ Advanced Networking

📋 Prerequisites

🛠️ Build Instructions

💻 Usage

⚙️ Arguments

💡 Examples

🔧 Configuration

⚖️ Legal Disclaimer & Terms of Use

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages