Skip to content

feat(devtools): add RSS feed validation script#1438

Open
BurhanAbdullah wants to merge 9 commits intokoala73:mainfrom
BurhanAbdullah:feat/rss-feed-validator
Open

feat(devtools): add RSS feed validation script#1438
BurhanAbdullah wants to merge 9 commits intokoala73:mainfrom
BurhanAbdullah:feat/rss-feed-validator

Conversation

@BurhanAbdullah
Copy link

Adds a developer utility script that scans the repository for RSS feed URLs and checks their availability.

The script follows redirects and reports feeds returning non-200 status codes, helping maintain the large number of RSS sources used in World Monitor.

@vercel
Copy link

vercel bot commented Mar 11, 2026

@BurhanAbdullah is attempting to deploy a commit to the Elie Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Owner

@koala73 koala73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, @BurhanAbdullah! Useful utility for maintaining our RSS sources. A few items to address:

BLOCKING — Command injection risk (Security)
Line 16: curl ... "$url" where $url comes from grepped file contents. A malformed URL containing shell metacharacters (backticks, $(...), semicolons, pipes) could execute arbitrary commands. The grep pattern https?://[^"' ]+ doesn't filter those characters out.
Fix: pipe URLs through a stricter allowlist regex (e.g., only allow [a-zA-Z0-9./:_?&=-]) or use curl --url with additional sanitization.

HIGH — Incomplete directory scanning
Hardcoded DIRS=("api" "server" "shared" "data" "docs") misses src/, scripts/, and config files where RSS URLs also live. Consider scanning . with excludes for node_modules, .git, dist, etc.

MEDIUM — Grep pattern too broad and too narrow
grep -Ei "rss|feed|xml" matches non-RSS URLs containing "xml" (sitemaps, etc.) while missing feeds without those strings (e.g., Atom feeds at /atom, feeds at /latest).

Suggestions:

  • Add set -euo pipefail at the top for robustness
  • Consider making --max-time configurable via env var or argument
  • A brief --help / usage comment would be nice

Please fix the security issue and we can move forward. Thanks again!

@koala73
Copy link
Owner

koala73 commented Mar 12, 2026

Thanks for the contribution, @BurhanAbdullah! Having an RSS feed health checker is a useful idea for maintaining our feed sources.

A few issues that need addressing before this can be merged:

HIGH: Scans all URLs, not just RSS feeds

The grep currently matches every URL in the entire codebase (API endpoints, CDN URLs, status pages, npm packages, auth endpoints, etc.). Running this would fire hundreds of curl requests against services that are not RSS feeds, including rate-limited APIs and authenticated endpoints. It should only target files that contain RSS feed definitions, like shared/rss-allowed-domains.json or the RSS source configs.

HIGH: Missing User-Agent header

Many servers return 403 without a User-Agent. The curl call needs something like:

-H 'User-Agent: Mozilla/5.0 (compatible; WorldMonitor-RSS-Check/1.0)'

MEDIUM: No concurrency control

For the number of URLs this would find, it would either take a very long time (sequential) or hammer services. Consider batching with xargs -P or adding a small delay between requests.

LOW: Validates reachability, not RSS validity

The script checks HTTP status codes but doesn't verify the response is actually valid RSS/Atom XML. A URL returning 200 with an HTML error page would show as passing.

The intent is good but the implementation needs significant rework. It should target RSS feed source files specifically, not grep the entire repo for all URLs. As-is, running this would hit every URL in the codebase including authenticated APIs.

Happy to help if you have questions on the RSS feed file locations!

@koala73
Copy link
Owner

koala73 commented Mar 12, 2026

Review: Follow-up on fixes

Thanks for addressing all four items from the previous review, @BurhanAbdullah! The User-Agent, concurrency control, RSS content validation, and scoped file targeting are all in place now. Nice work.

However, there's a new critical issue with the latest version:

Blocking: Script checks zero feeds

shared/rss-allowed-domains.json contains bare domain names (e.g. feeds.bbci.co.uk, rss.cnn.com), not full URLs. The grep -Eo "https?://..." regex matches nothing in that file, so the script silently succeeds without checking any feeds.

Suggested fix (simplest path): Parse the JSON array with jq and prepend https:// to each domain:

urls=$(jq -r '.[]' "${FILES[@]}" | sed 's|^|https://|' | sort -u)

This way each domain becomes https://feeds.bbci.co.uk, https://rss.cnn.com, etc., and the existing curl + RSS content check logic works as-is.

Minor: Double curl is wasteful

The script currently curls each URL twice (once for status code, once for body). You can combine into one call:

tmpfile=$(mktemp)
status=$(curl -L -H "User-Agent: $USER_AGENT" --max-time "$MAX_TIME" -s -w "%{http_code}" -o "$tmpfile" "$url")
# check $status, then grep "$tmpfile" for <rss|<feed
rm -f "$tmpfile"

Once the jq fix is in, this should be ready to merge. Let me know if you have any questions!

Copy link
Author

@BurhanAbdullah BurhanAbdullah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!

Updated the script to:

  • restrict scanning to shared/rss-allowed-domains.json
  • add a User-Agent header
  • add concurrency via xargs
  • verify RSS/Atom XML content

Let me know if any further adjustments are needed.

@BurhanAbdullah
Copy link
Author

Thanks for the suggestion! Updated the script to parse the domains
using jq and prepend https:// so the feeds are now actually checked.
Let me know if anything else should be adjusted.

@vercel
Copy link

vercel bot commented Mar 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
worldmonitor Ready Ready Preview, Comment Mar 12, 2026 6:27pm

Request Review

Repository owner deleted a comment from ashsolei Mar 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants