feat(devtools): add RSS feed validation script#1438
feat(devtools): add RSS feed validation script#1438BurhanAbdullah wants to merge 9 commits intokoala73:mainfrom
Conversation
|
@BurhanAbdullah is attempting to deploy a commit to the Elie Team on Vercel. A member of the Team first needs to authorize it. |
koala73
left a comment
There was a problem hiding this comment.
Thanks for the contribution, @BurhanAbdullah! Useful utility for maintaining our RSS sources. A few items to address:
BLOCKING — Command injection risk (Security)
Line 16: curl ... "$url" where $url comes from grepped file contents. A malformed URL containing shell metacharacters (backticks, $(...), semicolons, pipes) could execute arbitrary commands. The grep pattern https?://[^"' ]+ doesn't filter those characters out.
Fix: pipe URLs through a stricter allowlist regex (e.g., only allow [a-zA-Z0-9./:_?&=-]) or use curl --url with additional sanitization.
HIGH — Incomplete directory scanning
Hardcoded DIRS=("api" "server" "shared" "data" "docs") misses src/, scripts/, and config files where RSS URLs also live. Consider scanning . with excludes for node_modules, .git, dist, etc.
MEDIUM — Grep pattern too broad and too narrow
grep -Ei "rss|feed|xml" matches non-RSS URLs containing "xml" (sitemaps, etc.) while missing feeds without those strings (e.g., Atom feeds at /atom, feeds at /latest).
Suggestions:
- Add
set -euo pipefailat the top for robustness - Consider making
--max-timeconfigurable via env var or argument - A brief
--help/ usage comment would be nice
Please fix the security issue and we can move forward. Thanks again!
|
Thanks for the contribution, @BurhanAbdullah! Having an RSS feed health checker is a useful idea for maintaining our feed sources. A few issues that need addressing before this can be merged: HIGH: Scans all URLs, not just RSS feeds The HIGH: Missing User-Agent header Many servers return 403 without a User-Agent. The curl call needs something like: -H 'User-Agent: Mozilla/5.0 (compatible; WorldMonitor-RSS-Check/1.0)'MEDIUM: No concurrency control For the number of URLs this would find, it would either take a very long time (sequential) or hammer services. Consider batching with LOW: Validates reachability, not RSS validity The script checks HTTP status codes but doesn't verify the response is actually valid RSS/Atom XML. A URL returning 200 with an HTML error page would show as passing. The intent is good but the implementation needs significant rework. It should target RSS feed source files specifically, not grep the entire repo for all URLs. As-is, running this would hit every URL in the codebase including authenticated APIs. Happy to help if you have questions on the RSS feed file locations! |
Review: Follow-up on fixesThanks for addressing all four items from the previous review, @BurhanAbdullah! The User-Agent, concurrency control, RSS content validation, and scoped file targeting are all in place now. Nice work. However, there's a new critical issue with the latest version: Blocking: Script checks zero feeds
Suggested fix (simplest path): Parse the JSON array with urls=$(jq -r '.[]' "${FILES[@]}" | sed 's|^|https://|' | sort -u)This way each domain becomes Minor: Double curl is wastefulThe script currently curls each URL twice (once for status code, once for body). You can combine into one call: tmpfile=$(mktemp)
status=$(curl -L -H "User-Agent: $USER_AGENT" --max-time "$MAX_TIME" -s -w "%{http_code}" -o "$tmpfile" "$url")
# check $status, then grep "$tmpfile" for <rss|<feed
rm -f "$tmpfile"Once the |
BurhanAbdullah
left a comment
There was a problem hiding this comment.
Thanks for the feedback!
Updated the script to:
- restrict scanning to
shared/rss-allowed-domains.json - add a User-Agent header
- add concurrency via xargs
- verify RSS/Atom XML content
Let me know if any further adjustments are needed.
|
Thanks for the suggestion! Updated the script to parse the domains |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Adds a developer utility script that scans the repository for RSS feed URLs and checks their availability.
The script follows redirects and reports feeds returning non-200 status codes, helping maintain the large number of RSS sources used in World Monitor.