A Python FastAPI server with a pluggable detector abstraction layer for deepfake detection. Supports multiple models running in parallel across image, video, and audio media types.
POST /detect (multipart file upload)
│
├─ detect media type (image/video)
│
├─ route to all registered detectors for that type
│
└─ return combined results from all detectors
Every detector implements BaseDetector (load, detect, supported_media_types) and is registered in app/detectors/registry.py. New models plug in by implementing the interface and adding one line to the registry.
| Model | Media | Architecture | Source |
|---|---|---|---|
| ViT Deep-Fake Detector v2 | Image | ViT-base (HuggingFace pipeline) | prithivMLmods/Deep-Fake-Detector-v2-Model |
| Frame Sampler | Video | Samples 20 video frames, runs ViT image detector on each, averages scores | Uses the image model above |
| VoiceGen | Audio (from video) | Dual RawNet2 encoders with domain-agnostic feature disentanglement, SAM optimization, 59M params | Purdue-M2/AI-Synthesized-Voice-Generalization |
- Python 3.11+
- ~600MB disk for model weights (downloaded on first run)
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtStarts the server, sends a file, prints results, and shuts down:
python cli.py path/to/video.mp4
python cli.py path/to/image.jpgStart the server:
uvicorn app.main:app --host 0.0.0.0 --port 8000Detect a file:
curl -X POST http://localhost:8000/detect \
-F "file=@video.mp4"Health check:
curl http://localhost:8000/healthEvaluates all models against labeled test files in media/ (filenames prefixed with fake- or real-):
python benchmark.pyOutputs per-model accuracy and writes detailed results to benchmark_results.csv.
Environment variables:
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
— | HuggingFace token (for gated models) |
DEVICE |
cpu |
cpu or cuda |
HOST |
0.0.0.0 |
Server bind address |
PORT |
8000 |
Server port |
deep-fake-detection/
├── app/
│ ├── main.py # FastAPI app, /detect and /health endpoints
│ ├── config.py # Settings from env vars
│ ├── schemas.py # Pydantic request/response models
│ └── detectors/
│ ├── base.py # BaseDetector abstract class
│ ├── registry.py # Detector registry and media type routing
│ ├── hf_image.py # ViT image detector (HuggingFace)
│ ├── frame_sampler.py # Video → frame sampling → image detector
│ └── voice_gen/ # VoiceGen dual-RawNet2 audio detector
├── cli.py # CLI tool
├── benchmark.py # Model evaluation script
├── media/ # Test files (fake-*.mp4, real-*.mp4)
├── requirements.txt
└── Dockerfile
Collection of deepfake detectors from TrueMedia.org covering image, video, and audio:
- DistilDIRE (image) — distilled diffusion-based detector, 3.2x faster than DIRE, handles GAN and diffusion outputs
- UniversalFakeDetectV2 (image) — CLIP-ViT feature spaces with nearest-neighbor/linear probing
- GenConViT (video) — ConvNeXt + Swin Transformer hybrid with CNN autoencoder and VAE
- StyleFlow (video) — style-latent flow anomaly detection with StyleGRU + supervised contrastive learning
- FTCN (video) — temporal convolution network for long-term coherence detection
- Transcript Based Detector (audio) — speech recognition + LLM analysis for factual coherence
Repository: https://github.com/truemediaorg/ml-models
Note: TrueMedia model weights require a formal request to
aerin@truemedia.orgwith affiliation and intended use.
- CAEL (image) — Cross-modal Appearance-Edge Learning transformer with multi-grained fusion, 158.63M params, 99.88% within-dataset ACC but 65.04% cross-dataset AUC
- Dataset: GenFace (515K forged + 100K real faces covering GANs and diffusion methods)
- Repository: https://github.com/Jenine-321/GenFace
- Skipped for now: redundant with existing ViT image detector, weak cross-dataset generalization
- GenD CLIP L/14 (video) — CLIP ViT-L/14 + linear probe, 20-frame averaging. yermandy/GenD_CLIP_L_14
- D3 (video) — Dual-branch CLIP ViT-L/14 (shuffled + original patches) + attention head. BigAandSmallq/D3
- AASIST (audio) — Graph Attention Network with SincConv, 297K params. clovaai/aasist
- UniversalFakeDetect (image + video) — Frozen CLIP ViT-L/14 + linear probe, 769 params. WisconsinAIVision/UniversalFakeDetect
- GenD DINOv3 L (image + video) — DINOv3 ViT-L/16 + linear probe, 300M params. yermandy/GenD_DINOv3_L
- Wav2Vec2 Voice Detector (audio) — Fine-tuned Wav2Vec2-XLSR, 300M params. garystafford/wav2vec2-deepfake-voice-detector
- WaveSpect (audio) — hybrid waveform + CQT spectrogram analysis for synthetic audio detection. No public weights or code available yet.
- FakeBrAccent / XGBoost (audio) — XGBoost/CNN on Brazilian-accented speech dataset. Too narrow (Portuguese-only, accent-specific) for general use.
- BRSpeech-DF (audio) — Brazilian Portuguese deepfake speech dataset. Could be used to fine-tune AASIST but no pretrained weights provided.
- SDD-APALLM (audio) — CQT spectrograms + LLM prompting. Interesting but no released code/weights.
- F-SAT / DeepFakeVox-HQ (audio) — frequency-selective adversarial training for robustness. Code and dataset forthcoming.
- deitfake-v2 (image) — DeiT-based image classifier on HuggingFace. Only 2 classes, redundant with existing ViT detector.
- DFD-FCG (video) — frequency-aware CLIP with graph learning. Uses same CLIP ViT-L/14 backbone as GenD/D3; weights not publicly available.
- FakeVLM (video/image) — vision-language model for explainable deepfake detection, 7B+ params. Too heavy for real-time pipeline.
- DTAD (video) — temporal artifact detection. Code available but limited documentation and unclear weight availability.