Skip to content

TaCertoIssoAI/deep-fake-detector

Repository files navigation

Deep-Fake Detection Service

A Python FastAPI server with a pluggable detector abstraction layer for deepfake detection. Supports multiple models running in parallel across image, video, and audio media types.

Architecture

POST /detect  (multipart file upload)
     │
     ├─ detect media type (image/video)
     │
     ├─ route to all registered detectors for that type
     │
     └─ return combined results from all detectors

Every detector implements BaseDetector (load, detect, supported_media_types) and is registered in app/detectors/registry.py. New models plug in by implementing the interface and adding one line to the registry.

Current Models

Model Media Architecture Source
ViT Deep-Fake Detector v2 Image ViT-base (HuggingFace pipeline) prithivMLmods/Deep-Fake-Detector-v2-Model
Frame Sampler Video Samples 20 video frames, runs ViT image detector on each, averages scores Uses the image model above
VoiceGen Audio (from video) Dual RawNet2 encoders with domain-agnostic feature disentanglement, SAM optimization, 59M params Purdue-M2/AI-Synthesized-Voice-Generalization

Quick Start

Requirements

  • Python 3.11+
  • ~600MB disk for model weights (downloaded on first run)

Setup

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

CLI

Starts the server, sends a file, prints results, and shuts down:

python cli.py path/to/video.mp4
python cli.py path/to/image.jpg

API

Start the server:

uvicorn app.main:app --host 0.0.0.0 --port 8000

Detect a file:

curl -X POST http://localhost:8000/detect \
  -F "file=@video.mp4"

Health check:

curl http://localhost:8000/health

Benchmark

Evaluates all models against labeled test files in media/ (filenames prefixed with fake- or real-):

python benchmark.py

Outputs per-model accuracy and writes detailed results to benchmark_results.csv.

Configuration

Environment variables:

Variable Default Description
HF_TOKEN HuggingFace token (for gated models)
DEVICE cpu cpu or cuda
HOST 0.0.0.0 Server bind address
PORT 8000 Server port

Project Structure

deep-fake-detection/
├── app/
│   ├── main.py              # FastAPI app, /detect and /health endpoints
│   ├── config.py             # Settings from env vars
│   ├── schemas.py            # Pydantic request/response models
│   └── detectors/
│       ├── base.py           # BaseDetector abstract class
│       ├── registry.py       # Detector registry and media type routing
│       ├── hf_image.py       # ViT image detector (HuggingFace)
│       ├── frame_sampler.py  # Video → frame sampling → image detector
│       └── voice_gen/        # VoiceGen dual-RawNet2 audio detector
├── cli.py                    # CLI tool
├── benchmark.py              # Model evaluation script
├── media/                    # Test files (fake-*.mp4, real-*.mp4)
├── requirements.txt
└── Dockerfile

Models to Evaluate in the Future

TrueMedia ML Models

Collection of deepfake detectors from TrueMedia.org covering image, video, and audio:

  • DistilDIRE (image) — distilled diffusion-based detector, 3.2x faster than DIRE, handles GAN and diffusion outputs
  • UniversalFakeDetectV2 (image) — CLIP-ViT feature spaces with nearest-neighbor/linear probing
  • GenConViT (video) — ConvNeXt + Swin Transformer hybrid with CNN autoencoder and VAE
  • StyleFlow (video) — style-latent flow anomaly detection with StyleGRU + supervised contrastive learning
  • FTCN (video) — temporal convolution network for long-term coherence detection
  • Transcript Based Detector (audio) — speech recognition + LLM analysis for factual coherence

Repository: https://github.com/truemediaorg/ml-models

Note: TrueMedia model weights require a formal request to aerin@truemedia.org with affiliation and intended use.

GenFace / CAEL

  • CAEL (image) — Cross-modal Appearance-Edge Learning transformer with multi-grained fusion, 158.63M params, 99.88% within-dataset ACC but 65.04% cross-dataset AUC
  • Dataset: GenFace (515K forged + 100K real faces covering GANs and diffusion methods)
  • Repository: https://github.com/Jenine-321/GenFace
  • Skipped for now: redundant with existing ViT image detector, weak cross-dataset generalization

Previously Integrated (Removed — Underperformed)

Other Evaluated Models

  • WaveSpect (audio) — hybrid waveform + CQT spectrogram analysis for synthetic audio detection. No public weights or code available yet.
  • FakeBrAccent / XGBoost (audio) — XGBoost/CNN on Brazilian-accented speech dataset. Too narrow (Portuguese-only, accent-specific) for general use.
  • BRSpeech-DF (audio) — Brazilian Portuguese deepfake speech dataset. Could be used to fine-tune AASIST but no pretrained weights provided.
  • SDD-APALLM (audio) — CQT spectrograms + LLM prompting. Interesting but no released code/weights.
  • F-SAT / DeepFakeVox-HQ (audio) — frequency-selective adversarial training for robustness. Code and dataset forthcoming.
  • deitfake-v2 (image) — DeiT-based image classifier on HuggingFace. Only 2 classes, redundant with existing ViT detector.
  • DFD-FCG (video) — frequency-aware CLIP with graph learning. Uses same CLIP ViT-L/14 backbone as GenD/D3; weights not publicly available.
  • FakeVLM (video/image) — vision-language model for explainable deepfake detection, 7B+ params. Too heavy for real-time pipeline.
  • DTAD (video) — temporal artifact detection. Code available but limited documentation and unclear weight availability.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors