Skip to content

climatesense-project/climatesense-kg

Repository files navigation

ClimateSense KG Pipeline

Python uv License

The ClimateSense KG is a continuously updated knowledge graph that integrates climate fact-checking data from multiple sources to combat climate misinformation. It links information from fact-checking organizations with enriched data, giving researchers a more comprehensive view of the problem.

πŸ” Overview

Pipeline Architecture

Key Features

Documentation & Resources

πŸ“‹ Prerequisites

Table of Contents

Quick Start

Install:

git clone https://github.com/climatesense-project/climatesense-kg.git
cd climatesense-kg
just install

Run:

just run config/minimal.yaml

Docker Setup

Requirements:

Initial Setup:

  1. Clone the repository and navigate to the docker directory:

    git clone https://github.com/climatesense-project/climatesense-kg.git
    cd climatesense-kg/docker
  2. Copy and configure environment variables:

    cp .env.example .env

    Edit .env to configure:

    • GITHUB_TOKEN: GitHub token used for private repositories
    • VIRTUOSO_HOST: Virtuoso host name (default virtuoso)
    • VIRTUOSO_PORT: Virtuoso HTTP/SPARQL port (default 8890)
    • VIRTUOSO_ISQL_PORT: Virtuoso ISQL port (default 1111)
    • VIRTUOSO_USER: Virtuoso database user (default dba)
    • VIRTUOSO_PASSWORD: Virtuoso database password (default dba)
    • VIRTUOSO_ISQL_SERVICE_URL: Virtuoso ISQL HTTP endpoint (default http://isql-service:8080)
    • ISQL_SERVICE_PORT: Published port for the ISQL helper service (default 8080)
    • CIMPLE_FACTORS_API_URL: CIMPLE Factors API base URL (default http://localhost:8000)
    • POSTGRES_HOST: Cache database host (default postgres)
    • POSTGRES_PORT: Cache database port (default 5432)
    • POSTGRES_DB: Cache database name (default climatesense_cache)
    • POSTGRES_USER: Cache database user (default postgres)
    • POSTGRES_PASSWORD: Cache database password (required)
    • ANALYTICS_SPARQL_ENDPOINT: Virtuoso SPARQL endpoint for analytics (default http://virtuoso:8890/sparql)
    • ANALYTICS_ALLOWED_ORIGINS: Comma-separated origins permitted to call the analytics API (default http://localhost:3000)
    • ANALYTICS_CACHE_TTL: Analytics API cache TTL in seconds (default 60)
    • ANALYTICS_SPARQL_TIMEOUT: SPARQL timeout in seconds for analytics queries (default 20)
    • NEXT_PUBLIC_ANALYTICS_API_URL: Base URL the dashboard uses for the analytics API (default http://localhost:8000)
    • ANALYTICS_API_PORT: Published port for the analytics API container (default 8000)
    • ANALYTICS_UI_PORT: Published port for the analytics UI container (default 3000)
  3. Start the services:

    docker compose up -d
  4. Run the pipeline:

    docker compose run --rm pipeline run -c config/minimal.yaml

Configuration

The pipeline uses YAML-based configuration. Example config:

data_sources:
  - name: "claimreview_sample"
    type: "claimreviewdata"
    input_path: "samples/claimreviewdata-data"
  - name: "euroclimatecheck_sample"
    type: "euroclimatecheck"
    input_path: "samples/euroclimatecheck-data"

enrichment:
  url_text_extraction:
    enabled: true
    rate_limit_delay: 0.5
    timeout: 15
    max_retries: 2

  dbpedia_spotlight:
    enabled: true
    api_url: "https://api.dbpedia-spotlight.org/en/annotate"
    confidence: 0.6
    support: 30
    timeout: 20
    rate_limit_delay: 0.2

  bert_factors:
    enabled: true
    batch_size: 32
    max_length: 128
    timeout: 30
    rate_limit_delay: 0.1

output:
  format: "turtle"
  output_path: "data/rdf/{DATE}/{SOURCE}.ttl"
  base_uri: "http://data.climatesense-project.eu"

cache:
  cache_dir: "cache"
  default_ttl_hours: 24.0

Querying the cache

You can use any PostgreSQL client to connect to the PostgreSQL cache database and run SQL queries.

Example SQL Queries

-- Processing success rates by step
SELECT step, COUNT(*) AS total, COUNT(*) FILTER (WHERE success) AS successes
FROM cache_entries GROUP BY step;

-- Error analysis by domain
SELECT split_part(payload->'payload'->>'review_url', '/', 3) AS domain, COUNT(*) AS failures
FROM cache_entries WHERE success = false GROUP BY domain;

Querying the Knowledge Graph

Once loaded into Virtuoso, query the knowledge graph using SPARQL:

Example SPARQL Queries

Find all climate claims:

PREFIX schema: <http://schema.org/>
SELECT ?claim ?text ?rating
WHERE {
  ?claim a schema:ClaimReview ;
         schema:claimReviewed ?text ;
         schema:reviewRating ?rating .
}
LIMIT 10

Find claims by fact-checking organization:

PREFIX schema: <http://schema.org/>
SELECT ?claim ?author
WHERE {
  ?claim a schema:ClaimReview ;
         schema:author ?author .
}
LIMIT 10

Development

Setup

just setup-dev

Common Tasks

just format          # Format code with ruff
just check           # Run linting and type checks
just pre-commit-all  # Run pre-commit on all files

CLI Usage

# Display help
uv run climatesense-kg --help

# Run minimal pipeline with debug logging
uv run climatesense-kg run --config config/minimal.yaml --debug

# Run daily pipeline skipping data download and forcing full RDF regeneration
uv run climatesense-kg run --config config/daily.yaml --skip-download --force-regenerate

Acknowledgments

This project builds upon the work of the CIMPLE project and reuses components from:

About

Modular pipeline for building the climate misinformation knowledge graph from fact-checking data sources with data enrichment and RDF output

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors