Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
293 changes: 293 additions & 0 deletions INFRA.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
# Testnet Infrastructure Operations Guide

## Architecture Overview

Each HP masternode runs **dashmate** which orchestrates Docker containers for:
- **Core** (`dashmate_testnet-core-1`): Dash Core daemon (dashd)
- **Drive ABCI** (`dashmate_testnet-drive_abci-1`): Platform state machine
- **Drive Tenderdash** (`dashmate_testnet-drive_tenderdash-1`): BFT consensus engine
- **Gateway** (`dashmate_testnet-gateway-1`): Envoy proxy for DAPI
- **RS DAPI** (`dashmate_testnet-rs_dapi-1`): Rust DAPI implementation
- **Dashmate Helper** (`dashmate_testnet-dashmate_helper-1`): Background tasks
- **Gateway Rate Limiter** (`dashmate_testnet-gateway_rate_limiter-1`): Rate limiting (Redis + metrics)

The wallet node (`dashd-wallet-1`) runs standalone dashd with the MNO wallet for managing masternode registrations and collateral.

## Key Files

| File | Purpose |
|------|---------|
| `networks/testnet.yml` | Node keys (owner, collateral, operator, node_key), dashmate version, passwords |
| `networks/testnet.inventory` | Ansible inventory with IPs, protx hashes, host groups |
| `ansible/deploy.yml` | Main deployment playbook with tagged plays |
| `ansible/roles/dashmate/` | Dashmate installation, config, SSL, restart logic |
| `ansible/roles/mn_init/` | Masternode registration (key import, collateral funding, protx register) |
| `ansible/roles/mn_unban/` | ProUpServTx to revive PoSe-banned nodes |

Note: `networks/` is a separate private git repo (`dashpay/dash-network-configs`), gitignored by the parent repo.

## Dashmate Commands

All dashmate commands must be run as the **dashmate** user:

```bash
# SSH to a node
ssh ubuntu@<IP>

# Status
sudo -u dashmate dashmate status

# Start/Stop/Restart
sudo -u dashmate dashmate start --verbose
sudo -u dashmate dashmate stop --verbose
sudo -u dashmate dashmate stop --force --verbose # Skip DKG check
sudo -u dashmate dashmate restart --verbose
sudo -u dashmate dashmate restart --force --verbose # Skip DKG check
sudo -u dashmate dashmate restart --platform --verbose # Platform only, keeps Core running

# Config
sudo -u dashmate dashmate config get <path>
sudo -u dashmate dashmate config set <path> <value>
sudo -u dashmate dashmate config render --verbose # Regenerate docker-compose from config
sudo -u dashmate dashmate config default testnet # Set default config name

# SSL
sudo -u dashmate dashmate ssl obtain --verbose

# Core operations (run as root, not dashmate user)
sudo dashmate core reindex # Interactive prompt - may hang in scripts
```

### Restart Modes

| Mode | Flag | Behaviour |
|------|------|-----------|
| Safe | (default) | Waits for DKG window, can timeout |
| No flags | `restart` | Refuses if DKG session is active |
| Force | `--force` | Always works, risks brief PoSe penalty |
| Platform only | `--platform` | Restarts platform services, leaves Core running |

## Checking Logs

```bash
# Docker logs (run as ubuntu or root)
sudo docker logs dashmate_testnet-core-1 --tail 50
sudo docker logs dashmate_testnet-drive_tenderdash-1 --tail 50
sudo docker logs dashmate_testnet-gateway-1 --tail 50
sudo docker logs dashmate_testnet-rs_dapi-1 --tail 50
sudo docker logs dashmate_testnet-drive_abci-1 --tail 50

# Log files on disk
ls -lhS /home/dashmate/logs/

# Common log files (can grow very large):
# drive-json.log, drive-pretty.log - Drive logs (can be 6GB+)
# drive-grovedb-operations.log - GroveDB ops (can be 4GB+)
# tenderdash.log - Tenderdash consensus
# core.log - Dash Core
```

## Common Issues and Fixes

### EvoDB Inconsistency / Core Stuck

**Symptoms**: Core crashes with `Found EvoDB inconsistency, you must reindex to continue` or core is stuck at a height with "Potential stale tip detected" and block headers marked conflicting.

**Fix**: Wipe evoDB and chainstate, let core rebuild from existing block data:

```bash
sudo -u dashmate dashmate stop --force --verbose
sudo docker run --rm -v dashmate_testnet_core_data:/data alpine sh -c \
'rm -rf /data/.dashcore/testnet3/evodb /data/.dashcore/testnet3/chainstate && echo done'
sudo -u dashmate dashmate start --verbose
```

Core will rebuild from block data (starts from height 0, takes hours for full testnet chain). Do NOT use `dashmate core reindex` as it has an interactive prompt that hangs in non-interactive contexts.

### Disk Full

**Symptoms**: Docker logs fail with `no space left on device`, core crashes.

**Fix**: Truncate large log files:

```bash
df -h /
sudo du -sh /home/dashmate/logs/
sudo truncate -s 0 /home/dashmate/logs/drive-json.log \
/home/dashmate/logs/drive-pretty.log \
/home/dashmate/logs/drive-grovedb-operations.log \
/home/dashmate/logs/tenderdash.log \
/home/dashmate/logs/core.log
```

### Docker Network Overlap

**Symptoms**: `dashmate start` fails with `Pool overlaps with other one on this address space`.

**Fix**: Old containers/networks from a previous config prefix are conflicting:

```bash
sudo docker stop $(sudo docker ps -q)
sudo docker rm $(sudo docker ps -aq)
sudo docker network prune -f
sudo -u dashmate dashmate start --verbose
```

### Platform Error (Tenderdash crash-looping)

**Symptoms**: Platform status shows `error`, tenderdash logs show `unexpected masternode state POSE_BANNED`.

**Cause**: Tenderdash refuses to start if the masternode is PoSe-banned. Fix the ban first (see ProUpServTx below), then tenderdash will start automatically on its next restart cycle.

### SSL Certificate Issues

**Symptoms**: Platform in error, gateway can't serve HTTPS.

**Prerequisites for `dashmate ssl obtain`**:
- `externalIp` must be set in config
- `platform.gateway.ssl.enabled` must be `true`
- `platform.gateway.ssl.providerConfigs.zerossl.apiKey` must be set
- SSL directory must contain files not directories (if directories exist at `bundle.crt` or `private.key` paths, `rm -rf` them first)

```bash
# Check current SSL config
sudo -u dashmate dashmate config get platform.gateway.ssl

# Set required values if missing
sudo -u dashmate dashmate config set externalIp <IP>
sudo -u dashmate dashmate config set platform.gateway.ssl.enabled true
sudo -u dashmate dashmate config set platform.gateway.ssl.providerConfigs.zerossl.apiKey <key>

# Obtain cert
sudo -u dashmate dashmate ssl obtain --verbose

# Fix if bundle.crt/private.key are directories instead of files
sudo rm -rf /root/.dashmate/testnet/platform/gateway/ssl/bundle.crt
sudo rm -rf /root/.dashmate/testnet/platform/gateway/ssl/private.key
sudo -u dashmate dashmate ssl obtain --verbose
```

### Dashmate Config Not Taking Effect

**Symptom**: Config file on disk has correct values but `dashmate config get` returns null.

**Cause**: The config.json was written by ansible but dashmate's internal state diverged. Use `dashmate config set` to set values explicitly, or `dashmate config render` to regenerate service configs.

## ProTx Lifecycle

### Fresh Registration

Run via ansible:
```bash
./bin/deploy -p --tags=unban_hp_masternodes testnet
```

This handles: key import, wallet rescan, collateral funding (4000 DASH), `protx register_evo`, and writing protx hash to inventory.

### Unbanning (ProUpServTx)

When a node is PoSe-banned, send a ProUpServTx to revive it:

```bash
# From the wallet node
dash-cli -rpcwallet=dashd-wallet-1-mno protx update_service_evo \
<protx_hash> \
'<IP>:19999' \
<operator_private_key> \
<platform_node_id> \
36656 1443
```

If you get `protx-dup`, it means the on-chain details already match. Use a fee source address to make the transaction unique:

```bash
# Fund the owner address first
dash-cli -rpcwallet=dashd-wallet-1-mno sendtoaddress <owner_address> 0.01

# Then use it as fee source (last parameter)
dash-cli -rpcwallet=dashd-wallet-1-mno protx update_service_evo \
<protx_hash> '<IP>:19999' <operator_private_key> \
<platform_node_id> 36656 1443 '' <owner_address>
```

### Checking ProTx Status

```bash
# From wallet node
dash-cli -rpcwallet=dashd-wallet-1-mno protx info <protx_hash>

# Key fields:
# PoSePenalty: 0 = healthy, 543 = max (banned)
# PoSeBanHeight: -1 = not banned, >0 = banned at this height
# PoSeRevivedHeight: -1 = never revived, >0 = revived at this height
```

## Ansible Deployment

### Common Commands

```bash
# Full deploy to all nodes
./bin/deploy -p testnet

# Dashmate deploy to specific node(s)
./bin/deploy -p --tags=dashmate_deploy -a='--limit hp-masternode-3' testnet

# Fast mode (skips SSL, filebeat, image updates)
./bin/deploy -p --fast --tags=dashmate_deploy testnet

# Registration / unban only
./bin/deploy -p --tags=unban_hp_masternodes testnet
```

### Ansible Environment Setup

```bash
# Requires nix-shell for nodejs, and ansible venv
nix-shell -p nodejs_20 python3 --run "export PATH=/tmp/ansible-venv/bin:\$PATH && ./bin/deploy ..."

# Required pip packages in /tmp/ansible-venv:
# ansible, netaddr, boto3, botocore

# Required ansible galaxy roles:
# geerlingguy.filebeat, elastic.beats
```

### Known Ansible Gotchas

- **`gather_facts: false`** in deploy.yml (line 338) was changed to `true` because `geerlingguy.filebeat` needs `ansible_facts.os_family`
- **`default()` filter** does NOT trigger for YAML null values, only for undefined. Use `default(value, true)` for falsy values
- **`dashmate_core_rpc_quorum_list_password`** must be explicitly set in testnet.yml (not null) for dashmate 3.0.1 config validation
- **`rescanblockchain`** via ansible can appear to hang - the RPC is synchronous and blocks until complete on the full testnet chain

## AWS / IP Management

HP masternodes use a mix of standard EIPs and BYOIP addresses.

```bash
# Allocate a specific BYOIP address
aws ec2 allocate-address --region us-west-2 \
--address 68.67.122.X \
--ipam-pool-id ipam-pool-0de83ed8bba5f9b48

# Associate with an instance
aws ec2 associate-address --region us-west-2 \
--allocation-id eipalloc-XXXXX \
--instance-id i-XXXXX

# Release an old EIP
aws ec2 disassociate-address --region us-west-2 --association-id eipassoc-XXXXX
aws ec2 release-address --region us-west-2 --allocation-id eipalloc-XXXXX
```

## Current Node Status (as of 2026-02-26)

| Node | Status | Notes |
|------|--------|-------|
| hp-masternode-3 | READY, PoSe=0, Platform syncing | Freshly registered with new IP 68.67.122.3 |
| hp-masternode-4 | READY, PoSe=0, Platform syncing | Re-registered in previous session |
| hp-masternode-6 | READY, PoSe=0, Platform syncing | Re-registered in previous session |
| hp-masternode-16 | READY, PoSe=0, Platform up | rs-dapi metrics config updated |
| hp-masternode-18 | Syncing (99.97%) | Recently unbanned, waiting for core sync |
| hp-masternode-22 | Rebuilding chainstate | EvoDB corruption + disk full, logs truncated, rebuilding |
| hp-masternode-29 | Rebuilding chainstate | Stuck on conflicting block, evoDB wiped, rebuilding |
Comment on lines +283 to +293
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid committing live node status and concrete infrastructure identifiers.

This section exposes operational state and host mapping (including a concrete public IP). Move this to a private runbook or redact to non-identifying examples to reduce reconnaissance risk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@INFRA.MD` around lines 283 - 293, The "Current Node Status" table in INFRA.MD
exposes concrete hostnames and an IP (e.g., rows referencing hp-masternode-3,
hp-masternode-4, hp-masternode-6, hp-masternode-16, hp-masternode-18,
hp-masternode-22, hp-masternode-29 and the IP 68.67.122.3); remove or redact
these identifiers and the public IP from the committed doc and either move the
detailed operational status to a private runbook or replace the table with
non-identifying examples (e.g., "masternode-A", "masternode-B", status examples)
and a note pointing to the secure runbook for real values so no live
infrastructure details remain in the public repo.

2 changes: 1 addition & 1 deletion ansible/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -335,7 +335,7 @@
- name: Set up core and platform on HP masternodes
hosts: hp_masternodes
become: true
gather_facts: false
gather_facts: true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep gather_facts disabled; gather only required facts explicitly.

Turning this on breaks the fast-deploy baseline for this play. Keep gather_facts: false and add a minimal setup pre-task to fetch only the facts needed by downstream roles (for example OS family).

Proposed change
 - name: Set up core and platform on HP masternodes
   hosts: hp_masternodes
   become: true
-  gather_facts: true
+  gather_facts: false
   # Using strategy: free for parallel execution to improve deployment speed
   # This is intentional for performance optimization
   strategy: free  # noqa: run-once[play]
   serial: 0
   pre_tasks:
+    - name: Gather required OS fact for role conditionals/templates
+      ansible.builtin.setup:
+        filter:
+          - ansible_os_family
     - name: Check inventory for HP masternodes
       ansible.builtin.set_fact:
         node: "{{ hp_masternodes[inventory_hostname] }}"

As per coding guidelines "ansible/deploy.yml: Add dashmate_deploy tag, set gather_facts: false, and use strategy: free in ansible/deploy.yml to enable fast, parallel deployments".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible/deploy.yml` at line 338, Change the play to keep gather_facts: false,
add strategy: free and the dashmate_deploy tag on the play, and create a
pre-task that runs the ansible setup module to collect only required facts
(e.g., ansible_facts.os_family) instead of full facts; update the play header
(gather_facts → false, add strategy: free and tags: [dashmate_deploy]) and add a
lightweight setup pre-task using the setup module with the filter parameter to
fetch only the minimal facts downstream roles need.

# Using strategy: free for parallel execution to improve deployment speed
# This is intentional for performance optimization
strategy: free # noqa: run-once[play]
Expand Down