billing-service/docs/multi-node-https-plan.md
Haitao Pan 886963606f
feat: add multi-source billing ingestion (#3)
Co-authored-by: Haitao Pan <manbuzhe2009@qq.com>
2026-04-12 13:51:47 +08:00

5.7 KiB

billing-service multi-node HTTPS ingestion plan

This document defines the target evolution path for billing-service from the current single EXPORTER_BASE_URL pull model to a secure multi-node ingestion model.

Goal

Keep billing-service as the single billing write model, but let it ingest snapshots from many remote xray-exporter instances over HTTPS without assuming private-network reachability.

Consistency budget

Target state does not require second-level strong consistency.

  • minute-level sync drift is acceptable
  • the system should be treated as eventually consistent across a short multi-minute window
  • user-facing reads in accounts.svc.plus and console.svc.plus may lag the newest exporter counters briefly
  • billing correctness matters more than immediate freshness

Operational meaning:

  • collector retries may intentionally overlap prior windows
  • delayed exporter delivery should be repaired by later collect or reconcile runs
  • the write model must converge to the correct minute buckets and ledger state without double charging

Why the current model is not enough

Today:

  • billing-service accepts one EXPORTER_BASE_URL
  • it fetches one GET /v1/snapshots/latest payload
  • it assumes the latest snapshot is enough to advance billing state

This is fine for a single local exporter, but it is not enough for:

  • multiple proxy nodes
  • exporters reachable only over public or cross-region networks
  • outage recovery where latest alone cannot prove whether intermediate windows were missed
  • source-specific authentication and certificate validation

Target design

1. Multi-source registry instead of one base URL

Target state replaces the single EXPORTER_BASE_URL dependency with a source registry owned by billing-service.

Each configured source should define at least:

  • source_id
  • node_id
  • env
  • base_url
  • enabled
  • auth_mode
  • credential_ref
  • ca_bundle_ref or trusted issuer reference
  • server_name
  • collect_interval
  • request_timeout

Rules:

  • target base_url must be https://...
  • node_id and env must match what the exporter emits
  • one source maps to one exporter endpoint, even if several sources later share the same network path

2. HTTPS-only upstream interaction

Target state requires secure transport for remote exporter pulls.

Security rules:

  • remote exporter pulls must use HTTPS
  • certificate verification must stay enabled
  • billing-service must not rely on insecure skip-verify mode
  • prefer mTLS for service-to-service trust
  • if mTLS is not yet available, use HTTPS plus a per-source bearer token
  • credentials must be scoped per source, not shared globally across all nodes

Recommended trust order:

  1. HTTPS + mTLS
  2. HTTPS + bearer token + pinned CA / trusted issuer

3. Completeness-first pull contract

To make multi-node billing safe, the upstream contract must evolve from latest to a windowed pull API.

Recommended target contract:

GET /v1/snapshots/window?since=<RFC3339>&until=<RFC3339>&limit=<n>&cursor=<token>

Response shape should include:

  • source_id
  • node_id
  • env
  • window_start
  • window_end
  • items[]
  • next_cursor
  • has_more
  • emitted_at

Each item should still carry:

  • collected_at
  • samples[].uuid
  • samples[].email
  • samples[].inbound_tag
  • samples[].uplink_bytes_total
  • samples[].downlink_bytes_total

Why this matters:

  • latest is enough for observability, but not enough to prove billing completeness
  • windowed pagination lets billing-service resume from checkpoints and catch up after transient failures

4. Source checkpoints and replay safety

billing-service should track fetch progress per source, not globally.

Recommended source checkpoint fields:

  • source_id
  • last_successful_until
  • last_cursor
  • last_attempted_at
  • last_succeeded_at
  • last_error

Collection behavior:

  • pull per source using that source's last successful checkpoint
  • always overlap a small safety window during retries
  • rely on idempotent minute-bucket writes so overlap does not double-charge
  • expose source-level health in /v1/status
  • treat short multi-minute lag as acceptable if replay convergence is preserved

5. Safe write semantics

Security alone is not enough; the write path must remain replay-safe.

Target write-path rules:

  • billing facts remain keyed by node_id, env, uuid, inbound_tag, and bucket time
  • re-fetching the same source window must not duplicate usage or ledger rows
  • reconcile jobs must be able to replay a source or time range intentionally

Phase 1. Preserve current runtime

  • keep EXPORTER_BASE_URL as legacy single-source mode
  • keep GET /v1/snapshots/latest for current deployment compatibility

Phase 2. Add source registry support

  • introduce a multi-source config model
  • let billing-service iterate sources internally
  • keep single-source config as a compatibility shim

Phase 3. Add HTTPS window API to exporter

  • extend xray-exporter with a secure windowed snapshot API
  • add source authentication and certificate validation requirements

Phase 4. Dual-read migration

  • let billing-service support both:
    • legacy single-source latest
    • target multi-source HTTPS window pulls
  • compare source-level completeness and write counts during rollout

Phase 5. Make multi-source HTTPS the default

  • require HTTPS for remote exporter sources
  • reserve plain HTTP for explicit same-host dev or local-only modes
  • retire single global EXPORTER_BASE_URL as the primary production contract

Non-goals

  • exposing billing-service as a user-facing query API
  • moving billing truth into Prometheus
  • weakening TLS verification to simplify rollout
  • making accounts.svc.plus call billing-service for runtime reads