200 lines
5.7 KiB
Markdown
200 lines
5.7 KiB
Markdown
# billing-service multi-node HTTPS ingestion plan
|
|
|
|
This document defines the target evolution path for `billing-service` from the
|
|
current single `EXPORTER_BASE_URL` pull model to a secure multi-node ingestion
|
|
model.
|
|
|
|
## Goal
|
|
|
|
Keep `billing-service` as the single billing write model, but let it ingest
|
|
snapshots from many remote `xray-exporter` instances over HTTPS without
|
|
assuming private-network reachability.
|
|
|
|
## Consistency budget
|
|
|
|
Target state does not require second-level strong consistency.
|
|
|
|
- minute-level sync drift is acceptable
|
|
- the system should be treated as eventually consistent across a short
|
|
multi-minute window
|
|
- user-facing reads in `accounts.svc.plus` and `console.svc.plus` may lag the
|
|
newest exporter counters briefly
|
|
- billing correctness matters more than immediate freshness
|
|
|
|
Operational meaning:
|
|
|
|
- collector retries may intentionally overlap prior windows
|
|
- delayed exporter delivery should be repaired by later collect or reconcile
|
|
runs
|
|
- the write model must converge to the correct minute buckets and ledger state
|
|
without double charging
|
|
|
|
## Why the current model is not enough
|
|
|
|
Today:
|
|
|
|
- `billing-service` accepts one `EXPORTER_BASE_URL`
|
|
- it fetches one `GET /v1/snapshots/latest` payload
|
|
- it assumes the latest snapshot is enough to advance billing state
|
|
|
|
This is fine for a single local exporter, but it is not enough for:
|
|
|
|
- multiple proxy nodes
|
|
- exporters reachable only over public or cross-region networks
|
|
- outage recovery where `latest` alone cannot prove whether intermediate
|
|
windows were missed
|
|
- source-specific authentication and certificate validation
|
|
|
|
## Target design
|
|
|
|
### 1. Multi-source registry instead of one base URL
|
|
|
|
Target state replaces the single `EXPORTER_BASE_URL` dependency with a source
|
|
registry owned by `billing-service`.
|
|
|
|
Each configured source should define at least:
|
|
|
|
- `source_id`
|
|
- `node_id`
|
|
- `env`
|
|
- `base_url`
|
|
- `enabled`
|
|
- `auth_mode`
|
|
- `credential_ref`
|
|
- `ca_bundle_ref` or trusted issuer reference
|
|
- `server_name`
|
|
- `collect_interval`
|
|
- `request_timeout`
|
|
|
|
Rules:
|
|
|
|
- target `base_url` must be `https://...`
|
|
- `node_id` and `env` must match what the exporter emits
|
|
- one source maps to one exporter endpoint, even if several sources later share
|
|
the same network path
|
|
|
|
### 2. HTTPS-only upstream interaction
|
|
|
|
Target state requires secure transport for remote exporter pulls.
|
|
|
|
Security rules:
|
|
|
|
- remote exporter pulls must use HTTPS
|
|
- certificate verification must stay enabled
|
|
- `billing-service` must not rely on insecure skip-verify mode
|
|
- prefer mTLS for service-to-service trust
|
|
- if mTLS is not yet available, use HTTPS plus a per-source bearer token
|
|
- credentials must be scoped per source, not shared globally across all nodes
|
|
|
|
Recommended trust order:
|
|
|
|
1. HTTPS + mTLS
|
|
2. HTTPS + bearer token + pinned CA / trusted issuer
|
|
|
|
### 3. Completeness-first pull contract
|
|
|
|
To make multi-node billing safe, the upstream contract must evolve from
|
|
`latest` to a windowed pull API.
|
|
|
|
Recommended target contract:
|
|
|
|
`GET /v1/snapshots/window?since=<RFC3339>&until=<RFC3339>&limit=<n>&cursor=<token>`
|
|
|
|
Response shape should include:
|
|
|
|
- `source_id`
|
|
- `node_id`
|
|
- `env`
|
|
- `window_start`
|
|
- `window_end`
|
|
- `items[]`
|
|
- `next_cursor`
|
|
- `has_more`
|
|
- `emitted_at`
|
|
|
|
Each item should still carry:
|
|
|
|
- `collected_at`
|
|
- `samples[].uuid`
|
|
- `samples[].email`
|
|
- `samples[].inbound_tag`
|
|
- `samples[].uplink_bytes_total`
|
|
- `samples[].downlink_bytes_total`
|
|
|
|
Why this matters:
|
|
|
|
- `latest` is enough for observability, but not enough to prove billing
|
|
completeness
|
|
- windowed pagination lets `billing-service` resume from checkpoints and catch
|
|
up after transient failures
|
|
|
|
### 4. Source checkpoints and replay safety
|
|
|
|
`billing-service` should track fetch progress per source, not globally.
|
|
|
|
Recommended source checkpoint fields:
|
|
|
|
- `source_id`
|
|
- `last_successful_until`
|
|
- `last_cursor`
|
|
- `last_attempted_at`
|
|
- `last_succeeded_at`
|
|
- `last_error`
|
|
|
|
Collection behavior:
|
|
|
|
- pull per source using that source's last successful checkpoint
|
|
- always overlap a small safety window during retries
|
|
- rely on idempotent minute-bucket writes so overlap does not double-charge
|
|
- expose source-level health in `/v1/status`
|
|
- treat short multi-minute lag as acceptable if replay convergence is preserved
|
|
|
|
### 5. Safe write semantics
|
|
|
|
Security alone is not enough; the write path must remain replay-safe.
|
|
|
|
Target write-path rules:
|
|
|
|
- billing facts remain keyed by `node_id`, `env`, `uuid`, `inbound_tag`, and
|
|
bucket time
|
|
- re-fetching the same source window must not duplicate usage or ledger rows
|
|
- reconcile jobs must be able to replay a source or time range intentionally
|
|
|
|
## Recommended rollout
|
|
|
|
### Phase 1. Preserve current runtime
|
|
|
|
- keep `EXPORTER_BASE_URL` as legacy single-source mode
|
|
- keep `GET /v1/snapshots/latest` for current deployment compatibility
|
|
|
|
### Phase 2. Add source registry support
|
|
|
|
- introduce a multi-source config model
|
|
- let `billing-service` iterate sources internally
|
|
- keep single-source config as a compatibility shim
|
|
|
|
### Phase 3. Add HTTPS window API to exporter
|
|
|
|
- extend `xray-exporter` with a secure windowed snapshot API
|
|
- add source authentication and certificate validation requirements
|
|
|
|
### Phase 4. Dual-read migration
|
|
|
|
- let `billing-service` support both:
|
|
- legacy single-source `latest`
|
|
- target multi-source HTTPS window pulls
|
|
- compare source-level completeness and write counts during rollout
|
|
|
|
### Phase 5. Make multi-source HTTPS the default
|
|
|
|
- require HTTPS for remote exporter sources
|
|
- reserve plain HTTP for explicit same-host dev or local-only modes
|
|
- retire single global `EXPORTER_BASE_URL` as the primary production contract
|
|
|
|
## Non-goals
|
|
|
|
- exposing `billing-service` as a user-facing query API
|
|
- moving billing truth into Prometheus
|
|
- weakening TLS verification to simplify rollout
|
|
- making `accounts.svc.plus` call `billing-service` for runtime reads
|