Monitoring and observability

Public health signals without exposing internal consoles.

A public-safe summary of core platform health. Detailed Grafana and Prometheus views stay behind the executive login, while this page gives a quick signal that the public site and lab services are reachable.

Public view, not admin console Hybrid on-prem and edge path Designed for executive readability

Health summary

public-safe
Lab-core API Loading...
Prometheus Loading...
Grafana Loading...
Host exporter Loading...
Updated: loading... Overall: loading...

Health Model

The platform should answer three questions quickly.

Is it up? External reachability, ingress, DNS, TLS, and public routing.
Is it healthy? Host pressure, VM pressure, service latency, and application state.
Is it degrading? Error rates, queueing, model latency, tunnel churn, and restart patterns.
Can it recover? Backups, restart paths, rebuild scripts, and clear ownership of the stack.

Coverage

What the monitoring layer is intended to cover.

Infrastructure

Host and virtualization layer

  • Proxmox host resource pressure and storage health
  • VM availability, restart history, and network reachability
  • Tunnel continuity between on-prem and OCI edge

Applications

Public and internal service surfaces

  • `cstreicher.com` and Resume AI response health
  • Candidate lab readiness and reset state
  • ERP uptime, job failures, and integration boundaries

Signals

Metrics, logs, and human-readable summaries

  • Executive-safe summary view for quick status checks
  • Internal dashboarding for root-cause analysis
  • Alert paths for drift, failure, and performance change

Operations

How this should mature from a summary page into a true operations view.

Public-safe tiles

Public website reachability
Resume AI availability
Tunnel and edge path status
Overall platform status
Last notable incident window

Private-only tiles

  • Detailed host metrics and storage usage.
  • Grafana drill-downs and Prometheus internals.
  • ERP application logs and candidate-lab admin controls.

Near-term build order

1. Public monitoring summary
2. Internal Grafana map for host and VM health
3. Candidate-lab telemetry and reset status
4. ERP health and job/error panels

Design rule

  • Public pages explain status and operational discipline.
  • Private dashboards expose the details needed to fix issues.
  • The site should communicate control without leaking internals.