Monitoring and observability

Public health signals without exposing internal consoles.

A public-safe summary of core platform health. Detailed Grafana and Prometheus views stay behind the executive login, while this page gives a quick signal that the public site and lab services are reachable.

Current focus Open Grafana (Private) Open Prometheus (Private) Back to homepage

Public view, not admin console Hybrid on-prem and edge path Designed for executive readability

Health summary

public-safe

Lab-core API Loading...

Prometheus Loading...

Grafana Loading...

Host exporter Loading...

Updated: loading... Overall: loading...

Health Model

The platform should answer three questions quickly.

Is it up? External reachability, ingress, DNS, TLS, and public routing.

Is it healthy? Host pressure, VM pressure, service latency, and application state.

Is it degrading? Error rates, queueing, model latency, tunnel churn, and restart patterns.

Can it recover? Backups, restart paths, rebuild scripts, and clear ownership of the stack.

Coverage

What the monitoring layer is intended to cover.

Infrastructure

Host and virtualization layer

Proxmox host resource pressure and storage health
VM availability, restart history, and network reachability
Tunnel continuity between on-prem and OCI edge

Applications

Public and internal service surfaces

`cstreicher.com` and Resume AI response health
Candidate lab readiness and reset state
ERP uptime, job failures, and integration boundaries

Signals

Metrics, logs, and human-readable summaries

Executive-safe summary view for quick status checks
Internal dashboarding for root-cause analysis
Alert paths for drift, failure, and performance change

Operations

How this should mature from a summary page into a true operations view.

Public-safe tiles

Public website reachability
Resume AI availability
Tunnel and edge path status
Overall platform status
Last notable incident window

Private-only tiles

Detailed host metrics and storage usage.
Grafana drill-downs and Prometheus internals.
ERP application logs and candidate-lab admin controls.

Near-term build order

1. Public monitoring summary
2. Internal Grafana map for host and VM health
3. Candidate-lab telemetry and reset status
4. ERP health and job/error panels

Design rule

Public pages explain status and operational discipline.
Private dashboards expose the details needed to fix issues.
The site should communicate control without leaking internals.