Monitoring and observability
Public health signals without exposing internal consoles.
A public-safe summary of core platform health. Detailed Grafana and Prometheus views stay behind the executive login, while this page gives a quick signal that the public site and lab services are reachable.
Public view, not admin console
Hybrid on-prem and edge path
Designed for executive readability
Health summary
public-safe
Updated: loading...
Overall: loading...
Health Model
The platform should answer three questions quickly.
Coverage
What the monitoring layer is intended to cover.
Infrastructure
Host and virtualization layer
- Proxmox host resource pressure and storage health
- VM availability, restart history, and network reachability
- Tunnel continuity between on-prem and OCI edge
Applications
Public and internal service surfaces
- `cstreicher.com` and Resume AI response health
- Candidate lab readiness and reset state
- ERP uptime, job failures, and integration boundaries
Signals
Metrics, logs, and human-readable summaries
- Executive-safe summary view for quick status checks
- Internal dashboarding for root-cause analysis
- Alert paths for drift, failure, and performance change
Operations
How this should mature from a summary page into a true operations view.
Public-safe tiles
Public website reachability Resume AI availability Tunnel and edge path status Overall platform status Last notable incident window
Private-only tiles
- Detailed host metrics and storage usage.
- Grafana drill-downs and Prometheus internals.
- ERP application logs and candidate-lab admin controls.
Near-term build order
1. Public monitoring summary 2. Internal Grafana map for host and VM health 3. Candidate-lab telemetry and reset status 4. ERP health and job/error panels
Design rule
- Public pages explain status and operational discipline.
- Private dashboards expose the details needed to fix issues.
- The site should communicate control without leaking internals.