What metrics are available
Jarvis exposes metrics across three categories: Node health- CPU, memory, and disk usage per node
- GPU utilization and VRAM consumption on inference nodes
- Network throughput between nodes
- Container uptime and restart counts
- Requests per second per model
- Average and p95 inference latency
- Token throughput (tokens/sec)
- Error rates and timeout counts
- Tasks submitted, in-progress, and completed
- Task duration and delegation depth
- Model usage per agent
- Failed task rate
Check system health
The fastest way to get a health snapshot is the LiteLLM gateway health endpoint:Health indicator checklist
Run through this checklist when you want to confirm your deployment is healthy:- All nodes respond to
ping your-node-hostname -
docker pson each node shows all expected containers inUpstate -
/healthendpoint on the LiteLLM gateway returns200 OK - At least one model returns a successful inference via
/models - GPU nodes show non-zero GPU utilization when a model is loaded
- n8n is accessible and workflows show recent successful runs
- No containers show restart counts above zero in
docker ps
Alerting and notifications via n8n
Jarvis uses n8n to run monitoring workflows that check node and service health on a schedule and send alerts when something is wrong.Find the monitoring workflows
Look for workflows with names like “Node Health Check”, “Model Latency Monitor”, or “Agent Error Alert”. These run on a cron schedule and check the health endpoints described above.
Configure notification channels
Each monitoring workflow ends with a notification node. Edit it to set your preferred destination — for example, a Slack webhook, a Telegram bot, or an email address.Replace the placeholder credentials in the node with your own, then save the workflow.
You can create custom monitoring workflows in n8n by combining the HTTP Request node (to poll health endpoints) with any notification node. See the n8n integration guide for more detail.
Common issues and how to address them
A node stops responding
A node stops responding
If a node doesn’t respond to ping or SSH:
- Check whether the machine is powered on
- Verify it’s connected to your LAN
- If it’s reachable but SSH is down, try rebooting via your hypervisor or management interface
- Once you’re in, run
docker psto check whether services restarted automatically
Model inference is slow or timing out
Model inference is slow or timing out
High latency usually points to resource contention:
- SSH into the GPU node and run
nvidia-smi— check VRAM usage and GPU utilization - If VRAM is saturated, you may have too many models loaded simultaneously. Unload unused models via Ollama
- Check CPU usage with
docker stats— a non-GPU node doing inference will be slow - Review LiteLLM logs for timeout errors:
docker logs litellm --tail 50
An agent task fails or hangs
An agent task fails or hangs
For failed or hanging tasks:
- Check the agent container logs:
docker logs paperclip --tail 50ordocker logs hermes --tail 50 - Confirm the model the agent is using is responding — test it directly via the
/modelsendpoint - If the task is hanging, restart the agent container:
docker restart paperclip - For persistent failures, check whether the MCP tool server is reachable if the task involves tool use
Disk space is running low
Disk space is running low
Docker images, model weights, and logs accumulate over time. To reclaim space:Model weights stored by Ollama live outside Docker volumes — check
/usr/share/ollama or your configured Ollama data directory.n8n workflows are failing
n8n workflows are failing
If monitoring or automation workflows stop running:
- Open n8n and check the Executions tab for error messages
- Confirm the n8n container is running:
docker ps | grep n8n - Re-enter any expired credentials (API keys, webhook tokens) in the Credentials section
- For webhook-triggered workflows, verify the webhook URL is still reachable from the triggering service
Next steps
n8n workflows
Build custom monitoring and alerting workflows with n8n.
Docker operations
Restart services and inspect logs when health checks fail.
Mesh nodes
Understand the role of each node and what to expect from each.
API reference
Query health and metrics programmatically via the REST API.