Monitoring & Alerting
The relayer exposes metrics on the port number specified by the argument --metrics
. Port 9090
is the default, though any valid port can be chosen.
We recommend using Prometheus to scrape these metrics and Grafana to visualize them, using this dashboard JSON template, as seen below. The benefit of this dashboard is that it displays multiple validators at the same time. The screenshot shows metrics grouped by chain or by kubernetes pod.
If running as a Docker image, make sure to port-forward the metrics endpoint port. For, instance, to forward port 9090 on the local port 80, add the following flag to your docker run
command: -p 9090:80
Metrics
The dashboard template includes the following metrics.
Metric | Description |
---|---|
hyperlane_messages_processed_count | The messages processed by the relayer, by origin and destination chain ("Messages Processed"). |
hyperlane_submitter_queue_length{queue_name="prepare_queue"} | Length of the submitter's prepare queue ("Prepare Queues"). Each destination chain has a submitter running, and all pending / retriable messages end up in the prepare queue, awaiting to be (re)submitted. An increase in this queue may either mean that the volume of messages exceeds the relayer's processing throughput or that submitting is failing for various reasons (bad RPCs, low balance). |
hyperlane_submitter_queue_length{queue_name="submit_queue"} | Length of the submitter's submit queue ("Submit Queues"). These are messages that have been prepared but not submitted yet. |
hyperlane_submitter_queue_length{queue_name="confirm_queue"} | Length of the submitter's confirm queue ("Confirm Queues"). These are messages that have been submitted but not confirmed (finalized) yet. |
Other relevant metrics include:
Metric | Description |
---|---|
hyperlane_request_count | Records the number of successful or failed RPCs using the following labels: chain , status (failure or success), provider_node (the responsible base RPC URL), and RPC method . Present for all agents, but only supported on EVM chains at the moment. |
hyperlane_wallet_balance{agent="relayer"} | The relayer's balance on every chain, expressed in the lowest denomination unit |
hyperlane_critical_error | Boolean marker for critical errors on a chain, signalling loss of liveness. |
hyperlane_block_height | The block height of the RPC node the agent is connected to. If this metric is not increasing, the RPC may be unhealthy and need changing. |
hyperlane_span_events_total{agent="relayer", event_level="error"} | The total number of errors logged. If the derivative of this metric exceeds 1 over the last hour, at least a low-severity alert is warranted. Note that the dashboard query groups metrics by kubernetes pod name, so you may need to adjust this query if you are not running in a kubernetes environment. |
hyperlane_span_events_total{agent="relayer", event_level="warn"} | The total number of warnings logged. If the derivative of this metric exceeds 1 over the last hour, at least a low-severity alert is warranted. Note that the dashboard query groups metrics by kubernetes pod name, so you may need to adjust this query if you are not running in a kubernetes environment. |
Alerts
The metrics above can be combined to create alerts that minimize false positives. Some example critical alerts:
For all agents:
hyperlane_block_height
for any chain has not increased in the last 15 minutes- The rate of
hyperlane_request_count{status="failure"}
in the last 10 minutes is > 60% of the rate ofhyperlane_request_count{status=~"success|failure"}
. Most agent issues are due to bad RPCs, and this is likely to catch these issues.
For relayers:
hyperlane_critical_error
is1
for a chain, meaning the relayer has lost liveness there. While operations on other chains are not affected by this, this is a high severity alert - usually to do with unreliable RPCs for the affected chain.- Using
hyperlane_submitter_queue_length{queue_name="prepare_queue"}
, the prepare queue length has been increasing, and the confirm queue length was zero, and error/warning count diffs have been increasing, over the last 30 minutes. This could mean that the relayer has run out of balance and cannot pay for gas, or that the destination chain RPC url is not working correctly, or just that all new messages are unprocessable hyperlane_wallet_balance
has dropped below a certain threshold. For example, the current balance divided by the difference in the last 24h is less than 2, meaning balance must be topped up within two days
If you get alerted, always check the logs for what the problem could be.