Monitoring & Alerting

The relayer exposes metrics on the port number specified by the argument . Port is the default, though any valid port can be chosen. We recommend using Prometheus to scrape these metrics and Grafana to visualize them, using this dashboard JSON template, as seen below. The benefit of this dashboard is that it displays multiple validators at the same time. The screenshot shows metrics grouped by chain or by kubernetes pod.

If running as a Docker image, make sure to port-forward the metrics endpoint port. For, instance, to forward port 9090 on the local port 80, add the following flag to your docker run command: -p 9090:80

Metrics

The dashboard template includes the following metrics.

Metric	Description
`hyperlane_messages_processed_count`	The messages processed by the relayer, by origin and destination chain (“Messages Processed”).
`hyperlane_submitter_queue_length{queue_name="prepare_queue"}`	Length of the submitter’s prepare queue (“Prepare Queues”). Each destination chain has a submitter running, and all pending / retriable messages end up in the prepare queue, awaiting to be (re)submitted. An increase in this queue may either mean that the volume of messages exceeds the relayer’s processing throughput or that submitting is failing for various reasons (bad RPCs, low balance).
`hyperlane_submitter_queue_length{queue_name="submit_queue"}`	Length of the submitter’s submit queue (“Submit Queues”). These are messages that have been prepared but not submitted yet.
`hyperlane_submitter_queue_length{queue_name="confirm_queue"}`	Length of the submitter’s confirm queue (“Confirm Queues”). These are messages that have been submitted but not confirmed (finalized) yet.

Other relevant metrics include:

Metric	Description
`hyperlane_request_count`	Records the number of successful or failed RPCs using the following labels: `chain`, `status` (failure or success), `provider_node` (the responsible base RPC URL), and RPC `method`. Present for all agents, but only supported on EVM chains at the moment.
`hyperlane_wallet_balance{agent="relayer"}`	The relayer’s balance on every chain, expressed in the lowest denomination unit
`hyperlane_critical_error`	Boolean marker for critical errors on a chain, signalling loss of liveness.
`hyperlane_block_height`	The block height of the RPC node the agent is connected to. If this metric is not increasing, the RPC may be unhealthy and need changing.
`hyperlane_span_events_total{agent="relayer", event_level="error"}`	The total number of errors logged. If the derivative of this metric exceeds 1 over the last hour, at least a low-severity alert is warranted. Note that the dashboard query groups metrics by kubernetes pod name, so you may need to adjust this query if you are not running in a kubernetes environment.
`hyperlane_span_events_total{agent="relayer", event_level="warn"}`	The total number of warnings logged. If the derivative of this metric exceeds 1 over the last hour, at least a low-severity alert is warranted. Note that the dashboard query groups metrics by kubernetes pod name, so you may need to adjust this query if you are not running in a kubernetes environment.

Alerts

The metrics above can be combined to create alerts that minimize false positives. Some example critical alerts:

For all agents

hyperlane_block_height for any chain has not increased in the last 15 minutes
The rate of hyperlane_request_count{status="failure"} in the last 10 minutes is > 60% of the rate of hyperlane_request_count{status=~"success|failure"}. Most agent issues are due to bad RPCs, and this is likely to catch these issues.

For relayers

hyperlane_critical_error is 1 for a chain, meaning the relayer has lost liveness there. While operations on other chains are not affected by this, this is a high severity alert - usually to do with unreliable RPCs for the affected chain.
Using hyperlane_submitter_queue_length{queue_name="prepare_queue"}, the prepare queue length has been increasing, and the confirm queue length was zero, and error/warning count diffs have been increasing, over the last 30 minutes. This could mean that the relayer has run out of balance and cannot pay for gas, or that the destination chain RPC url is not working correctly, or just that all new messages are unprocessable
hyperlane_wallet_balance has dropped below a certain threshold. For example, the current balance divided by the difference in the last 24h is less than 2, meaning balance must be topped up within two days

If you get alerted, always check the logs for what the problem could be.

Introduction

Key Management

Validators

Relayer

Agent Config

Guides

Metrics

Alerts

For all agents

For relayers

Introduction

Key Management

Validators

Relayer

Agent Config

Guides

​Metrics

​Alerts

​For all agents

​For relayers

Metrics

Alerts

For all agents

For relayers