Skip to Main Content
Interweave Exchange Ideas Portal

This portal provides an open platform for user feedback and product change requests. Anyone can add an idea and remain as a Guest, but please consider signing up so that others can see who has created the ideas!

Note: this is a public facing web portal, any text here can be viewed by anyone over the internet, so please consider carefully the content you wish to share and please do not post anything of a sensitive nature.

Status Needs review
Created by Nick White
Created on Oct 24, 2024

Health check repeats on failure

Currently if a provider fails a health check they get set as 'offline' and this causes errors with PIX registration and other processes.

There are regular health checks carried out which can be interrupted by transient issues and end up knocking the system offline for potentially multiple hours until somebody can manually run a health check to register the system as online. If a manual "health check" isn't performed against the host, it won't be done automatically until the early hours of the morning (around 2 to 3 am).


My suggestions:

  • A system going offline unexpectedly should count as a P1 - this should be automatically raised so that the service desk can investigate

  • Systems should not be marked as offline after one failed health check - there should be some leeway for temporary outages - potentially a couple of hours of failed health checks? - would still be good to have a notification on failure so it can be investigated

  • On returning an error message that indicates the system is offline it should automatically redo the health check as this indicates the offline system is attempting to re-establish connection

  • On health check failure the health check should re-run for the failed system at more regular intervals - potentially in a trailing fashion like with the message retries e.g. after 5 minutes, 10 minutes, 20 mins, 40 mins, 1 hour etc unless the system is confirmed to be offline

It is also worth noting that it has been said that it normally takes 2 health checks to fully restore connection - the first getting the error Data source 'X' is not a registered data provider this should be fixed so that it only takes one

  • Attach files