Health check repeats on failure

Currently if a provider fails a health check they get set as 'offline' and this causes errors with PIX registration and other processes.

There are regular health checks carried out which can be interrupted by transient issues and end up knocking the system offline for potentially multiple hours until somebody can manually run a health check to register the system as online. If a manual "health check" isn't performed against the host, it won't be done automatically until the early hours of the morning (around 2 to 3 am).

My suggestions:

A system going offline unexpectedly should count as a P1 - this should be automatically raised so that the service desk can investigate
Systems should not be marked as offline after one failed health check - there should be some leeway for temporary outages - potentially a couple of hours of failed health checks? - would still be good to have a notification on failure so it can be investigated
On returning an error message that indicates the system is offline it should automatically redo the health check as this indicates the offline system is attempting to re-establish connection
On health check failure the health check should re-run for the failed system at more regular intervals - potentially in a trailing fashion like with the message retries e.g. after 5 minutes, 10 minutes, 20 mins, 40 mins, 1 hour etc unless the system is confirmed to be offline

It is also worth noting that it has been said that it normally takes 2 health checks to fully restore connection - the first getting the error Data source 'X' is not a registered data provider this should be fixed so that it only takes one

Attach files

Enter a subject

Please enter your email address

RELATED IDEAS

Health check repeats on failure