devz-docz

Aggregation of onboarding and general devz standards that I have gatherd over my career.

View on GitHub

Tools and Practice / Healthchecks

Overview

In modern API services, it is not uncommon to build and rely upon Healthcheck endpoint(s). These endpoints are generally orthogonal to the operating business logic of the service, and are intended to be consumed by the Operators of the system (e.g. InfraSec practitioners).

There are many ways that Healthcheck endpoints can be used in a system:

Many apps will try to achieve their subset of the above goals via a single endpoint, but as apps grow in sophistication it is not unheard of to tease those responsibilities out to separate endpoints.

Security

Frequently healthcheck endpoints are agnostic about authentication and authorization, meaning anyone that can reach the service via the network can access these endpoints.

Having heathcheck endpoints be un-authenticated implies several other considerations:

Reliability

Particular care should be taken when using healthchecks as part of the reliability story of an app.

Prof von Neuman is credited with the concept of “synthesis of reliable organisms from unreliable components”. He showed that the math works out as such: if my app is serially dependent on 3 other services that are each at 97% reliable, my app can be no more than ~91% (0.97 x 0.97 x 0.97) reliable. This means that if an app’s healthcheck is setup to fail when any of its dependencies have failed, this has significantly limited the upper bound of the apps reliability.

Sometimes an app’s reliability really is limited by a dependency, e.g. a database connection. If every single endpoint and functionality in an app requires interaction with the app’s database, then it is reasonable to call your app fully inoperable when the database connection is not functioning.

However, if only a subset of the app’s functionality is impacted by an unreliabile dependency, it may be better to handle that instability at runtime, rather than declaring the whole app down. Consider the following example: if only one (out of many) of an app’s functionalities depend on sending email, the app should maybe not be considered wholly down if the email service is unreachable; it would be better to handle errors from interacting with the email service gracefully, or possibly even using a technique like feature flagging to turn off any attempts to use that functionality until the service is back in service.

Healthchecks in Deployment Pipelines

It is not uncommon to see teams utilize an app’s healthcheck endpoint on the critical path of a deployment pipeline, however this should be evaluated for unintended consequences.

In modern software engineering, automated deployments are a critical feature of a software system, facilitating the ability to safely and incrementally improve the quality of a running app.

Again, consider the app example from above that uses an email sending service for one feature: should the external service instability be able to jeopardize the ability to roll out new code? Perhaps the new code for deployment is useful for diagnosing the issue or is intended to temporarily turn off the offending sending of emails: blocking roll out makes remediation of the issue harder rather than safer.

Contents