Informação
infraestrutura
melhores práticas
noc
segurança

The 4 Golden Signals: An Essential Approach to System Monitoring

In the digital era we live in, it is important for companies to be prepared to handle complex systems and provide their customers with a seamless user experience. For this reason, IT managers need to have a clear and accurate view of the health of the systems under their responsibility. It is in this context that the four golden signals emerge as an important approach to system monitoring. The four golden signals are latency, traffic, errors, and saturation. If you can measure only four metrics of your user-facing system, focus on these four indicators.

The four golden signals are latency, traffic, errors, and saturation. If you can measure only four metrics of your user-facing system, focus on these four indicators.

Latency: This metric represents the time it takes to fulfill a request. It is important to distinguish between latency for successful requests and latency for failed requests.  For example, an HTTP 500 error can be served quickly due to a loss of connection with a critical backend resource such as a database. However, since an HTTP 500 error indicates a failed request, including 500s in the overall latency count may result in inaccurate calculations. On the other hand, slow failures are even worse than fast failures.   Therefore, it is important to track error latency instead of simply filtering them out.

Traffic: This is a measure of the demand placed on your system, typically measured by a high-level system-specific metric. For a web service, this measurement is often in HTTP requests per second, separated by request type (e.g., static content vs. dynamic content). For an audio streaming system, the measurement may focus on network input/output rate or concurrent sessions. For a key-value storage system, the measurement may be in transactions and retrievals per second.  

Errors: The rate of failed requests, whether explicit (e.g., HTTP 500s), implicit (e.g., a successful HTTP 200 response but associated with incorrect content), or by policy (e.g., "if you committed to one-second response times, any request above one second is an error"). Monitoring these error rates is important to identify weaknesses in the system and ensure the quality of service offered to users.

Saturation: How "full" your service is. It is a measure of the fraction of your system, emphasizing the most constrained resources (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade performance before reaching 100% utilization, so having a utilization target is essential. In complex systems, saturation can be complemented with a measurement of higher-level load: can your service handle twice the traffic adequately, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters altering the request complexity (e.g., "Give me a nonce" or "I need a unique global monotonic number") that rarely change configuration, a static value from a load test may be suitable.  However, as discussed in the previous paragraph, most services need to use indirect signals such as CPU usage or network bandwidth that have a known upper limit. Increases in latency are often an early sign of saturation. Measuring your 99th percentile latency over a small window (e.g., one minute) can provide an early signal of saturation.  

These four golden signals are crucial for IT management as they provide an overview of the challenges faced by the system. Monitoring the four signals allows IT managers to quickly identify issues, determine their severity, and make informed decisions to resolve them.  By maintaining an overall view of system performance, IT managers can anticipate potential issues before they occur, ensuring business continuity and avoiding service disruptions. Additionally, monitoring the four signals enables IT managers to identify areas for improvement in the system, which can help enhance efficiency and scalability.

Transform your IT operation into a successful business with InfraOPS. With an impressive success rate, InfraOPS is the ideal choice for companies seeking professional and reliable solutions to monitor and operate their critical IT environments. Our highly skilled team provides customized solutions that guarantee system availability, data protection, and customer satisfaction. Join our satisfied clients and experience the difference InfraOPS can make for your business. Contact us today.