Test In Production — The Ideal Monitoring

4 min readFeb 12, 2023

Imagine a regression bug (wiki: software bug where a feature that has worked before stops working) in your production system! That is a huge nightmare. The risk keeps rising as the codebase gets more and more contributions from engineers.

This blog provides a few inputs on monitoring your production system for any regression bug that can be introduced while everyone constantly makes changes.

Growth of Software in Production — Photo by Emile Perron on Unsplash

Let us think of a design wherein a service grows into an ecosystem that does some heavy processing and stores data in a data lake like this

A sample service in a production system (something like an ETL)

Quick overview of the above diagram:

Data is ingested via the website or a set of IoT devices
We process the data by cleaning and applying some aggregations
Store the processed TeraBytes of data in our data lake

Now imagine a couple of these scenarios:

Scenario 1 — We introduced a new version of the SDK to connect to our storage layer; in this case, this storage layer is GCS, which now has a rate limit via config. Let us assume this rate limit is applied only if you hit a million requests per minute. Now, Unfortunately, your testing cannot account for this.
Scenario 2 — We introduced a regex inside our processing; this starts throwing Stackoverflow exceptions while parsing big strings. This could suggest the service keeps constantly getting restarted, and the incoming requests can get rejected.
Scenario 3 — We introduced a new validation rule for data cleaning before we store it. But this logic was defective, and thus it now reduces the data accuracy.

In simple terms, the service breaks when integrated with other services or when the data size is massive. So, the natural question that runs through our mind is, Why not write more tests?

With the microservices world and the terra/petabytes of data storage requirements, it becomes difficult to replicate all those scenarios in a test suite and expect it to pass on every build.

What about the unit tests?

It’s not enough to write unit tests! You cannot unit-test every single method with every possible value. Imagine the ugly mocks you have to place to test service-to-service interactions.

What about the integration tests?

You can write them, but then consider these:

The service you are integrating can go down during your tests or would not respond quickly. This means we have a flaky build. We cannot iterate and deploy faster to the production system.
You are creating test resources that are exclusive to your team. So, you cannot ensure to take into consideration the source data for your system.
We cannot always produce a StackOverflow exception that keeps your service restarting for all data patterns.
The production data has a variety of values, like nulls and wildcard characters. We cannot replicate TeraBytes/PetaBytes of the data if you want to test the data integrity.

So, what can we do?

Run your tests in the production system!

Plan to run live tests on a scheduled basis directly in our production system. This plan is separate from the automated test suites we run as part of our standard automated build tests via your CI/CD. It should not be part of your CI/CD pipeline and should run as a separate service within your production system or from outside.

You need to run a scheduled test to get notified of any anomaly from the service. This anomaly can look like if your service cannot connect to another service. Or it can also surface if the service starts producing an erroneous data record. Or if the user request was rejected when the service keeps restarting.

The approach you can take is in these simple steps:

Time it!

We should ensure the full validation is executed in a finite amount of time. There needs to be an SLA for the expected turnaround time for this test to be complete.

Alert it!

The service should warn if results are not matching with your expected outcome. Get alerted even if it takes more than the desired time because there could be an issue if your results are not on time.

Monitor it!

Keep an eye open for these tests in the production system. Sometimes, we see that these tests are not running anymore; or the scheduler involved in triggering these tests is not doing its job. So ensure to monitor the tests running in the production system regularly.

Save it!

Keep the tests in a separate repository and ensure to save the result of this monitoring by sharing the report via email every day or week. That brings a lot of confidence and provides visibility into the overall progress.

Mechanisms I have used in the past

I may not provide detailed input on tools that are best in the market. I can nevertheless point you towards some tools in the market that can assist you in running some tests in the production system and be able to provide a monitoring platform that can alert you if things don’t run as expected. I have personally used these in my previous projects.

Pingdom Real User Monitoring — helps to trigger payload from multiple regions across the globe to your service. Very good if your service reacts differently with different regional users.

Sensu Observability Pipeline — write code to create user requests for your production service and monitor the success of different scenarios.

There are many more options, but this blog was intended to highlight that production testing is the best bet for ideal monitoring.