Resco Cloud availability issue

Incident Report for Resco Cloud

Resolved

Yesterday on November 22nd at approx. 10AM CET we started to observe the intermittent unavailability of our Resco Cloud services. These outages were caused by overloaded Redis cache server which is a part of our Resco Cloud infrastructure. Responses from this cache exceeded recommended size limits due to specific reason.

Requests that produce larger responses often cause problems, because the Redis server and client libraries are optimized for rapidly processing and transmitting many small requests. Attempting to move large payloads through this system can severely impact Redis server load and performance, leading to command timeouts and failures in the Resco Cloud client application.

We are working on identifying the source of this occasional behavior which is happening only in specific conditions. Our Development Teams is also working on solution to avoid this behavior in the future. We expect to have a solution by the next release at the latest, but we will do our best to release it as soon as the solution is developed and tested. Then it can be released as a bugfix as soon as possible.

Currently the resolution includes manual intervention to allow the system to run as usual.

Our company’s Status page did not reflect the unavailability of this service due to the nature of how we detect our service unavailability, and our monitoring API solution did not catch this specific outage. We already made a changes to our monitoring to include similar behavior into detection.

Until a solution is provided, this outage can occur again. We will react asap to solve such outage in earliest time possible.

Posted Nov 22, 2022 - 10:00 CET