On October 24, 2024, customers reported difficulties using Shelf's search functionality. Our monitoring systems detected performance degradation in the search service, affecting users' ability to find and access their content. The issue was first identified through our automated monitoring systems and confirmed by customer reports.
The search service experienced capacity constraints due to an unexpected pattern in request handling. When users performed searches, our system is designed to make a single attempt to retrieve results. However, a recent configuration change inadvertently allowed multiple retry attempts for each search request. This meant that a single user search could generate several duplicate requests, creating unnecessary load on the system.
Under normal conditions, our search infrastructure efficiently handles the typical volume of search requests. In this case, the combination of increased search traffic and the multiplicative effect of retry attempts exceeded the system's designed capacity. This overload caused the search service to respond slowly or fail to respond at all.
Users in the US region experienced difficulties performing searches across the platform for approximately one hour. While other platform features remained functional, any operations requiring search capabilities were affected. Users could access their content directly through navigation, but could not search for specific items or use search-dependent features. The EU and Canada regions were not impacted by this incident. The exact number of affected users during this period cannot be precisely determined, but the issue was limited to users actively attempting to use search functionality in the US region between 19:28 UTC and 20:21 UTC.
Our team implemented an immediate fix by adjusting the request handling configuration to prevent excessive retries. We also increased the search service's capacity to ensure stable operation. After implementing these changes, we conducted comprehensive testing across all affected components and monitored the system for an extended period to confirm the solution's effectiveness.