Search Functionality Degradation Affecting Multiple Features

Incident Report for Shelf

Postmortem

What Happened?

On October 24, 2024, customers reported difficulties using Shelf's search functionality. Our monitoring systems detected performance degradation in the search service, affecting users' ability to find and access their content. The issue was first identified through our automated monitoring systems and confirmed by customer reports.

Why Did It Happen?

The search service experienced capacity constraints due to an unexpected pattern in request handling. When users performed searches, our system is designed to make a single attempt to retrieve results. However, a recent configuration change inadvertently allowed multiple retry attempts for each search request. This meant that a single user search could generate several duplicate requests, creating unnecessary load on the system.

Under normal conditions, our search infrastructure efficiently handles the typical volume of search requests. In this case, the combination of increased search traffic and the multiplicative effect of retry attempts exceeded the system's designed capacity. This overload caused the search service to respond slowly or fail to respond at all.

Impact

Users in the US region experienced difficulties performing searches across the platform for approximately one hour. While other platform features remained functional, any operations requiring search capabilities were affected. Users could access their content directly through navigation, but could not search for specific items or use search-dependent features. The EU and Canada regions were not impacted by this incident. The exact number of affected users during this period cannot be precisely determined, but the issue was limited to users actively attempting to use search functionality in the US region between 19:28 UTC and 20:21 UTC.

Solution

Our team implemented an immediate fix by adjusting the request handling configuration to prevent excessive retries. We also increased the search service's capacity to ensure stable operation. After implementing these changes, we conducted comprehensive testing across all affected components and monitored the system for an extended period to confirm the solution's effectiveness.

Posted Oct 30, 2024 - 19:15 UTC

Resolved

Search functionality has been restored across all major features including Gems search. The system is now operating normally and we continue to monitor for stability. We appreciate your patience during this disruption.
Posted Oct 24, 2024 - 20:21 UTC

Monitoring

We have successfully restored search functionality for Announcements, Feedback lists, and CPW tasks. Gems search remains temporarily affected while we complete the final recovery steps. Our engineering team continues working on full service restoration. We will provide another update once all features are back online.
Posted Oct 24, 2024 - 20:00 UTC

Update

Our engineering team has identified the source of the disruption affecting search functionality. We are actively implementing necessary adjustments to restore service. Search capabilities remain limited across affected features while we work on the resolution. We will provide another update once service is restored.
Posted Oct 24, 2024 - 19:35 UTC

Identified

We are experiencing a service disruption affecting our search-related functionality across multiple features including Gems search, Feedback list, and Announcement list. Our engineering team has been notified and is actively working to restore normal operations. While basic platform functionality remains available, search capabilities are currently limited. We will provide updates as the situation develops.
Posted Oct 24, 2024 - 19:28 UTC
This incident affected: Shelf: US Region (Viewing Content & Search).