Sample Header Ad - 728x90

Is there a Scalable Solution in Elasticsearch for Handling Frequent Event Updates in Large-Scale Systems?

0 votes
0 answers
26 views
I’m working on a large-scale event storage system where a significant percentage (50-75%) of stored events need to be updated over time. These events include fields such as startTimestamp, endTimestamp, and connectionStatus, which may change as the system processes more data. From my research, I understand that updating documents in Elasticsearch can be inefficient, as each update triggers an insert and a delete operation in the background, leading to reindexing. To work around this, I’ve considered an alternative strategy: instead of updating documents directly, I fetch the current version of an event, merge it with the updated data in the application layer, and then insert a new version of the event with a unique document ID. This way, the event history is preserved without triggering Elasticsearch’s costly reindexing mechanism. To retrieve the latest version of an event, I plan to use the collapse query on the eventId field, sorting by the version to fetch only the most recent version. Here’s a simplified version of the strategy I’m considering: 1. **Fetch the current version** of the event using eventId. 2. **Merge the updated fields** with the existing event data at the application level. 3. **Insert the new version** of the event with the same eventId but a new unique document ID and an updated version. 4. **Use collapse queries** to retrieve only the latest version of the event, based on version. My questions are: - **Will this approach handle high volumes of data efficiently?** For context, the system processes thousands of writes per second, and each event may be updated multiple times over its lifecycle. - **Does the collapse feature perform well when querying millions of documents?** How does it scale when querying across large indices with a mix of new and old events? - **Is there a better alternative** for handling frequent updates in Elasticsearch at this scale that avoids the inefficiencies of the update and delete mechanism? Additionally, I'm concerned about whether updates are really as bad as they seem in Elasticsearch for my use case. Since 50-75% of events will require updates, is my approach of creating new document versions justified, or could using Elasticsearch’s standard update operations work well enough without impacting performance too much? Any advice or insights from those with experience in large-scale Elasticsearch systems would be greatly appreciated!
Asked by Chillax (131 rep)
Oct 16, 2024, 11:16 AM