In the days before January 12th, performance issues are observed with our backup storage provider for the Visual Tour Builder database.
Support case opened with our Cloud Provider. They confirm known issue: daily saturation during midnight until early morning at Neurenberg location.
Workarounds implemented: upload chunk size increased from 5 MB to 256 MB, full backup interval changed from daily to weekly, backup schedule moved outside peak hours.
January 9th at 12:00: last successful full database backup completes before the incident.
January 12th at 00:11: scheduled security maintenance begins on production database cluster to enhance system security.
At 00:14 the database cluster enters unexpected failure state requiring restoration from backup.
At 00:30 on-call engineer alerted and begins immediate recovery procedures.
January 12th (Sunday)
00:30 – 06:00: Multiple recovery attempts fail due to our Cloud Provider Object Storage timeouts during overnight saturation period. Each Point-in-Time Recovery attempt fails partway through due to S3 endpoint instability.
06:46: First recovery cluster deployed using workaround (S3 caching proxy).
08:14: Status page updated - "We are currently facing issues with opening and transferring VTBs. We have identified the issue and working on a solution."
09:15: Initial ETA missed - the Object Storage continues to cause recovery failures.
10:58: Status update - "Our estimate to be back online is now by the end of the day."
18:36: Status update - "We are working around the clock to recover from a database disruption that started last night at 00:30."
21:03: Status update - "We are making good progress and now expect to see initial recovery in the next hours."
January 13th (Monday)
01:05: Engineering team pauses overnight work due to fatigue; recovery strategy refined.
05:09: Recovery strategy pivoted to direct NVMe storage for improved I/O performance. Accepted to restore an outdated version of the database, which increased the chances of a recovery.
06:55: Production database core restored and validated, albeit an outdated version.
07:05: Status update - "We have been able to restore all VTB Data that was created or last edited before 09-01-2025 at 12:00 noon. Any changes to existing VTBs or VTBs created after that timestamp are currently unavailable."
09:13 – 14:30: Extended recovery session to restore individual tenant databases with edit timestamps after 09-01-2026 12:00 noon.
17:58: VTB Standalone and primary services fully restored.
January 14th (Tuesday)
09:00: Final databases restorations completed.
11:43: Status update - "The script for restoring the missing VTBs and data is running."
14:12 – 14:48: Brief issue with VTB creation resolved.
January 15th (Wednesday)
01:51: Final application updates deployed resolving remaining edge cases.
06:47: Status update - "The Visual Tour Builder is fully operational again and all VTB data has been restored."
10:07: Issue marked as resolved.
Primary cause:The production database cluster experienced an unexpected failure during infrastructure maintenance, requiring restoration from backups.
Contributing factor: Cloud Provider Object Storage performance issues during overnight hours (documented by provider) caused repeated recovery failures between 00:30 and 06:00, requiring implementation of retry logic and caching layer to complete restoration.
Scale factor: Multiple tenant databases required sequential restoration, each dependent on downloading data from the affected storage endpoint.
When storage performance issues arose, temporary workarounds were implemented rather than migrating to an alternative provider. This resulted in the most recent full backup being from January 9, 12:00 CET, necessitating a three-day transaction log replay during recovery.
✅ Migrate all database backups to a different Cloud Provider S3 enterprise storage (completed January 15-17)
✅ Restore daily full backup schedule (completed January 15)
✅ Implement backup health monitoring and alerting (completed January 17)
✅ Deploy secondary recovery clusters in standby (completed January 14)
🔄 Implement cross-region backup replication (planned February 2026)
🔄 Automated backup verification testing (planned February 2026)
🔄 Disaster recovery runbook documentation (in progress)