Downtime in storage cluster...

Downtime

Downtime in storage cluster due to firmware bug

Mar 04 at 03:23pm CET

Affected services

stockholm1-10-vm (HA cluster)

stockholm1-12-vm (HA cluster)

stockholm1-9-vm (HA cluster)

stockholm1-8-vm (HA cluster)

stockholm1-7-vm (HA cluster)

stockholm1-6-vm (HA cluster)

stockholm1-5-vm (HA cluster)

stockholm1-4-vm (HA cluster)

stockholm1-3-vm (25 GbE HDD Archive backup-2 server)

stockholm1-2-vm (Dedicated core, 200G ram 128 vCPU)

stockholm1-1-vm

Resolved
Mar 11 at 11:05am CET

Hello everyone,
We want to publish an update as the developer has finally found the root cause of the issue.
Please find bug report here: https://tracker.ceph.com/issues/70390
The bug triggered when adding new OSD's and when there existed an Erasure Encoding pool. We also see other people in the past 1-2 weeks have experienced similar issues.

To summarize, the issue was caused by a bug in Ceph 19.2 (squid). It only effected the Erasure Encoding pool.

We are now 100% confident that this will not happen again since we moved everything away from Erasure Encoding 4 + 2 to Replication-3 for redundancy.

Additionally, to prevent such a bug from ever happening again going forward we'll always stay one major version behind. if the current version is 19.2.1, we'll stay at 18.2.4 until version 20 is released, only then do we upgrade to version 19.X. This should allow more time for the "new" version to be battle tested by others first.

Updated
Mar 05 at 07:29pm CET

The cluster has been an optimal state with an healthy 3x Replication pool for around 24 hours now.

An import of your scheduled backups (if set, 7 free are included) or a reinstallation is sadly required to bring VM back up.
Backup management guide: https://hostup.se/en/support/hantera-sakerhetskopior/
Reinstall instructions: https://hostup.se/en/support/installera-nytt-operativsystem-pa-din-vps/

Over our close to 5 years of running Ceph based storage, we've never seen an such an issue we had. 3x replicatied pool is now standard replacing the EC and 2x rep for both redundancy and simplicity.

We are truly sorry for the trouble this brought everyone and we are going to make sure something like this never happens again.

Additionally automated backups (without having to set a schedule yourself) and high speed storage solution for these backups will also be high on the priority list.

Updated
Mar 04 at 07:11pm CET

We at Hostup sincerely apologize for the severe incident that occurred this afternoon, caused by multiple OSD failures leading to irreversible data corruption in our storage cluster. The root cause was identified as a firmware bug in the new disks recently introduced to our Ceph cluster.

As a result, all VPS instances must now be reinstalled. We will work throughout the night to assist you with restoring your backups and getting your services operational again as quickly as possible. Don't worry they're safe and backups are always included in our services.

Additionally, we’ve transitioned all customers from replication-2 and EC 4+2 to replication-3 to significantly increase redundancy and prevent similar issues in the future.

For instructions on reinstalling your VPS and restoring from backups, please refer to these guides:
https://hostup.se/en/support/hantera-sakerhetskopior/ https://hostup.se/en/support/installera-nytt-operativsystem-pa-din-vps/

Created
Mar 04 at 03:23pm CET

We are currently experiencing catastrophic firmware bug in new disks. This firmware causes data to be corrupted sometimes on write, and the issue now is that the storage cluster is down due to corruption.

We will rollback the system to bring it online again