Back to overview

Unexpected server reboot

Nov 22 at 09:35pm CET
Affected services
stockholm1-10-vm (HA cluster)

Resolved
Nov 22 at 09:35pm CET

Incident overview
Affected server: Stockholm1-10vm
Time of incident: 21:20
Duration: Approx. 15 minutes
Impact: Around 20% of VMs in the subnets 95.141.241.0/24 and 91.226.221.0/24 experienced extended downtime due to network connectivity issues.

What happened

At 21:20, the Stockholm1-10vm server unexpectedly rebooted due to a hardware issue. The affected VMs were automatically redistributed across the remaining servers and began booting within approximately two minutes.

However, around 20% of the VMs in the affected subnets (95.141.241.0/24 and 91.226.221.0/24) did not regain network connectivity upon reboot. These VMs require VLAN tagging to connect, as their subnets are announced via another ASN. Since the redistribution was random, some VMs were deployed on servers that do not support VLAN tagging, resulting in network downtime.

By 21:35, all VMs were operational after our team manually migrated those affected to servers with proper VLAN tagging support.

Root cause

The server rebooted due to a memory issue in the CPU1 G0 slot. A defective RAM stick caused the reboot.

The extended downtime for VMs in 95.141.241.0/24 and 91.226.221.0/24 subnets was due to their reliance on VLAN tagging for connectivity, combined with the random redistribution of VMs across servers.

Actions taken

All VMs were manually migrated to appropriate servers to restore network connectivity.
The defective memory stick has been identified and will be removed to prevent further incidents.

Future improvements

Hardware maintenance: The faulty RAM stick will be removed, and the server will be returned to production after thorough testing.

Subnet reconfiguration:

We will migrate 95.141.241.0/24 and 91.226.221.0/24 to our own ASN over the weekend, removing the dependency on VLAN tagging.

This change will ensure that, in case of future server failures, affected VMs can reboot on any server without network connectivity issues.

Automatic handling: We will enhance our automated recovery systems to prevent similar delays during VM redistribution.

We apologize for any inconvenience caused and are committed to improving our systems to minimize downtime in the future.