HostUp

HostUp Status

All Systems Operational

99.998% uptime over 90 days

Latest Post-Mortem

26 March 2026 · 2 days agoI/O Degradation26m

VPS - RL1 Node 2 — I/O Storm Caused by krbd Sparse-Read Bug (CVE-2026-23136)

A brief network disruption triggered a known kernel bug in the Ceph storage client (krbd), causing an unrecoverable I/O retry loop that stalled all ~65 VMs on this node. No data was lost.

Timeline
00:36

First occurrence overnight. We recovered the node but didn't identify the root cause

17:40

It happened again. Server load hit 680+, kernel logs flooded with CRC checksum errors across all OSD connections simultaneously. Storage I/O completely stalled

17:45

Throttled the link to 2 Gbit with tc to break the retry loop. CRC errors stopped immediately

17:50

Identified the bond hash was set to layer2 instead of layer3+4, funneling all inbound traffic through one 10G NIC

17:55

Started migrating VMs to other nodes

18:06

All VMs restored. VMs that went read-only were rebooted to clear filesystem state

Impact

About 65 VMs on Node 2 had disk I/O stall completely. Some guest filesystems went read-only as a protective measure. All VMs were restored with no data loss.

Root Cause

The incident had two contributing factors:

Network bonding imbalance: This node's bond was configured with layer2 hashing, which selects the outgoing NIC based on MAC address. With only two endpoints (server and switch), inbound traffic landed almost entirely on one 10G NIC. During a Ceph deep scrub, the increased read traffic was enough to cause packet drops and CRC failures on the saturated link.

Kernel bug (CVE-2026-23136): When the CRC errors caused libceph to drop and reconnect OSD connections, a bug in the kernel's sparse-read state machine prevented recovery. On reconnect, the client misinterpreted new OSD replies as continuations of previous failed operations, causing every retry to fail immediately and trigger another reconnect. This created a self-sustaining loop that could not resolve on its own.

Because krbd handles all VM storage through a single kernel process, this loop affected every VM on the node simultaneously. With librbd (QEMU's userspace Ceph client), each VM maintains independent connections — the same bug does not exist in the userspace client, and even a connection failure would only affect the individual VM.

Resolution
  1. Throttled the link to 2 Gbit to break the retry loop
  2. Fixed the bond hash policy from layer2 to layer3+4
  3. Migrated VMs to other nodes and rebooted those in read-only state
  4. Rebooted the node to clear stale kernel Ceph state
Preventive Measures
  • All nodes confirmed on layer3+4 bond hashing — this was the only node still on layer2
  • Migrated all nodes from krbd to librbd (QEMU) on March 28. With librbd, connection faults are isolated per VM and the kernel sparse-read bug is not in the code path. Done via live migration with no downtime

Scheduled Maintenance

Completed

VPS - RL1 — Migration from krbd to QEMU librbd

All VMs were live-migrated with no downtime. All nodes now run librbd.

Website & Portal

agent

Operational

API

Operational

cloud.hostup.se

Customer portal

Operational

hostup.se

Operational

webmail

Operational

Web Hosting - cPanel

delta

Test site

Operational

lambda

Test site

Operational

mu

Test site

Operational

omega

Test site

Operational

pi

Test site

Operational

srv11

High-frequency cPanel

Operational

Web Hosting - ApisCP (Legacy)

epsilon

Test site

Maintenance

eta

Test site

Operational

orion

Test site

Operational

theta

Test site

Operational

zeta

Test site

Operational

VPS - RL1

Stockholm Älvsjö datacenter

High Frequency Ryzen 9950x

High-performance compute

Maintenance

IPv4 Gateway

IPv4 routing

Operational

IPv6 Gateway

IPv6 routing

Operational

Node 0

Legacy node

Operational

Node 12

HA cluster node

Operational

Node 13

HA cluster node

Operational

Node 16

HA cluster node

Operational

Node 23

HA cluster node

Operational

Node 24

HA cluster node

Operational

Node 25

HA cluster node

Operational

Node 26

HA cluster node

Operational

Node 3

Snapshot storage

Maintenance

Node 4

HA cluster node

Operational

Node 5

HA cluster node

Operational

Node 6

HA cluster node

Operational

Node 7

HA cluster node

Operational

Node 8

HA cluster node

Operational

Node 9

HA cluster node

Operational

VPS - RL2

Stockholm Älvsjö datacenter

IPv4 Gateway

IPv4 routing

Operational

IPv6 Gateway

IPv6 routing

Operational

Node 1

Hypervisor

Operational

Node 2

Hypervisor

Operational

Node 3

Hypervisor

Operational

Node 4

Hypervisor

Operational

Node 5

Hypervisor

Operational

Node 6

Hypervisor

Operational

Node 7

Hypervisor

Operational

Node 8

Hypervisor

Operational

Node 9

Hypervisor

Operational

DNS

Cloudflare whitelabel anycast nameservers

primary.ns.hostup.se

Cloudflare anycast

Operational

secondary.ns.hostup.se

Cloudflare anycast

Operational

Past Incidents

March 2026

ResolvedDuration: 12 min

Node 6 (RL1)

No packets returned by host

26 Mar 2026, 03:47
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 14:13
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 14:11
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 13:48
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 12:54
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 10:11
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 09:56

Issue not listed here?

Try our AI troubleshooting agent — it can check your website, verify DNS records, test if ports are open (SSH, RDP), and help determine if the issue is on your end or ours.

Automated health checks running every 30 seconds. Web hosting monitors use test WordPress sites — brief unavailability (1-2 min) may occur during auto-updates.