All Systems Operational
99.998% uptime over 90 days
Latest Post-Mortem
VPS - RL1 Node 2 — I/O Storm Caused by krbd Sparse-Read Bug (CVE-2026-23136)
A brief network disruption triggered a known kernel bug in the Ceph storage client (krbd), causing an unrecoverable I/O retry loop that stalled all ~65 VMs on this node. No data was lost.
Timeline
First occurrence overnight. We recovered the node but didn't identify the root cause
It happened again. Server load hit 680+, kernel logs flooded with CRC checksum errors across all OSD connections simultaneously. Storage I/O completely stalled
Throttled the link to 2 Gbit with tc to break the retry loop. CRC errors stopped immediately
Identified the bond hash was set to layer2 instead of layer3+4, funneling all inbound traffic through one 10G NIC
Started migrating VMs to other nodes
All VMs restored. VMs that went read-only were rebooted to clear filesystem state
Impact
About 65 VMs on Node 2 had disk I/O stall completely. Some guest filesystems went read-only as a protective measure. All VMs were restored with no data loss.
Root Cause
The incident had two contributing factors:
Network bonding imbalance: This node's bond was configured with layer2 hashing, which selects the outgoing NIC based on MAC address. With only two endpoints (server and switch), inbound traffic landed almost entirely on one 10G NIC. During a Ceph deep scrub, the increased read traffic was enough to cause packet drops and CRC failures on the saturated link.
Kernel bug (CVE-2026-23136): When the CRC errors caused libceph to drop and reconnect OSD connections, a bug in the kernel's sparse-read state machine prevented recovery. On reconnect, the client misinterpreted new OSD replies as continuations of previous failed operations, causing every retry to fail immediately and trigger another reconnect. This created a self-sustaining loop that could not resolve on its own.
Because krbd handles all VM storage through a single kernel process, this loop affected every VM on the node simultaneously. With librbd (QEMU's userspace Ceph client), each VM maintains independent connections — the same bug does not exist in the userspace client, and even a connection failure would only affect the individual VM.
Resolution
- Throttled the link to 2 Gbit to break the retry loop
- Fixed the bond hash policy from
layer2tolayer3+4 - Migrated VMs to other nodes and rebooted those in read-only state
- Rebooted the node to clear stale kernel Ceph state
Preventive Measures
- All nodes confirmed on layer3+4 bond hashing — this was the only node still on layer2
- Migrated all nodes from krbd to librbd (QEMU) on March 28. With librbd, connection faults are isolated per VM and the kernel sparse-read bug is not in the code path. Done via live migration with no downtime
Scheduled Maintenance
VPS - RL1 — Migration from krbd to QEMU librbd
All VMs were live-migrated with no downtime. All nodes now run librbd.
Website & Portal
agent
API
cloud.hostup.se
Customer portal
hostup.se
webmail
Web Hosting - cPanel
delta
Test site
lambda
Test site
mu
Test site
omega
Test site
pi
Test site
srv11
High-frequency cPanel
Web Hosting - ApisCP (Legacy)
epsilon
Test site
eta
Test site
orion
Test site
theta
Test site
zeta
Test site
VPS - RL1
Stockholm Älvsjö datacenter
High Frequency Ryzen 9950x
High-performance compute
IPv4 Gateway
IPv4 routing
IPv6 Gateway
IPv6 routing
Node 0
Legacy node
Node 12
HA cluster node
Node 13
HA cluster node
Node 16
HA cluster node
Node 23
HA cluster node
Node 24
HA cluster node
Node 25
HA cluster node
Node 26
HA cluster node
Node 3
Snapshot storage
Node 4
HA cluster node
Node 5
HA cluster node
Node 6
HA cluster node
Node 7
HA cluster node
Node 8
HA cluster node
Node 9
HA cluster node
VPS - RL2
Stockholm Älvsjö datacenter
IPv4 Gateway
IPv4 routing
IPv6 Gateway
IPv6 routing
Node 1
Hypervisor
Node 2
Hypervisor
Node 3
Hypervisor
Node 4
Hypervisor
Node 5
Hypervisor
Node 6
Hypervisor
Node 7
Hypervisor
Node 8
Hypervisor
Node 9
Hypervisor
DNS
Cloudflare whitelabel anycast nameservers
primary.ns.hostup.se
Cloudflare anycast
secondary.ns.hostup.se
Cloudflare anycast
Past Incidents
March 2026
Node 6 (RL1)
No packets returned by host
IPv6 Gateway
No packets returned by host
IPv6 Gateway
No packets returned by host
IPv6 Gateway
No packets returned by host
IPv6 Gateway
No packets returned by host
IPv6 Gateway
No packets returned by host
IPv6 Gateway
No packets returned by host
Issue not listed here?
Try our AI troubleshooting agent — it can check your website, verify DNS records, test if ports are open (SSH, RDP), and help determine if the issue is on your end or ours.
Automated health checks running every 30 seconds. Web hosting monitors use test WordPress sites — brief unavailability (1-2 min) may occur during auto-updates.