All Systems Operational
99.998% uptime over 90 days
Latest Post-Mortem
VPS - RL1 Node 2 — I/O Storm Caused by krbd + Network Imbalance
This server was the only one in our cluster still using krbd instead of librbd (QEMU). A network bonding imbalance caused one link to max out, which led to CRC checksum failures on storage network packets — no actual data was lost or corrupted on disk. But because krbd handles all VMs in a single kernel process with no retry limit, the CRC errors triggered a retry loop that hit all ~65 VMs at once. With librbd (which all our other nodes use), this would have stayed isolated to individual VMs.
Timeline
Same issue happened overnight — we fixed it but didn't dig into the root cause. In hindsight, we should have
It happened again. The network link was completely maxed out (~14 Gbps on a 10G link), kernel logs flooded with storage CRC errors. Server load hit 680+
Throttled the link to 2 Gbit to break the retry loop — CRC errors stopped immediately
Found the root cause: bond hash was set to layer2 instead of layer3+4, so all inbound traffic was hitting one NIC. Fixed the config
Started moving VMs to other nodes
All VMs back up. Those that went read-only got a reboot to clear the filesystem state. Done
Impact
About 65 VMs on Node 2 had their disk I/O stall completely. Some went into read-only mode. We restarted the affected VMs and everything came back — no data was lost.
Root Cause
This server's network bond was set to layer2 hashing, which picks which NIC to use based on MAC address. Since there are only two endpoints — our server and the switch — this meant nearly all incoming traffic ended up on one NIC and all outgoing on the other. Not balanced at all.
At normal load this didn't matter. But when Ceph storage traffic spiked (probably a scrub job), the receiving NIC hit its 10G ceiling. The NIC's buffer overflowed — over 10 million packets dropped — and incoming network packets failed CRC checksum validation. No data on disk was affected.
Here's where it got bad: when libceph (the kernel storage client) sees a CRC failure, it tears down the connection and retries everything in-flight. Every retry means the storage servers re-read and resend the data, which adds more traffic to the link that's already full, which causes more CRC failures, which triggers more retries. A feedback loop across all 182 storage connections and 78 virtual disks on this server.
The real kicker: libceph has no retry limit and no timeout — this is a known issue in the kernel client. The storm literally cannot stop on its own.
This node has 256G RAM and 128 vCPUs — it's used for CPU-heavy VMs that need less RAM, which is why it only has 2x10G networking instead of the 25G or 100G links on our bigger nodes. At normal load that's plenty, but the combination of krbd + imbalanced bonding + smaller pipes made it the one node where this could happen. All our other nodes were already on layer3+4 hashing and librbd. This setting had been wrong since day one.
Resolution
- Throttled the link with
tcto 2 Gbit — broke the retry loop, CRC errors stopped immediately - Fixed the bond hash from
layer2tolayer3+4so traffic gets spread across both NICs - Moved VMs to other nodes
- Rebooted VMs that went read-only
- Rebooted the node to clear stale Ceph connections
Preventive Measures
- All nodes confirmed on layer3+4 bond hash — this was the only one still on layer2
- Migrating this node from krbd to librbd (QEMU) — with librbd each VM handles its own storage connections, so one bad connection can't take down every VM on the node. All our other clusters already run librbd
- Reviewing per-VM I/O and network limits to make sure no single VM can saturate a link
- Looking into dedicated storage networking (separate NICs for Ceph) to keep VM traffic and storage traffic apart
Scheduled Maintenance
VPS - RL1 — Migrate from krbd to QEMU librbd
VMs will be live-migrated to another server already running librbd, then migrated back. No downtime — same process as the datacenter migration.
Website & Portal
agent
API
cloud.hostup.se
Customer portal
hostup.se
webmail
Web Hosting - cPanel
delta
Test site
lambda
Test site
mu
Test site
omega
Test site
pi
Test site
srv11
High-frequency cPanel
Web Hosting - ApisCP (Legacy)
epsilon
Test site
eta
Test site
orion
Test site
theta
Test site
zeta
Test site
VPS - RL1
Stockholm Älvsjö datacenter
High Frequency Ryzen 9950x
High-performance compute
IPv4 Gateway
IPv4 routing
IPv6 Gateway
IPv6 routing
Node 0
Legacy node
Node 12
HA cluster node
Node 13
HA cluster node
Node 16
HA cluster node
Node 2
Hypervisor
Node 23
HA cluster node
Node 24
HA cluster node
Node 25
HA cluster node
Node 26
HA cluster node
Node 3
Snapshot storage
Node 4
HA cluster node
Node 5
HA cluster node
Node 6
HA cluster node
Node 7
HA cluster node
Node 8
HA cluster node
Node 9
HA cluster node
VPS - RL2
Stockholm Älvsjö datacenter
IPv4 Gateway
IPv4 routing
IPv6 Gateway
IPv6 routing
Node 1
Hypervisor
Node 2
Hypervisor
Node 3
Hypervisor
Node 4
Hypervisor
Node 5
Hypervisor
Node 6
Hypervisor
Node 7
Hypervisor
Node 8
Hypervisor
DNS
Cloudflare whitelabel anycast nameservers
primary.ns.hostup.se
Cloudflare anycast
secondary.ns.hostup.se
Cloudflare anycast
Past Incidents
March 2026
Node 2 (RL1)
No packets returned by host
Node 6 (RL1)
No packets returned by host
IPv6 Gateway
No packets returned by host
IPv6 Gateway
No packets returned by host
IPv6 Gateway
No packets returned by host
IPv6 Gateway
No packets returned by host
IPv6 Gateway
No packets returned by host
IPv6 Gateway
No packets returned by host
Issue not listed here?
Try our AI troubleshooting agent — it can check your website, verify DNS records, test if ports are open (SSH, RDP), and help determine if the issue is on your end or ours.
Automated health checks running every 30 seconds. Web hosting monitors use test WordPress sites — brief unavailability (1-2 min) may occur during auto-updates.