HostUp

HostUp Status

All Systems Operational

99.998% uptime over 90 days

Latest Post-Mortem

26 March 2026 · 1 hour agoI/O Degradation26m

VPS - RL1 Node 2 — I/O Storm Caused by krbd + Network Imbalance

This server was the only one in our cluster still using krbd instead of librbd (QEMU). A network bonding imbalance caused one link to max out, which led to CRC checksum failures on storage network packets — no actual data was lost or corrupted on disk. But because krbd handles all VMs in a single kernel process with no retry limit, the CRC errors triggered a retry loop that hit all ~65 VMs at once. With librbd (which all our other nodes use), this would have stayed isolated to individual VMs.

Timeline
00:36

Same issue happened overnight — we fixed it but didn't dig into the root cause. In hindsight, we should have

17:40

It happened again. The network link was completely maxed out (~14 Gbps on a 10G link), kernel logs flooded with storage CRC errors. Server load hit 680+

17:45

Throttled the link to 2 Gbit to break the retry loop — CRC errors stopped immediately

17:50

Found the root cause: bond hash was set to layer2 instead of layer3+4, so all inbound traffic was hitting one NIC. Fixed the config

17:55

Started moving VMs to other nodes

18:06

All VMs back up. Those that went read-only got a reboot to clear the filesystem state. Done

Impact

About 65 VMs on Node 2 had their disk I/O stall completely. Some went into read-only mode. We restarted the affected VMs and everything came back — no data was lost.

Root Cause

This server's network bond was set to layer2 hashing, which picks which NIC to use based on MAC address. Since there are only two endpoints — our server and the switch — this meant nearly all incoming traffic ended up on one NIC and all outgoing on the other. Not balanced at all.

At normal load this didn't matter. But when Ceph storage traffic spiked (probably a scrub job), the receiving NIC hit its 10G ceiling. The NIC's buffer overflowed — over 10 million packets dropped — and incoming network packets failed CRC checksum validation. No data on disk was affected.

Here's where it got bad: when libceph (the kernel storage client) sees a CRC failure, it tears down the connection and retries everything in-flight. Every retry means the storage servers re-read and resend the data, which adds more traffic to the link that's already full, which causes more CRC failures, which triggers more retries. A feedback loop across all 182 storage connections and 78 virtual disks on this server.

The real kicker: libceph has no retry limit and no timeout — this is a known issue in the kernel client. The storm literally cannot stop on its own.

This node has 256G RAM and 128 vCPUs — it's used for CPU-heavy VMs that need less RAM, which is why it only has 2x10G networking instead of the 25G or 100G links on our bigger nodes. At normal load that's plenty, but the combination of krbd + imbalanced bonding + smaller pipes made it the one node where this could happen. All our other nodes were already on layer3+4 hashing and librbd. This setting had been wrong since day one.

Resolution
  1. Throttled the link with tc to 2 Gbit — broke the retry loop, CRC errors stopped immediately
  2. Fixed the bond hash from layer2 to layer3+4 so traffic gets spread across both NICs
  3. Moved VMs to other nodes
  4. Rebooted VMs that went read-only
  5. Rebooted the node to clear stale Ceph connections
Preventive Measures
  • All nodes confirmed on layer3+4 bond hash — this was the only one still on layer2
  • Migrating this node from krbd to librbd (QEMU) — with librbd each VM handles its own storage connections, so one bad connection can't take down every VM on the node. All our other clusters already run librbd
  • Reviewing per-VM I/O and network limits to make sure no single VM can saturate a link
  • Looking into dedicated storage networking (separate NICs for Ceph) to keep VM traffic and storage traffic apart

Scheduled Maintenance

PlannedNo downtime

VPS - RL1 — Migrate from krbd to QEMU librbd

VMs will be live-migrated to another server already running librbd, then migrated back. No downtime — same process as the datacenter migration.

Website & Portal

agent

Operational

API

Operational

cloud.hostup.se

Customer portal

Operational

hostup.se

Operational

webmail

Operational

Web Hosting - cPanel

delta

Test site

Operational

lambda

Test site

Operational

mu

Test site

Operational

omega

Test site

Operational

pi

Test site

Operational

srv11

High-frequency cPanel

Operational

Web Hosting - ApisCP (Legacy)

epsilon

Test site

Maintenance

eta

Test site

Operational

orion

Test site

Operational

theta

Test site

Operational

zeta

Test site

Operational

VPS - RL1

Stockholm Älvsjö datacenter

High Frequency Ryzen 9950x

High-performance compute

Operational

IPv4 Gateway

IPv4 routing

Operational

IPv6 Gateway

IPv6 routing

Operational

Node 0

Legacy node

Operational

Node 12

HA cluster node

Operational

Node 13

HA cluster node

Operational

Node 16

HA cluster node

Operational

Node 2

Hypervisor

Maintenance

Node 23

HA cluster node

Operational

Node 24

HA cluster node

Operational

Node 25

HA cluster node

Operational

Node 26

HA cluster node

Operational

Node 3

Snapshot storage

Operational

Node 4

HA cluster node

Operational

Node 5

HA cluster node

Operational

Node 6

HA cluster node

Operational

Node 7

HA cluster node

Operational

Node 8

HA cluster node

Operational

Node 9

HA cluster node

Operational

VPS - RL2

Stockholm Älvsjö datacenter

IPv4 Gateway

IPv4 routing

Operational

IPv6 Gateway

IPv6 routing

Operational

Node 1

Hypervisor

Operational

Node 2

Hypervisor

Operational

Node 3

Hypervisor

Operational

Node 4

Hypervisor

Operational

Node 5

Hypervisor

Operational

Node 6

Hypervisor

Operational

Node 7

Hypervisor

Operational

Node 8

Hypervisor

Operational

DNS

Cloudflare whitelabel anycast nameservers

primary.ns.hostup.se

Cloudflare anycast

Operational

secondary.ns.hostup.se

Cloudflare anycast

Operational

Past Incidents

March 2026

ResolvedDuration: 0 min

Node 2 (RL1)

No packets returned by host

26 Mar 2026, 18:31
ResolvedDuration: 12 min

Node 6 (RL1)

No packets returned by host

26 Mar 2026, 03:47
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 14:13
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 14:11
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 13:48
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 12:54
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 10:11
ResolvedDuration: 0 min

IPv6 Gateway

No packets returned by host

18 Mar 2026, 09:56

Issue not listed here?

Try our AI troubleshooting agent — it can check your website, verify DNS records, test if ports are open (SSH, RDP), and help determine if the issue is on your end or ours.

Automated health checks running every 30 seconds. Web hosting monitors use test WordPress sites — brief unavailability (1-2 min) may occur during auto-updates.