All Systems Operational
99.998% uptime over 90 days
Latest Post-Mortem
VPS - RL1 Node 6 — Intermittent Storage Latency Due to DAC Cable Fault
Node 6 experienced intermittent high storage latency caused by a faulty DAC cable on one port of the Mellanox ConnectX NIC. The LACP bond port flapping disrupted Ceph OSD connectivity, causing brief I/O stalls for VMs on this node.
Timeline
mlx5_core lag map begins flapping between port 1 and port 2 every few seconds
Ceph monitor sessions lost, osd4 and osd6 marked down. MDS caps go stale. rbd watch errors (-107) on VM disks
Repeated cycles of port flapping, OSD bounce (osd4/osd6 down→up), monitor session hunting, and rbd watch errors approximately every 5-10 minutes
bond0 slave enp65s0f0np0 marked "link status definitely down" — bond fails over to remaining port
Continued intermittent OSD flapping as the faulty port attempts to recover
Faulty port disabled with ifdown enp65s0f0np0 — all traffic stable on remaining port. Issue resolved
Impact
VMs on Node 6 experienced brief periods of elevated storage latency and intermittent I/O stalls as Ceph OSDs (osd4, osd6) bounced up and down. No complete outage — the LACP bond with dual switches kept the node online throughout, but performance was degraded.
Root Cause
A faulty DAC (Direct Attach Copper) cable on port enp65s0f0np0 of the dual-port Mellanox ConnectX NIC caused continuous link flapping. The mlx5_core driver repeatedly rebalanced the LACP lag map between the two ports every 3-4 seconds, disrupting established Ceph connections.
Each port flap caused the Ceph client on Node 6 to lose its monitor sessions and mark osd4 and osd6 as down. While the OSDs recovered within seconds each time, the repeated cycling caused rbd watch errors (-107 ENOTCONN) and MDS capability timeouts — resulting in elevated I/O latency for VMs whose disks were served by these OSDs.
The LACP bond across two separate switches prevented a full outage — traffic continued flowing through the healthy port — but the constant rebalancing between a good and bad port created the intermittent disruption pattern seen in the logs.
Resolution
Disabled the faulty port with ifdown enp65s0f0np0, leaving the bond running on the remaining healthy port. All Ceph connections stabilized immediately. The DAC cable will be replaced on the next datacenter visit.
Preventive Measures
- Replace the faulty DAC cable on Node 6 at next datacenter visit
- Benefit of dual-switch LACP confirmed — node stayed online throughout despite a complete port failure
Scheduled Maintenance
Website & Portal
agent
API
cloud.hostup.se
Customer portal
hostup.se
webmail
Web Hosting - cPanel
delta
Test site
lambda
Test site
mu
Test site
omega
Test site
pi
Test site
srv11
High-frequency cPanel
Web Hosting - ApisCP (Legacy)
epsilon
Test site
eta
Test site
orion
Test site
theta
Test site
zeta
Test site
VPS - RL1
Stockholm Älvsjö datacenter
High Frequency Ryzen 9950x
High-performance compute
IPv4 Gateway
IPv4 routing
IPv6 Gateway
IPv6 routing
Node 0
Legacy node
Node 12
HA cluster node
Node 13
HA cluster node
Node 16
HA cluster node
Node 2
Hypervisor
Node 23
HA cluster node
Node 24
HA cluster node
Node 25
HA cluster node
Node 26
HA cluster node
Node 3
Snapshot storage
Node 4
HA cluster node
Node 5
HA cluster node
Node 6
HA cluster node
Node 7
HA cluster node
Node 8
HA cluster node
Node 9
HA cluster node
VPS - RL2
Stockholm Älvsjö datacenter
IPv4 Gateway
IPv4 routing
IPv6 Gateway
IPv6 routing
Node 1
Hypervisor
Node 2
Hypervisor
Node 3
Hypervisor
Node 4
Hypervisor
Node 5
Hypervisor
Node 6
Hypervisor
Node 7
Hypervisor
Node 8
Hypervisor
DNS
Cloudflare whitelabel anycast nameservers
primary.ns.hostup.se
Cloudflare anycast
secondary.ns.hostup.se
Cloudflare anycast
Past Incidents
March 2026
Node 3 (RL2)
No packets returned by host
February 2026
eta
Status 521
mu
Timeout (no headers received)
lambda
Status 502
mu
Couldn't connect to server
Node 1 (RL2)
No packets returned by host
January 2026
Node 3 (RL2)
No packets returned by host
Node 4 (RL2)
No packets returned by host
Node 6 (RL2)
No packets returned by host
Issue not listed here?
Try our AI troubleshooting agent — it can check your website, verify DNS records, test if ports are open (SSH, RDP), and help determine if the issue is on your end or ours.
Automated health checks running every 30 seconds. Web hosting monitors use test WordPress sites — brief unavailability (1-2 min) may occur during auto-updates.