Need assistance solving unexpected and random I/O errors in homelab

tintedundercollar12@sopuli.xyz · 4 days ago

Need assistance solving unexpected and random I/O errors in homelab

sylver_dragon@lemmy.world · 4 days ago

With intermittent errors like that, I’d take the following test plan:

Check for disk errors - You already did this with the SMART tools.
Check for memory errors - Boot a USB drive to memtest86 and test.
Check for overheating issues - Thermal paste does wear out, check your logs for overheating warnings.
Power issues - Is the system powered straight from the wall or a surge protector? While it’s less of an issue these days, AC power coming from the wall should have a consistent sine wave. If that wave isn’t consistent, it can cause a voltage ripple on the DC side of the power supply. This can lead to all kinds of weird fuckery. A good surge protector (or UPS) will usually filter out most of the AC inconsistencies.
Power Supply - Similar to above, if the power supply is having a marginal failure it can cause issues. If you have a spare one, try swapping it out and seeing if the errors continue.
Processor failure - If you have a space processor which will fit the motherboard, you could try swapping that and looking for errors to continue.
Motherboard failure - Same type of thing. If you have a spare, swap and look for errors.

At this point, you’ll have tested basically everything and likely found the error. For most errors like this, I’ve rarely seen it go past the first two tests (drive/RAM failure), with the third (heat) picking up the majority of the rest. Power issues I’ve only ever seen in old buildings with electrical systems which probably wouldn’t pass an inspection. Though, bad power can cause other hardware failures. It’s one reason to have a surge protector in line at all times anyway.