It is a fact of life that hardware failures occur, they will happen to everyone at one stage or another.
They can happen to any service we operate.
Who is responsible?
If the machine is a shared game server, this is our hardware and we will take care of the replacements.
The most common failures occur on hardware that has moving parts, the most common moving part of any server will be its hard drive, except in the case of solid state drives.
Solid state drives, whilst faster and generally more reliable, do have a lifespan as well, it can vary between age, size, manufacturer, and how many times it has been written to.
With HDD and SSD, you can get S.M.A.R.T. data, which can help indicate issues.
RAM is easily dealt with as it can just be replaced.
Motherboards can fail, this can require a full rebuild and reinstall.
A power outage, though very rare (datacentres will have backup power), can cause catastrophic data loss and failures.
A power outage can also just affect a single machine, if its PSU (power supply) fails for any reason, it can have the same affect as any other outage.
It is important to note that power outages are very rare.
In some cases, they failures cannot be reliably detected before they happen, with some failures you can get hints and evidence that a component may be struggling or starting to fail, slow or sluggish response to commands, files failing to open or save - anything that is not considered normal.
It is important to note that some HDDs will just naturally be slow, older and smaller ones typically fall in to this category, ageing components can also exhibit signs of age, so it is important to factor this in.
It is your responsibility to arrange and keep backups on your services, this way you can ensure you have the data you need should you ever need it.