Update – 03/13/2015: The recent failover plan that we mentioned in this announcement for our ecbiz97 server has been completed successfully. All account data from March 5th has been fully recovered. Please note: some accounts have not completed restoration, typically due to their size. If you find that you are missing files, they will re-appear in your account shortly, and this is no cause for alarm. For those who opted to bring information from the temporary server, these request have been completed as well.
As you may be aware, ecbiz97, the server housing your account encountered a serious issue Thursday afternoon March 5th which resulted in a prolonged service interruption well beyond what we initially expected. While we have made updates on our status pages, support center, and social media sites, we would like to take the time to fully disclose what happened on ecbiz97. The purpose of this article is to provide more details into what specifically occurred, what you can expect over the next few days, as well as what we are doing to help prevent problems like this from occurring in the future.
The Details
On Wednesday evening March 4th, our monitoring team reported extreme sluggishness on the server which was initially traced back to a malfunctioning hard drive. The hard drive was replaced, and the problem subsided until the next morning when it was reported that the latency had returned. Further inspection showed the RAID card on the system was also not functioning properly, and an emergency maintenance window was authorized by our Architecture team to have it replaced. However, the replacement of the bad card revealed a second failing hard drive that would have been detected by our monitoring system had the original RAID card been working properly. This type of situation is extremely unusual and not one that we have experienced in the past.
This server runs several hard drives in a RAID 5 array, which allows for one single disk failure at a time. When two hard drives are non-functional at any point in time, it can cause system instability and data loss. In this case, the server failed to boot properly due to a corrupted file system and we were forced to initiate a filesystem check (fsck) to correct this corruption. Due to a combination of causes which included a failing hard drive, a recently replaced drive to finish rebuilding, and the legacy filesystem type of this class of server caused the fsck to take much longer than originally expected.
On Friday March 6th, our Architecture team made the decision to fail the server over to older backups. There were questions as to why we did not do this sooner. The decision was weighted based on how long the server might be down as opposed to how inconvenient the older data would be for customers on the server. Specifically, websites actively writing data to MySQL databases would be severely impacted, since MySQL data cannot be “merged.” With this in mind, and not having a reasonable ETA for how long the fsck would take, during which time the server is effectively offline, we ultimately decided to revert to backups on a temporary server.
On Monday, we were able to bring the physical ecbiz97 server back up with all original data intact. Though with the server largely unresponsive and sluggish, we were hesitant to replace the second failing drive. At this point in time, our Architecture team decided to mount the filesystem in the server’s “rescue” mode, and copy all data to a new server, which again is a time consuming process considering the state of the original ecbiz97 server.
Within the next 24-36 hours, we plan to power up the new server that contains the files that were in place on ecbiz97 as of Thursday 3/5 at approximately 2pm EST. Prior to us bringing the server online, customers on this server will receive an email that they may reply to if they wish to retain the data that is currently in use. You will have a choice for retaining the following:
- All databases that have been live for the last week
- All files that have been live for the last week
- Or both databases and files that have been live for the last week
If we receive no response we will restore your account to the state it was in on March 5th (This is the most common option for most users on this server). Any modifications made since March 5th at 2pm will need to be made again.
The temporary server will continue to be available for up to two weeks after the new server is online, in case any data needs to be retrieved. You will have access to this server if needed, though it should be noted that this server will no longer actively house your live website so no changes should be made to it. Our Support Department will be available to assist with any data migrations that may be needed. The information for both your old server and new server will be provided in a separate email when the transition is complete.
While we understand how this may inconvenice you, we believe this is the best case scenario when taking under consideration the entire population of the server as a whole, and the age of the data present on the temporary server.
We will also like to inform you that your new server boasts solid-state drives and additional hardware redundancy to provide protection against the type of failure that ecbiz97 endured. You can learn more about our SSD platform here:
https://www.inmotionhosting.com/ssd-hosting#ssd-hosting
Again, we are very sorry for the frustration and inconvenience caused by this unexpected hardware failure.
We want you to know that we appreciate you being a long-standing and loyal customer with us, and want to assure you we are as dedicated to you and your website as you have been to us. If you have any further questions or concerns please feel free to contact us, we’re available 24/7 via phone (888-321-4678 OPTION 2), chat, and email.