The purpose of this guide is to go over failure and maintenance scenarios in
the OpenStack cloud that could occur and what to do to address them.
Things can and will go wrong. It is good to be prepared for these events.
Table of Contents
What should be done should a hardware node fail?
If a hardware node fails or needs to come down for maintenance, you should
know what steps to take. Depending on the maintenance required, you will either
require the assistance of our data center staff or you can do the maintenance
This section describes the steps needed to take a hardware compute node out of
the cloud in the event work needs to be done on it or the cloud needs to be
reduced in size.
NOTE! – If you know a node or nodes need maintenance that require a hardware
modification you’ll need to create a ticket from the Flex Metal Central
control panel to our data center staff to perform that task for you.
For official documentation on this subject, see OpenStack’s
Compute Node Failures and Maintenance guide.
The general work flow for bringing a compute node down will involve first
disabling that node, finding the instances on that node, migrating those
instances to another node, and removing any ceph Object Storage Daemons
(OSDs). Optionally, you can migrate the instances back to
the original node when the maintenance is done.
OpenStackClient will be required to perform the maintenance.
Procedure for removing a compute node
Start with disabling the
nova-compute service on the appropriate node:
$ openstack compute service set --disable --disable-reason maintenance COMPUTE_NODE_NAME nova-compute
List the instances on that node:
$ openstack server list --host COMPUTE_NODE_NAME --all-projects
Migrate the instances to another node:
$ openstack server migrate INSTANCE_UUID --live-migration
NOTE – This deployment of OpenStack is using ceph as the backend shared storage so
there is no need to pass the
--block-migration flag to
Because OpenStack has been deployed using Kolla Ansible, each OpenStack
service runs in a docker container.
Stop the nova_compute docker container:
# docker stop nova_compute
Perform the needed maintenance, and then restart the nova_compute service:
# docker start nova_compute
nova_compute docker container is running:
# docker ps | grep nova_compute 286e1b2e2ae5 kolla/centos-binary-nova-compute:train-centos8 "dumb-init --single-â€¦" 2 months ago Up 18 minutes nova_compute
Finally, verify the nova service has connected to the messaging service,
$ grep AMQP /var/log/kolla/nova/nova-compute.log
There are times where unplanned maintenance is required. This section will
describe what can be done in the event a compute node goes down unexpectedly.
The primary concern is that instances associated with the compute node that
has failed will no longer work.
Ceph failure scenarios and recovery
Ceph by nature is resilient to hardware failure and self-healing.
The primary concern with ceph is failed hard drives. How can an operator
be alerted to a failed hard drive? Will ceph continue to function if a drive
Generally, ceph will continue to function if a drive is lost, however the
drive should be replaced as soon as possible.
How do you know if a hard drive has failed?
Currently there is no monitoring for failed ceph drives, however the intention
is to monitor for these events in the future. Due to this, it is recommended
monitoring of drives be put into place. Software such as Icinga or
Nagios are viable options for
If it is suspected a drive has failed, you should first determine if this
really is the case.
The overall procedure for determining if a drive has failed is to:
- Check Ceph health
- See if the OSD associated with the drive in question can be started if it is
- Check the OSD’s mount point using `df -h`
smartctlon the drive in question
The following explains these steps in more detail.
From one of the hardware nodes, perform the following checks:
Check if ceph is healthy:
# ceph health
Find the location of the OSD within the CRUSH map:
# ceph osd tree | grep -i down
On the node that houses the OSD, try to start the OSD using
where OSD is a placeholder for the actual OSD identifier:
# systemctl start [email protected]
The systemctl unit file for the OSD will vary depending on which OSD has
failed. In this case the systemctl unit file is called
If a hard drive has failed, our data center team will need to replace it. A
ticket will need to made in Flex Metal Central to alert of team of the failure.
The drive or drives will be replaced by our team.