OpenStack Hardware Failures


The purpose of this guide is to go over failure and maintenance scenarios in
the OpenStack cloud that could occur and what to do to address them.

Things can and will go wrong. It is good to be prepared for these events.


What should be done should a hardware node fail?

If a hardware node fails or needs to come down for maintenance, you should
know what steps to take. Depending on the maintenance required, you will either
require the assistance of our data center staff or you can do the maintenance
yourself.

Planned maintenance

This section describes the steps needed to take a hardware compute node out of
the cloud in the event work needs to be done on it or the cloud needs to be
reduced in size.

NOTE! – If you know a node or nodes need maintenance that require a hardware
modification you’ll need to create a ticket from the Flex Metal Central
control panel to our data center staff to perform that task for you.

For official documentation on this subject, see OpenStack’s
Compute Node Failures and Maintenance guide.

The general work flow for bringing a compute node down will involve first
disabling that node, finding the instances on that node, migrating those
instances to another node, and removing any ceph Object Storage Daemons
(OSDs). Optionally, you can migrate the instances back to
the original node when the maintenance is done.

OpenStackClient will be required to perform the maintenance.

Procedure for removing a compute node

Start with disabling the nova-compute service on the appropriate node:

$ openstack compute service set --disable --disable-reason
maintenance COMPUTE_NODE_NAME nova-compute

List the instances on that node:

$ openstack server list --host COMPUTE_NODE_NAME --all-projects

Migrate the instances to another node:

$ openstack server migrate INSTANCE_UUID --live-migration

NOTE – This deployment of OpenStack is using ceph as the backend shared storage so
there is no need to pass the --block-migration flag to openstack server
migrate
.

Because OpenStack has been deployed using Kolla Ansible, each OpenStack
service runs in a docker container.

Stop the nova_compute docker container:

# docker stop nova_compute

Perform the needed maintenance, and then restart the nova_compute service:

# docker start nova_compute

Verify the nova_compute docker container is running:

# docker ps | grep nova_compute
286e1b2e2ae5        kolla/centos-binary-nova-compute:train-centos8
"dumb-init --single-…"   2 months ago        Up 18 minutes
nova_compute

Finally, verify the nova service has connected to the messaging service,
AMQP:

$ grep AMQP /var/log/kolla/nova/nova-compute.log

Unplanned maintenance

There are times where unplanned maintenance is required. This section will
describe what can be done in the event a compute node goes down unexpectedly.

The primary concern is that instances associated with the compute node that
has failed will no longer work.


Ceph failure scenarios and recovery

Ceph by nature is resilient to hardware failure and self-healing.

The primary concern with ceph is failed hard drives. How can an operator
be alerted to a failed hard drive? Will ceph continue to function if a drive
is lost?

Generally, ceph will continue to function if a drive is lost, however the
drive should be replaced as soon as possible.

How do you know if a hard drive has failed?

Currently there is no monitoring for failed ceph drives, however the intention
is to monitor for these events in the future. Due to this, it is recommended
monitoring of drives be put into place. Software such as Icinga or
Nagios are viable options for
monitoring.

If it is suspected a drive has failed, you should first determine if this
really is the case.

The overall procedure for determining if a drive has failed is to:

  • Check Ceph health
  • See if the OSD associated with the drive in question can be started if it is
    stopped
  • Check the OSD’s mount point using `df -h`
  • Use smartctl on the drive in question

The following explains these steps in more detail.


Procedure

Reference:

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/operations_guide/handling-a-disk-failure

From one of the hardware nodes, perform the following checks:

Check if ceph is healthy:

# ceph health

Find the location of the OSD within the CRUSH map:

# ceph osd tree | grep -i down

On the node that houses the OSD, try to start the OSD using systemctl
where OSD is a placeholder for the actual OSD identifier:

# systemctl start [email protected]

The systemctl unit file for the OSD will vary depending on which OSD has
failed. In this case the systemctl unit file is called [email protected].

If a hard drive has failed, our data center team will need to replace it. A
ticket will need to made in Flex Metal Central to alert of team of the failure.
The drive or drives will be replaced by our team.

NW
Nick West Systems Engineer

Nick is an avid aggressive inline skater, nature enthusiast, and loves working with open source software in a Linux environment.

More Articles by Nick

Was this article helpful? Let us know!