Frequently Asked Questions
Is there a graphical user interface?
We offer Flex Metal Central as a GUI as well as by API. OpenStack and Ceph both have an administrative GUI and a “Self Service User” GUI. Of note, as OpenStack and Ceph are often considered to be “API first” or “Infrastructure as Code first” applications, more administrative features are available via API or Command Line than within the administrative interface. For users that you might give Self Service access, OpenStack and Ceph have strong capabilities within the GUI.
What is OpenStack?
OpenStack is the overarching cloud management software and handles networking, compute, storage connection, access levels, and much more. More information can be found here.
What is Ceph?
Ceph provides the network storage including Block Storage, Object Storage, and, if needed, an NFS compatible file storage called CephFS. More information can be found here.
How many IOPS will I get?
These servers are dedicated to you. IOPS will vary by the hardware you purchase and the technology you are using to access the hardware. The drives used are data center grade Intel NVMe or SATA SSDs. Spinning hard drives are data center grade from. Review detailed specs here.
For extremely high IOPS, we recommend using the NVMe or SATA SSD drives directly from your application. This means that you will need to accomplish data integrity and high-availability through your software. By doing this though, many applications like high-performance databases can function extremely well. The NVMe drives on the HC Standards and the Compute Standards, in particular, have extreme IOPS. It bears repeating though - you must handle data integrity and HA yourself.
For very high IOPS with built-in data protection, Ceph with a replication of 2 on NVMe drives is popular. A replica level of 3 will slightly reduce the IOPS but is a recommended choice.
Do you offer GPUs?
We are currently researching the right hardware for bulk availability. Please contact your Account Manager for access to GPUs.
What is the "server to switch" port speeds?
HC Smalls have 2X1gbit ports. All other servers have 2X10gbit ports. They are bonded by default to provide redundancy and greater throughput.
What is your overall connectivity?
Flex Metal Clouds are organized by “Pods”. Each Pod has a minimum of 200gbits of connectivity that can be upgraded based on usage. Pods/overall network may also have direct peering with other cloud providers for maximum throughput.
OpenStack Security Groups
Create firewall rules on the hardware nodes to protect VMs on the individual node. This allows you to have a public IP address on a networks so that individual departments can have their own private network space for their VMs separated from other departments. VM that does not traverse the OpenStack router, but is properly firewalled.
OpenStack Runs VXLAN
In your private network so that within your “hard” VLANs you can also create overlay provide networking, management control, control panels, APIs, and more to the Compute and Storage. For more information go here.
Provisioning Bare Metal Servers
When provisioning bare metal servers within your network they will be, by default, on your private VLANs. You can then use OpenStack’s Firewall as a Service to allow selected public traffic through to that bare metal server. You have the option to place any bare metal servers on the public VLAN by overriding the VLAN tagging on that individual server. This is not an automated process as placing a bare metal server on the public VLAN will result in a server without a firewall unless you manually create it on said server. In the case you are running bare metal servers that are not part of the OpenStack cluster, then those bare metal servers will be within the private or public VLAN you assigned and must traverse one of the private OpenStack routers to connect to a VM that is on a VXLAN. This is typical architecture as the bare metal to VM route is entirely within your private network.
How many resources go to the Control Plane?
This depends on the side of the Flex Metal Cloud and Services being used from the OpenStack Control Plane. For small Flex Metal Clouds, this might only be a few CPU cores and 2-4GB of RAM per Private Cloud Core server. Examples of small would include Flex Metal Clouds that are only made up of a 3 member Private Cloud Core. For very large Flex Metal Clouds, like several hundred server nodes, the Control Plane on the PCC can use enough of the PCC’s resources that best practices will advise against using the PCC for Compute and Storage. We also recommend very large deployments choose the HC Standard X5 to spread the usage across 5 servers versus 3 servers for best performance. HC Standards are very powerful machines though and selected to be able to cover many different situations while supplying Control Plane Services with Compute and Storage.
How are IP Addresses Handled?
We supply IPv4 for lease and will be terminated on your VLANs. We are aiming to supply IPv6 for a no charge lease in a near release. You can also SWIP your IPv4 blocks to us.
What is a Control Plane?
In OpenStack the Control Plane is made up of all the services that are necessary to IPMI port is connected to an IPMI network that only allows traffic between your port and our central management IP.
Is there any shared hardware in our Flex Metal Cloud?
Your servers are 100% dedicated to you. The crossover between your Flex Metal Cloud and the overall data center comes at the physical switch level for internet traffic and for IPMI traffic. For internet traffic, you are assigned a set of VLANs within the physical switches. Those VLANs only terminate on your hardware. For administrative purposes, your hardware’s those departments or people. You can set resource limitations that will be enforced by OpenStack. Regardless of if the Project is being managed via API or through Horizon, OpenStack will enforce your policies. As OpenStack is an API first system, there is often more functionality available via the API than within Horizon. For Cloud Administrators, a robust CLI that uses the API is the most popular way to administer OpenStack.
How do I give self-service access to different departments or people within my company?
Self service access to VMs, networking space, storage, and other OpenStack services are handled through the Horizon interface or through automation against OpenStack APIs. As the cloud administrator, you will setup Projects for those departments or people. You can set resource limitations that will be enforced by OpenStack. Regardless of if the Project is being managed via API or through Horizon, OpenStack will enforce your policies. As OpenStack is an API first system, there is often more functionality available via the API than within Horizon. For Cloud Administrators, a robust CLI that uses the API is the most popular way to administer OpenStack.
Why select a 5 server PCC over a 3 server PCC?
Capacity and redundancy benefits come with the 5 PCC footprint and are typically appropriate for very large deployments. Three areas to consider:
The use of 3 replicas has typically been the standard for storage systems like Ceph. It means that 3 copies exist at all times in normal operation to prevent data loss in the event of a failure. In Ceph’s lingo, if identical data is stored on 3 OSDs, when one of the OSDs fails, the two remaining replicas can still tolerate one of them failing without loss of data. Depending on the Ceph settings and the storage available, when Ceph detects the failed OSD, it will wait in the “degraded” state for a certain time, then begin a copy process to recover back to 3 replicas. During this wait and/or copy process, the Ceph is not in danger of data loss if another OSD fails.
Two downsides to consider. The first down side to 3 replicas is slower maximum performance as the storage system must write the data 3 times. Your applications may operate under the maximum performance though so maximum performance may not be a factor.
The second downside is cost as with 3 replicas it means that if you need to store 1GB of user data, it will consume 3GB of storage space.
With data center grade SATA SSD and NVMe drives, the mean time between failure (MTBF) is better than traditional spinning drives. Spinning drive reliability is what drove the initial 3 replica standard. Large trustworthy data sets describe a 4X to 6X MTBF advantage to SSDs over HDDs. This advantage has led to many cloud administrators moving to 2 replicas for Ceph when running on data center grade SSDs. Both our HC Smalls and HC Standards use data center grade SSDs.
Considerations for 2 replicas:
First, with two replicas, during a failure of one OSD there is time when a loss of a second OSD will result in data loss. This time is during the timeout to allow the first OSD to potentially rejoin the cluster and the time needed to create a new replica on a different running OSD. This risk is real, but is offset by the very low chance of this occurring and the relative ease or difficulty for you to recover data from a backup.
Storage space is more economical as 1GB only consumes 2GB
Maximum IOPS may increased as Ceph only needs to write 2 copies before acknowledging the write
Latency may decrease as Ceph only needs to write 2 copies before acknowledging the write
For Ceph data redundancy, why choose 3 replicas versus 2 replicas or vice-versa?
If you need to maximize your usable disk space, we have the following general preference for Replica 2. This choice is based on the following:
We supply only data center grade SATA SSD and NVMe drives. The Mean Time Between Failure of a typical hard drive is 300,000 hours. Most recommendations and history of selection of 3 replicas come from hard drive use cases taking into account this failure rate. Both our SATA SSDs MTBF and our NVMe’s MTBF are 2 million hours. Though failures will certainly still occur, it is roughly 6 times less likely than with a HDD.
When Ceph has been hyper-converged onto 3 servers with a replica level of 3 when you lose one of the 3 members, Ceph can not recover itself out of degraded state until the lost member is restored or replaced. The data is not at risk since two copies remain but it is now effectively a Replica level of 2. When Ceph has been hyper-converged onto 3 servers with a replica level of 2 when you lose one of the 3 members, Ceph can be set to self heal by taking any data that has fallen to 1 replica and automatically start the copy process to recover to a replica level of 2. Your data loss danger only occurs during the time when only 1 replica is present.
Disaster recovery processes for data have progressed significantly. This will be based on your specific situation, but if restoring data from backups to production is straightforward and fast, then in the extremely rare case of both of the 2 replicas failing in the degraded period, you will then need to recover from backups.
Usable Ceph disk space savings are significant (estimated, not exact):
HC Small, Replica 3 - 960GB * 3 servers / 3 replicas = 960GB usable
HC Small, Replica 2 - 960GB * 3 servers / 2 replicas = 1440GB usable
HC Standard, Replica 3 - 3.2TB* 3 servers / 3 replicas = 3.2TB usable
HC Standard, Replica 2 - 3.2TB * 3 servers / 2 replicas = 4.8TB usable
Do I have access to the OpenStack APIs to automate deployments by using Terraform, Ansible, etc.?
Yes, this is your private cloud!
What are the options to grow my Compute and/or Storage resources?
First, a little background on Ceph and creating Storage Pools. The following is important.
All servers will have at least one usable drive for data storage, including servers labeled as Compute. You have the option to use this drive for LVM based storage, Ephemeral storage, or as part of Ceph. Each drive is typically performing only 1 duty and that is our default recommendation*.
For Ceph, if the drive types differ - ie, SATA SSD vs NVMe SSD vs Spinners - you should not join them together within one Pool. Ceph can support multiple different performance Pools, but you should not mix drive types within a Pool. In order to create a Pool that can support Replication of 2, you will need at least two servers. For a Replication of 3, you will need 3 servers. For Erasure Coding, you typically need 4 or more separate servers.
If you are creating a large storage Pool with Spinners, we have advice specific to using the NVMe drives as an accelerator for the storage process and as part of the Object Gateway Service. Please check with your Account Manager for more information.
*Of note, though this is not a common scenario yet, with our high performance NVMe drives, the IO is often much, much higher than typical applications require so splitting the drive to be both part of Ceph and as a local high performance LVM is possible with good results.
With that being said, there are several ways to grow your Compute and Storage past what is within your PCC.
You can add additional matching or non-matching Compute Nodes. Keep in mind that during a failure scenario, you will need to rebalance the VMs from that Node to Nodes of a different VM capacity. Though not required, it is typical practice to keep a Cloud as homogeneous as possible for management ease.
You can add additional matching Converged Servers to your PCC. Typically you will join the SSD with your Ceph as a new OSD, but the drive on the new Node can be used as Ephemeral storage or as traditional drive storage via LVM. If joined to Ceph, you will see Ceph will automatically balance existing data onto the new capacity. For Compute, once merged with the existing PCC, the new resources will become available inside OpenStack.
You can create a new Converged Cluster. This allows you to select servers that are different from your PCC servers. You will need to use at least 2 servers for the new Ceph Pool, 3 or greater is the most typical. Of note, one Ceph can manage many different Pools, but you can also have multiple Ceph clusters if you see that as necessary.
You can create a new Storage Cloud. This is typically done for large scale implementations when the economy of scale favors separating Compute and Storage. This is also done when Object Storage is a focus of the Storage Cloud. Our Blended Storage and Large Storage have up to 12 large capacity Spinners. They are available for this use and others.
When should I consider moving from Converged to a Stand-alone Compute and Storage?
This depends on your use case. It typically happens naturally as you scale up. You will find that you have some “marooned” resources in your Cluster. For example, marooned could mean you have disk space left over but your RAM/CPU has been consumed. In this scenario, you will just need to add Compute. If you are out of Storage but have plenty of RAM/CPU, you have two options. You have the choice of creating a new Storage only cluster or shifting your base Converged Node to be a 2+ drive Converged Node. In the second situation you will typically need to move to 3+ of these servers to accommodate Ceph’s Pool rules and then, potentially, retire a few of the single drive Nodes. Consult with your Account Manager for advice.
Can I use the boot drive for data storage or part of a Ceph Pool?
No, that is not recommended unless it is an emergency or similar temporary situation. Those drives are not intended for heavy use and are not rated for high disk writes per day.
I am using a Flex Metal Cloud as part of my Disaster Recovery plan. What do you need to know to be prepared in case we fail over to our Flex Metal Cloud?
Several things come into play here. First, your default available expansion capacity will likely be less than what you need. For example, you keep a 10 node Storage Deployment with us containing 1+ Petabyte of data. Your current deployment using said data is 100 server nodes. Your DR plan would mean you would need to spin up 100 Compute nodes to get back to running order. You will need to work with your Account Manager ahead of time to have that capacity available to you. Very large deployments do require an agreement for this service if it increases our standby server quota.
Second, it is likely that you will be “SWIPing” IP addresses to us to broadcast from our routers.
It is wise to understand the processes above ahead of time and potentially perform a yearly dry run.
What are we responsible for versus what are you responsible for?
As a customer centric business, we want to provide paths to help you succeed. If you are finding a barrier to your success within our system, please escalate your contact within Flex Metal Central. There we provide direct contact to our Product Manager, our Support Manager, and to our company President.
In general, we manage the Networks above your Flex Metal Clouds and we supply the hardware and parts replacements as needed for hardware in your Flex Metal Clouds.
Flex Metal Clouds themselves are managed by your team. If your team has not managed OpenStack and Ceph private clouds before, we have several options to be sure you can succeed.
Complimentary onboarding training matched to your deployment size and our joint agreements
Self paced free onboarding guides.
Free Test Clouds, some limits apply
Paid additional training, coaching, and live assistance
Complimentary Emergency Service - please note this can be limited in the case of overuse. That being said, we are nice people and are driven to see you succeed with an Open Source alternative to the mega clouds.
In addition, we may maintain a free “Cloud in a VM” image you can use for testing and training purposes within your Cloud. This is currently available in our Image Catalog here.
I need help in my Private Cloud, how do I let you in?
You will need to add one of our public keys to the server in question. These keys are rotated periodically. You should remove our public keys after service has been rendered. Please see our current public keys here.
Will you help with Linux questions? Or individual Services running on one of my VMs?
Probably not unless you are in a paid training program. If you are not sure, ask in our public forums. That team can often answer more questions there around general administration.
How do I return servers?
You can return a server by simply removing all running Cloud services then requesting removal via API or from Flex Metal Central. To safely remove the server: You should spin down or move off any VMs. You should direct your OpenStack to drop management of this server. You should detach Ceph from using any drives on this server.
Please see this guide for a more complete explanation.
You can override the safety check from within an API or within Flex Metal Central. This is not recommended and can lead to many issues.
When should I add more servers?
We generally recommend that Clouds are not run much over 80% of their theoretical capacity. Performance monitoring and node health are key to track.
What do you recommend for Disaster Recovery replication or backups?
This will depend on your situation, but Ceph has native remote replication options. Use of more than one of our locations can often meet your DR requirements. There are also several companies that specialize in Ceph data replication if your rules require a third party. Please contact your Account Manager. For backups in general, the Ceph Object Storage system is one of the best in the industry and that is native to any of your Flex Metal Storage Clouds.