Frequently Asked Questions about Sales and Purchases
What is Flex Metal Central?
Flex Metal Central (FMC) is the central overarching API (and dashboard) that is typically used by IT Leadership. Within FMC hardware are organized by “Cloud Projects”. This allows IT Leadership to spin up multiple separate Clouds and set budget limits to what can be added to an individual project. More information can be found here.
Should we have multiple smaller Private Clouds or one large Private Cloud?
Running a Cloud is much easier with Flex Metal and it allows: having autonomous clouds for pre-deployment testing or proof-of-concepts; separating critical data or workloads from daily usage data or workloads; separating large departments or projects from each other; separating data or workloads that have geographic sensitivity like government regulations or user performance needs; and lowering the effect of issues within a cloud to just that small cloud.
Large Private Clouds do have an advantage over many small private clouds as large clouds are more efficient with very large workloads or data storage. Talk to your Account Manager in Flex Metal Central about your particular situation. Many times a mixed set of small and large clouds are ideal.
What APIs are available?
As this is a fully private OpenStack and Ceph cloud, you will have full access to all of OpenStack and Ceph’s APIs. In addition, Flex Metal Central is an API first application that will allow you to automate adding and removing hardware from your clouds.
Is there a graphical user interface?
We offer Flex Metal Central as a GUI as well as by API. OpenStack and Ceph both have an administrative GUI and a “Self Service User” GUI. Of note, as OpenStack and Ceph are often considered to be “API first” or “Infrastructure as Code first” applications, more administrative features are available via API or Command Line than within the administrative interface. For users that you might give Self Service access, OpenStack and Ceph have strong capabilities within the GUI.
What is OpenStack?
OpenStack is the overarching cloud management software and handles networking, compute, storage connection, access levels, and much more. More information can be found here.
What is Ceph?
Ceph provides the network storage including Block Storage, Object Storage, and, if needed, an NFS compatible file storage called CephFS. More information can be found here.
How many IOPS will I get?
These servers are dedicated to you. IOPS will vary by the hardware you purchase and the technology you are using to access the hardware. The drives used are data center grade Intel NVMe or SATA SSDs. Spinning hard drives are data center grade from. Review detailed specs here.
For extremely high IOPS, we recommend using the NVMe or SATA SSD drives directly from your application. This means that you will need to accomplish data integrity and high-availability through your software. By doing this though, many applications like high-performance databases can function extremely well. The NVMe drives on the HC Standards and the Compute Standards, in particular, have extreme IOPS. It bears repeating though - you must handle data integrity and HA yourself.
For very high IOPS with built-in data protection, Ceph with a replication of 2 on NVMe drives is popular. A replica level of 3 will slightly reduce the IOPS but is a recommended choice.
Do you offer GPUs?
We are currently researching the right hardware for bulk availability. Please contact your Account Manager for access to GPUs.
What is the "server to switch" port speeds?
HC Smalls have 2X1gbit ports. All other servers have 2X10gbit ports. They are bonded by default to provide redundancy and greater throughput.
What is your overall connectivity?
Flex Metal Clouds are organized by “Pods”. Each Pod has a minimum of 200gbits of connectivity that can be upgraded based on usage. Pods/overall network may also have direct peering with other cloud providers for maximum throughput.
There are two types of networks that come with your Private Cloud, a private network and a public network. The private network is supplied by OpenStack and consists of a set of routers, switches, and on server switching. This network is run within a set of 5 “hard” VLANs that are exclusively part of your Cloud and terminated only on servers within your project. One of those VLANs is public and the remaining are private. When you set up a Cloud, you must also select public IP space. This subnet will be terminated on your public “hard” VLAN. We recommend, and preconfigure, most of those IPs as “Floating” within OpenStack. This simply means that OpenStack has a pool of public IP addresses that it can allocate when VMs that need public IPs are spun up.
OpenStack Security Groups
Create firewall rules on the hardware nodes to protect VMs on the individual node. This allows you to have a public IP address on a networks so that individual departments can have their own private network space for their VMs separated from other departments. VM that does not traverse the OpenStack router, but is properly firewalled.
OpenStack Runs VXLAN
In your private network so that within your “hard” VLANs you can also create overlay provide networking, management control, control panels, APIs, and more to the Compute and Storage. For more information go here.
Provisioning Bare Metal Servers
When provisioning bare metal servers within your network they will be, by default, on your private VLANs. You can then use OpenStack’s Firewall as a Service to allow selected public traffic through to that bare metal server. You have the option to place any bare metal servers on the public VLAN by overriding the VLAN tagging on that individual server. This is not an automated process as placing a bare metal server on the public VLAN will result in a server without a firewall unless you manually create it on said server. In the case you are running bare metal servers that are not part of the OpenStack cluster, then those bare metal servers will be within the private or public VLAN you assigned and must traverse one of the private OpenStack routers to connect to a VM that is on a VXLAN. This is typical architecture as the bare metal to VM route is entirely within your private network.
How many resources go to the Control Plane?
This depends on the side of the Flex Metal Cloud and Services being used from the OpenStack Control Plane. For small Flex Metal Clouds, this might only be a few CPU cores and 2-4GB of RAM per Private Cloud Core server. Examples of small would include Flex Metal Clouds that are only made up of a 3 member Private Cloud Core. For very large Flex Metal Clouds, like several hundred server nodes, the Control Plane on the PCC can use enough of the PCC’s resources that best practices will advise against using the PCC for Compute and Storage. We also recommend very large deployments choose the HC Standard X5 to spread the usage across 5 servers versus 3 servers for best performance. HC Standards are very powerful machines though and selected to be able to cover many different situations while supplying Control Plane Services with Compute and Storage.
How are IP Addresses Handled?
We supply IPv4 for lease and will be terminated on your VLANs. We are aiming to supply IPv6 for a no charge lease in a near release. You can also SWIP your IPv4 blocks to us.
What is a Control Plane?
In OpenStack the Control Plane is made up of all the services that are necessary to IPMI port is connected to an IPMI network that only allows traffic between your port and our central management IP.
Is there any shared hardware in our Flex Metal Cloud?
Your servers are 100% dedicated to you. The crossover between your Flex Metal Cloud and the overall data center comes at the physical switch level for internet traffic and for IPMI traffic. For internet traffic, you are assigned a set of VLANs within the physical switches. Those VLANs only terminate on your hardware. For administrative purposes, your hardware’s those departments or people. You can set resource limitations that will be enforced by OpenStack. Regardless of if the Project is being managed via API or through Horizon, OpenStack will enforce your policies. As OpenStack is an API first system, there is often more functionality available via the API than within Horizon. For Cloud Administrators, a robust CLI that uses the API is the most popular way to administer OpenStack.
How do I give self-service access to different departments or people within my company?
Self service access to VMs, networking space, storage, and other OpenStack services are handled through the Horizon interface or through automation against OpenStack APIs. As the cloud administrator, you will setup Projects for those departments or people. You can set resource limitations that will be enforced by OpenStack. Regardless of if the Project is being managed via API or through Horizon, OpenStack will enforce your policies. As OpenStack is an API first system, there is often more functionality available via the API than within Horizon. For Cloud Administrators, a robust CLI that uses the API is the most popular way to administer OpenStack.
Why select a HC Standard vs a HC Small PCC?
The HC Standard has several key attributes to consider for production deployments. As your internal network is handled by the PCC, the HC Standard’s 20gbit connectivity may be critical if you are moving large amounts of data back and forth internally that must traverse the PCC. The HC Small is 2gbits only.
The HC Standard has NVMe drives versus the SATA SSDs in the HC Small. Though both are Intel Data Center Class, the NVMe drives have a 6X to 10X advantage over the SATA SSDs.
For development, testing, or small deployments, the HC Smalls are very economical and can perform very well. We strongly encourage you to consider them for Proof of Concepts, edge or regional deployments to extend your main Private Cloud, or any time that cost is a critical factor that can not be mitigated.
Of note, as OpenStack provides firewall functionality at the Node level, traffic can safely go directly from your private VLAN to your VM close to the line speed of the Node. This does not traverse the PCC’s network and thus can be almost as fast as the Node connection, often 20gbits.
Each member of a PCC handles compute, storage, and the Control Plane. For the Control Plane, each member of the PCC has a redundant copy of the service for HA reason but also for capacity reasons. For example, a 3 server PCC will have 3 routers. Failure of one PCC means the remaining two routers will handle the traffic your cloud is transmitting - each of the 2 servers must handle 50% more traffic. In a 5 server PCC that loses a member the 4 remaining servers will need to handle only 25% more traffic. The same logic applies for VMs that must be rebalanced to remaining PCC members.
Ceph’s minimum footprint is 3 servers to allow for a data replication of 3. Replication of 3 is a popular option, particularly on hyperconverged systems, but it does mean with a 3 servers with 1 drive per server, to recover to 3 replicas requires fixing the third member by either correct server action or replacement of said server. Having 5 members avoids this situation and allows the remaining 4 servers to recover itself automatically.
Growing a production 3 server PCC to a 5 server PCC requires manual intervention at some level. If you know your deployment will be quite large in the near future, the cost of a 5 server PCC is often fairly trivial when spread across the total value delivered by the large deployment.
Why select a 5 server PCC over a 3 server PCC?
Capacity and redundancy benefits come with the 5 PCC footprint and are typically appropriate for very large deployments. Three areas to consider:
The use of 3 replicas has typically been the standard for storage systems like Ceph. It means that 3 copies exist at all times in normal operation to prevent data loss in the event of a failure. In Ceph’s lingo, if identical data is stored on 3 OSDs, when one of the OSDs fails, the two remaining replicas can still tolerate one of them failing without loss of data. Depending on the Ceph settings and the storage available, when Ceph detects the failed OSD, it will wait in the “degraded” state for a certain time, then begin a copy process to recover back to 3 replicas. During this wait and/or copy process, the Ceph is not in danger of data loss if another OSD fails.
Two downsides to consider. The first down side to 3 replicas is slower maximum performance as the storage system must write the data 3 times. Your applications may operate under the maximum performance though so maximum performance may not be a factor.
The second downside is cost as with 3 replicas it means that if you need to store 1GB of user data, it will consume 3GB of storage space.
With data center grade SATA SSD and NVMe drives, the mean time between failure (MTBF) is better than traditional spinning drives. Spinning drive reliability is what drove the initial 3 replica standard. Large trustworthy data sets describe a 4X to 6X MTBF advantage to SSDs over HDDs. This advantage has led to many cloud administrators moving to 2 replicas for Ceph when running on data center grade SSDs. Both our HC Smalls and HC Standards use data center grade SSDs.
Considerations for 2 replicas:
First, with two replicas, during a failure of one OSD there is time when a loss of a second OSD will result in data loss. This time is during the timeout to allow the first OSD to potentially rejoin the cluster and the time needed to create a new replica on a different running OSD. This risk is real, but is offset by the very low chance of this occurring and the relative ease or difficulty for you to recover data from a backup.
Storage space is more economical as 1GB only consumes 2GB
Maximum IOPS may increased as Ceph only needs to write 2 copies before acknowledging the write
Latency may decrease as Ceph only needs to write 2 copies before acknowledging the write
For Ceph data redundancy, why choose 3 replicas versus 2 replicas or vice-versa?
If you need to maximize your usable disk space, we have the following general preference for Replica 2. This choice is based on the following:
We supply only data center grade SATA SSD and NVMe drives. The Mean Time Between Failure of a typical hard drive is 300,000 hours. Most recommendations and history of selection of 3 replicas come from hard drive use cases taking into account this failure rate. Both our SATA SSDs MTBF and our NVMe’s MTBF are 2 million hours. Though failures will certainly still occur, it is roughly 6 times less likely than with a HDD.
When Ceph has been hyper-converged onto 3 servers with a replica level of 3 when you lose one of the 3 members, Ceph can not recover itself out of degraded state until the lost member is restored or replaced. The data is not at risk since two copies remain but it is now effectively a Replica level of 2. When Ceph has been hyper-converged onto 3 servers with a replica level of 2 when you lose one of the 3 members, Ceph can be set to self heal by taking any data that has fallen to 1 replica and automatically start the copy process to recover to a replica level of 2. Your data loss danger only occurs during the time when only 1 replica is present.
Disaster recovery processes for data have progressed significantly. This will be based on your specific situation, but if restoring data from backups to production is straightforward and fast, then in the extremely rare case of both of the 2 replicas failing in the degraded period, you will then need to recover from backups.
Usable Ceph disk space savings are significant (estimated, not exact):
HC Small, Replica 3 - 960GB * 3 servers / 3 replicas = 960GB usable
HC Small, Replica 2 - 960GB * 3 servers / 2 replicas = 1440GB usable
HC Standard, Replica 3 - 3.2TB* 3 servers / 3 replicas = 3.2TB usable
HC Standard, Replica 2 - 3.2TB * 3 servers / 2 replicas = 4.8TB usable
Why do you recommend using less than 100% of a cloud’s resources?
The following are common factors for running your cloud below its theoretical limit:
Workloads will spike and a buffer is important to allow for spikes in a healthy way.
On failure of a node, you will need sufficient resources on the remaining nodes to absorb the workload of the failed node.
Hyper-converged and converged systems have many Control Plane services that require resources. In the event of a spike in resources needed by any one or more of those services, you will want to have some buffer for these cases. Running close to the theoretical limit opens you up to the theory being off by enough to make the system unstable.
If you are running a mixture of different server types a lower total cloud utilization is typical. Homogenous clouds are easier to balance and can often be run closer to 100%. We recommend not having a continuous utilization of over 80% of smaller clouds or over 90% of large, homogenous clouds. Check with your Account Manager for additional advice.
There is a final reason as well. You will most likely have configured self-service resources to departments. You need to be sure they have access to additional resources without waking you up in the middle of the night because the system cannot provision what you allocated to them.
How do I add a Flex Metal Cloud as a new Region to my current OpenStack?
This is native functionality for OpenStack. The steps (simplified) are as follows:
Create your new Flex Metal Cloud in the new Pod/geographic location - called Cloud 2
Edit Cloud 2’s OpenStack to use Cloud 1’s Keystone as the source of administration truth and to turn off Cloud 2’s Keystone and Horizon. Then edit Cloud 1 to recognize and admin Cloud 2.
Cloud 2 will now appear as a Region in Cloud 1
For the non-simplified version, see “Adding a Region to an Existing OpenStack”.
Do I have access to the OpenStack APIs to automate deployments by using Terraform, Ansible, etc.?
Yes, this is your private cloud!
What are the options to grow my Compute and/or Storage resources?
First, a little background on Ceph and creating Storage Pools. The following is important.
All servers will have at least one usable drive for data storage, including servers labeled as Compute. You have the option to use this drive for LVM based storage, Ephemeral storage, or as part of Ceph. Each drive is typically performing only 1 duty and that is our default recommendation*.
For Ceph, if the drive types differ - ie, SATA SSD vs NVMe SSD vs Spinners - you should not join them together within one Pool. Ceph can support multiple different performance Pools, but you should not mix drive types within a Pool. In order to create a Pool that can support Replication of 2, you will need at least two servers. For a Replication of 3, you will need 3 servers. For Erasure Coding, you typically need 4 or more separate servers.
If you are creating a large storage Pool with Spinners, we have advice specific to using the NVMe drives as an accelerator for the storage process and as part of the Object Gateway Service. Please check with your Account Manager for more information.
*Of note, though this is not a common scenario yet, with our high performance NVMe drives, the IO is often much, much higher than typical applications require so splitting the drive to be both part of Ceph and as a local high performance LVM is possible with good results.
With that being said, there are several ways to grow your Compute and Storage past what is within your PCC.
You can add additional matching or non-matching Compute Nodes. Keep in mind that during a failure scenario, you will need to rebalance the VMs from that Node to Nodes of a different VM capacity. Though not required, it is typical practice to keep a Cloud as homogeneous as possible for management ease.
You can add additional matching Converged Servers to your PCC. Typically you will join the SSD with your Ceph as a new OSD, but the drive on the new Node can be used as Ephemeral storage or as traditional drive storage via LVM. If joined to Ceph, you will see Ceph will automatically balance existing data onto the new capacity. For Compute, once merged with the existing PCC, the new resources will become available inside OpenStack.
You can create a new Converged Cluster. This allows you to select servers that are different from your PCC servers. You will need to use at least 2 servers for the new Ceph Pool, 3 or greater is the most typical. Of note, one Ceph can manage many different Pools, but you can also have multiple Ceph clusters if you see that as necessary.
You can create a new Storage Cloud. This is typically done for large scale implementations when the economy of scale favors separating Compute and Storage. This is also done when Object Storage is a focus of the Storage Cloud. Our Blended Storage and Large Storage have up to 12 large capacity Spinners. They are available for this use and others.
When should I consider moving from Converged to a Stand-alone Compute and Storage?
This depends on your use case. It typically happens naturally as you scale up. You will find that you have some “marooned” resources in your Cluster. For example, marooned could mean you have disk space left over but your RAM/CPU has been consumed. In this scenario, you will just need to add Compute. If you are out of Storage but have plenty of RAM/CPU, you have two options. You have the choice of creating a new Storage only cluster or shifting your base Converged Node to be a 2+ drive Converged Node. In the second situation you will typically need to move to 3+ of these servers to accommodate Ceph’s Pool rules and then, potentially, retire a few of the single drive Nodes. Consult with your Account Manager for advice.
Can I use the boot drive for data storage or part of a Ceph Pool?
No, that is not recommended unless it is an emergency or similar temporary situation. Those drives are not intended for heavy use and are not rated for high disk writes per day.
I am using a Flex Metal Cloud as part of my Disaster Recovery plan. What do you need to know to be prepared in case we fail over to our Flex Metal Cloud?
Several things come into play here. First, your default available expansion capacity will likely be less than what you need. For example, you keep a 10 node Storage Deployment with us containing 1+ Petabyte of data. Your current deployment using said data is 100 server nodes. Your DR plan would mean you would need to spin up 100 Compute nodes to get back to running order. You will need to work with your Account Manager ahead of time to have that capacity available to you. Very large deployments do require an agreement for this service if it increases our standby server quota.
Second, it is likely that you will be “SWIPing” IP addresses to us to broadcast from our routers.
It is wise to understand the processes above ahead of time and potentially perform a yearly dry run.
What are we responsible for versus what are you responsible for?
As a customer centric business, we want to provide paths to help you succeed. If you are finding a barrier to your success within our system, please escalate your contact within Flex Metal Central. There we provide direct contact to our Product Manager, our Support Manager, and to our company President.
In general, we manage the Networks above your Flex Metal Clouds and we supply the hardware and parts replacements as needed for hardware in your Flex Metal Clouds.
Flex Metal Clouds themselves are managed by your team. If your team has not managed OpenStack and Ceph private clouds before, we have several options to be sure you can succeed.
Complimentary onboarding training matched to your deployment size and our joint agreements
Self paced free onboarding guides.
Free Test Clouds, some limits apply
Paid additional training, coaching, and live assistance
Complimentary Emergency Service - please note this can be limited in the case of overuse. That being said, we are nice people and are driven to see you succeed with an Open Source alternative to the mega clouds.
In addition, we may maintain a free “Cloud in a VM” image you can use for testing and training purposes within your Cloud. This is currently available in our Image Catalog here.
What is the Network Uptime SLA?
The current network performance for 2020 is 99.994%. The base SLA is 99.96%.
How do I get help with OpenStack or Ceph or other sofware?
In addition to our complimentary, paid, and self guided training, your Account Manager can also connect you with skilled resources in the OpenStack and Ceph Open Source Communities. Options include support levels appropriate for small, medium, and large customers.
I need help in my Private Cloud, how do I let you in?
You will need to add one of our public keys to the server in question. These keys are rotated periodically. You should remove our public keys after service has been rendered. Please see our current public keys here.
Will you help with Linux questions? Or individual Services running on one of my VMs?
Probably not unless you are in a paid training program. If you are not sure, ask in our public forums. That team can often answer more questions there around general administration.
How do I request servers?
You can do it through an API call or through Flex Metal Central. Through API is the recommended way as then you can also script the deployment of that hardware. If you have retained the “stock” deployment, our deployment system can continue to manage adding and removing servers.
How do I return servers?
You can return a server by simply removing all running Cloud services then requesting removal via API or from Flex Metal Central. To safely remove the server: You should spin down or move off any VMs. You should direct your OpenStack to drop management of this server. You should detach Ceph from using any drives on this server.
Please see this guide for a more complete explanation.
You can override the safety check from within an API or within Flex Metal Central. This is not recommended and can lead to many issues.
How many servers can I use?
Based on your history with us, our system sets a limit. Your Account Manager can adjust this limit but may require additional information or a formal agreement to adjust this to high levels.
When should I add more servers?
We generally recommend that Clouds are not run much over 80% of their theoretical capacity. Performance monitoring and node health are key to track.
What do you recommend for Disaster Recovery replication or backups?
This will depend on your situation, but Ceph has native remote replication options. Use of more than one of our locations can often meet your DR requirements. There are also several companies that specialize in Ceph data replication if your rules require a third party. Please contact your Account Manager. For backups in general, the Ceph Object Storage system is one of the best in the industry and that is native to any of your Flex Metal Storage Clouds.
When do we get billed?
Every 7 days based on previous usage.
How much do we pay for bandwidth?
For Egress you are billed on the 95th Percentile at $0.045 per mbit per day or $0.27 per mbit PER WEEK. Your 95th Percentile is calculated against the number of days up to 7 days. Currently, Ingress is not charged. There is no current way to limit your bandwidth usage from our side but we encourage you to use the tools within OpenStack and Linux that can limit bandwidth usage.
Can we set a budget of servers that can be used?
Yes, each Flex Metal Cloud can have limits set to keep your team within the budget.
The fastest way to get familiar with Flex Metal is to take a free test drive of our Hyper-Converged Small Cluster.
Discuss your infrastructure requirements with the Flex Metal Cloud Team to get a personalized assessment.