Storage systems are absolutely fundamental for the future of the cloud, and there is a hunger for ultra-reliable, high-performance, and expansive storage systems. It is now commonplace to see flash-based storage, SSD arrays, and traditional mechanical disks in the data center. Storage hardware has progressed extensively in recent years, and RAID technology is an ever-present design choice.
To quickly recap, RAID technology underpins nearly every server environment and its purpose is to provide faster IOPs and preserve data integrity. There are several different RAID configurations that perform differently depending on the scenario: RAID that provides superfast I/O, RAID that creates multiple layers of data protection, and RAID that blends both I/O and data protection.
This article is for advanced users, meaning that we will assume that you are already familiar with storage and RAID. However, if you are just starting the journey it would be advisable to check out an introduction to RAID, you can find them on the Atlantic.Net blog here and here for RAID 10.
Remember that problems with RAID can occur over an extended time, so it’s important to have a number of safeguards in place such as replication and data backups. RAID is a protection status, not a backup solution.
Why Is It Best Practice To Stop Using A RAID Array If A Drive Fails?
Local server RAIDs are typically very reliable, but occasionally issues have been encountered that impact the integrity of the array. RAIDs are susceptible to environmental events such as power loss, inadequate cooling, or increased humidity. If using RAID 1, 5, 6, or greater, then your data will be safe providing precautions are followed.
If a drive fails in a RAID 1 configuration, it is recommended to stop using the server and any attached disks. RAID 1 mirrors the data between at least 2 disks, giving you 1:1 copies of the data. If a disk fails, the RAID will go into a failed state until the disk is replaced. You will effectively be working on a single disk with zero redundancy. If the working disk(s) fail, all of the data will be lost, so swapping out the failed disk should be a top priority.
How Does Each RAID Level Affect Performance?
RAID can provide many performance benefits; it’s possible to get superfast performance and strong data resilience by using nested RAID. Parity blocks are written to the RAID member disks and the storage controller tracks where each block resides. This is all done in a near-instant fashion, but the type of RAID configuration used can impact performance significantly.
RAID performance depends on several factors: the capacity of the disk, the number of disks in the array, the type of disk (flash / SSD / mechanical), and read/write IOPS. Working out the performance is complicated, however, there is a great calculator here.
RAID 0 Performance: RAID 0 is always the fastest option, but there is no data protection built in. If the array fails, that’s it – all data is gone. RAID 0 divides the data between at least 2 disks, every striped disk in the array can be used simultaneously to read/write. The more disks you use in RAID 0, the better the performance.
Nested RAID performance: Nested RAID levels like RAID 10, 50, and 60 consist of two or more sets of mirrored disks. Read performance is great and writing is good. However, latency is introduced as writes are doubled per sub-array.
RAID 10 has multiple mirrored disks, and RAID 50 has more disks per sub-array allowing 1 disk per sub-array to be fault-tolerant. RAID 60 is very much the same except 2 disks can fail in the sub-array. Nested RAID levels require many disks and the available disk capacity ranges from 50% up to 80% depending on the number of disks.
RAID 5 Performance: RAID 5 is all about resilience; performance is good for reads and fair for writes. The introduction of parity on a single array creates a write bottleneck. To limit this problem, you can leverage SSD caching or, if feasible, purchase flash memory storage.
RAID 6 Performance: RAID 6 is even more resilient than RAID 5, allowing up to 2 disks to fail at any one time. The payback is that performance is impacted as 2 parity blocks are written each cycle making performance much slower than RAID 5.
What About Software vs Hardware RAID?
The type of RAID controller impacts performance considerably. Controller cards can be either physical, often dedicated riser cards, or software-defined controllers. Software controllers are cheap but they can impact performance drastically. The software controller offloads RAID processing to the CPU, this is generally a bad idea in business-critical systems because storage I/O is CPU intensive.
Most servers and production systems are configured with physical controller cards. These riser cards plug directly into the system board of the server and process all RAID I/O requests. This is not only much faster than offloading to the server CPU, but it eliminates any I/O bottlenecks. Physical controller cards enable advanced features such as multi-pathing and failover channels for redundancy, and often larger production servers will have multiple riser cards installed.
How Is the RAID Rebuild Time Affected by Raid Level?
When a hard disk fails within an array, the RAID controller marks the disk as failed. The controller will automatically use any available hot spare disk to rebuild the array. The parity blocks that are saved on each disk are used by the controller to rebuild the data array.
The time taken to rebuild the array varies per RAID level, the size and speed of the disks, and the size of the array. RAID 5 is a common configuration because you get good performance and decent resilience, and the rebuild times are quick because only 1 disk needs rebuilding at a time.
The disk utilization, the used space, and the size of the disk all impact the time taken to rebuild the array. For mechanical disks, you are looking at about 24 hours per 1TB disk if the system is heavily used.
How Do You Monitor the RAID Array?
Storage systems are often overlooked when it comes to monitoring. This can have catastrophic consequences on the health of the array. Storage controllers support SMTP so email traps can be sent to a team of support engineers with ease.
Most controllers feature a dial-home feature, and when configured correctly the storage controller automatically alerts the hardware vendor of any failure or pending failure to disk or storage hardware. Depending on your level of paid tech support, replacement hardware and an engineer will be dispatched automatically.
Despite this useful feature, monitoring is essential, so what events should you alert against?
- Disk Health: Monitor disk state to alert against failed or pending disk failures.
- Controller Failover: It’s critical to ensure controller failover is monitored. Failover is not necessarily a bad thing, but if there are configuration mismatches on the disk paths access to data can be severed.
- Performance: Monitor I/O looking for bottlenecks and performance degradation.
- RAID Card Health: A ping check to ensure the controller card is up and alert against any state changes including health checks against the node canister processing units.
- Battery Backup Unit Health: The state of the BBU is critical because it saves the cached state of the array. If the battery is dead and then the array fails, the chance of data loss is significant.
- Error Logs: The event logs on the storage are detailed and highly accurate. A healthy array rarely alerts, so keep your eyes open for events because it’s a very good indicator of an issue.
Do RAID Types Matter Now That HHDs Have Been Replaced By SATA SSD and NVME SSDs?
SSDs are so affordable now that it is commonplace to have SSD or NVME-based storage systems. RAID is still essential for data protection even when using SSD/NVME. There is certainly less need for RAID 0 arrays, however, solid-state disks work very well in RAID 50, 60, and RAID 5, 6.
Atlantic.Net RAID Options
The ACP Cloud features an exclusively SSD-based, multi-tiered, and highly redundant array. Every cloud customer benefits from the breakneck performance of our storage systems. Customers also have the option to leverage dedicated servers with either SSD or NVME storage.
To add a Dedicated Host or learn more about our storage options, please contact the Atlantic.Net Sales Team by calling 888-618-DATA (3282) (toll-free) or +1-321-206-3734 (international) or writing to us via the Contact Page and we will be happy to assist you.