Reliability and Availability

Author: Martin Czekalski
Maxtor Corporation

Reliability, availability, and scalability are the cornerstones of online transaction processing (OLTP) and enterprise class computing systems and storage subsystems. In this issue of Serial Storage Wire we will cover the first two as they apply to Serial Attached SCSI (SAS) and cover scalability in the next issue.

The terms reliability and availability are often misunderstood and used interchangeably, when in fact they refer to two different attributes of a system or its components. For example, a component’s reliability is typically specified as Mean Time Between Failure (MTBF), which represents the mean number of component hours accumulated between failure events when a large sample of components is run in operation. Another important aspect of reliability is the ability to detect errors when they occur and take appropriate action so as not to adversely affect the integrity of a system or its data (error detection and containment).

Availability, on the other hand, is the percentage of time a system or component is available for use. To calculate availability, additional factors beyond MTBF need to be factored in such as Mean Time to Repair (MTTR) and the overall system architecture. In many cases individual system components can fail while not affecting the availability of a system (e.g. RAID systems can usually remain available, even if a single disk drive fails).

SAS System Characteristics
Let’s take a look at some of the characteristics of a SAS system and infrastructure and examine its reliability and availability characteristics. SAS disk drives typically have much higher MTBF ratings under 24/7 high-use operating conditions such as OLTP. SATA drives have MTBF ratings based on nine hours per day under much lower use conditions. Many of these differences are described in the last issue of Serial Storage Wire, so let’s look at some of the other aspects of the system and protocols.

When examining the physical links for a SAS system, both SATA and SAS protocols pass across the wires. In both cases, transmitted frames are protected by an Error Detection Code (EDC) that can detect transmission errors at the receiver. When an error is detected, the protocols allow for notification that an error has occurred and a recovery mechanism is initiated at a higher level in the protocol stack. This allows operations to continue if the error was a transient event, or halts operations if it is a hard error. In both protocols, erroneous data or commands are prevented from corrupting stored data or operations.

To ensure reliable routing through expanders, the SAS protocol includes mechanisms to detect dropped or misrouted frames. Included in a SAS frame is a hashed version of the SAS source and destination addresses, as well as a data offset. These features provide the added protection of preventing data corruption due to dropped or misrouted frames. In addition to these error detection mechanisms, there are numerous event counters and logging mechanisms throughout the SAS architecture to understand and record the occurrence of errors events. These mechanisms provide visibility by system management functions to allow isolation of these events in order to enable appropriate recovery actions to be taken.

A typical high-availability SAS configuration is shown in the diagram below. This configuration shows two RAID controllers, two sets of expanders (forming two SAS domains) and drives attached to each set of expanders. Each port on the dual-port SAS drives is attached to one set of expanders. Since SATA drives have only a single port, they are attached to each set of expanders through a device called a Port Selector. Let’s consider how this configuration provides high availability.


Redundant Controllers Protect Against Failure
Individual drive failures are protected by the RAID controllers, while RAID controller failures are protected by having redundant controllers. Each controller can take over the workload of the other controller if one of the controllers fails. Because there are two independent sets of expanders, two SAS domains are established and a failure in either domain still allows a path to be established to any of the disk drives to continue operation. It should be noted that even though one of the system components may fail within its specified MTBF, the system is still able to operate.

SAS and SATA Operational Differences
There are, however, some operational differences in the way the dual ports on SAS drives operate and the way the Port Selector performs fail-over. Dual ports on SAS drives are true dual ports that can process commands simultaneously from both controllers. This not only allows for either controller to access the drive, but provides the added benefit of allowing load balancing between domains for improved performance and faster recovery time in the event of a failure in one of the domains. In the case of a Port Selector, one controller is chosen to access the SATA drive and the other port is inactive. When a failure occurs, control must be switched to allow the remaining controller access to the drive. This adds additional time and complexity to the recovery process.

Calculation of availability for such a system involves knowing the MTBF and MTTR for each system component, a thorough understanding of all the possible failure points, the ability to isolate failure events and the time required for the system hardware, software, and firmware to recover from each event. It should be noted that a failure in one of the system components does not usually translate into a system failure from the users viewpoint. The system continues to operate, and the component failure can be treated as a system maintenance event.

SAS is Cost-Effective in Mainstream Applications
The reliability and availability features inherent in the SAS architecture were previously available only on high-end computer and storage systems. With SAS, these capabilities can be cost-effectively included in the mainstream server, storage and application markets, increasing the choices and value propositions to end users and IT professionals.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.