Author: Matthew Hallberg, Storage Protocol Specialist
LeCroy Protocol Solutions Group
This article takes a close look at a specific problem which might occur during the development phase of a new SAS-2 product. Although one issue is dealt with here, the approach and techniques utilized can be generalized to the many different issues that might be encountered while developing and debugging new products.
The Problem: Incomplete READ(10) Commands
A problem has been reported during testing that appears to be associated with READ(10) commands. In order to troubleshoot this problem, the user first needs to understand the issue, then develop a means to replicate the problem, and finally to develop a test scenario to ensure the problem is fixed and does not reoccur.
Understanding protocol issues starts with the use of a protocol analyzer. Replication of the problem and establishment of a standard test requires the use of a device or host emulator. For this example, using a test tool is recommended that has the ability to act as both a protocol analyzer and an emulator which also fully supports the SAS-2 specification.
Step 1: Identifying the Problem
A key first step in identifying the problem is to capture and examine the actual data traffic taking place on a high-speed serial data link between host and target. For this step we will use the tool in the protocol analyzer mode, which allows one to eavesdrop on the traffic and then capture and display specific problem areas to understand the errors which are occurring.
To set up the system, the protocol analyzer is inserted into the link between target and host as shown in Figure 1. Traffic flows through the protocol analyzer and can be recorded and displayed on the host computer system to which the analyzer is connected.
Figure 1: Test Tool Configured in Protocol Analyzer Mode
Looking for specific problems in a high-speed serial data link can be likened to searching for a very small needle in a very large haystack. On a 6Gb/s serial link, the sheer volume of information transmitted can easily overwhelm the user’s ability to comprehend the issues involved. For problem solving on this scale, a protocol analyzer should be used which is equipped with powerful tools to help pinpoint the problem.
First, the analyzer must have a sophisticated trigger system, which includes the ability to identify a specific traffic pattern or complex sequence of traffic patterns, and can be used to control when to start or stop the analyzer. The analyzer provides the flexibility to position this trigger at the start, at the end, or anywhere in the middle of the actual recorded trace, allowing the user to view conditions which led to the error and to see how the system attempted to recover from the error.
Second, the analyzer should have a powerful set of data filters which can be used to include or to exclude virtually any type of traffic, allowing the user to record only the data that is relevant to the problem being studied and exclude up to 99% of the rest of the data – which represents additional noise that can obscure the problem.
For the specific problem being addressed here, the user first selects a trigger for the READ(10) command (see Figure 2). The user then selects a second condition, which requires the READ(10) command to be followed by a CHECK CONDITION response (see Figure 3), therefore focusing only on commands which do not complete successfully, and excluding the many commands which might not be part of the problem.
Figure 2: Triggering on READ(10) Commands
Figure 3: Limiting Trigger Condition to Incomplete READ(10) Commands
The resulting trace recording reveals multiple locations where the READ(10) command has not successfully completed (see Figure 4). By clicking on any given command, the user can drill down to lower levels of the protocol to reveal increasing detail and determine specifically what caused the command to fail (see exploded detail on Figure 4).
Figure 4: Captured Trace Showing Incomplete READ(10) Command and Expanded Detail Showing NAK CRC Error as the Source of the Problem
Through the ability to easily move up and down the protocol stack, the user can quickly zoom into the actual cause of this error – In this case, a NAK (CRC ERROR). By moving to other errors that occur elsewhere in the trace, the user determines that in every case in which the READ(10) command failed to complete, a NAK (CRC ERROR) was the fundamental cause of the failure.
To document the source of the error in a way that can be shared with others on the development team, the user bookmarks the error and provides a comment describing the source of the error (see Figure 5). This trace can then be shared with other members of the development team, any of whom can view the original trace along with the user’s comments on their own systems by using the trace viewer.
Figure 5: Adding a Bookmark to a Trace File
Step 2: Replicating the Problem
With specific information in hand on how the error is created, the user can now move on to providing a way to replicate this error for test purposes. For this task the user will use the same system but will use it in the host emulation mode to recreate the error condition in a controlled and repeatable fashion. The problem, in this case, is believed to be associated with the firmware in the drive. For host emulation, the system will be connected directly to the target drive and will take the place of the host system while running the target through a test or series of tests intended to replicate this error.
Using host emulation, the user creates a script that simulates the activity of a host to first setup the drive using strings of commands, loops and other logic, and then to create the error condition by creating a NAK (CRC ERROR) on the handshake of the incoming data frame from the drive. This task is considerably simplified by the ability of the test tool to use the prerecorded trace as a starting script for the emulator, allowing the bulk of the needed script to simply be imported into the emulator.
By testing with the host emulator, the user is able to verify that the emulator can recreate the error in a reliable and repeatable way, and therefore take the place of a much more complicated host environment while this error is being corrected. New firmware for the target can be quickly and easily tested against the emulator to ensure that the error is now handled correctly and that the system recovers properly. This guarantees that the new firmware will allow the drive to work as designed, and can be reintroduced into the main system with assurance that this error has been overcome.
Note that if the problems were associated with the driver software in the host, the user would also be able to reproduce the same type of error, while using the test tool in device emulation mode to emulate the drive. The user is able to import fields from a known target and emulate that target with respect to device address and other properties. The user can also create a wide variety of error conditions, including simulating a bad drive while testing the drivers on the host to determine their ability to correctly identify and recover from these errors.
Step 3: Developing Test Suites
As these simple examples demonstrate, product development is often paced by the identification and resolution of numerous small issues that affect the ability of the development team to bring a product to market quickly. In many cases solutions to a problem are not difficult to find once the problem is correctly understood. Finding problems rapidly and tracing those problems quickly and correctly to their sources is a major factor in the overall progress of new product development. This issue is exacerbated when the new product is designed for an emerging new standard and there are few, if any, existing products on the market to test against to ensure interoperability. In such cases the developers must be even more rigorous in testing thoroughly against the requirements of the specification to ensure that the new product will interoperate with other new products as they emerge.
Emulators can be invaluable during this process by providing a set of canned tests that are intended to ensure compliance with the requirements of the new standard in developing product-specific tests. Using these tests will ensure that old problems do not reoccur due to future code or other product changes. As a result of producing special test conditions to test error recognition and recovery from many errors which cannot be easily simulated using standard equipment, an emulator can save critical time to market.
The advantages of emulation in development environments are clear – rapid development cycles, first-to-market with new products and outstanding product reliability – all achieved through the use of host and device emulators.