SCSI Library

QAS Boosts SCSI Transmission Rates

Parallel SCSI » White Papers

by: Charles Gimar
Performance Analyst Storage Components Division
LSI Logic Corp. Colorado Springs, CO
Also appeared in EE Times, Nov. 3, 2000.

The growing demand for bandwidth to and from storage is putting pressure on protocols like the Small Computer Systems Interface (SCSI) point-to-point data-transmission standards. Such protocols must continually evolve to provide more data-transmission bandwidth, greater configuration options and improved management tools.

It is not sufficient to double or quadruple the bandwidth to increase performance. Additional changes must also be made to reduce nondata-overhead portions of the protocol. Quick Arbitrate and Select (QAS) is a technique, first included in the SPI-3 (Ultra160) SCSI standard, to reduce protocol overhead when devices arbitrate for and gain access to the SCSI bus. Even though QAS and Information Unit (packetized) data transfers are features of the SPI-3 standard, these features, with modification, are in the SPI-4 (Ultra320) SCSI draft standard. Systems containing Ultra320 SCSI production components implementing packetized data transfers using QAS can be expected in the latter half of 2001.

QAS is responsible for overhead reductions of up to 16 percent over Normal arbitration in packetized Ultra320 SCSI data transfers. Packetized transfers using QAS can have 57 percent less overhead than nonpacketized Normal transfers. These reductions are fully realized in I/O workloads that can queue multiple I/O requests to storage targets.

Information transfers in Ultra320 SCSI can occur with either the legacy Data Group (nonpacketized) protocol, or the new Information Unit (packetized) protocol. Under the packetized one, command and control information is sent at synchronous data rates, up to 320 Mbytes/second. Under nonpacketized, all command and control is sent asynchronously and the maximum synchronous data rate is 160 Mbytes/s. There’s often significant differences between theoretical protocol timings compared to the timings that factor in expected command and control overheads with synchronous data-transfer timings.

“Expected” overheads are the theoretical protocol timings plus additional overhead necessary for physical implementations. Each implementation is different. At Ultra320 SCSI data rates and small sequential I/O operations (8 kbytes or fewer), the protocol portion is the dominant contributor to I/O latency. Any reduction in protocol time improves performance, especially at smaller data block sizes. Data phase disconnects increase protocol overhead at 64 kbytes.

QAS is not applicable to all implementations of Ultra320 SCSI. It is possible for packetized and nonpacketized, Normal and QAS and two fairness schemes to all coexist on the same physical SCSI bus. These options are typically negotiated at start of day and stored in each device. This potential for many protocol options increases development and interoperability testing requirements and thereby the risk in deploying solutions that implement QAS.

QAS may be initiated only in packetized operation. A nonpacketized device may respond to the QAS request message and win a QAS arbitration. Arbitration priority is also a concern in considering arbitration type. A new priority scheme, called simply Fairness, must be implemented with QAS and is optional with nonpacketized and Normal implementations. Upon arbitration, the Fairness scheme passes control of the bus to the requesting SCSI device with the next lower SCSI ID that previously lost arbitration. The earlier priority scheme used an already assigned priority, based only upon SCSI ID. Fairness does not affect the results presented here, but can have an impact on real performance when many disk drives are attached to a SCSI bus and device starvation can occur. Single transition data rate supports up to 80 Mbytes/s data transmission over a 16-bit SCSI bus. Double transition data rate supports up to 160 Mbytes/s over a 16-bit SCSI bus. Paced transfers support up to 320 Mbytes/s over a 16-bit SCSI bus.

Nonpacketized devices cannot initiate a QAS. It has been proposed that this be allowed in order to permit devices such as SCSI Enclosure Services chips to operate in QAS-enabled buses. The T10 committee may consider this proposal in the future.

Nonpacketized devices cannot initiate a QAS. It has been proposed that this be allowed in order to permit devices such as SCSI Enclosure Services chips to operate in QAS-enabled buses. The T10 committee may consider this proposal in the future.

Nonpacketized devices cannot initiate a QAS. It has been proposed that this be allowed in order to permit devices such as SCSI Enclosure Services chips to operate in QAS-enabled buses. The T10 committee may consider this proposal in the future.

SS1137_LSI_PG_106.gifTo explore the matter further it’s useful to contrast the performance of three implementations. The three, all at Ultra320 SCSI synchronous data speeds, are nonpacketized, Normal (with and without command disconnect); packetized, Normal; and packetized, QAS.
 
The SCSI protocol defines transitions between several bus phases to transfer information between initiator and target. The minimum possible protocol time for a nonpacketized I/O to transfer control from one device to another is 5.8 microseconds.
 
This time includes the Bus Free, Arbitrate and Select phases. In practice the time is greater than this minimum because of additional implementation delays. An initiator begins a data or information transfer by giving a target peripheral device (a target is a device that performs the task directed by an initiator) a task to perform.
 
A physical connection is made when an initiator and a target negotiate a new connection during which control and data will pass between the two entities. Whenever a target disconnects from the bus, the bus must make a transition through Bus Free and Arbitration phases before connecting a different initiator-target pair. Disconnects may occur between commands, between data phases, at the end of a complete data transfer, or under error conditions. Each disconnect marks the end of a physical initiator-target connection and is expensive in terms of protocol time.
 
In packetized SCSI, all bus phases except Bus Free, Arbitrate and Select occur at the negotiated synchronous speed. With QAS, the protocol timing of Arbitrate is much shorter than with Normal arbitration. The minimum possible protocol time for a packetized I/O to transfer control from one device to another using QAS is 3.4 microseconds. Logical or physical disconnects can occur between command and data phases. A logical connection occurs between initiator-target pairs at each L_Q packet.

Packetized SCSI gains a performance advantage through two mechanisms. First, all command and control occurs at the synchronous data rate. Second, packetized SCSI allows multiple commands or multiple data blocks or both to be transmitted during a single physical connection. Since normal Bus Free and Arbitrate phases take a lot of time for each physical connection, reducing the number of physical connections also reduces the time spent in Bus Free and Arbitrate phases and overall can decrease the average per-I/O protocol overhead. QAS allows transfer of control from a target to another target or initiator without making a transition through a Bus Free phase. That greatly reduces the time required to transfer bus control and increases performance.

Data transfer time is equivalent for all the examined protocol options. We assume a SCSI bus loaded with seven targets. In practice, with sequential I/O and high-performance disks, bus saturation due to data will occur when transferring large data blocks to and from seven disks. Also, bus saturation due to overhead will occur when doing small block I/Os to fewer than seven disks.

Since the benefit of QAS and packetized occurs with sequences of associated I/Os, overhead for one to 16 queued I/Os per disk is examined. To queue I/Os, the initiator physically connects to a target and, under packetized, sends a group of I/O commands to the disk target. The disk can then process those requests as a group and return data (on a read) as data is available in the disk track buffer. Disconnects may occur between phases, but don’t have to.

Nonpacketized SCSI disconnects and transitions to Bus Free after the Command Complete message of each data transfer. If the I/Os are disconnected, there is also a Bus Free phase after each command phase. Other disconnects may occur after parts of the data phase. As per-target queue depth increases, the bus still must make a transition through Bus Free after each data block or command phase or both. Thus, there is no overhead reduction benefit to increasing per-target queue depth in this nonpacketized example. The nonpacketized SCSI has a higher overhead because there is an additional disconnect following the command phase.

In packetized SCSI, a string of commands can be sent to a target without going to Bus Free between each command packet. As per-target queue depth increases, the cost of the initial bus free-arbitrate phase is amortized over several I/Os. There is a similar effect during the data portion of the I/O stream. Thus, there is an average overhead reduction as per-target queue depth increases.

Comparing packetized overheads with Normal and QAS arbitration shows a 2- to 3-microseconds advantage for QAS at greater than two queued I/Os per target. That is directly attributable to the lower Bus Free and Arbitrate protocol times of QAS.

Packetized QAS has the lowest overhead when two or more I/Os are queued per target. This benefit also holds as the number of targets per SCSI bus is reduced, as well as for any other block size. The packetized and QAS benefit increases as the number of disconnects rise, as would happen at large block sizes with a saturated SCSI bus.

Implementations of Ultra320 SCSI must include bus training, the process to measure bus timing characteristics and adjust internal receiver and filter circuits to minimize the effects of timing skew. Training may occur on each physical connection for each direction data is sent across the bus. Training may occur when needed, with parameters recovered from local storage with each change in physical connection.

Training has a negative impact on bus overhead. A delay of up to 2 microseconds is incurred for each direction that training is performed. With protocol overhead times already around 20 microseconds, it is apparent that training on every I/O can be quite costly. If training is requested via a PPR message, the bus must first make the transition to Bus Free. In this case, any benefit of a long string of I/Os in which control is transferred via QAS is diminished each time the bus goes to Bus Free. The analysis in this article assumed a consistent training scheme was used on both packetized and nonpacketized transfers and did not include the training latency in the overhead calculations. Training won’t change the conclusions of this analysis.

QAS is an enhancement of the SCSI standard that reduces protocol overhead by shortening the time it takes to transfer bus control from one device to another. Overhead reduction allows increased data throughput and increased SCSI bus efficiency, beyond levels available without using QAS. QAS does not occur on each I/O; rather, it’s only used between sequences of I/Os to change the physical connection between initiator and target and is primarily used with packetized transfers. A new bus Fairness algorithm is used with QAS, reducing the possibility of device starvati on when large numbers of SCSI devices are connected on a bus. The performance benefit of QAS is reduced each time a bus is trained.

The greatest performance benefit of QAS occurs when I/O block size is 8 kbytes or smaller and with more than two I/Os queued to each target. As SCSI bus synchronous speeds increase and smaller I/Os are dominated by protocol, it is up to novel schemes such as QAS to reduce the protocol overhead portion, allowing the higher performance to be realized. As system throughput requirements continue to increase, it’s vital that I/O standards concentrate on enhancing both the data transfer portion as well as the command and control overhead.