Physical Volume Library Deadlock Avoidance in a Striped Media Environment

Jeff K. Deutsch
IBM Government Systems
Houston, Texas
Mark R. Gary
Lawrence Livermore National Laboratory
Livermore, California

Abstract:

Most modern high performance storage systems store data in large repositories of removable media volumes. Management of the removable volumes is performed by a software module known as a Physical Volume Library (PVL). To meet performance and scalability requirements, a PVL can be asked to mount multiple removable media volumes for use by a single client for parallel data transfer. Mounting sets of volumes creates an environment in which it is possible for multiple client requests to deadlock while attempting to gain access to storage resources.

Scenarios leading to deadlock in a PVL include multiple client requests that contend for the same cartridge(s), and client requests that vie for a limited set of drive resources. These deadlock scenarios are further complicated by the potential for volumes to be mounted out-of-order (for example, by Automatic Cartridge Loaders or human operators).

This paper begins by introducing those PVL requirements which create the possibility of deadlock. Next we examine traditional approaches to deadlock resolution and how they might be applied in a PVL. This leads to a design for a PVL that addresses deadlock scenarios. Following the design presentation is a discussion of possible design enhancements. We end with a case study of an actual implementation of the PVL design in the High Performance Storage System (HPSS).

Introduction

Processing power and data collection abilities have been increasing faster than storage system bandwidth and capacity for many years ([1],[2]). This growing gap has caused the storage system to become a bottleneck for more and more applications. While techniques such as third party transfer and network attached peripherals [3] address this bottleneck, existing storage systems continue to fall short of meeting the present and predicted data storage and retrieval needs of supercomputers, massively parallel processors, and workstation networks.

In response, researchers working on the next generation of storage systems are looking for innovative, open solutions which will narrow the storage gap. The IEEE Reference Model for Open Storage Systems Interconnection (Project 1244) [4] defines a storage architecture which addresses the needs of high-end storage clients. The reference model defines a set of cooperating modules and interfaces which combine to form a functional storage system. The focus of this paper deals primarily with the design aspects of one of these modules known as the Physical Volume Library (PVL).

A quick overview of the reference model's PVL module and those modules most closely associated with the PVL is necessary:

At this point we need to define few other terms that we will using throughout the paper:

While a physical volume mount request maps directly to a request to mount media, the cartridge containing the media can hold one or more physical volumes. For example, with some optical platters, the transportable object, the cartridge, is the optical platter itself. If the platter is capable of storing data on each of its sides, the cartridge could be considered to hold two physical volumes represented by the two sides of the platter. Tape cartridges capable of partitioned access can be similarly configured to contain multiple physical volumes.

The PVL is the enterprise-wide manager of volume and drive resources and is responsible for queuing mount requests to prevent resource contention. The PVL translates client (VSS) requests to mount a physical volume into requests to mount specific media. When a volume is requested, the PVL identifies the PVR that manages the media. It then allocates the requested media and drive resources, using one of many possible queuing and allocation schemes, and issues mount commands to the PVR. When the PVR has mounted the media, the PVL verifies the internal media label via requests to a Mover.

In a high performance storage environment, it is often necessary that a single bitfile be striped across multiple physical volumes to attain an adequate data transfer rate. For example, a modern high performance tape drive might read and write at about 10 megabytes-per-second (MB/s) and might store 20 gigabytes (20 GB) of data on a cartridge. If a bitfile is striped across four such tape drives, the resulting virtual volume appears to the client to read and write at 40 MB/s and to store 80 GB. It should be noted that, while the advantages of striping are obvious, deciding when and how wide to stripe a bitfile is challenging ([7],[8],[9]).

High performance storage systems that implement striping require that sets of volumes be mounted together in order to satisfy striped data requests. For this paper we make the assumption that striped data can not be accessed until all volumes making up the virtual volume are mounted. This assumption is necessary in cases such as direct tape-to-tape copies, when the source stripe width does not match that of the sink, and tape-to-display copies when a system does not typically have enough memory to buffer data while awaiting mounts. Sets of volumes may also be needed for other purposes such as creating mirrored copies of data. It is in the process of satisfying atomic mounts of sets of volumes that potential for deadlock arises in a PVL.

Problem statement

A system is considered to be deadlocked if every activity in the system is waiting for an event which can only be generated by another activity in the system [10]. Mounting data volumes in a high performance storage environment can cause deadlock in three different ways:

Non-deadlock scenarios involving clients that monopolize resources can also effectively prevent the allocation of storage resources. In this section we discuss these scenarios after first investigating the potential for deadlock caused by each of the three deadlock conditions.

Drive resource contention

A PVL is required to mount multiple physical volumes together for striped bitfile access. In most parallel environments all physical volumes must be mounted concurrently to satisfy the striped request. Mounting three out of four cartridges for a striped tape request does not typically allow the data to flow, and the three drive units occupied by the mounted tapes are not free to satisfy other requests during the wait for the fourth cartridge. Worse yet is the potential for deadlock if the required fourth drive resource never becomes available because another request, which cannot complete until one of the drives occupied by the first request is relinquished, occupies the remaining drives.

This simple deadlock scenario is illustrated in Figure 1. In this example a PVL managing four drives has two separate requests, one for a four-wide stripe and a second for a three-wide stripe. The PVL has mounted two cartridges for each request and is deadlocked waiting for drives to free for each request.

Another aspect of this deadlock scenario is that, even if one of the striped requests is satisfied, data may not be able to flow until both mount requests are satisfied concurrently. If the two clients in Figure 1 were attempting a tape-to-tape copy directly from one set of striped tapes to another (Client 1 to Client 2), all source and sink tapes must be mounted for data to flow. If the two clients involved submit their striped mount requests separately, then even if one of the stripe sets is successfully mounted without deadlocking, that striped set occupies drives without moving data until the second stripe set is mounted. If the second stripe set is unable to mount due to resource contention, PVL deadlock is achieved. In fact, tape-to-tape copy is impossible given the hardware in Figure 1, as seven drives are required for the copy. It should be noted that, if the tape-to-tape copy is between a like number of tapes (four-wide stripe to four-wide stripe) copying could be accomplished by copying each stripe independently.

Out-of-order mounts

The potential for out-of-order volumes being mounted creates another deadlock scenario in a PVL. Out-of-order volumes are those volumes that have been requested by a client, or perhaps will be requested soon by a client, but have not yet been requested of the PVR by the PVL. In addition to out-of-order volume mounts caused by operator and robot error, another common mechanism which can cause out-of-order volume mounts is the use of a traditional sequential Automatic Cartridge Loader, more commonly referred to as a stacker. This device mounts the next cartridge in its stack as soon as the current cartridge is unloaded. No external command is required from the PVL or PVR for such a mount. If the PVL keeps the out-of-order volume mounted, possibly to satisfy another queued request, the same type of deadlock condition we observed previously could occur (see Figure 2).

Note that, in Figure 2, there is no way for both client requests to succeed without operator intervention.

Another scenario created by out-of-order tape mounts is one leading to a state of indefinite postponement of a client request, also called livelock [11]. If, when an out-of-order mount occurs, the PVL uses the volume to satisfy a queued request, other mount requests which may have been older, or of higher priority, are postponed. Theoretically a mount request might never be satisfied because out-of-order mounts could indefinitely monopolize drive resources.

Multiple requests for the same cartridge

It is also possible for a PVL to deadlock based on contention for cartridge resources. If two clients are mounting striped sets that require different physical volumes, but two of the volumes exist on the same cartridge, then deadlock could occur. This deadlock can occur even though clients may be requesting discrete sets of volumes, but in fact, are requesting overlapping sets of cartridges (see Figure 3).

Resource monopolization

In the subsection on out-of-order mounts, we saw how accepting out-of-order mounts could lead to the indefinite postponement of requests even if it did not cause deadlock. The end result is similar to deadlock in that client requests to the storage system are never satisfied. Resource monopolization is another condition that can lead to indefinite postponement even when it does not cause deadlock.

In some environments it is quite possible that an apparently well-behaved client may hold a resource indefinitely. For example, some persistent process may acquire a set of drives and volumes to use as scratch space for calculations. Such a process would prevent other storage system clients from ever accessing those drives. While it is possible for a PVL to force a volume to be dismounted, the result can lead to serious errors in the client and corrupted data on the volume. Because of this, most systems do not allow a PVL to force a client to terminate the use of a drive/volume combination. For the purposes of this paper we assume well-behaved clients which hold resources for some bounded amount of time.

Traditional approaches to deadlock

Deadlock conditions have been well defined in operating systems research. There are four conditions necessary for deadlock to exist. Coffman et al. [12] introduced these conditions which, defined simply are:

When all four of these conditions are satisfied, deadlock will occur. All three of the PVL deadlock cases presented in the last section satisfy all four deadlock conditions.

There are three major approaches to dealing with deadlock. Deitel [13] identifies these approaches as:

We now examine each of these approaches and their application to a PVL.

Deadlock prevention

Deadlock prevention entails eliminating any possibility of a deadlock condition occurring. This is done by ensuring that at least one of the four conditions necessary for deadlock can never occur.

In a PVL it is impractical to try to eliminate the Mutual Exclusion condition because drives cannot be assumed to be concurrently shared between clients. As discussed earlier, preemption is also not practical, so the No Preemption condition also holds. Circular Wait is generally a function of client requests. Because clients do not typically coordinate independent requests with one another, there is no way for them to guarantee that their requests will never result in a circular wait for resources. A PVL cannot prevent such a condition as it is unaware of all of its clients' higher level interdependencies. Because of this we cannot eliminate the Circular Wait condition in the PVL.

Our final chance at eliminating a deadlock condition is to prevent the Hold and Wait condition from ever occurring. As it turns out it would be trivial for a storage system to eliminate this condition and thereby eliminate any potential for deadlock. The PVL could simply request drives for a client, but never hold drive resources while waiting for others to become available. In this manner a PVL mounting a four-wide stripe might request the drives one at a time, releasing any successful drive reservations if any one drive request couldn't be immediately satisfied. While this prevention algorithm eliminates the potential for deadlock, it represents a very inefficient algorithm for obtaining resources. Depending on the implementation, it can also lead to job resource starvation.

Deadlock avoidance

Given information about how serially reusable resources such as cartridges and drives are used, it is possible to construct an algorithm that avoids deadlock. Deadlock avoidance algorithms avoid deadlock occurrence through the judicious allocation of resources. An example of a classical avoidance algorithm is Dijkstra's Banker's Algorithm [14].

Storage system clients typically know the total number of resources a job requires immediately at the start of a job. A four-wide stripe requires four drives and four volumes. A direct tape-to-tape copy of one four-wide stripe to another requires eight drives and eight volumes. If a PVL presents an interface allowing clients to provide information about what mounts need to occur atomically together, PVL deadlock can be avoided and the possibility of client level deadlock, a circular wait, is diminished. Because of this, PVLs are well suited for the application of deadlock avoidance algorithms. It should be noted that clients which are not well behaved can still deadlock themselves by independently requesting mounts that are dependent on each other at a level higher than the PVL.

Deadlock detection and recovery

Unlike deadlock avoidance, deadlock detection algorithms make no effort to prevent deadlock from occurring. Instead, the system is periodically examined to determine if deadlock has occurred. Detection algorithms typically involve checking resource allocation graphs for cycles [15].

When it is determined that deadlock has occurred, deadlock recovery must be invoked. Deadlock recovery involves terminating processes or preempting resources to break the deadlock. Deadlock detection and recovery are often used in environments where deadlock is unlikely and/or when checking for deadlock at each request is impractical.

In order to implement deadlock detection and recovery in a PVL, it must be possible for the deadlock to be broken. As discussed earlier, breaking deadlock by preempting resources or by forcibly unmounting client tapes in a PVL is at best difficult, and often is not allowed.

A design for PVL deadlock avoidance

We would now like to outline the design of a PVL for a high performance storage-striped media-environment. We concentrate on those aspects of the design that address the deadlock issues we have raised, and explain the rationale behind our design decisions.

Design approach

We chose to combine deadlock avoidance techniques with an algorithm designed to prevent indefinite resource postponement. Deadlock prevention was rejected because of its dependence on inefficient resource allocation algorithms. Deadlock detection and recovery was rejected because of the operational ramifications of the requirement that a PVL be able to preempt or terminate requests to recover from deadlock.

Our design presents a set of transactional Application Programming Interfaces (APIs) to the client application. These APIs allow a client to atomically specify all of the resources which will be needed for a single job. This set of required resources is used by the deadlock avoidance algorithm to determine if all or part of the request should be queued. The queuing mechanism is first-come-first-served (nonpreemptive) with defined precedence rules for reserving storage resources. Out-of-order mounts are allowed, but preemption of storage resources is tightly controlled.

APIs for atomic mounts

In order to allow for atomic mounts, our PVL design includes the following APIs:

Using these PVL APIs, a client can build a single or multivolume mount request and submit (commit) it to the PVL. The interface also allows a pair of clients to work together to create a single atomic mount request for applications such as tape-to-tape copying. Imagine two Virtual Storage Servers (VSSs) which need to copy data from a tape virtual volume managed by one VSS to a tape virtual volume managed by the other VSS. As mentioned previously, to avoid deadlock and to maximize drive utilization, the tape mounts for both VSSs should be combined atomically. With these APIs, one VSS could obtain a job identifier using MountNew. This identifier could then be shared by both VSS clients in the building of a single mount request using calls to MountAdd. Once the request was built, one of the VSSs would be in charge of actually committing the combined mount. Such an algorithm allows clients to avoid a possible deadlock situation that the PVL would otherwise be unable to prevent.

In our design the APIs allowing atomic mount present an asynchronous interface. In order to notify a client that mounts have completed, the design specifies an API for client notification of mounts:

PVL mount queuing

Internally our design accepts MountAdd requests, queuing them until the job is committed. It is a job's commit time that is used to initially order mount requests (not the time of the MountAdds). Once a commit is received for a job, the PVL first verifies that the job does not require more resources than exist in the system. The PVL also verifies that the job is requesting valid volumes and that none of the volumes reside on the same cartridge. At that point the PVL returns to the client and indicates that the mount is in progress.

Asynchronously, the PVL begins allocating resources. The key to deadlock avoidance is preventing a circular wait. Our deadlock avoidance algorithm achieves this by requiring that the PVL follow a strict precedence ordering in reserving resources and that those resources be assigned to client requests in a specific order. It is our experience that drive resources are typically much more scarce than media resources. A typical site might have thousands of cartridges and fewer than 20 drives. Because drives are more scarce, our PVL first attempts to reserve the media necessary for the request before trying to reserve drives.

One aspect of our design is that cartridges, not physical volumes, are reserved by the PVL. This is important because, as mentioned previously, two or more distinct physical volumes can reside on the same cartridge. Reserving cartridges ensures that two mount requests for physical volumes on the same cartridge are not concurrently issued to a PVR. When a cartridge resource becomes available, it is given to the appropriate mount job which has the earliest commit time.

Once cartridge resources are reserved, the job is placed in a second queue to reserve drive resources. Free drive resources are allocated to requesting jobs based on their order in the queue. No preemption of a job for better resource utilization is allowed. This does have potential drawbacks including less than optimal utilization of drive resources.

Another important aspect of the design is the fact that a particular drive is not reserved for a mount request, rather a count of drives of the requested type is kept and the mount request reserves a drive by decrementing a count of available drives of the appropriate type. This allows the PVL to deal with PVRs that do not allow pre-assignment of drive resources.

An exception in this drive assignment algorithm is made when the PVR managing the cartridge in question is an operator PVR, that is, with human mounted drives. In this case the PVL does not do any reservation of drives; rather, it immediately sends the request to the PVR. An exception to this rule is discussed when we describe how we deal with out-of-order volume mounts. The benefits of this technique are two-fold. First, allowing an operator to see all tape mount requests rather than just the oldest optimizes the process of retrieving cartridges from vaults. Second, it allows for a simpler algorithm to deal with out-of-order volume mounts which we will discuss shortly.

When a volume mount has been assigned both its cartridge and drive, the PVL asks the PVR to mount the cartridge. Because the deadlock avoidance algorithm guarantees that drive assignments cannot cause deadlock, we can mount each cartridge as soon as the drive is assigned rather than waiting for drives to be reserved for the entire job. When the PVR has successfully mounted a cartridge, the PVL verifies the internal media label using a Mover. If the internal label is correct, the PVL responds to the client using the PVLNotify API.

We have shown that this PVL design very simply addresses both the problem of atomic mount induced deadlock, and that of multiple concurrent requests for the same cartridge resource. The design presented thus far has not addressed the challenge of out-of-order mounts.

Out-of-order mount handling

As we discussed earlier, allowing an out-of-order mount to be honored can lead to deadlock and indefinite postponement. For our design we chose a simple strategy that prevents deadlock due to out-of-order mounts, but allows the use of conventional stackers at the risk of indefinite postponement.

Before settling on a design, we considered the simple algorithm of allowing no preemption of mount requests. Under such an algorithm, if a volume has been mounted as requested by a PVL client, but hasn't yet been requested to be mounted by PVL, then the volume will be dismounted. Such an approach makes extremely poor use of traditional stacker devices and forces an operator to retrieve a dismounted cartridge and place it back in the stacker before the mount can be satisfied. This eliminates the primary labor-saving advantage of using a stacker.

Because of this weakness we decided to allow out-of-order mounts to be honored with some restrictions. Our design accomplishes this by treating operator mounted volumes as exceptions. As we stated previously, our deadlock avoidance algorithm sends operator mounted volume requests to a PVR as soon as a cartridges are reserved rather than first trying to reserve a drive. The one exception to this rule is that, once mount requests for a multivolume mount involving an operator PVR have begun, all subsequent mount requests for that PVR are queued in the PVL until the hand mounted volumes involved in the multivolume mount are mounted. This rule, combined with the rule that mounts are only accepted if they have been requested by a PVL, ensures that two multivolume mounts never deadlock on operator controlled drive resources.

Important to our design is the requirement that mount request displays communicate to operators which volumes are associated together as part of a multivolume mount. This information is vital to provide an operator enough information to keep him or her from stacking two or more volumes that are part of the same multivolume mount in the same stacker. Also, because we assume that most operator mount request displays will show how long a particular mount request has been outstanding, we depend on the operations staff to make sure that the number of preemptions, and hence the length of postponement, are minimized.

Possible enhancements to the design

The PVL design aspects we have presented represent an attempt to, in an uncomplicated manner, satisfy requirements imposed by a high performance storage system while preventing resource deadlock. There are a number of enhancements that might be made to this design without violating our goal of maintaining PVL simplicity.

Scheduling and preemption enhancements

Our PVL design operates fundamentally on a first-committed first-served scheduling basis. The only preemption allowed takes place when out-of-order mounts are allowed in operator PVRs. One can imagine any number of prioritization schemes that would allow jobs to be scheduled based on an assigned job weight or priority. Jobs could be assigned greater priority based on:

As long as the PVL followed the deadlock avoidance rule that no circular wait be allowed, then any weighted scheduling mechanism could easily be added to our design without adding possible deadlock scenarios.

One can also imagine allowing the preemption of jobs that the PVL has already sent to a PVR. This preemption might be the result of jobs of a greater weight as assigned by a prioritization scheme, arriving at the PVL after submission of the original mount request to the PVR. Implementing this kind of preemption would require the addition of a mechanism to back out the drive and cartridge allocation of preempted jobs, as well as a mechanism to abort or dismount PVR requests that are being preempted.

All of the priority weighting schemes listed (and many more are possible) are very site dependent. Each site will have different rules and preferences as to how a job should be weighted based on their local environment. In order to accommodate different weighting mechanisms, the priority setting portion of the PVL would best be implemented as a separate policy module that each site could modify. Inputs to the policy module would be all information known about the request, and the output would be a priority weight to be assigned to the job.

Adding client deadlock detection

One of the key features of our deadlock avoidance algorithm is that the PVL is made aware of all resources that will be used by the client to satisfy a single request. Our PVL APIs allow one or more clients to specify all of the resources a request will use; the PVL uses this information to prevent deadlock. It is possible that a poorly behaved client might deadlock itself by issuing separate codependent PVL requests.

While our PVL design does not totally protect a client from itself, it could be enhanced to detect when client-induced deadlock might have occurred. This detection would rely on watching how long a particular job has been mounted, possibly in conjunction with client provided mount duration information. Regardless of the detection method, the PVL could alarm operators and/or clients that a deadlock condition may exist, or the PVL could even be given the capability to unmount the offending jobs.

Limiting the amount of preemption due to out-of-order mounts

By honoring out-of-order mounts in our design, we have introduced the possibility of indefinite postponement. Possible approaches to minimizing the impact of indefinite postponement might involve trying to limit the extent or number of postponements allowed.

Eliminating the extent of a single postponement is difficult for a PVL because, as we have seen, a PVL typically can not unmount a volume on its own prerogative. This fact alone makes any single postponement one of indeterminate time and makes any enhancement to this aspect of our design difficult. It is only in systems where mount durations are well known, controlled, or modifiable by a PVL, that allowing a preemption does not entail some amount of postponement risk.

Managing the number of times a mount request can be postponed is an easier challenge. Setting a hard limit on the number of preemptions, weighted priority schemes and aging algorithms could all be applied to aid indefinite postponement detection and recovery with some benefit. Unfortunately such algorithms are often very dependent on specific site policies, and do not eliminate the problem of any single postponement being of unknown duration.

A case study-the HPSS PVL

The PVL design presented in this paper has been implemented as part of the National Storage Laboratory's [16] High Performance Storage System (HPSS) ([17],[18]) under development by the National Storage Laboratory. HPSS is a storage system that manages scalable, parallel storage, possibly petabytes in size, and requiring up to several gigabytes per second aggregate throughput. HPSS is designed to meet the needs of parallel computers, traditional supercomputers, and workstation clusters. HPSS is based upon the IEEE Reference Model for Open Storage Systems Interconnection and is implemented using Open Software Foundation's (OSF) Distributed Computing Environment Remote Procedure Calls (DCE RPCs) [19] and Transarc Corporation's Encina metadata management and transactional RPC software [20].

To meet performance and scalability requirements, HPSS requires that a PVL mount multiple physical volumes in parallel to service a single client request. An HPSS PVL must satisfy all of the requirements and challenges discussed in this paper. Our PVL design was implemented for HPSS in the C language in a platform independent manner, and currently runs on an IBM RS/6000 computer under the AIX operating system.

The HPSS PVL is a multitasking server built on top of DCE threads. It uses DCE RPCs to communicate with clients, PVRs, and Storage System Manager applications. Unix sockets are used to communicate with Movers. The PVL stores its metadata, internal information about configurations, volumes, requests, and so on, using Encina's Structured File System (SFS) transactional metadata storage system. The PVL maintains support interfaces allowing storage system management applications access to configuration and status.

Our PVL presents an API to its client that is a superset of the APIs presented in our design above. When each request is committed, a job is created in the PVL and placed at the end of an ordered job list. A second list of all cartridges that have been requested by existing jobs is also maintained. These two lists form a two dimensional sparse matrix (see Figure 4).

Note that in Figure 4, VOL003 and VOL004 will not be mounted even though drives are available. This is because a job must have all of its cartridges assigned before it can allocate any drives.

Each node in the matrix is a separate activity. Each activity represents a single volume that needs to be, or has been, mounted. An activity moves through a set of states first acquiring resources, then mounting a cartridge in a drive, and finally dismounting the cartridge and assigning the resources to the next waiting activity. Some of the more common activity states, transitions between these states, and some expanded implementation details are described as follows:

The HPSS implementation of our design currently supports StorageTek 4400, IBM 3494, IBM 3495, Ampex DST800, and operator mounted drives. The next release of the HPSS PVL will include enhanced device support and will support mounts requested for magnetic disk volumes as required by the IEEE Reference Model for Open Storage Systems Interconnection.

Even though HPSS does not currently support any optical disk devices, our PVL does support the concept of multisided cartridges. This is necessary for future support of optical devices, but may also be needed by tape devices. For example, Ampex DD2 cartridges can be divided into multiple partitions and it is possible to mount cartridges such that the drive firmware enforces access to only a specific partition. In this case a single DD2 cartridge could be considered to have multiple volumes.

HPSS was successfully demonstrated at Supercomputing '94. As part of that demonstration, the HPSS PVL was involved in mounting one-way, two-way, and four-way media stripes of tape and disk media. At the time this paper was written, February 1995, a preliminary release of HPSS was being installed and tested at several early deployment sites. The preliminary release contains support for striped tape. The next release of HPSS adds support for striped disk, multiple storage hierarchies, and migration and caching between hierarchies.

While the HPSS PVL was implemented to fill the need for a PVL satisfying the requirements of a high-end storage system, it also served as a proof of concept of our PVL design. It showed that expanding upon typical PVL interfaces and dealing with deadlock challenges was not only possible, but could be accomplished with a relatively simple design. With time, we are sure that some of the enhancements mentioned above will be added to the HPSS PVL. However, based upon input from the operational sites involved in the development of HPSS, we found that implementing these enhancements will have to be done carefully because of their site specific nature.

Conclusion

A PVL mounting striped, removable media can cause deadlocks three different ways: contention for drives, contention for cartridges, and mounting out-of-order volumes. There are several well known methods of eliminating deadlock when acquiring serially reusable resources; we chose deadlock avoidance for our PVL design. A PVL is ideally suited to deadlock avoidance techniques because its clients are able to specify all of the resources that will be used by a single request and because deadlock avoidance does not require the preemption of resources. The key to our deadlock avoidance algorithm is to prevent circular dependencies by requiring that the PVL follow a strict precedence ordering in reserving resources and that those resources be assigned to client requests in a specific order. This PVL design has been demonstrated through its implementation as part of the National Storage Laboratory's HPSS system.

Acknowledgments

HPSS is the product of a collaboration between Industry, IBM Government Systems; the Department of Energy, Lawrence Livermore National Laboratory, Los Alamos National Laboratory, Oak Ridge National Laboratory, Sandia National Laboratories; Higher Education, Cornell University; and NASA, Langley and Lewis Research Centers. Developers representing these organizations have worked with diligence and dedication as a single team. Hardware and software to support HPSS development and demonstration have been provided by Ampex, IBM, Kinesix, Maximum Strategy Inc., Network Systems Corp., PsiTech, Sony Precision Graphics, Storage Technology, and Zitel. Additional support in making HPSS available to high performance clients has been provided by Cray Research, Intel, IBM, and Meiko.

We would particularly like to thank Norm Samuelson who wrote most of the code for the HPSS PVL. We would also like to thank Jim Daveler for his work on both the PVL and PVR. Finally, we are grateful to Randy Burris for his helpful comments on this paper.

This work was, in part, performed by the Lawrence Livermore National Laboratory, under the auspices of a US Department of Energy Cooperative Research and Development Agreement and by IBM Government Systems under Independent Research and Development and other internal funding.

References

  1. K. Herbst, ``Vendors Seek Solutions to Growing I/O Bottleneck,'' Supercomputing Review, Vol. 4, No. 3, pp. 46-49, March 1991.

  2. L.M. Thorndyke, ``Supercomputers and Mass Storage: The Challenges and Impacts,'' Digest of Papers, 7th IEEE Symp. Mass Storage Systems, November 1985, pp. 27-30.

  3. R. Hyer, R. Ruef, and R.W. Watson, ``High Performance Direct Network Data Transfers at the National Storage Laboratory,'' Proc. 12th IEEE Symp. Mass Storage Systems, Apr. 1993, pp. 275-28.

  4. IEEE Project 1244, Reference Model for Open Storage Systems Interconnection, Lester Buck et al., eds., Sept. 1994.

  5. S. Coleman, ``Physical Volume Repository,'' Digest of Papers, 9th IEEE Symp. Mass Storage Systems, Oct. 1988, pp. 21-24.

  6. D. Kitts, S. Coleman, and B. Griffing, ``Bitfile Mover,'' Digest of Papers, 9th IEEE Symp. Mass Storage Systems, Oct. 1988, pp. 25-28.

  7. L. Golubchik, R.R. Munz, and R.W. Watson, ``Analysis of Striping Techniques in Robotic Storage Libraries,'' Proc. 14th IEEE Symp. Mass Storage Systems, Sept. 1995.

  8. A.L. Drapeau and R.H. Katz, ``Striped Tape Arrays,'' Proc. 12th IEEE Symp. Mass Storage Systems, Apr. 1993, pp. 257-265.

  9. A.L. Drapeau and R.H. Katz, ``Striping in Large Tape Libraries,'' Proc. Supercomputing '93, Nov. 1993, pp. 378-387.

  10. A. Silberschatz, and J.L. Peterson, Operating System Concepts, Addison-Wesley, 1988, pp. 187-218.

  11. J.D. Ullman, Principles of Database Systems, Computer Science Press, 1982, pp. 372-374.

  12. E.G. Coffman, M.J. Elphic, and A. Shoshani, ``System Deadlocks,'' Computing Surveys, Vol. 5, No. 2, 1971, pp. 67-78.

  13. H.M. Deitel, An Introduction to Operating Systems, Addison-Wesley, 1984, pp. 126-150.

  14. E.W. Dijkstra, ``Cooperating Sequential Processes,'' Technical Report EWD-123, Technological University, Eindhoven, The Netherlands, 1965.

  15. A.C. Shaw, The Logical Design of Operating Systems, Prentice-Hall, 1974, pp. 203-243.

  16. R.W. Watson , and R.A. Coyne, ``The National Storage Laboratory (NSL): Overview and Status,'' Proc. 13th IEEE Symp. Mass Storage Systems, June 1994, pp. 39-43.

  17. R.A. Coyne, H. Hulen and R. Watson, ``The High Performance Storage System,'' Proc. Supercomputing '93, Nov. 1993, pp. 83-92.

  18. D. Teaff, R.A. Coyne, and R.W. Watson, ``The Architecture of the High Performance Storage System,'' Proc. 4th NASA GSFC Conf. Mass Storage Systems and Technologies, College Park, MD, March 28-30.

  19. Open Software Foundation Distributed Computing Environment Version 1.0 Documentation Set, Open Software Foundation, Cambridge, Mass., 1992.

  20. Encina Documentation Set, Transarc Corporation, Pittsburgh, Pa., 1994.

Google