|Publication number||US7917539 B1|
|Application number||US 12/020,626|
|Publication date||Mar 29, 2011|
|Filing date||Jan 28, 2008|
|Priority date||Apr 25, 2003|
|Also published as||US7330862|
|Publication number||020626, 12020626, US 7917539 B1, US 7917539B1, US-B1-7917539, US7917539 B1, US7917539B1|
|Inventors||Mohan Srinivasan, Jeffrey S. Kimmel, Yinfung Fong|
|Original Assignee||Netapp, Inc.|
|Export Citation||BiBTeX, EndNote, RefMan|
|Patent Citations (26), Non-Patent Citations (9), Referenced by (3), Classifications (10), Legal Events (2)|
|External Links: USPTO, USPTO Assignment, Espacenet|
This application is a continuation of U.S. Ser. No. 10/423,381, filed by Mohan Srinivasan et al. on Apr. 25, 2003, titled Zero Copy Write Datapath, now issued as U.S. Pat. No. 7,330,862 on Feb. 12, 2008.
The present invention relates to storage systems and, more particularly, to storing information on storage systems.
A storage system is a computer that provides storage service relating to the organization of information on writable persistent storage devices, such as memories, tapes or disks. The storage system is commonly deployed within a network attached storage (NAS) or storage area network (SAN) environment. A SAN is a high-speed network that enables establishment of direct connections between a storage system, such as an application server, and its storage devices. The SAN may thus be viewed as an extension to a storage bus and, as such, a storage operating system of the storage system enables access to stored information using block-based access protocols over the “extended bus”. In this context, the extended bus is typically embodied as Fibre Channel (FC) or Ethernet media (i.e., network) adapted to operate with block access protocols, such as Small Computer Systems Interface (SCSI) protocol encapsulation over FC or Transmission Control Protocol/Internet Protocol (TCP/IP)/Ethernet.
SCSI is a peripheral input/output (I/O) interface with a standard, device independent protocol that allows different peripheral storage devices, such as disks, to attach to the storage system. In SCSI terminology, clients operating in a SAN environment are initiators that initiate requests and commands for data. The storage system is a target configured to respond to the requests issued by the initiators in accordance with a request/response protocol. The SAN clients typically identify and address the stored information in the form of blocks or disks by logical unit numbers (“luns”).
When used within a NAS environment, the storage system may be embodied as a file server including a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on, e.g., the disks. Each “on-disk” file may be implemented as a set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored.
The file server, or filer, of a NAS system may be further configured to operate according to a client/server model of information delivery to thereby allow many client systems (clients) to access shared resources, such as files, stored on the filer. In the client/server model, the client may comprise an application executing on a computer that “connects” to the filer over a computer network, such as a point-to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet. The clients typically communicate with the filer by exchanging discrete frames or packets of data according to pre-defined protocols, such as the TCP/IP.
NAS systems generally utilize file-based protocols to access data stored on the filer. Each NAS client may therefore request the services of the filer by issuing file system protocol messages (in the form of packets) to the file system over the network. By supporting a plurality of file system protocols, such as the conventional Common Internet File System (CIFS), the Network File System (NFS) and the Direct Access File System (DAFS) protocols, the utility of the filer may be enhanced for networking clients.
A common type of file system is a “write in-place” file system, an example of which is the conventional Berkeley fast file system. In a write in-place file system, the locations of the data structures, such as modes and data blocks, on disk are typically fixed. An mode is a data structure used to store information, such as metadata, about a file, whereas the data blocks are structures used to store the actual data for the file. The information contained in an mode may include, e.g., ownership of the file, access permission for the file, size of the file, file type and references to locations on disk of the data blocks for the file. The references to the locations of the file data are provided by pointers, which may further reference indirect blocks that, in turn, reference the data blocks, depending upon the quantity of data in the file. Changes to the modes and data blocks are made “in-place” in accordance with the write in-place file system. If an update to a file extends the quantity of data for the file, an additional data block is allocated and the appropriate mode is updated to reference that data block.
Another type of file system is a write-anywhere file system that does not over-write data on disks. If a data block on disk is retrieved (read) from disk into memory and “dirtied” with new data, the data block is stored (written) to a new location on disk to thereby optimize write performance. An example of a write-anywhere file system that is configured to operate on a filer is the Write Anywhere File Layout (WAFL™) file system available from Network Appliance, Inc. of Sunnyvale, Calif. The WAFL file system is implemented within a microkernel as part of the overall protocol stack of the filer and associated disk storage. This microkernel is supplied as part of Network Appliance's Data ONTAP™ storage operating system residing on the filer.
Data is often received at the storage system from the network as packets of various lengths that are stored in lists of variable length input buffers. In contrast, file systems usually operate on data arranged in blocks of a predetermined size. For instance, data in the WAFL file system is stored in contiguous 4 kilobyte (kB) blocks. Therefore, data received by the storage system is typically converted from variable length input buffers to the fixed sized blocks for use by the file system. The process of converting data stored in input buffers to fixed sized blocks involves copying the contents of the input buffers into the system's memory, then having the file system reorganize the data into blocks of a predetermined size. However, the copy operation from the input buffers to the file system buffers consumes processor resources as that copy operation is performed in software. The present invention is directed to a technique that eliminates this copy operation into the file system buffers.
The invention relates to a technique for enhancing a write data path within a storage operating system executing on a storage system. As used herein, the write data path defines program logic used by a file system of the storage operating system to process write requests directed to data, e.g., files or virtual disks (vdisks), served by the file system. The inventive technique enhances the write data path of the storage system by providing a “zero copy” write data path embodied as a zero copy write function of the storage operating system that eliminates a copy operation for a write request received at the storage system. The eliminated operation is a data copy operation from a list of input buffers to buffers used by the file system.
In the illustrative embodiment, the storage system is a multi-protocol storage appliance having a memory for storing data and a non-volatile random access memory (NVRAM) capability that prevents data loss within the storage appliance. A portion of the memory is organized as a buffer cache having buffers used by the file system to store data associated with, e.g., write requests. When a block access (or a certain file access) write request directed to a vdisk (or file) is received at the storage appliance, a network adapter transfers write data associated with the request into selected buffers used by the file system via a direct memory access (DMA) operation. A Small Computer Systems Interconnect (SCSI) target module of the storage operating system constructs a list of these selected file system buffers for use with a write operation associated with the write request. The list of buffers is then processed by the zero copy write function of the storage operating system.
Specifically, the zero copy write function “grafts” (inserts) the selected buffers directly into a buffer tree structure of the buffer cache. The buffer tree is an internal representation of data, e.g., for a file or vdisk, stored in the buffer cache and maintained by the file system of the storage operating system. Rather than actually copying the data stored in the buffers, grafting of the file system buffers into the buffer tree entails swapping pointers that reference memory locations of buffers in the buffer tree with pointers that reference memory locations of the selected file system buffers. After the write data is grafted into the buffer tree, another DMA operation is initiated from these grafted buffers to a non-volatile log (NVlog) of the NVRAM.
Advantageously, the novel zero copy write data path technique obviates a copy operation from the input buffers into the file system buffers by allowing the network adapter to copy the write data directly from the write requests into those buffers. The invention thus eliminates the data copy operation and its consumption of processor cycles.
The above and further advantages of invention may be better understood by referring to the following description in conjunction with the accompanying drawings in which like reference numerals indicate identical or functionally similar elements:
The multi-protocol storage appliance 100 is illustratively embodied as a storage system comprising a processor 122, a memory 124, a plurality of network adapters 125, 126 and a storage adapter 128 interconnected by a system bus 123. The multi-protocol storage appliance 100 also includes a storage operating system 200 that provides a virtualization system (and, in particular, a file system) to logically organize the information as a hierarchical structure of named directory, file and virtual disk (vdisk) storage objects on the disks 130. An example of a multi-protocol storage appliance that may be advantageously used with the present invention is described in co-pending and commonly assigned U.S. Pat. No. 7,873,700, issued on Jan. 18, 2011, entitled A Multi-Protocol Storage Appliance that Provides Integrated Support for File and Block Access Protocols, which application is hereby incorporated by reference as though fully set forth herein.
Whereas clients of a NAS-based network environment have a storage viewpoint of files, the clients of a SAN-based network environment have a storage viewpoint of blocks or disks. To that end, the multi-protocol storage appliance 100 presents (exports) disks to SAN clients through the creation of logical unit numbers (luns) or vdisk objects. A vdisk object (hereinafter “vdisk”) is a special file type that is implemented by the virtualization system and translated into an emulated disk as viewed by the SAN clients. The multi-protocol storage appliance thereafter makes these emulated disks accessible to the SAN clients through controlled exports.
In the illustrative embodiment, the memory 124 comprises storage locations that are addressable by the processor and adapters for storing software program code. A portion of the memory may be further organized as a buffer cache 300 having buffers used by the file system (hereinafter “file system buffers”) to store data associated with, e.g., write requests. The processor and adapters may, in turn, comprise processing elements and/or logic circuitry configured to execute the software code and manipulate the data structures. The storage operating system 200, portions of which are typically resident in memory and executed by the processing elements, functionally organizes the storage appliance by, inter alia, invoking storage operations in support of the storage service implemented by the appliance. It will be apparent to those skilled in the art that other processing and memory means, including various computer readable media, may be used for storing and executing program instructions pertaining to the invention described herein.
The network adapter 125 couples the storage appliance to a plurality of clients 160 a,b over point-to-point links, wide area networks, virtual private networks implemented over a public network (Internet) or a shared local area network, hereinafter referred to as an illustrative Ethernet network 165. For this NAS-based network environment, the clients are configured to access information stored on the multi-protocol appliance as files. Therefore, the network adapter 125 may comprise a network interface card (NIC) having the mechanical, electrical and signaling circuitry needed to connect the appliance to a network switch, such as a conventional Ethernet switch 170. The clients 160 communicate with the storage appliance over network 165 by exchanging discrete frames or packets of data according to pre-defined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP).
The clients 160 may be general-purpose computers configured to execute applications over a variety of operating systems, including the Solaris™/Unix® or Microsoft Windows® operating systems. Client systems generally utilize file-based access protocols when accessing information (in the form of files and directories) over a NAS-based network. Therefore, each client 160 may request the services of the storage appliance 100 by issuing file access protocol messages (in the form of packets) to the appliance over the network 165. For example, a client 160 a running the Windows operating system may communicate with the storage appliance 100 using the Common Internet File System (CIFS) protocol over TCP/IP. On the other hand, a client 160 b running the Solaris operating system may communicate with the multi-protocol appliance using either the Network File System (NFS) protocol over TCP/IP or the Direct Access File System (DAFS) protocol over a virtual interface (VI) transport in accordance with a remote DMA (RDMA) protocol over TCP/IP. It will be apparent to those skilled in the art that other clients running other types of operating systems may also communicate with the integrated multi-protocol storage appliance using other file access protocols.
The storage network “target” adapter 126 also couples the multi-protocol storage appliance 100 to clients 160 that may be further configured to access the stored information as blocks or disks. For this SAN-based network environment, the storage appliance is coupled to an illustrative Fibre Channel (FC) network 185. FC is a networking standard describing a suite of protocols and media that is primarily found in SAN deployments. The network target adapter 126 may comprise a FC host bus adapter (HBA) having the mechanical, electrical and signaling circuitry needed to connect the appliance 100 to a SAN network switch, such as a conventional FC switch 180. In addition to providing FC access, the FC HBA offloads fiber channel network processing operations for the storage appliance.
The clients 160 generally utilize block-based access protocols, such as the Small Computer Systems Interface (SCSI) protocol, when accessing information (in the form of blocks, disks or vdisks) over a SAN-based network. SCSI is a peripheral input/output (I/O) interface with a standard, device independent protocol that allows different peripheral devices, such as disks 130, to attach to the storage appliance 100. In SCSI terminology, clients 160 operating in a SAN environment are initiators that initiate requests and commands for data. The multi-protocol storage appliance is thus a target configured to respond to the requests issued by the initiators in accordance with a request/response protocol. The initiators and targets have endpoint addresses that, in accordance with the FC protocol, comprise worldwide names (WWN). A WWN is a unique identifier, e.g., a node name or a port name, consisting of an 8-byte number.
The multi-protocol storage appliance 100 supports various SCSI-based protocols used in SAN deployments, including SCSI encapsulated over TCP (iSCSI) and SCSI encapsulated over FC (FCP). The initiators (hereinafter clients 160) may thus request the services of the target (hereinafter storage appliance 100) by issuing iSCSI and FCP messages over the network 165, 185 to block protocol (e.g., iSCSI and FCP) interconnect adapters 125, 126 to thereby access information stored on the disks. It will be apparent to those skilled in the art that the clients may also request the services of the integrated multi-protocol storage appliance using other block access protocols. By supporting a plurality of block access protocols, the multi-protocol storage appliance provides a unified and coherent access solution to vdisks/luns in a heterogeneous SAN environment.
The storage adapter 128 cooperates with the storage operating system 200 executing on the storage appliance to access information requested by the clients. The information may be stored on the disks 130 or other similar media adapted to store information. The storage adapter includes I/O interface circuitry that couples to the disks over an I/O interconnect arrangement, such as a conventional high-performance, FC serial link topology. The information is retrieved by the storage adapter and, if necessary, processed by the processor 122 (or the adapter 128 itself) prior to being forwarded over the system bus 123 to the network adapters 125, 126, where the information is formatted into packets or messages and returned to the clients.
Storage of information on the appliance 100 is preferably implemented as one or more storage volumes (e.g., VOL1-2 150) that comprise a cluster of physical storage disks 130, defining an overall logical arrangement of disk space. The disks within a volume are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID). RAID implementations enhance the reliability/integrity of data storage through the writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of redundant information with respect to the striped data. The redundant information enables recovery of data lost when a storage device fails.
Specifically, each volume 150 is constructed from an array of physical disks 130 that are organized as RAID groups 140, 142, and 144. The physical disks of each RAID group include those disks configured to store striped data (D) and those configured to store parity (P) for the data, in accordance with an illustrative RAID 4 level configuration. However, other RAID level configurations (e.g. RAID 5) are also contemplated. In the illustrative embodiment, a minimum of one parity disk and one data disk may be employed. However, a typical implementation may include three data and one parity disk per RAID group and at least one RAID group per volume.
The storage appliance 100 also includes a non-volatile random access memory (NVRAM 190) that provides fault-tolerant backup of data, enabling the integrity of requests received at the storage appliance to survive a service interruption based upon a power failure or other fault. That is, the exemplary storage appliance may be made more reliable and stable in the event of a system shutdown or other unforeseen problem by employing a backup memory consisting of NVRAM 190. Data associated with every write request received at the storage appliance is stored in the NVRAM to protect against data loss in the event of a sudden crash or failure of the storage appliance. These write requests may apply to either NAS or SAN based client requests.
Illustratively, the NVRAM 190 is a large-volume, solid-state memory array having either a back-up battery or other built-in, last-state-retention capabilities (e.g. a FLASH memory) that hold a last state of the memory in the event of any power loss to the array. The size of the NVRAM is variable; it is typically sized sufficiently to log a certain time-based chunk of transactions/requests (for example, several seconds worth) in accordance with an NVlog capability 195. The NVRAM 190 is filled in parallel with the buffer cache 300 after each client request is completed, but before the result of the request is returned to the requesting client.
To facilitate access to the disks 130, the storage operating system 200 implements a write-anywhere file system that cooperates with virtualization modules to provide a function that “virtualizes” the storage space provided by disks 130. The file system logically organizes the information as a hierarchical structure of named directory and file objects (hereinafter “directories” and “files”) on the disks. Each “on-disk” file may be implemented as set of disk blocks configured to store information, such as data, whereas the directory may be implemented as a specially formatted file in which names and links to other files and directories are stored. The virtualization system allows the file system to further logically organize information as a hierarchical structure of named vdisks on the disks, thereby providing an integrated NAS and SAN appliance approach to storage by enabling file-based (NAS) access to the files and directories, while further enabling block-based (SAN) access to the vdisks on a file-based storage platform.
In the illustrative embodiment, the storage operating system is preferably the NetApp® Data ONTAP™ operating system available from Network Appliance, Inc., Sunnyvale, Calif. that implements a Write Anywhere File Layout (WAFL™) file system. However, it is expressly contemplated that any appropriate storage operating system, including a write in-place file system, may be enhanced for use in accordance with the inventive principles described herein. As such, where the term “WAFL” is employed, it should be taken broadly to refer to any storage operating system that is otherwise adaptable to the teachings of this invention.
As used herein, the term “storage operating system” generally refers to the computer-executable code operable on a computer that manages data access and may, in the case of a multi-protocol storage appliance, implement data access semantics, such as the Data ONTAP storage operating system, which is implemented as a microkernel. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as the Solaris or Windows operating system, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
In addition, it will be understood to those skilled in the art that the inventive technique described herein may apply to any type of special-purpose (e.g., storage serving appliance) or general-purpose computer, including a standalone computer or portion thereof, embodied as or including a storage system. Moreover, the teachings of this invention can be adapted to a variety of storage system architectures including, but not limited to, a network-attached storage environment, a storage area network and disk assembly directly-attached to a client or host computer. The term “storage system” should therefore be taken broadly to include such arrangements in addition to any subsystems configured to perform a storage function and associated with other equipment or systems.
An iSCSI driver layer 228 provides block protocol access over the TCP/IP network protocol layers, while a FC driver layer 230 operates with the FC HBA 126 to receive and transmit block access requests and responses to and from the integrated storage appliance. The FC and iSCSI drivers provide FC-specific and iSCSI-specific access control to the luns (vdisks) and, thus, manage exports of vdisks to either iSCSI or FCP or, alternatively, to both iSCSI and FCP when accessing a single vdisk on the multi-protocol storage appliance. In addition, the storage operating system includes a disk storage layer 240 that implements a disk storage protocol, such as a RAID protocol, and a disk driver layer 250 that implements a disk access protocol such as, e.g., a SCSI protocol.
Bridging the disk software layers with the integrated network protocol stack layers is a virtualization system 260 that is implemented by a file system 280 interacting with virtualization modules illustratively embodied as, e.g., vdisk module 290 and SCSI target module 270. It should be noted that the vdisk module 290, the file system 280 and SCSI target module 270 can be implemented in software, hardware, firmware, or a combination thereof. The vdisk module 290 is layered on the file system 280 to enable access by administrative interfaces, such as a streamlined user interface, in response to a system administrator issuing commands to the multi-protocol storage appliance 100. In essence, the vdisk module 290 manages SAN deployments by, among other things, implementing a comprehensive set of vdisk (lun) commands issued through the user interface by a system administrator. These vdisk commands are converted to primitive file system operations (“primitives”) that interact with the file system 280 and the SCSI target module 270 to implement the vdisks.
The SCSI target module 270, in turn, initiates emulation of a disk or lun by providing a mapping procedure that translates luns into the special vdisk file types. The SCSI target module is illustratively disposed between the FC and iSCSI drivers 228, 230 and the file system 280 to thereby provide a translation layer of the virtualization system 260 between the SAN block (lun) space and the file system space, where luns are represented as vdisks 282. To that end, the SCSI target module has a set of application programming interfaces (APIs) that are based on the SCSI protocol and that enable a consistent interface to both the iSCSI and FCP drivers 228, 230. By “disposing” SAN virtualization over the file system 280, the multi-protocol storage appliance reverses the approaches taken by prior systems to thereby provide a single unified storage platform for essentially all storage access protocols.
The file system 280 is illustratively a message-based system; as such, the SCSI target module 270 transposes a SCSI request into one or more messages representing an operation(s) directed to the file system. For example, a message generated by the SCSI target module may include a type of operation (e.g., read, write) along with a pathname (e.g., a path descriptor) and a filename (e.g., a special filename) of the vdisk object represented in the file system. Alternatively, the generated message may include an operation type and file handle containing volume/inode information. The SCSI target module 270 passes the message into the file system layer 280 as, e.g., a function call, where the operation is performed.
The file system provides volume management capabilities for use in block-based access to the information stored on the storage devices, such as disks. That is, in addition to providing file system semantics, such as naming of storage objects, the file system 280 provides functions normally associated with a volume manager. These functions include (i) aggregation of the disks, (ii) aggregation of storage bandwidth of the disks, and (iii) reliability guarantees, such as mirroring and/or parity (RAID), to thereby present one or more storage objects layered on the file system. A feature of the multi-protocol storage appliance is the simplicity of use associated with these volume management capabilities, particularly when used in SAN deployments.
The file system 280 illustratively implements the WAFL file system having an on-disk format representation that is block-based using, e.g., 4 kilobyte (kB) blocks and using inodes to describe the files. The WAFL file system uses files to store metadata describing the layout of its file system; these metadata files include, among others, an inode file. A file handle, i.e., an identifier that includes an inode number, is used to retrieve an inode from disk. A description of the structure of the file system, including on-disk inodes and the inode file, is provided in U.S. Pat. No. 5,819,292, titled Method for Maintaining Consistent States of a File System and for Creating User-Accessible Read-Only Copies of a File System by David Hitz et al., issued Oct. 6, 1998, which patent is hereby incorporated by reference as though fully set forth herein.
As noted, a vdisk is a special file type in a volume that derives from a plain (regular) file, but that has associated export controls and operation restrictions that support emulation of a disk. More specifically, the visk 282 is a multi-inode object comprising a special file inode and at least one associated stream inode that are managed as a single, encapsulated storage object within the file system 280. The vdisk 282 illustratively manifests as an embodiment of the stream inode that, in cooperation with the special file inode, creates a new type of file storage object having the capacity to encapsulate specific security, management and addressing (export) information. An example of a vdisk that may be advantageously used with the present invention is described in U.S. Pat. No. 7,107,385, issued on Sep. 12, 2006, titled Storage Virtualization by Layering Virtual Disk Objects on a File System, which application is hereby incorporated by reference as though fully set forth herein.
The file system implements access operations to vdisks 282, as well as to files 284 and directories (dir 286) that coexist with respect to global space management of units of storage, such as volumes 150 and/or qtrees 288. A qtree 288 is a special directory that has the properties of a logical sub-volume within the namespace of a physical volume. Each file system storage object (file, directory or vdisk) is illustratively associated with one qtree, and quotas, security properties and other items can be assigned on a per-qtree basis. The vdisks and files/directories may be layered on top of qtrees 288 that, in turn, are layered on top of volumes 150 as abstracted by the file system “virtualization” layer 280.
The vdisk storage objects in the file system 280 are generally associated with SAN deployments of the multi-protocol storage appliance, whereas the file and directory storage objects are associated with NAS deployments of the appliance. The files and directories are generally not accessible via the FC or SCSI block access protocols; however, a file can be converted to a vdisk and then accessed by either the SAN or NAS protocol. The vdisks are thus accessible as luns from the SAN (FC and SCSI) protocols and as files by the NAS (NFS and CIFS) protocols.
In general, data associated with write requests issued by clients 160 in accordance with file access protocols and directed to files served by the file system 280 may be received at the storage appliance as packets of various lengths. These packets are generally stored in lists of variable length input buffers. However, file systems typically operate on data arranged in blocks of a predetermined size. For instance, data in the WAFL file system is stored in contiguous 4 kilobyte (kB) blocks. The file data received by the storage appliance is thus converted from variable length input buffers to the fixed sized blocks for use by the file system 280. This conversion is accomplished by copying data in the input buffers into the fixed size file system buffers within the buffer cache 300.
For example, when a NFS or CIFS write request (and associated write data) is received from a client 160 at the storage appliance 100, a single-source multiple-destination copy operation is performed on the write data within file system protocol layers of the storage operating system 200. That is, the data contained in a write request embodied as a NFS or CIFS request is initially stored in a collection of input buffers when the data is received at the system. The appropriate file access protocol then copies that write data into file system buffers of the buffer cache 300 and into the NVRAM 190. However, the copy operation from the input buffers to the file system buffers consumes processor resources as that copy operation is performed in software.
According to the invention, a technique is provided for enhancing a write data path within the storage operating system executing on the multi-protocol storage appliance. As used herein, the write data path defines program logic used by the file system of the storage operating system to process write requests directed to files or vdisks served by the file system. The program logic can be implemented in software, hardware, firmware, or a combination thereof. The inventive technique provides a “zero copy” write data path in the file system that eliminates data copy operations for write requests associated with a block access request (or a certain file access request, such as a DAFS request) received at the storage appliance. The data copy operations eliminated by the novel technique are illustratively copy operations from input buffers to file system buffers of a buffer tree within the buffer cache. The inventive technique thus enhances the write data path of the storage appliance by providing a zero copy write data path embodied as a zero copy write function of the storage operating system that eliminates a copy operation for a write request received at the storage appliance.
The buffer cache 300 also comprises a plurality of buffers 340 that are organized as a pool 350 of “anonymous” buffers available for use by the file system 280. These anonymous file system buffers 340 are not assigned to a file 284 or vdisk 282 associated with, e.g., a write request received at the storage appliance. Therefore, this pool of “free” (unassigned) file system buffers 340 may be acquired, as needed, by entities, such as the SCSI target module 270 to store data associated with block access or certain file access write requests directed to the file system 280.
The SCSI protocol, which forms the basis of several block access protocols, typically transports data as blocks without protocol headers. As a result, block protocol interconnect adapters, e.g., as used with FCP or iSCSI, can allow a storage operating system to control placement of write data into input buffers. In addition, certain file access protocols, such as the DAFS protocol, can exploit network adapters with direct data placement capabilities. Therefore, write data received using protocols and adapters with sufficient data placement controls and directed to vdisks or files served by the file system can be stored directly in the file system buffers 340 of the buffer cache 300, and can thus allow use of the novel zero copy write technique.
In contrast, write requests received using protocols or adapters that lack sufficient data placement controls are typically received at the network protocol layers of the storage appliance and their data is loaded into input buffers. The input buffers are typically fragmented and thus do not allow efficient conversion into 4 k WAFL file system buffers used in the buffer trees. Therefore, conventional NAS-based protocols, such as NFS or CIFS, may not be used in accordance with the novel zero copy write data path technique.
Specifically, the SCSI target module 270 constructs an input/output vector (iovec 430) using pointers to the acquired file system buffer addresses and headers associated with those buffers. The iovec 430 is thus essentially a list of file system buffers 340 that comprise a write operation associated with the write request issued by the initiator. The SCSI target module 270 passes the iovec 430 to the file system 280 as the zero copy write function 420, where an operation is performed to “graft” (insert) the selected buffers 340 directly into buffer tree 320. Rather than actually copying the data stored in the selected buffers, the zero copy write function 420 uses the iovec to graft the file system buffers into the buffer tree by swapping pointers that reference memory locations of buffers in the buffer tree with pointers that reference memory locations of the selected file system buffers. After the write data is grafted into the buffer tree, a DMA operation is initiated to transfer the write data from these grafted buffers to the NVlog 195.
Upon receiving the XFR_RDY response, the initiator transmits data associated with the write request to one or more addresses of those buffers specified by the offset within the response and equal to an amount specified by the length value in that response (Step 510). Specifically, the initiator transfers the data associated with the write request to the adapter 126, where the DMA engine 410 and FCP driver 230 transfer the data of a particular length to a particular offset (buffer address) in the buffer cache 300 of memory 124 in accordance with a DMA operation. More specifically, the SCSI target module 270 passes the addresses of the file system buffers 340 in buffer cache 300 to the FCP driver 230, which then passes those addresses to the DMA engine (firmware) on the network adapter 126. The DMA engine logic 410 uses those addresses to transfer the write data (when it arrives from the initiator) directly into the acquired file system buffers 340 in accordance with a DMA operation. Notably, the DMA engine cooperates with the driver to transfer the data into the appropriate buffers without intervention of the processor 122 of the storage appliance 100.
Once the DMA logic has transferred the write data into the appropriate buffers, the driver 230 notifies the SCSI target module 270 that the write data has arrived and is stored in the acquired file system buffers at the specified addresses/offset. In Step 512, the SCSI target module constructs the iovec 430 using pointers to the acquired file system buffer addresses and headers associated with those buffers. The SCSI target module 270 then passes the iovec 430 to the file system 280 (Step 514) where, at Step 516, the zero copy write function 420 is invoked to “graft” (insert) the acquired list of buffers associated with the iovec directly into buffer tree 320 of the buffer cache 300. Since the acquired file system buffers 340 are anonymous buffers, they are not assigned within the buffer tree 320 to a particular file 284 or vdisk 282. Therefore, the file system 280 inserts the list of buffers identified by the iovec 430 directly into the buffer tree 320 for the particular file or vdisk. By directly inserting the list of file system buffers 340 into the buffer tree at the appropriate locations, no copy operations are required.
After the file system buffers are grafted into the buffer tree of buffer cache 300, storage locations in the NVRAM 190 are allocated for these buffers (Step 518) and a DMA operation is initiated to transfer the write data from those buffers into the NVRAM (Step 520). That is, after the file system buffers are grafted into the buffer tree 320, the DMA operation is initiated to the NVlog 195 using the buffers 340 as the source of the data transfer. The NVlog capabilities interact with the DMA operation to transfer the write data stored in the file system buffers to the NVRAM 190 without the use of processor-intensive copy operations. In particular, the file system calls a driver in the NVRAM to initiate the DMA operation by providing DMA descriptors (address and length information) to the NVRAM driver. Upon completion of the DMA operation, the NVRAM driver notifies the file system of completion. The file system then sends a “callback” to the SCSI target module 270 instructing it (Step 522) to send a completion response to the initiator. The sequence then ends at Step 524.
Transfer of write data to the NVlog 195 in accordance with a DMA operation essentially creates a zero-copy write operation. Zero-copy write operations require different treatment than write operations over network-based file system protocols, like NFS or CIFS. For NFS and CIFS, the write data (the source of the DMA transfers) resides in input buffers; therefore, while the DMA operation is scheduled (or in progress), this write data cannot change. For zero-copy write operations, however, the data that is transferred in accordance with the DMA operation resides in the buffer tree. This creates a potential problem with respect to overwrite operations to the data (“pages”) scheduled for DMA transfer to the NVlog. These pages cannot be modified until the DMA operation completes.
In other words during the DMA operation, the buffers from which the operation is initiated must be protected against overwriting. As noted, DMA engines on the network adapter are programmed (initialized) to transfer the write data into the file system buffers acquired by the SCSI target module and having correct addresses to enable them to be efficiently (and easily) grafted into the buffer tree for the particular file or vdisk. The data stored in the grafted buffers is then transferred to the NVRAM in accordance with a DMA operation initiated by the file system. When the DMA operation from the buffer tree to the NVlog 195 is in progress, a subsequent write operation directed to those grafted buffers should not overwrite (destroy) the data stored in those buffers. Rather, the subsequent write operation directed to those grafted buffers generates a copy-on-write (COW) operation within the file system.
Note that there are actually two DMA operations involved in connection with the zero-copy write data path technique. The first DMA operation involves transfer of data associated with a write request from the network adapter into the file system buffers acquired by the SCSI target module from a pool of free or anonymous buffers. As noted, since the buffers acquired by the SCSI target module are anonymous, they are not assigned to any file or vdisk. Once acquired, those free buffers “belong” to the SCSI target module; it is thus impossible that a subsequent write operation may overwrite the data contents of those buffers during this first DMA operation.
The second DMA operation involves the write data stored in the acquired buffers that is transferred into the NVRAM as initiated by the file system. These acquired buffers have been grafted in the buffer tree and are thus now assigned to a file or vdisk. While this second DMA operation is in progress, the contents of the grafted buffers must be protected against subsequent write operations issued by initiators or clients to the particular file or vdisk, as these write operations may overwrite the data stored in those grafted buffers. The contents of the grafted buffers may be protected by either “holding off” subsequent write operations directed to those buffers or allowing the write operations to complete using the COW operation.
Specifically, when the zero copy write operation is in progress, the buffers 340 are “locked” and the DMA operation to the NVlog 195 is scheduled. A COW operation is performed on each locked file system buffer 340 that is the target of a subsequent write operation and that is the same buffer involved in the DMA operation to the NVlog. The COW operation creates another file system buffer to accommodate the subsequent write operation, while the original file system buffer continues to be used for the DMA operation to the NVlog. More specifically, the COW operation involves creating a copy of a buffer that is involved with the DMA operation and directing a subsequent write operation to that created copy. The originally acquired file system buffer is then immediately detached from the buffer tree for that particular file or vdisk and returned to the free buffer pool 350. The created copy of the acquired buffer is then grafted into the buffer tree to replace the originally acquired buffer.
Advantageously, the novel zero copy write data path technique obviates a copy operation from the input buffers into the file system buffers by allowing the network adapter to copy the write data directly from the write requests into those buffers. The invention thus eliminates the data copy operation and its consumption of processor cycles.
While there has been shown and described an illustrative embodiment for enhancing a write data path within a file system of a storage operating system executing on a storage system, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. For example, zero copy write operations may be implemented for large unaligned (non-4 k alignment) write operations. As noted, a SCSI target, such as the multi-protocol storage appliance 100, initiates write operations from a SCSI initiator using transfer ready (XFR_RDY) messages. At the time the appliance sends the XFR_RDY message, it allocates buffers to hold the write data that is expected. When sending a XFR_RDY message soliciting a large write operation starting at a non-4 k offset, the storage appliance indexes into a first queued buffer at a particular offset (to hold the write data) and that data is then transferred in accordance with a DMA operation starting at the offset. This enables performance of an entire zero-copy write operation, except for “runts” at the front and back of a large transfer that need to be copied.
The foregoing description has been directed to specific embodiments of this invention. It will be apparent, however, that other variations and modifications may be made to the described embodiments, with the attainment of some or all of their advantages. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5163131||Sep 8, 1989||Nov 10, 1992||Auspex Systems, Inc.||Parallel i/o network file server architecture|
|US5355453||Oct 13, 1992||Oct 11, 1994||Auspex Systems, Inc.||Parallel I/O network file server architecture|
|US5485579||Apr 8, 1994||Jan 16, 1996||Auspex Systems, Inc.||Multiple facility operating system architecture|
|US5758357 *||Jan 19, 1996||May 26, 1998||Dbc Software, Inc.||Fast DB2 tablespace reorganization method that is restartable after interruption of the process|
|US5802366||Oct 11, 1994||Sep 1, 1998||Auspex Systems, Inc.||Parallel I/O network file server architecture|
|US5819292||May 31, 1995||Oct 6, 1998||Network Appliance, Inc.||Method for maintaining consistent states of a file system and for creating user-accessible read-only copies of a file system|
|US5931918||Jul 30, 1997||Aug 3, 1999||Auspex Systems, Inc.||Parallel I/O network file server architecture|
|US5941972||Dec 31, 1997||Aug 24, 1999||Crossroads Systems, Inc.||Storage router and method for providing virtual local storage|
|US5948110||Jun 5, 1995||Sep 7, 1999||Network Appliance, Inc.||Method for providing parity in a raid sub-system using non-volatile memory|
|US5950225||Feb 28, 1997||Sep 7, 1999||Network Appliance, Inc.||Fly-by XOR for generating parity for data gleaned from a bus|
|US5963962||Jun 30, 1998||Oct 5, 1999||Network Appliance, Inc.||Write anywhere file-system layout|
|US6038570||May 31, 1995||Mar 14, 2000||Network Appliance, Inc.||Method for allocating files in a file system integrated with a RAID disk sub-system|
|US6065037 *||Jun 7, 1995||May 16, 2000||Auspex Systems, Inc.||Multiple software-facility component operating system for co-operative processor control within a multiprocessor computer system|
|US6138126||Jul 21, 1999||Oct 24, 2000||Network Appliance, Inc.||Method for allocating files in a file system integrated with a raid disk sub-system|
|US6144969 *||Feb 7, 1997||Nov 7, 2000||Sony Corporation||File name conversion|
|US6289356||Sep 14, 1998||Sep 11, 2001||Network Appliance, Inc.||Write anywhere file-system layout|
|US6389433 *||Jul 16, 1999||May 14, 2002||Microsoft Corporation||Method and system for automatically merging files into a single instance store|
|US6425035||Sep 27, 2001||Jul 23, 2002||Crossroads Systems, Inc.||Storage router and method for providing virtual local storage|
|US6868417 *||Dec 18, 2000||Mar 15, 2005||Spinnaker Networks, Inc.||Mechanism for handling file level and block level remote file accesses using the same server|
|US7107385||Aug 9, 2002||Sep 12, 2006||Network Appliance, Inc.||Storage virtualization by layering virtual disk objects on a file system|
|US20020112022||Dec 18, 2000||Aug 15, 2002||Spinnaker Networks, Inc.||Mechanism for handling file level and block level remote file accesses using the same server|
|US20020116593||Dec 7, 2000||Aug 22, 2002||Spinnaker Networks, Inc.||Method and system for responding to file system requests|
|US20040030668||Aug 9, 2002||Feb 12, 2004||Brian Pawlowski||Multi-protocol storage appliance that provides integrated support for file and block access protocols|
|US20040260673 *||Apr 12, 2004||Dec 23, 2004||David Hitz||Copy on write file system consistency and block usage|
|US20050246503 *||Apr 30, 2004||Nov 3, 2005||Fair Robert L||Online clone volume splitting technique|
|US20100198795 *||Apr 14, 2010||Aug 5, 2010||Chen Raymond C||System and method for restoring a virtual disk from a snapshot|
|1||Anthony J. McGregor Department of Computer Science, University of Waikato Dissertation: Block-Based Distributed File Systems Jul. 1997.|
|2||Asante Desktop EN/SC Adapters User's Manual Apr. 1996.|
|3||Asante EN/SC Adapter Family Installation Guide May 1994.|
|4||Common Internet File System (CIFS) Version: CIFS-Spec 0.9, Storage Networking Industry Association (SNIA), Draft SNIA CIFS Documentation Work Group Work-in-Progress, Revision Date: Mar. 26, 2001.|
|5||Fielding et al. (1999) Request for Comments (RFC) 2616, HTTP/1.1.|
|6||Maintenance Procedures ND (8C) nd-network disk control Feb. 1985.|
|7||Misc. Reference Manual Pages ND (4P) nd-network disk driver Jul. 26, 1985.|
|8||Performance Without Compromise: The Virtual Storage Architecture 1997.|
|9||U.S. Appl. No. 10/423,381, filed Apr. 25, 2003 by Mohan Srinivasan et al. for Zero Copy Write Datapath, all pages.|
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US9563361||Feb 16, 2016||Feb 7, 2017||International Business Machines Corporation||Zero copy support by the virtual memory manager|
|US9740424||Nov 8, 2016||Aug 22, 2017||International Business Machines Corporation||Zero copy support by the virtual memory manager|
|US9747031||Nov 8, 2016||Aug 29, 2017||International Business Machines Corporation||Zero copy support by the virtual memory manager|
|U.S. Classification||707/802, 707/803, 707/812|
|International Classification||G07F17/30, G06F7/00|
|Cooperative Classification||Y10S707/99956, G06F17/30067, G06F12/0866|
|European Classification||G06F12/08B12, G06F17/30F|
|Jan 24, 2012||CC||Certificate of correction|
|Sep 29, 2014||FPAY||Fee payment|
Year of fee payment: 4