US 7890529 B1
A system for implementing a distributed, segmented file system includes file servers that each are configured to control separate segments of the distributed-file system, the file servers including: a memory interface configured to communicate with a memory storing at least one of the segments of the distributed file system; a communication interface coupled to at least another of the file servers; and a processor coupled to the memory interface and the communication interface and configured to control, read, and write to file system objects stored in the memory. The system further includes means for transferring permission for access to a requested file system object from an owner server currently controlling a segment where a requested object resides to an access-requesting server.
1. A system for implementing a distributed, segmented file system, the system comprising:
a plurality of file servers that each are configured to control separate segments of the distributed-file system, the file servers being configured to:
communicate with a memory storing at least one of the segments of the distributed file system; and
control, read, and write to file system objects stored in the memory;
means for transferring permission for access to a requested file system object, in response to an access request, from a first file server currently controlling a segment where the requested file system object resides to a second file server; and
means for caching the requested file system object at the second file server in response to receiving an indication, from the means for transferring, of transferred permission to access the requested file system object.
2. The system of
3. The system of
4. The system of
5. The system of
6. The system of
7. The system of
8. The system of
9. The system of
10. The system of
11. The system of
12. The system of
13. The system of
14. A computer program product for use in a file server of a distributed, segmented single file system implemented by a plurality of file servers that control metadata of separate segments of the single file system, the single file system including file system objects residing in the segments and comprising at least portions of one of files and directories, the computer program product residing on a computer-readable medium and comprising computer-readable instructions for causing a computer to:
receive a request for access to a file system object stored in a segment controlled by a first file server, the request for access being associated with a second file server;
determine a level of permission for access to the file system object currently granted to at least one other file server;
send an indication of permission to access the file system object toward the second file server, wherein a level of permission granted by the indication of permission is determined based on the level of permission currently granted to the other file server; and
modify the level of permission granted to the other file server in response to receiving the request for access.
15. The computer program product of
16. The computer program product of
17. The computer program product of
18. The computer program product of
19. The computer program product of
20. The system of
means for caching the requested file system object at the second file server in response to receiving an indication, from the means for transferring, of transferred permission to access the requested file system object.
This application is a continuation of and claims priority to U.S. application Ser. No. 10/833,923, filed Apr. 28, 2004 now abandoned, which is incorporated by reference herein in its entirety. This application claims the benefit of U.S. Provisional Application No. 60/465,894 filed Apr. 28, 2003.
The invention relates to computer storage and file systems and more specifically to techniques for delegating and caching control and locks over objects in a distributed segmented storage system.
Data generated by, and used by, computers are often stored in file systems. File system designs have evolved over approximately the last two decades from server-centric models (that can be thought of as local file systems) to storage-centric models (that can be thought of as networked file systems).
Stand-alone personal computers exemplify a server-centric model—storage has resided on the personal computer itself, initially using hard disk storage, and more recently, optical storage. As local area networks (“LANs”) became popular, networked computers could store and share data on a so-called file server on the LAN. Storage associated with a given file server is commonly referred to as server attached storage (“SAS”). Storage could be increased by adding disk space to a file server. SASs are expandable internally and there is no transparent data sharing between file servers. Further, with SASs throughput is governed by the speed of a fixed number of busses internal to the file server. Accordingly, SASs also exemplify a server-centric model.
As networks have become more common, and as network speed and reliability increased, network attached storage (“NAS”) has become popular. NASs are easy to install and each NAS, individually, is relatively easy to maintain. In a NAS, a file system on the server is accessible from a client via a network file system protocol like NFS or CIFS.
Network file systems like NFS and CIFS are layered protocols that allow a client to request a particular file from a pre-designated server. The client's operating system translates a file access request to the NFS or DFS format and forwards it to the server. The server processes the request and in turn translates it to a local file system call that accesses the information on magnetic disks or other storage media. Using this technology, a file system can expand to the limits of an NAS machine. Typically no more than a few NAS units and no more than a few file systems are administered and maintained. In this regard, NASs can be thought of as a server-centric file system model.
Storage area networks (SANs) (and clustered file systems) exemplify a storage-centric file system model. SANs provide a simple technology for managing a cluster or group of disk-storage units, effectively pooling such units. SANs use a front-end system, that can be a NAS or a traditional server. SANs are (i) easy to expand, (ii) permit centralized management and administration of the pool of disk storage units, and (iii) allow the pool of disk storage units to be shared among a set of front-end server systems. Moreover, SANs enable various data protection/availability functions such as multi-unit mirroring with failover for example. SANs, however, are expensive and while they permit space to be shared among front-end server systems, they do not permit multiple SANs environments to use the same file system. Thus, although SANs pool storage, they basically behave as a server-centric file system. That is, a SAN behaves like a fancy (e.g., with advanced data protection and availability functions) disk drive on a system. Also, various incompatible versions of SANs have emerged.
Embodiments of the invention provide techniques for producing general delegations of objects owned by given servers to one or more of a plurality of servers involved in the segmented file system. The invention provides a general service for delegating control and locks and enabling caching of a variety of objects including, but not limited to, files, byte-ranges, segments. The delegations themselves can be used to identify the objects that they control or protect or with which they are otherwise involved. Delegations are also used to recover the state of protected objects in cases such as network disconnections, and other failures. Other embodiments are within the scope and spirit of the invention.
In general, in an aspect, the invention provides a system for implementing a distributed, segmented file system, the system comprising file servers that each are configured to control separate segments of the distributed-file system, the file servers comprising a memory interface configured to communicate with a memory storing at least one of the segments of the distributed file system, a communication interface coupled to at least another of the file servers, and a processor coupled to the memory interface and the communication interface and configured to control, read, and write to file system objects stored in the memory. The system further includes means for transferring permission for access to a requested file system object from an owner server currently controlling a segment where a requested object resides to an access-requesting server.
Implementations of the invention may include one or more of the following features. The transferring means is configured to provide an indication related to an identity of the requested file system object. The servers are configured to determine from the indication a current state of access permission of the requested file system object. The current state includes a current file server that has control of the requested file system object. The file system object is one of a file and a byte range. The owner server currently controlling the segment where the requested object resides and the access-requesting server as the same server. The means for transferring permissions is configured to transfer permissions without affecting the physical file system.
Embodiments of the invention may provide one or more of the following capabilities. Cache coherence can be provided and consistency maintained. Delegation of access control and cache control can be regulated. Control over file system objects that reside on segments of a segmented file system, and locks, may be delegated and caching enabled to at least one of a plurality of servers at a logical layer above that of the structure of the physical file system. In this invention, the structure of such delegations is general and applied to such objects as byte-ranges of files, files, segments, application locks (F-locks), and so on. Permissions can be transferred permissions without affecting a physical layer of a system.
The file server 222 is configured to perform file access, storage, and network access operations as indicated by various operations modules. The file server 222 can perform local file operations 226 a including reading and writing files, inserting and deleting directory entries, locking, etc. As part of the local file operations 226 a, the server 222 can translate given requests into input/output (“I/O”) requests that are submitted to a peripheral storage interface operations 228 a module. The peripheral storage interface operations 228 a process the I/O requests to a local storage sub-system 229 a. The storage sub-system 229 a can be used to store data such as files. The peripheral storage interface operations 228 a is configured to provide data transfer capability, error recovery and status updates. The peripheral storage interface operations 228 a may involve various types of protocols for communication with the storage sub-system 229 a, such as a network protocol. File operation requests access the local file operations 226 a, and responses to such requests are provided to the network 210, via a network interface operations module 224 a. The modules shown in
The portal 230 includes various modules for translating calls, routing, and relating file system segments and servers. A client (user) can access the portal 230 via an access point 238 a in a file system call translation operations module 232 a. One way for this entry is through a system call, which will typically be operating-system specific and file-system related. The file system call translation operations 232 a can convert a file system request to one or more atomic file operations, where an atomic file operation accesses or modifies a file system object. Such atomic file operations may be expressed as commands contained in a transaction object. If the system call includes a file identifier (e.g., an Inode number), the file system call translation operations 232 a may determine a physical part of a storage medium of the file system corresponding to the transaction (e.g., a segment number) from a (globally/file-system wide) unique file identifier (e.g., Inode number). The file system call translation operations 232 a may include a single stage or multiple stages. This translation operations 232 a may also contain local cache 233 a. This local cache 233 a preferably includes a local data cache, a cache of file locks and other information that may be frequently used by a client, or by a program servicing a client. If a request cannot be satisfied using local cache 233 a, the file system translation operations 232 a may forward the transaction object containing atomic file operation commands to the transaction routing operations 234 a. Similar functionality is provided in, and similar operations may be performed by, the combined portal and file server 250.
The transaction routing operations 234 a, 234 b use the file identifier to determine the location (e.g., the IP address) of a file server 222/250 that is in charge of the uniquely identified file/directory. This file server can be local (i.e., for the unit 250 acting as both a portal and a file server, that received the request) or remote. If this file server is local, the transaction routing operations 234 b pass the file operation to the local file operations 226 b that, in turn, pass an appropriate command to the peripheral storage interface operations 228 b for accessing the storage medium 229 b. If, on the other hand, the file server is remote, the network 210 is used to communicate this operation. The routing operations 234 may use the file identifier to derive a corresponding segment number to determine the location of the file/directory. The system is preferably independent of any particular networking hardware, protocols or software. Networking requests are handed over to a network interface operations 224 b, 236 b.
The network interface operations 224/236 service networking requests regardless of the underlying hardware or protocol, and forward the transaction toward the appropriate file server 222, 250 (i.e., that controls a particular file system segment associated with the request). The network interface operations 224/236 may provide data transfer, error recovery and status updates on the network 210.
The virtual storage 310 uses storage system segments 340 for storing data. The segment 340 is a logical portion of storage (e.g., of a disk or other storage medium). The actual sizes of segments can vary from storage medium to storage medium.
To determine what each segment contains, a superblock 330 include a file system id, segment number, and other information identifying the file system and the file system state.
In the file system, a file or Inode stored on a disk may be addressed by (i) a segment number, and (ii) a block number within the segment. The translation of this address to a physical disk address occurs at (or by) the lowest level (the SFSSFS (Segmented File System) Physical System in
This convention also makes it simple to distribute the file system over multiple servers as well using a map of which segments of the file system reside on which host file server. More specifically, once the segment number is derived from the FID, the appropriate file server can be determined by mapping, such as through a routing table. For example, this map may be a table that lists the file servers (on which the local agents execute) corresponding to particular segments. The file server may be identified by its IP address. Referring to
File servers may be organized in groups, such as in a hierarchy or some other logical topology, and the lookup of a server may use communication over the network 210 with a group leader or a node in a hierarchy. Such information may be cached on a leased basis with registration for notification on changes to maintain coherency. The local file operations 226 and peripheral storage operations 228 at the determined file server can determine the file to which an operation pertains. Once the request has been satisfied at the determined file server, the result is sent back to the original (portal) server (which may be the same as the determined file server). The original (portal) server may return the result to the requesting client.
Each (globally) unique FID may reside in a segment referred to as the “controlling segment” for that FID. The FID, e.g., an Inode, is associated with a file and encloses information, metadata, about the file (e.g., owner, permissions, length, type, access and modification times, location on disk, link count, etc.), but not the actual data. The data associated with an Inode may reside on another segment (i.e., outside the controlling segment of the Inode). The controlling segment of a particular Inode, however, and the segment(s) containing the data associated with the particular Inode, will be addressable and accessible by the controlling file server.
At any time, a segment is preferably under the control of at most one local agent (i.e., residing on the local file server). That agent is responsible for carrying out file system operations for any FID controlled by that segment. The controlling segment's unique identifier (“SID”) for each FID is computable from the FID by the translator using information available locally (e.g., in the superblock 330). The controlling SID may, for example, be computed via integer division of the FID by a system constant, which implies a fixed maximum number of files controlled per segment. Other techniques/algorithms may be used.
Data from a file may be contained in a segment in the maximal segment group that is not under the control of the file server responsible for the controlling segment. In this case, adding space to or deleting space from the file in that segment may be coordinated with the file server responsible for it. Preferably no coordination is necessary for simple read accesses to the blocks of the file.
Client (user) entry and access to the file system may thus occur through any unit that has translation and routing operations, and that has access to a segment location map. Such units may be referred to as “portals.” The file system preferably has multiple simultaneous access points into the system. A portal unit may not need file system call translator operations 232 if such operations are provided on the client (end user) machines.
Any of the file servers 16 may be general computing devices, such as personal computers, workstations, etc. As such, the file servers 16 can include processors and memories that store software instructions that are executable by the processors for performing described functions. The file servers 16 may have their own local storage instead of or in addition to the storage 19 and can control/manage segments of a file system on their local storage. The file servers 16 may be clustered to work on a common issue and the clustered servers 16 may be managed/regulated in accordance with the invention.
The file servers 16 can assign FIDs and allocate memory for write requests to the segments 20 that the servers 16 control. Each of the servers 16 can pre-allocate an amount of memory for an incoming write request. The amount of pre-allocated memory can be adjusted and is preferably a fixed parameter that is allocated without regard, or even knowledge, of a quantity of data (e.g., a size of a file) to be written. If the pre-allocated memory is used up and more is desired, then the server 16 can pre-allocate another portion of memory. The server 16 that controls the segment 20 to be written to will allocate an FID (e.g., an Inode number). The controlling server 16 can supply/assign the Inode number and the Inode, complete with storage block addresses. If not all of the pre-allocated block addresses are used by the write, then the writing server 16 will notify the controlling server 16 of the unused blocks, and the controlling server 16 can de-allocate the unused blocks and reuse them for future write operations.
A block diagram of the logical interconnects between exemplary components of a segmented file system is given in
The SFS Distribution Engine in turn includes the following major components:
In the SFS File System, each Segment, each inode, each file, each directory, etc. preferably has an owner. The Administration System assigns Segment ownership to a particular Host and resources on that Segment belong to that owner. For resource (file, directory, inode) access, the owner of a resource is called a Destination Server (DS) and an SFS Host that wants to access the resource is called an Entry Point Server (ES). In order to get access to a resource an ES obtains a lease, or delegation, to that resource. Delegations may be used in a variety of ways in, and may provide a variety of capabilities in, a segmented file system. Various resources of a DS may be delegated to an ES. SFS Connections are maintained between ESs and DSs to help keep track of Delegations.
Connections and Delegations
The descriptions below are exemplary embodiments of the implementation of Connections and Delegations and, as above, do not limit the invention, especially the claims, to require the details discussed to fall within the scope of the invention. Other embodiments should be apparent to one skilled in the art and are included as part of this disclosure. The invention provides for the possible layering of the implementation of Delegations as well as the use of Delegations in handling a variety of objects.
A Host Object is responsible for keeping network connection between a local host and the Host, represented by the Host Object. Different hosts may have different types of connections. A Host does not have to have a connection. If there are no active objects on all the Host's Segments, the host could be disconnected. An active object is discussed below. In case of TCP connection there is one connection per host; in case of a UDP connection, one connection could serve many hosts.
SFS Connection Management
The SFS Connection Management is typically the responsibility of the Host object. A Heart Beat (HB) mechanism may be used to maintain SFS Connections. In that case, the validity of Delegations is governed by the integrity of the Connection.
In at least one embodiment, any SFS Connection has State. The state of the SFS Connection can be defined by the state of the network connection, status of Heart Beats and State of resources that were delegated between the partners of this SFS Connection. On the ES side SFS Connection could be in one of the following states:
On the DS side SFS Connection could be in one of the following states:
To support this functionality, the Host object may keep:
A host's “ES timer” is responsible for sending Heart Beats to DS and receiving replies. When an ES Timer routine is called it checks the last time something has been sent to the host and decides if a heart beat should be sent. It also calculates the time interval for the next sleep. The Send method of the Host object could adjust the ES's Timer Sleep time, e.g., after every successful send. A host's “DS timer” is responsible for checking status of the SFS Connection and adjusting the state of the SFS Connection, Segments on the Host and/or Delegations. A Timer is preferably not active if the appropriate SFS Connection has not been established. In addition, the state of the SFS Connection could be adjusted, or state of every Segment on that Host (SFS Connection) or state of every Delegation could be adjusted. In some implementations, the state of every Segment should be adjusted and, when processing a Delegation, the state of the Segment should be adjusted. The list of Delegations on every Segment could be considered and the state of the Delegations adjusted so when a Delegation is used the correct state is handy.
The timer is preferably not active if the SFS Connection is not established.
SFS Connection Recovery
SFS Connection Recovery happens, e.g., in cases when something happens with the DS or with ES/DS communication. Upon Connection Recovery, a process occurs on the ES side to recover segments affected by the disconnections. This Recovery could be done in parallel for various ES and segments and results in retrieving the objects associated with active Delegations, flush of dirty data, etc.
ES Reboot or Failure
When ES reboots, there is no SFS Connection to recover. DS detects this situation when it receives an SFS_ICONNECT request from an ES with existing SFS Connection on the DS side. In this case DS should release the delegations from that DS. Even if the ES had any exclusive delegations and kept some lock/oplocks locally, NFS and SMB clients of that ES should detect this situation. For NFS Clients it is the same as NFS Server's failure and with help of STATFS it should be detected and NFS Clients should restore their locks. The Grace period Interval should be coordinated with NFS Server's one. SMB clients will detect loss of TCP connections and proceed as with normal SMB Server failure. If there are local clients, they will not care; they will be down as well.
ES failure should be detected by IAS and its failover partner should peek up for it. The failover partner will then by request of NFS Clients, which detected NFS Server's failure, start requesting resources from DSs. The failover partner may or may not have established an SFS Connection with a DS. In any case DS will receive a request for exclusive access to a resource that original ES had exclusive delegation to. If HB interval did not expire DS will try to break the original delegation and fail. If HB interval expired, DS will set the original ES's SFS Connection STALE and remove original delegation and grant a new request. Here the DS should or could let it be known that the new ES is a failover partner of the old ES and/or that it is processing in Grace period mode for first ES's NFS Clients.
If there is no failover partner for an ES, DS will not know about ES failure until it receives a request from another ES or Local Host for a resource, delegated to a failed ES. When DS does receive this request, it breaks an existing delegation. If HB interval has expired on a Session, DS does not need to send a request to break a delegation, but preferably sets the SFS Session STALE and deletes the delegation.
If an ES failed and DS does not need to communicate to that ES, the DS will not change the state of its SFS Connection and will not release delegated resources.
Segment Recovery and Segment Reassignment
The actual process of Segment Recovery or Segment Reassignment happens on a DS, while processing of restoring the state is driven by ESs. Segment Recovery and Segment Reassignment is the same process from the ES point of view. The difference is that in case of Segment Recovery, Segment stays on the same Host, while for Segment reassignment the Segment moves with delegations from one Host to another. Segment object is a keeper of the list of Delegations, so only Segment object may be moved from one list to another.
Segment Recovery or Segment Reassignment could happen during SFS Connection recovery.
Segment Reassignment could also happen regardless of SFS Connection recovery, when IAS commands a DS to give up ownership of a segment and give it to a different Segment.
There are two ways ES could find out about Segment reassignments:
Segment reassignment/recovery could be guaranteed and non-guaranteed (Forced and non-forced) Guaranteed recovery happens if DS is within the Grace Period and it means that the restoring delegations and locks should be granted by DS. Non-guaranteed recovery happens if:
In any case ES should:
When a DS gives up ownership of a segment, it should send_SEG_RELOCATED notification to all the Sessions that have delegations on any Segment's resources. It should not break any delegations, because ESs will be able to restore all the delegations they have with the new DS owner.
When a DS receives ownership of a segment, it should set up a Grace period on this segment so that only the following SFS commands to the resources on this Segment will be accepted:
This is not the same as Grace period on a Session, because Session could have other segments and requests to other Segments should continue processing. A possible way to implement this Segments Grace period is to postpone DS processing of other above-mentioned requests to objects on the Segment. Another way is to respond with an error to those requests and make ESs to repeat these requests after Grace period. The second way could be preferred if RPC is used for communication, because RPC will retransmit and could fail request if DS will postpone it for too long. ERR_SEGGRACE could be used as a reply on conflicting requests and ESs should retry those requests after grace period interval.
The descriptions below provide further examples of the invention, but do not exhaustively detail all possible embodiments of the invention. The descriptions below provide exemplary feature sets (definitions) of possible embodiments of the invention, including possible feature sets of Delegations. The described feature sets are not the only feature sets of Delegations, etc., that may be used in embodiments of the invention.
A Delegation is an object that defines set of rules for leasing resources from DS to ES. Resources that could be leases from a DS to an ES are called Delegatable resources. The following is a partial list of Delegatable resources:
Preferably every Delegatable resource keeps the list of Delegations. A resource is preferably local or remote, but not both. On DS side resource keeps list of issued Delegations. On the ES side resource keeps a list of received delegations. The ES side preferably has only one Delegation per object, although the same list could be reused and on the ES this list could have only one entry.
When an ES sends to the DS a request that requires information from a Delegatable resource to be returned, the DS considers if the resource could be leased (delegated) to the ES and what type of lease could be issued. Then this resource is sent back from the DS to the ES together with its possible corresponding delegation. Possibilities include the case when delegation NONE is issued. DS keeps an issued Delegation in the list of delegations for the resource. When ES receives a delegation it also attaches this delegation to its representation of this Delegatable resource.
Delegations typically have associated with them a sharing type. Examples of kinds of delegations are Shared, Exclusive, Notification or None:
The DS decides what type of delegation should be issued (See: “Delegations: Case Study”) below.
At any time DS can send a message and revoke delegation or possibly downgrade it from Exclusive to Shared. By receiving a revoke request, ES should sync the state for this resource to the DS, change data to stable storage (in case of write caching) and discard the read data cache for this resource.
When DS revokes an Exclusive Delegation it should postpone any usage of the protected resource until the ES replies. There should not be any timeout on the revocation request. Revoke request should be considered failed only if SFS Connection Management detected break of the SFS Connection from that ES. See SFS Connection Management and recovery for the definition of the behavior in this situation. Revoke request may entail a lot of work and network traffic on an ES part. To help avoid RPC timeouts and retransmissions, revoke request should be implemented as a set of two separate NULL-reply RPC requests: SFS_REVOKE from DS to ES and SFS_REVOKE_OK from ES to DS.
When DS revokes a shared delegation, it waits for SFS_REVOKE_OK so that the ES finishes direct media access.
Delegation has resource type. This is a type of the resource, delegation protects. Resource type defines a set of Delegation operations, which are resource dependent.
Delegations may have individual non 0 term associated with them. Such delegations are granted only for the duration of term. ES can renew them individually or just re-acquire the associated resource. Delegations with term set to 0 are renewed implicitly by the heartbeat mechanism. Any network exchange between ES and DS is treated as the heartbeat.
Delegation Object Definitions
The following provides exemplary code for implementing the feature set discussed above.
Following are typical external interfaces. Everything else can possibly be hidden inside of the delegation package.
The following describes exemplary possible usages for delegations and provides a detailed, specific example, but not an exhaustive, limiting description of the invention. As mentioned above, two Delegatable objects are considered: inode and File Byte Range. There are three different external interfaces to a File System:
All three of these interfaces are communicating with a File System through vfs set of functions. But the pattern of calls is different.
Most of the “Sys_” calls start from path_walk routine. This routine walks through elements of a given path, starting from the root or from the current directory. For every element it checks caches Dentries, validating them if found in the cache, or does lookup request. After code is done using a Dentry, it issues a put request on it.
NFS request does not have a path; “path_walk” routine is preferably never used. All NFS requests start from checking a received File Handle. fh_to_dentry routine is supplied, so it is an SFS code that parses the File Handle, finds Inode number from it and then Inode itself, from the cache or by doing read_inode.
Delegations of Inodes
Operations for files or directories can be divided into four sets:
Symlinks are special type of files in Unix-like file systems. It is very difficult to separate inode-type operations from data-type operations for symlinks, because in most cases all symlink's data is stored inside the inode. So for symlinks only Inode-Read-type and Inode-Write-type operations are distinguished.
Inode-type operations for files and directories are the same.
Inode-Read-type operations are:
Inode-Write-type operations are:
The following table shows how different sharing types are applied to Inode delegation.
DS creates an in memory representation of Inode when it receives Inode-type operation by request of local Host or an ES. Any Data-type operation to inode or directory is presided over by Inode-type operation. DS keeps track of the Inode usage. This is done with help of delegations.
When an in-memory presentation is created on the DS side, a corresponding delegation is created and is attached to this inode. Delegation contains a pointer to the SFS Connection that delegation is given to. In case of local usage, this pointer is NULL. When a delegation is created, what sharing type to give with this delegation should be decided. There could be different strategies for making this decision. The Decision making algorithm is preferably encapsulated into a special routine. A more sophisticated algorithm could be implemented that takes into consideration access pattern history, configuration factors, type of files, etc.
A simpler algorithm can be used that takes into consideration only other existing shares and type of the request.
Read-type operations are:
Write-type operations are:
Exemplary Read-type operations are:
Exclusive delegations may not be given to directories. Directory changes should occur on the DS side.
Symbolic Link Inodes
For symbolic links Inode-Read-type operations also include the following:
Write operations on Symlinks exist in form of operations on directory. For symbolic links, there may only be Notification delegations.
ES creates an in memory representation of an Inode when it receives Inode-type operation by request of local Host. It knows that the Inode doesn't belong to a local Segment. It means that the Host (SFS Connection) represents an ES for this Inode. ES always creates an in memory representation for all remote Inodes. There are several vfs requests during processing of those Inode can be created. Lookup and fh_to_dentry calls are supposed to create Inode structure for an existing Inode. They both use iget4 helper function in the Linux kernel that in turn calls read_inode vfs function in SFS. Functions create, mkdir, symlink create a new inode. They use Linux helper functions new_inode or get_empty_inode and fill in this inode by themselves. All these functions make corresponding calls to a DS and receive a corresponding delegation in the reply.
When an in-memory presentation is created on the DS side, a corresponding delegation is created and it is attached to this inode. Delegation contains a pointer to the SFS Connection that delegation is associated with. In case of local usage, this pointer is NULL. When a delegation is created, is should be decided what sharing type to give with this delegation. There could be different strategies for making this decision. The Decision making algorithm should be encapsulated into a special routine. A more sophisticated algorithm could be implemented that takes into consideration access pattern history, configuration factors, type of files, etc.
A simpler algorithm can be used that takes into consideration only other existing shares and type of the request.
Inode delegations are shared delegations. When an inode is originated on the DS side, a corresponding delegation is created:
This means: create an inode delegation, for the es (if NULL, for local use); the delegation is validated by the heartbeat.
When inode is changed on the DS side on behalf of the ES, a call to CreateDelegationDS is made again:
This time sharing parameter is set to encs_shared revoke, which causes revocation of, preferably all, conflicting delegations (but typically not the delegation for the originating ES).
When the inode is no more needed on the ES, a call to FreeDelegation is made. It triggers a free_enc( ) call on the DS.
Delegations: A Case Study
Assume a scenario involving a DS and 2 ES's—ES1 and ES2. Here for the simplicity reasons, ‘open’ is used as an example of the operation that may be delegated from the DS to ES. In fact, the operation could be any other delegatable operation on an inode.
ES1 performs readdir_plus (1s−1) type of operation retrieving file names and files attributes on a directory owned by the DS. To do that ES1 issues read_inode requests for every file in the directory. Assume a file from this directory was never requested by any ES and is not opened locally on DS—there is no delegation associated with it. In this case, DS may grant exclusive delegation to be sent to ES1.
Now when ES1 gets a request to open the file, set up a lease (oplock), set a lock or flock, it can do this without further communicating with DS. Assume that ES2 (other ES) also wants to read directory information and get file attributes. It also will issue read_inode request to the DS. However, this time the DS detects that most of the files have exclusive delegations assigned to ES1 and will grant notification delegation to the ES2. Note: exclusive delegation granted to ES1 is not broken.
Assume an application on the ES2 wants to open a file different from the file opened by the ES1. Since ES2 does not have the exclusive delegation to the inode, it will send open request to the DS. To execute this open request, DS has to revoke exclusive delegation it granted to the ES1. Since ES1 has no interest in this file it simply releases the delegation. When DS recognizes that inode is free (no delegations or local usage) it grants the exclusive delegation to the ES2. Now ES2 can re-execute open and perform other operations locally.
And, finally, assume ES2 wants to open the file delegate and used by ES1. Similar to the case above, ES2 does not have the exclusive delegation to the inode, it will send open request to the DS. To execute this open request, DS revokes exclusive delegation it granted to the ES1. However, now ES1 has objects that are protected by this delegation and before (or together with) releasing the delegation it sends these objects to the DS. DS can look into the type of objects it received from ES1 and then can grant both ES1 and ES2 shared delegations, no delegations at all (or notification delegation), and start serving this file itself.
Another exemplary scenario is depicted in
Assume ES-1 received a request to READ_INODE for file f (850) and sends it to DS.
DS received DS_READ_INODE request for file f (820). It checks if file f is opened or already delegated (822). Assume that it is not. In this case DS, creates an exclusive delegation for ES-1 (824) and replies with information for the requested inode and delegation, including sharing type and delegation ID (826). ES-1 receives a reply, creates a representation for this exclusive delegation, and hooks it to the Segment object and inode (852). Some time later ES-2 gets a write request for file f. Since it already has an exclusive delegation for this file, it does not have to send data to DS right away and can safely cache dirty pages on its side. Write requests are still coming in, data gets cached, and a Lazy Writer efficiently pushes data to the DS side (854). Assume ES-1 also gets some lock requests (856). Since it already has an exclusive delegation for this file (858), it does not have to send this request to the DS side and can process local locks itself (860).
Meanwhile ES-2 also receives a READ_INODE request and sends it to DS (880). DS receives DS_READ_INODE request for file f (830) and checks if the file is delegated (832). Now the file has exclusive delegation, so DS creates a None delegation (834) and sends data to ES-2 (836). ES-2 receives a reply, creates a representation for this exclusive delegation, and hooks it to the Segment object and inode (882). Some time later, ES-2 receives a READ request and sends it to DS (884). DS receives a DS_READ request (838) and sends a break exclusive-to-shared for the delegation granted to ES-1 (840). ES-1 receives a BREAK_DELEGATION up-call (862), flushes dirty pages and local locks to DS (864). Now DS knows that ES-1 is actively writing file f, plus to maintain locks for ES-1, so it issues a Shared delegation for ES-2 and sends it back together with the requested data (844). As time passes by, DS knows if ES-2 is active or not. If not, it can reissue an exclusive delegation to ES-1 effectively allowing it start caching again.
Other embodiments are within the scope of the invention.