Search Images Maps Play YouTube News Gmail Drive More »
Sign in
Screen reader users: click this link for accessible mode. Accessible mode has the same essential features but works better with your reader.

Patents

  1. Advanced Patent Search
Publication numberUS20050086192 A1
Publication typeApplication
Application numberUS 10/688,287
Publication dateApr 21, 2005
Filing dateOct 16, 2003
Priority dateOct 16, 2003
Also published asUS20090327248
Publication number10688287, 688287, US 2005/0086192 A1, US 2005/086192 A1, US 20050086192 A1, US 20050086192A1, US 2005086192 A1, US 2005086192A1, US-A1-20050086192, US-A1-2005086192, US2005/0086192A1, US2005/086192A1, US20050086192 A1, US20050086192A1, US2005086192 A1, US2005086192A1
InventorsShoji Kodama
Original AssigneeHitach, Ltd.
Export CitationBiBTeX, EndNote, RefMan
External Links: USPTO, USPTO Assignment, Espacenet
Method and apparatus for improving the integration between a search engine and one or more file servers
US 20050086192 A1
Abstract
In one aspect of the invention, a search engine parses files stored among one or more file servers in order to create and maintain index information used by the search engine to perform searches. For a given file server, the population of files presented to the search engine is reduced in size to facilitate the process of updating the index. In another aspect of the invention, the file server limits the files presented in a directory list request made by a search engine. This reduces the number of file s that need to be considered when performing an index update.
Images(10)
Previous page
Next page
Claims(79)
1. A method for accessing data comprising:
storing a plurality of files in a file server;
monitoring operations on one or more of the files in the file server;
if a file in the file server is modified, then adding information representative of the file in an update list, wherein the update list contains information representative of files that have been modified;
providing an index, the index comprising information produced from an analysis of one or more of the files in the file server, the index being accessed by a first computer other than the file server;
obtaining information from the update list, thus identifying each file contained in the update list; and
for each file contained in the update list, updating the index with information produced from an analysis of the file, whereby the updating is performed only on those files which have been modified.
2. The method of claim 1 wherein the step of obtaining information from the update list includes communicating to the first computer first information representative of one or more files referenced in the update list.
3. The method of claim 2 wherein the update list is stored in the file server.
4. The method of claim 2 wherein the first information comprises file references contained in the update list.
5. The method of claim 2 wherein the first information comprises copies of the files referenced in the update list.
6. The method of claim 1 wherein the update list is stored in a first file and the step of obtaining information from the update list includes communicating a copy of the first file to the first computer.
7. The method of claim 6 wherein the first file is stored in the file server.
8. The method of claim 1 further comprising clearing the update list when the index is updated, wherein contents of the update list are deleted.
9. The method of claim 8 wherein the step of updating is performed by the first computer.
10. The method of claim 1 wherein the index is stored in the first computer.
11. The method of claim 10 wherein the computer is a search engine server, wherein the index facilitates performing a search of files stored in the file server
12. A method for accessing data comprising:
storing a plurality of files;
receiving a request for a file operation to be performed on a first file;
if the file operation is a write operation, then storing a reference into an update list which identifies the first file, whereby the update list comprises references of only those files whose content have been modified;
receiving a request from a first computer for the update list and in response thereto, communicating information contained in the update list to the first computer; and
subsequent to the step of communicating information, removing the information contained in the update list.
13. The method of claim 12 wherein the one or more file operations comprises a plurality of write operations, wherein the step of storing identification information is performed only upon detecting a first of the write operations.
14. The method of claim 13 wherein the first write operation is received subsequent to receiving a clear request.
15. The method of claim 12 wherein the step of communicating information includes communicating content of the update list to the first computer.
16. The method of claim 15 wherein a copy of the update list is communicated to the first computer.
17. The method of claim 16 further comprising receiving from the first computer file operation requests for reading files identified in the update list, and in response to each such request communicating the requested file to the first computer.
18. The method of claim 17 wherein the first computer is a search engine.
19. The method of claim 12 wherein the step of communicating information includes providing a copy of each file that is referenced in the update list to the first computer.
20. A file server comprising:
storage for storing a plurality of files;
an update list; and
a file server controller,
the file server controller configured to perform the method steps of:
receiving a request for a file operation to be performed on a first file;
if the file operation is a write operation, then storing a reference to the first file into the update list, whereby the update list comprises references of only those files whose content have been modified;
receiving a request from a first computer for the update list and in response thereto, communicating information contained in the update list to the first computer; and
subsequent to the step of communicating information, removing the information contained in the update list.
21. The file server of claim 20 wherein the first computer is a search engine.
22. A method for accessing files from a file server comprising:
receiving a search request and in response thereto accessing an index using search criteria associated with the search request to obtain information which identifies any files that match the search criteria and communicating the information in the form of a search result, the index comprising information based on files stored among one or more file servers; and
updating the index comprising:
receiving file information from a first file server, the file information representative of only those files contained in the first file server that have been modified subsequent to a first point in time; and
for each file:
accessing the file;
parsing the file to produce index information; and
updating the index with the index information,
wherein only those files that have been modified since the first point in time are accessed and parsed.
23. The method of claim 22 wherein the first point in time is a time when a previous update of the index with files from the first file server was being performed.
24. The method of claim 22 further including creating the index, wherein the first point in time is a time subsequent to creating the index.
25. The method of claim 22 further comprising creating an index including:
accessing a plurality of first files from the first file server;
parsing one of the first files to produce index information; and
adding the index information into the index, thereby indexing one of the first files,
wherein the steps of parsing and adding are repeated for each of the first files,
wherein the first point in time is a time subsequent to indexing all of the first files.
26. The method of claim 25 further comprising communicating a first request to the first file server upon indexing the plurality of first files, whereby the first point in time is determined based on the file server receiving the first request.
27. The method of claim 25 wherein creating an index further comprises:
accessing a plurality of second files from a second file server;
parsing one of the second files to produce index information; and
adding the index information into the index, thereby indexing one of the second files,
wherein the steps of parsing and adding are repeated for each of the second files.
28. The method of claim 22 wherein the step of updating the index is performed for a plurality of file servers, wherein each file server is associated its own first point in time which is a point time subsequent to when the index was previously updated with files from the file server.
29. The method of claim 28 further including creating the index, wherein the first point in time is a time subsequent to creating the index.
30. The method of claim 28 wherein the first point in time is a time subsequent to a previous updating of the index.
31. The method of claim 22 wherein the step of updating the index is repeated for a plurality of additional file servers, wherein only those files in each additional file server which have been modified since the first point in time are accessed and parsed.
32. A computer for accessing files comprising:
a file access controller;
an index accessible by the file access controller; and
computer program code configured to control the file access controller to perform the method steps of claim 22.
33. A search engine server comprising:
a search engine controller;
an index accessible by the search engine controller; and
computer program code configured to control the search engine controller to perform the method steps of claim 22.
34. A system for data access comprising:
a first file server;
a second server configured to communicate with the first file server;
an index file accessible by the second server, the index file comprising index information obtained from files stored in the first file server; and
a first update file accessible by the first file server,
the first file server configured to add file references to the first update list for files in the first file server whose contents have changed since a first point in time, and further configured to provide first update information contained in the first update list to the second server,
the second server configured to:
receive the first update information;
access files referenced in the first update information;
analyze each of the files to produce index information; and
update the index with the index information,
whereby updating the index for files stored on the first file server includes accessing only those files which are referenced in the first update list.
35. The system of claim 34 wherein the second server is a search engine server.
36. The system of claim 34 wherein the first point in time is a time subsequent to when the index was created.
37. The system of claim 36 wherein the second server is further configured to create the index and to send a first request to the first file server after the index is created, the first point in time being a time subsequent to the first file server receiving the first request.
38. The system of claim 36 wherein the second server is further configured to send a first request to the first file server after the index is updated, the first file server further configured to clear the first update list in response to receiving the first request, the first point in time being a time subsequent to a time when the first update list is cleared.
39. The system of claim 34 further comprising a second file server and a second update list accessible by the second file server, the second server further being configured for communication with the second file server,
the index further comprising index information obtained from files stored in the second file server,
the second file server configured to add file references to the second update list for files in the second file server whose contents have changed since a second point in time and further configured to provide second update information contained in the second update list to the second server,
the second server further configured to update the index based on files referenced in the second update list.
40. The system of claim 39 wherein the second point in time is a point in time subsequent to when the index created.
41. The system of claim 39 wherein the second server is further configured to create the index and to send a first request to the first file server and to the second file server after the index is created, wherein the first point in time is a time subsequent to the first file server receiving the first request, wherein the second point in time is a time subsequent to the second file server receiving the first request.
42. A method for accessing data comprising:
storing one or more files in a file server;
receiving a first directory list request for a first directory at the file server, the first directory list request originating from a first computer;
in response to receiving the first directory list request from the first computer, producing a first directory listing that is representative of contents of the first directory;
receiving a second directory list request for the first directory at the file server, the second directory list request originating from a second computer;
in response to receiving the second directory list request from the second computer, producing a second directory listing that is representative of contents of the first directory, files represented in the second directory listing being based on one or more criteria contained in a file filter table; and
in the second computer, updating an index based on the second directory listing.
43. The method of claim 42 wherein the second computer is a search engine server.
44. The method of claim 42 wherein the one or more criteria are based on one or more of: file types; file owner information; file creation dates; and file sizes.
45. The method of claim 42 wherein the file filtering table comprises one or more file types which indicate whether files are to be excluded from the second directory listing.
46. The method of claim 45 wherein the file filtering table further comprises one or more of file owner information, file creation dates, file sizes.
47. The method of claim 42 wherein the file filter specifies which files are to be included in the second directory listing.
48. The method of claim 42 wherein the file filter specifies which files are to be excluded from the second directory listing.
49. The method of claim 42 wherein the file filter specifies which files are to be included in the second directory listing and which files are to be excluded from the second directory listing.
50. A method for accessing data comprising:
storing one or more files in a file system on a file server;
providing a plurality exports of the file system to a plurality of computer systems;
receiving from a first computer system a directory list request for a first directory stored on the file server;
producing a first directory listing that is representative of contents of the first directory if the first computer system has not mounted a predetermined one of the exports; and
producing a second directory listing that is representative of contents of the first directory if the first computer system has mounted a predetermined one of the exports, wherein files represented in the second directory listing are determined based on one or more criteria contained in a file filter table, wherein an index in the first computer system is updated based on information in the second directory listing.
51. The method of claim 50 wherein the first computer system is a search engine server.
52. The method of claim 50 wherein the file filter specifies which files are to be included in the second directory listing.
53. The method of claim 50 wherein the file filter specifies which files are to be excluded from the second directory listing.
54. A method for accessing data comprising:
storing one or more files in a file system on a file server;
receiving from a first computer system a directory list request for a first directory contained on the file server, the directory list request including source information comprising an identifier of the first computer system;
producing a first directory listing that is representative of contents of the first directory if the identifier of the first computer system is different from a predetermined identifier; and
producing a second directory listing that is representative of contents of the first directory if the identifier of the first computer system is the same as the predetermined identifier, wherein files represented in the second directory listing are determined based on one or more criteria contained in a file filter table, wherein an index in the first computer system is updated based on information in the second directory listing.
55. The method of claim 54 wherein the identifier is an internet protocol (IP) address.
56. In a file server, a method for providing access to files contained in the file server comprising:
organizing the files in a file system;
providing access to the file system to a plurality of computer systems;
storing information representative of one or more predetermined computer systems;
receiving from a first computer system a directory list request for a first directory stored on the file server;
producing a first directory listing that is representative of contents of the first directory if the first computer system is not one of the predetermined computer systems; and
producing a second directory listing that is representative of contents of the first directory if the first computer system is one of the predetermined computer systems, wherein files represented in the second directory listing are determined based on one or more criteria contained in a file filter table.
57. The method of claim 56 wherein the file filtering table comprises one or more file types which indicate, by file type, whether files are to be excluded from the second directory listing.
58. The method of claim 57 wherein the file filtering table further comprises one or more of file owner information, file creation dates, file sizes.
59. The method of claim 56 wherein the file filtering table comprises one or more criteria which indicate whether a file is to be excluded form the second directory listing.
60. The method of claim 56 wherein the file filtering table comprises one or more criteria which indicate whether a file is to be included in the second directory listing.
61. The method of claim 56 wherein the file filtering table comprises one or more first criteria which indicate whether a file is to be included in the second directory listing and one or more second criteria which indicate whether a file is to be included in the second directory listing.
62. The method of claim 56 further comprising providing one or more exports to the one or more computer systems, wherein the predetermined one or more computer systems are identified by the exports they have mounted, whereby the steps of producing are based on which of the one or more exports the first computer system has mounted.
63. The method of claim 56 wherein the predetermined one or more computer systems are identified by source addresses, whereby the steps of producing are based on a source address of the first computer system.
64. The method of claim 63 wherein the source address is an IP address.
65. A file server comprising:
storage for storing a plurality of files;
a file filter table; and
a file server controller,
the file server controller configured to perform the method steps of claim 56.
66. A method for accessing files comprising:
detecting a write operation to a first file in a file server;
selectively adding a representation of the first file into an update list based on one or more file filter criteria, wherein the detecting step and the selectively adding step is repeated for additional files in the file server;
communicating update information relating to content of the update list to a first computer,
subsequent to communicating the update information, clearing the update list; and
in the first computer updating a search index including accessing files contained in the file server based on the update information.
67. The method of claim 66 wherein file filter criteria specify one or more of: a file type; file ownership; file creation date; and file size.
68. The method of claim 66 wherein the update information comprises file references contained in the update list.
69. The method of claim 66 wherein the update information comprises copies of the files referenced in the update list.
70. The method of claim 66 wherein the update list is stored in a first file and the step of obtaining information from the update list includes communicating a copy of the first file to the first computer.
71. The method of claim 66 wherein the first computer is a search engine.
72. A method for accessing data comprising:
detecting write operations on first files in a file server;
for each first file, selectively adding a reference to the first file into an update list based on one or more filtering criteria;
receiving a first request from a first computer, and in response thereto communicating update information relating to the update list; and
subsequent to communicating the update information, clearing the update list.
73. The method of claim 72 wherein the filtering criteria include at least one of: a file type; file ownership; file creation date; and file size.
74. The method of claim 73 wherein the filtering criteria specify whether to add a file to the update list.
75. The method of claim 73 wherein the filtering criteria specify whether to exclude a file from the update list.
76. The method of claim 72 wherein the update information is a copy of the update list that is transferred to the first computer.
77. The method of claim 72 wherein the update information comprises copies of files referenced in the update list.
78. The method of claim 72 wherein the first computer is a search engine.
79. A file server for providing access to data comprising:
storage for storing a plurality of files;
an update list;
a file filter table; and
a file server controller,
the file server controller configured to perform the method steps of claim 72.
Description
BACKGROUND OF THE INVENTION

The present invention is related to computer file access and in particular to improving the performance of index maintenance in search engines.

The Internet is commonly associated with the world wide web (the “web”). The web has facilitated an explosive proliferation of information to the millions of users who access the web. This information is accessed in the form of files by web servers. However, the Internet has also provided access to files provided by file servers which pre-date the web, such as bulletin boards, ftp sites, and so on.

An intranet that is a private network of a company or any other organization is also used for sharing files. In this case, a file server or a NAS (Network Attached Storage) is common to store and get files. NFS and CIFS protocols are used for accessing files.

Search engines have become a valuable tool in navigating the Internet and/or file servers. Search engines are a commonly used tool to access the many millions of files on the Internet and/or file servers. Typically, the search engine accepts search requests from a user and sends a obtains a list of file names that match the search conditions.

An integral component of a search engine is its “index.” The index is a collection of information that is parsed or otherwise generated from an analysis of a file, and comprises keywords and related information used by the search engine to facilitate a file search. The specific information content and data structures of the index vary from one search engine to another, and is beyond the scope of the present invention.

However, common operations that are performed by typical search engines include the creation of the index and the subsequent maintenance or update of the index. The creation of the index typically involves the search engine checking updated dates of every files, reading every updated file on the Internet and/or file servers and parsing its contents to build up the index.

Invariably, file contents change over time. The search engine must therefore perform updates to the index in order that the index be current. This task typically involves once again crawling the web and/or file servers to access attributes of each file, and then determine whether the file has been updated since the last time the index was updated; or when the index was created, in the case of the very first index update. This determination can be made, for example, by accessing the modification date of the file and comparing it against the index. Making this check reduces the update effort and thus improves the update time; not every file will be re-indexed, only those that have changed relative to the time of the index.

Nevertheless, this update process remains a tedious task because modification date of every files need to be checked. This creates a large volume of traffic, just for the purpose of checking attributes of files. It is therefore very desirable to reduce Internet traffic and/or intranet traffic attributed to the indexing function. It is also desirable to further reduce the indexing effort to further increase the update time of an index.

SUMMARY OF THE INVENTION

In accordance with one aspect of the invention, an update list is maintained in a file server. Update information based on the update list is communicated to a search engine. The update information comprises only those files that have been modified during an previous update operation on an index in the search engine.

In accordance with another aspect of the invention, the file server presents a restricted directory listing to a search engine, as compared to a directory listing of the same directory to a client other than a search engine. A set of one or more filtering criteria can be used to limit the number of files presented to the search engine. This reduces the number of files the search engine must examine when performing an update of its search index.

In accordance with still another aspect of the invention, an update list is maintained in the file server. Files referenced in the update list are limited depending on one or more filtering criteria.

BRIEF DESCRIPTION OF THE DRAWINGS

Aspects, advantages and novel features of the present invention will become apparent from the following description of the invention presented in conjunction with the accompanying drawings:

FIG. 1 is a high level generalized block diagram of an illustrative embodiment of the present invention;

FIG. 2 is a generalized flow diagram highlighting the processing for creating an index;

FIG. 3 highlights the processing of file service requests in a file server;

FIG. 4 is a high level flow diagram showing steps in the file server for processing update lists;

FIG. 5 is a flow diagram highlighting steps in the file server for processing a write request;

FIG. 6 is a flow diagram highlighting steps in the file server for processing a write request according to another embodiment of the present invention;

FIG. 7 is a generalized flow diagram highlighting steps in the file server for processing a directory listing request;

FIG. 8 illustrates an example embodiment of an updated list;

FIG. 9 illustrates an example embodiment of a file filtering table; and

FIG. 10 illustrates multiple exports.

DESCRIPTION OF THE SPECIFIC EMBODIMENTS

FIG. 1 shows a high level block diagram outlining the basic architecture of an example embodiment of a search engine environment in accordance with the present invention. The figure shows at least one file server 0104 having one or more files which can be accessed by users on a network 0103. A file server controller 010403 provides the processing capability conventionally associated with a file server. This may include a central processing unit (CPU), memory, and storage for program code to control the operation of the CPU.

The files stored on a file server are organized into a system of files 010401. In one embodiment of the invention, the file server can access an update list 010402. In general, the update list can be contained in physical storage in a suitable location. In a more general sense, the file server element 0104 shown in the figure represents a plurality of file servers, each storing its own set of files. A typical protocol that file servers use is the network file system protocol (NFS). Another conventional protocol is the common internet file system (CIFS) protocol. Still other protocols such as HTTP can be used by a file server.

The architecture typically includes at least one NFS/CIFS clients 0101 who communicate with the file server(s) 0104 over the network 0103 via the NFS or CIFS protocol in order to read and write files in the file server. Clients include creators of the files, and users who can access the file to either read or modify files, or read and write files. In a more general sense, the client element 0101 of FIG. 1 represents a plurality of users, each capable of accessing one or more of the file servers.

A search engine server 0105 communicates via the network 0103. A file server controller 010502 provides the processing capability conventionally associated with a search engine. This may include a central processing unit (CPU), memory, and storage for program code to control the operation of the CPU. Although this embodiment of the invention is described using a search engine, it will be appreciated from the following description that aspects of the invention can be incorporated into any machine in a networked environment that tracks files and updates to the file. The search engine is merely a convenient example to use because search engines are well understood and familiar to most people who interact on a computer network.

A typical function of most search engines is the creation and maintenance of an index. The specific content and structure of the information comprising an index, and the specifics of the parsing function are beyond the scope of the present invention. For the purposes of discussion, it can be appreciated by those of ordinary skill that one can refer to an index on a particular file system, or an index associated with a file system. The index information can be represented generically as an index database 010501, without loss of generality.

The index is created and subsequently updated and otherwise maintained by the search engine. This activity includes parsing or otherwise generating information from files in the file server(s) 0104 in order to create the index database. It can be appreciated that the search engine can use the same NFS or CIFS protocol to access files in the file server(s).

The architecture shows at least one file search clients 0102. These are the users who access the search engine to submit file search requests. It can be appreciated that a “user” can be a human user or a machine user. An interface is understood to be provided by the search engine that is suitable to the kind of user being serviced. In a generalize sense, the file search client element 0102 shown in FIG. 1 represents a plurality of search clients.

The network 0103 is generally any suitable communication network that allows for communication among the various servers and clients mentioned above. The figure shows a local area network (LAN), but it can be appreciated that other communication networks are equally suitable. Connectivity to a LAN network is typically provided by the ethernet standard, using the TCP/IP protocol.

The file server 0104 and the search engine server 0105 each can be embodied in conventional computer hardware (e.g., comprising a suitable CPU, memory, storage devices, and so on). Conventional software platforms can be used to support the server; e.g., Unix or other UNIX-based OS's, Macintosh OS, various Microsoft OS's, and so on. It is also possible that the file server and the search engine server can run on same hardware and software platform. For example, NFS server and a search engine software can run on a Linux OS.

Referring to FIG. 2, processing in the search engine includes creating an index database. The “index” is used by the search engine when processing a search request. The index is consulted to identify those files, if any, which satisfy a search client's request. It can be appreciated that the term “index” is a very generalized reference to the specific data that a particular search engine may use. It is understood that the specific data structures and storage formats which comprise an “index” is likely to vary from one search engine to another. However, a search engine's index is likely to contain information about a file and its content (e.g., keywords).

In a particular embodiment, the index may be one large database or some other single organization of data representing all file servers. However, logically, one can refer to each file server as having its own associated index; it being understood that reference is being made with respect to that portion of index structure associated with a file server.

Thus, when a search engine first comes online, an index is created for all of the files that can be accessed from a file server; this is done for every file server that is made known to the search engine. Also, if a search engine which is already online learns of a new file server, an index needs to be created for the accessible files contained in that file server. This is represented in FIG. 2 at decision step 0201, where a determination is made whether the index is to be created for a particular file server.

For a new file server, the search engine sends an initialization operation (see FIG. 4) to the file server in a step 0202. This causes the file server to clear its associated update list 010402. This embodiment assumes, without loss of generality, the use of one file server as an example. Thus, the step 0201 is for creating the index for the first time. In case of multiple file servers, a table can be provided to manage which file server the search engine made an index and at which time. See FIG. 2A as an example. In this example, one file server can have multiple export points.

Referring to the decision step 0201, if the index for a file server was previously created, then the search engine accesses update information contained in an update list 010402 associated with that file server (step 0203). Then in a step 0204, files referenced in the update information are accessed by the search engine (see FIG. 4). For each file, the search engine will parse through (or otherwise analyze) the contents of the file to produce index information that is suitable for the index. The search engine can access each file one at a time and perform the parsing. Alternatively, the search engine can access groups of files at a time and perform the parsing operation on the group.

In one implementation, the update list can be accessed by the search engine, just like any other file. Thus, the file server creates a special file that contains a list of updated files and the search engine retrieves a copy of the file from the file server and stores it as a local copy. The search engine also deletes the contents of the special file. The search engine can then operate on the local copy; e.g., reading through the file to identify the files to parse. Alternatively, a protocol can be defined between the search engine and the file server to obtain the information contained in the update list. For example, the file server can communicate to the search engine each file name of the files in the updated file or a list of every file name of the files in the update list to be processed in the search engine. In accordance with another implementation the search engine can receive the actual files in the updated list instead of a list of file names from the file servers.

Referring to FIG. 3, a file server receives many requests for file operations. Typical operations include, for example, file creation, file open, file read, file write, directory listings, and so on. The specific file operations provided vary depending the file system and the protocols for communicating with the file server; e.g., NFS, CIFS, etc.

Thus, in a step 0301, the file server receives a file operation request from a client. In a determination step 0302, the request is handed off to an appropriate handler. For example, a file open request is handled by a file open handler 0303. A file read request is handled by a file read handler 0304. A file write request is handled by a file write handler 0305 in accordance with an embodiment of the present invention. This aspect of the invention will be discussed below. A directory listing request is handled by a directory listing handler 0306 in accordance with another aspect of the present invention. The directory listing request will be discussed further below. A “get update list” request is handled by the handler 0307. This function is provided in accordance with an embodiment of the present invention and is discussed below.

Referring to FIG. 5, processing of a file write operation in the file server in accordance with the present invention will be discussed. A file write operation changes (modifies) the content of the specified file. The file server makes a determination in a step 0501 whether this is the first write operation on the file since it was opened. If it is the first write operation since the file was opened, then in a step 0502 a reference to the file is placed in the update list 010402 associated with the file server. If it is a write operation subsequent to the first write operation after the file was opened, then processing proceeds to the next step. Typically, the next step is to effect the requested write operation (step 0503), the details of which depend on the specific file server.

The purpose of checking for the first write operation in step 0501 is to avoid having multiple entries in the update list 010402 for the same file. One way to achieve this is as disclosed in step 0502. Alternatively, the update list can be inspected each time to determine whether the file is already in the list or not.

In the case of file creation, a created file initially contains no data. Therefore, it is not necessary that the file server make an entry in the update list to refer to a newly created file. When content is placed in the file, this will occur via a file write operation. However, in some file systems, the file create operation may leave the file in a state where subsequent write operations can be performed; thus obviating the need for a separate file open function call. Therefore with reference to the decision step 0501 in FIG. 5, it can be appreciated that the test can be modified to include testing for the first write operation following a file open operation or a file create operation.

Referring to FIG. 8 for a moment, the information contained in the update list identifies the file that is the object of the write operation. For example, in a hierarchical directory organization, a complete path name of the file should suffice. Other naming conventions might be more suitable. The specific information will depend on the specifics of the filed serve, or the file system, and the like. Thus in FIG. 8, an typical implementation exemplar of the update list 010402 is shown. The implementation shown comprises a list of file names. Each file that is referenced in the update list has been modified. Each entry 080101 comprises a files name, including a full path name.

Referring to FIG. 4, the “get update list” request comprises two kinds of operations. When the file server receives a get update list request in a step 0401, it determines in a decision step 0402 whether the request is for an initialization operation or for a retrieval operation of the update list. If the request is an initialization operation, then in a step 0403, the file server simply clears the update list, if one previously existed. If an update list did not already exist, then the file server will create an update list. This aspect of the invention is discussed further below.

The particular implementation shown in FIG. 4 uses a special protocol between a file server and a search engine to communicate an updated file list. It can be appreciated that in accordance with another implementation, the search engine can use standard NFS/CIFS protocols to get a updated file list from a file server. In such an implementation, the updated file list is stored on the file server as a file. So the search engine reads the file via standard NFS/CIFS protocols and knows which files have been updated by reading the file. The content in the special file must be cleared after the read by the search engine.

Continuing with the figure, if the request is a get_file_list operation, then the file server will communicate the update list to the search engine (step 0404). A copy of the file can be communicated to the search engine, just like any other file. Alternatively, the file server can communicate the actual files to the search engine; either one at a time, or in groups, or in some other suitable manner. For each file in the update list, the search engine will analyze the file and update the index with information produce by the analysis, thereby updating the index.

When the update list is communicated to the search engine, the update list is cleared, in a step 0405. Thus, if the update list is communicated to the search engine as a single file, the update list can be cleared after the communication is complete. If the file server communicates files to the search engine instead, then each file that is referenced in the update list can be deleted from the update list after it is communicated to the search engine.

After the update list is cleared, the list is once again filled with references to files that are modified. The files referenced in the update list therefore represent those files that have been modified subsequent to a point in time when the update list was last cleared. Stated from a different point of view, the update list contains a list of file references that have been modified since the last time the update list was retrieved by the search engine.

From the point of view of the search engine, files referenced in the update list represent those files that have been modified subsequent to a point in time when the index was being updated. It can be appreciated that updating the index can be a time consuming operation. Thus, in practice, the clearing of the update list by the file server (by virtue of a get_file_list request) may very well occur before the completion of updating the index by the search engine.

The next time the search engine retrieves the update list to perform an update of the index, it will only need to parse through those files which were modified since the previous update operation on the index. The update list therefore avoids the search engine having to perform the brute force task of accessing and parsing every file on a given file server in order to update the index.

An index can be created for a file that does not have one. This situation may arise because the search engine was not previously aware of the file system, or for some reason it was decided to delete a previously existing index for the file system. When the search engine has completed the process of creating the index, it will send a get_file_list request for an initialization operation. This has the effect of creating the update list or of clearing an existing update list. If the file system was not previously known, then the file system may not likely to have an update list. In that case, an update list is created. If the file system already had an update list, then the initialization operation will serve to clear the list.

Based on the foregoing discussion, it can be appreciated that each file server has its own associated update list. However, as an alternative implementation, it is conceivable that an update list can be implemented that is accessible by two or more file servers that contains references to modified files from the two or more file servers. In the most general case, a global update list can be provided. However, this type of update list may or may not be preferable, depending on performance considerations, implementation considerations, and so on. In another alternative, one file server maintains multiple updated files. One update list is associated with one export point of the file server.

Referring to FIG. 10, in accordance with another embodiment of the present invention, a file server can be configured to provide different exports of a file system to different clients. Under NFS and CIFS conventions, a client “mounts” an export of the file system. Mounting is a process involving a series of communications between NFS/CIFS clients and the file server in order to make the export accessible by the NFS/CIFS clients. An export is a name of a file system to be shared or a name of a directory to be shared by NFS/CIFS clients.

As illustrated in FIG. 10, the file system 0104 provides a first export 1001 that can mounted by clients other than a search engine. A second export 1002 is provided by the file server to be mounted by the search engine. Both exports are on the same file system or directory 010401. The file server knows which export the search engine has mounted; for example, a mapping relationship can be described in a special file in the file server.

In accordance with this embodiment, the search engine performs conventional processing to either create an index on the files on the file server, or to update the index. The search engine mounts the export that has been made available by the file server. An administrator of a file server creates an export for a search engine. An administrator of the search engine specifies a list of exports that the search engine needs to make an index. This can be done, for example, by editing a special file in the search engine. By using a directory service, this configuration can be done systematically. The search engine then makes one or more requests for directory listing(s) of files on the file server; for example, using the standard requests provided in the NFS and CIFS protocols.

In the case where the index is being created for the file system, each file identified in the directory listing(s) is parsed and indexed. In the case where the index is being updated, the search engine determines whether the file should be parsed for indexing based on the modification date (or some other similar information) of the file. If the file was modified since the last time the index for this file system was updated, then the file is parsed and indexed; otherwise it is not parsed.

In accordance with this aspect of the invention, the list of files made available via a directory listing by the file server to the search engine is less than the files that are available in a directory listing to other clients. This is made possible because the search engine mounts an export that is different than the export that is mounted by clients other than the search engine. As will be discussed now, the file server is configured to perform differently depending on which export the file service request is being made; e.g., a directory listing service request.

Referring to FIG. 9, a file server configured according to this aspect of the invention includes a file filtering table 0901. The table contains conditions (criteria) 090101 that describe what kinds of files will be made available to an export that is mounted by the search engine. For example, users of the search engine may want to restrict files to be searched based on file type. Types of files can be determined by a file extension such as .ppt, .doc, .xls, and so on. In this case, files that having certain file extensions may be determined to be candidates for searching. Another criterion for determining which files can be searched might be based on file ownership, file creation time, file size, and so on.

The file filter table embodiment show in FIG. 9 is an inclusive table. This means that the file filter table specifies those files which should be included in the directory listing. For example, all “.doc” files will be included in the directory listing for a given directory. However, “.exe” files will not be included; i.e., excluded from the list. It can be appreciated that the file filter table can be an “exclusionary” table. Thus, the table specifies those files which will be excluded from the directory list. Thus, for example, an exclusionary table might contain the criterion of “.exe”, meaning that all files in a directory will be included in the directory list except for files of type “.exe”. Still another variation of the file filter table is to be able specify files to be included and files to be excluded.

Typically, files that are indexed are those that contain text. Some search engines will also index files that have graphics or some kind of image data, if there is corresponding text in the file. The file filter table can reduce the set of files that the search engine must consider by filtering out executable files or other files which do not contain data that can be searched.

FIG. 7 illustrates an example of the processing for a directory request that is made on an export that a search engine has mounted. The file server determines if the directory listing request issued from the search engine, step 0701. The directory listing request includes information as to which export the request was issued on. Since, the file server knows which export the search engine has mounted, the file server can make this determination. If the request did not come from a search engine, then in a step 0707, a conventional directory listing is produced and communicated to the requesting client.

If the request originated from a search engine, then in a step 0702, the file server consults the file filtering table 0901 to determine (step 0703) for each file in that directory whether it will be contained in the directory listing information. If the file meets the criterion(a) set forth in the file filtering table, then a reference to the file is added to a temporary list (step 0704). The file server can determine whether the request came from a search engine or from a client by looking at which export the request has been issued or by looking at an IP address of the requester, or by some other suitable identification technique. Also, the file server can maintain a suitable list that identifies one or more computer systems (e.g., search engines) for which the file filtering table will be used to satisfy a directory request.

If the file does not match any of the criteria in the file filtering list, then it will not be added to the temporary list. In a step 0705, a check is made to determine whether all of the files have been checked against the file filtering table. If more files need to be checked, then processing continues with step 0702. Otherwise, the temporary list is further processed in a step 0706 to produce a suitable directory listing that can then be communicated back to the search engine. This might include adding a listing of the subdirectories to the temporary list. File attributes of the files contained in the temporary list may need to be supplied. This might include information such as file size, creation date, modify date, permission information, and so on. The directory information is then communicated to the search engine as a response to the directory listing request.

It can be appreciated that the directory listing that the search engine receives is filtered by the file filtering table, and thus can contain a subset of the files that a non-search engine client might receive. By virtue of this reduced file list, processing in the search engine to create an index for the file system or to update it index can be reduced, as compared to conventional processing where an unfiltered directory listing might include many more files.

Referring to FIG. 6, still another aspect of the present invention is directed to the processing in the file server of write requests. When the file server receives a write request, a determination is made in a step 0601 whether the write request is the first write request since the specified file was last opened. If the write request is not a first write request, then the write request is processed in a conventional manner (step 0604), according to the specifics of the file server.

If the write request is the first write request since the last file open operation, then processing proceeds to a decision step at step 0602. There, a file filtering table 0901 is consulted. This table is used in the same manner as discussed above. If the file that is the object of the write operation satisfies any of the criteria in the file table, then a reference to the file is added to an update list 010402, in a step 0603. If no criteria are satisfied, then the write operation is completed in a conventional manner in step 0604.

It was noted above in connection with FIG. 5 that in the case of file creation, a created file initially contains no data. Therefore, it is not necessary that the file server make an entry in the update list to refer to a newly created file. When content is placed in the file, this will occur via a file write operation. However, in some file systems, the file create operation may leave the file in a state where subsequent write operations can be performed; thus obviating the need for a separate file open function call. Therefore with reference to the decision step 0601 in FIG. 6, it can be appreciated that the test can be modified to include testing for the first write operation following a file open operation or a file create operation.

It can be appreciated that this aspect of the invention is similar to the aspect of the invention discussed in connection with update lists. The search engine will consult the update list associated with the file system when it is ready to perform an update of its index for that file system, as discussed above. Thus, the search engine need only access and parse those files referenced in the update list when performing an index update. However, with the use of the file filter table, the size of the update list can be reduced somewhat. This has the desired effect of potentially reducing the index update time.

Patent Citations
Cited PatentFiling datePublication dateApplicantTitle
US215601 *Mar 28, 1879May 20, 1879Himself And Luther JImprovement in child s chair and carriage
US5845273 *Jun 27, 1996Dec 1, 1998Microsoft CorporationMethod and apparatus for integrating multiple indexed files
US6067541 *Sep 17, 1997May 23, 2000Microsoft CorporationMonitoring document changes in a file system of documents with the document change information stored in a persistent log
US6269362 *Dec 19, 1997Jul 31, 2001Alta Vista CompanySystem and method for monitoring web pages by comparing generated abstracts
US6356863 *Jun 1, 1999Mar 12, 2002Metaphorics LlcVirtual network file server
US6418453 *Nov 3, 1999Jul 9, 2002International Business Machines CorporationNetwork repository service for efficient web crawling
US6636854 *Dec 7, 2000Oct 21, 2003International Business Machines CorporationMethod and system for augmenting web-indexed search engine results with peer-to-peer search results
US7020658 *Jun 4, 2001Mar 28, 2006Charles E. Hill & AssociatesData file management system and method for browsers
US7231382 *Jun 1, 2001Jun 12, 2007Orbitz LlcSystem and method for receiving and loading fare and schedule data
Referenced by
Citing PatentFiling datePublication dateApplicantTitle
US7487138 *Aug 25, 2004Feb 3, 2009Symantec Operating CorporationSystem and method for chunk-based indexing of file system content
US7539702Mar 12, 2004May 26, 2009Netapp, Inc.Pre-summarization and analysis of results generated by an agent
US7603616 *Nov 5, 2004Oct 13, 2009Microsoft CorporationProxy server using a statistical model
US7617255Mar 8, 2006Nov 10, 2009Hitachi, Ltd.Storage system, storage control device and recovery point detection method for storage control device
US7630994Mar 12, 2004Dec 8, 2009Netapp, Inc.On the fly summarization of file walk data
US7716198Dec 21, 2004May 11, 2010Microsoft CorporationRanking search results using feature extraction
US7739277Sep 30, 2004Jun 15, 2010Microsoft CorporationSystem and method for incorporating anchor text into ranking search results
US7761448Sep 30, 2004Jul 20, 2010Microsoft CorporationSystem and method for ranking search results using click distance
US7792833Apr 26, 2006Sep 7, 2010Microsoft CorporationRanking search results using language types
US7827181Sep 29, 2005Nov 2, 2010Microsoft CorporationClick distance determination
US7840569Oct 18, 2007Nov 23, 2010Microsoft CorporationEnterprise relevancy ranking using a neural network
US7844646 *Mar 12, 2004Nov 30, 2010Netapp, Inc.Method and apparatus for representing file system metadata within a database for efficient queries
US8024309Aug 30, 2007Sep 20, 2011Netapp, Inc.Storage resource management across multiple paths
US8037113 *Jan 20, 2009Oct 11, 2011Novell, Inc.Techniques for file system searching
US8095565May 5, 2006Jan 10, 2012Microsoft CorporationMetadata driven user interface
US8418258 *Sep 23, 2010Apr 9, 2013Antenna Vaultus, Inc.System for providing mobile data security
US8473636 *Feb 5, 2008Jun 25, 2013Hitachi, Ltd.Information processing system and data management method
US8595238Jun 22, 2011Nov 26, 2013International Business Machines CorporationSmart index creation and reconciliation in an interconnected network of systems
US8935789 *Jul 17, 2009Jan 13, 2015Jayant ShuklaFixing computer files infected by virus and other malware
US8959593 *Dec 10, 2012Feb 17, 2015Antenna Vaultus, Inc.System for providing mobile data security
US8990285Feb 29, 2008Mar 24, 2015Netapp, Inc.Pre-summarization and analysis of results generated by an agent
US20050086583 *Nov 5, 2004Apr 21, 2005Microsoft CorporationProxy server using a statistical model
US20050203907 *Mar 12, 2004Sep 15, 2005Vijay DeshmukhPre-summarization and analysis of results generated by an agent
US20050210006 *Mar 18, 2004Sep 22, 2005Microsoft CorporationField weighting in text searching
US20100031361 *Jul 17, 2009Feb 4, 2010Jayant ShuklaFixing Computer Files Infected by Virus and Other Malware
US20110107437 *May 5, 2011Antenna Vaultus, Inc.System for providing mobile data security
EP2069938A2 *Sep 26, 2007Jun 17, 2009Sony CorporationProviding a user access to data files distributed in a plurality of different types of user devices
Classifications
U.S. Classification1/1, 707/E17.01, 707/999.001
International ClassificationG06F12/00, G06F17/30
Cooperative ClassificationG06F17/30067
European ClassificationG06F17/30F
Legal Events
DateCodeEventDescription
Oct 16, 2003ASAssignment
Owner name: HITACHI, LTD., JAPAN
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KODAMA, SHOJI;REEL/FRAME:014626/0287
Effective date: 20031013