Information retrieval system and method
US RE26429 E
Abstract available in
Claims available in
Description (OCR text may contain errors)
Aug. 6, 1968 5. KAUFMAN ET A1. Re. 26,429
lNFORMATION RETRIEVAL SYSTEM AND METHOD 16 Sheets-Sheet l Umginal Filed Dec. 3, 1964 lllll INVENTURS sAMufL KAUFMAN By Qi- J5PH J. MAcNmoJR/,
ATTORNEY Aug. 6, 1968 5, KAUFMAN ET AL Re. 26,429
INFORMATION RETRIEVAL SYSTEM AND METHOD 16 Sheets-Sheet t` FIG. 2A
Original Filed Deo. 8, 1964 TNCREMENT CW QUESTION CUL COvllJqIRTDER To ZERO 4 52 Razz-wim* G RESET CU T0100 TExT woRD CUB TNCREMENT COUNTER CLQ *if G A FIG. FIG. FTG. 2A 2c 2E 54 E1s. me. H 2e 2D CL4 F|G.2 A/z T TNCREMENTRH G 64 M542,
WCREMENT QUESTION WORD 54'100 sumo@ 0R LH ADDRESS REG OR :14%:
LH RH E n '8 4 @LSH- G To s W om G 1s 1 75 T4 4U 554B 1 LOGIC OR s D-FF- G DECODER T2 SSH* (SEE Ema) T #19 1&1 fm GEMA m44 cL4e,31,5o,55,eo,ao,1o\
Aug. 6, 1968 s. KAUFMAN ET AL INFORMATION RETRIEVAL SYSTEM AND METHOD Original Filed Dec. B, 1964 16 Sheets-Sheet 3 Aug. 6, 196s KAUFMAN ET AL NFORMATION RETRIEVAL SYSTEM AND METHOD 'Jriginal Filed Dec. 8, 1964 16 Sheets-Sheet 4 015.740,10 190 1110111111111 515156,61 MEMORY 0R @1111111110121,122,125,150 B1,102,112,H5 ADDRESS I 110,125,121131 REG1sTER 01103 5R/51 L Y El: FOUND 01111 013571,55(i CU55J60 REGISTER es, 5,10
FIXED 1e l LENGTH 015152,51 l rf G zERo F 1F R111 j 95 GV 01132 Y S0145 m01 ss140 COMPARE 10 ZERO zERo 011211` 0103 01115 01110 01120 123 0140F- G l G ,di G G IG 1l; G FALLOF 0B 10o 102 104 100 10a cti 01 01111511111 l l1 l I 1211` 128 51115050- F F F F F F (FF F F l F F 0R F11P-F10Ps 1 0 1 0 1 0 1 0 1 0 1 0 m 99 101\ 105 \05 107 109 P FA110140F G Ff G G G G G F111 F011 F011 F1111 s55 0105 01115 01110 01120 \SS155 H5100 SSW S889 55114 S5111 55121 55125 Aug. 6, 1968 S. KAUFMAN ET AL INFORMATION RETRI Original Filed Dec. 8, 1964 MEM. ADDRESS REGISTER J VARIABLE LENGTH MEMORY READ AccEss 0R DATA REGISTER CLT QUESTION WORD COMPARE REGISTER EVAL SYSTEM AND METHOD 16 Sheets-Sheet 5 FIG. 2D
OIIICR DATA REG COMPARE COMPARE CLI4 126 AGREE AGREE EQUAL /Z REG.
S EPARATOR wenn COMPARE ff FLIP-Hops GATE SSI?
Aug. 6, 1968 5. KAUFMAN ET AL Re. 26,429
INFORMATION RETRIEVAL SYSTEM AND METHOD 16 Sheets-Sheet 6 Original Filed Dec. S, 1964 MNE 5&5 E; @Hamm @NSO .3E @N50 Sz a O r z H II| |III w 2:@ o w I mmzoo w o m o E: Izmdwmoo 1 U I.. L II .m. .s /IIIIIIIIJIIIIIIL H O w w l o To o TI o I EOIII@ I j IIIILIIII m m I ..6 OZI N f 0X: 39 Nm S .I TI o ONO; mQ IIILIIII, m N f2 mzou I I mnoo u co9 29 ,IIIIII IIIII I IL III z m O.. 25E 55:8 Q IIIIIII 38 zoimmno IIIIII IiII Aug. 6, 1968 s. KAUF-MAN ET AL Re. 26,429
INFORMATION RETRIEVAL SYSTEM AND METHOD Original Filed Dec. 8, 1964 16 Sheets-Sheet 7 Aug. 6, 1968 s, KAUF-'MAN ET AL Re. 26,429
INFORMATION RETRlBvAL SYSTEM AND METHOD Original Filed Deo. 8, 1964 16 Sheets-Sheet 8 I l Il l l n I l 1FIIFO non u1u Ff 0V l 1FiF-0 FFO OFBIS i- QUESTION A woRn A ADDRESS A REclsTER A A A L. A
A A L A A P'- A ki l +V L'- A AUS- 6, 1968 s. KAUFMAN ET AL Re. 26,429
INFORMATION RETRIEVAL SYSTEM AND METHOD Original Filed Dec. B, 1964 16 Sheets-Sheet 9 FIG. 3C
Aug. 6, 1968 S. KAUFMAN ET AL INFORMATION RETRIEVAL SYSTEM AND METHOD Original Filed Dec. 8, i964 16 Sheets-Shes! 10 FIG. 4A
OCCJJWCJWG 52 50 52 14 L6 12 5a 90 99 SS i FROM 1 OR ss ss ss ss ss ss ss PROGRAM 2 5 4 411 40 40 40 00,02 124 04 05,11 120 15' 12711111111 11110 99 51 we 129 82 55 120 WL l ss ss ss ss L- R ss ss ss 14 15 10 11 29 50 1 1 1 1 ss ss ss ss ss ss ss ss 55 51 55 50 01 s2 05 60. 112, 5111 min ato 10 0F11 Sie wie 105 o 50 1T 105 1 ss ss ss ss ss ss t ss ss ss J 102 105 104 111 112 115 114 115 11s L 11 5ta is i 109 00 10ML 511 111111 111 ss ss OR ss ss ss ss 124 125 12s 121 12a 000 101 0F11 10s 0F11 H9 ss ss ss 155 150 Aug. 6, 1968 s. KAUFMAN ET AL INFORMATION RETRIEVAL SYSTEM AND METHOD Lriginal Filed Deo. 8, 1964 16 Sheets-Sheet 1l 1in 1o 122 54 51a 1a se B4 ss ss ss ss ss ss ss OR 5 s 1 9 1o 11 .OR 12 1 1l 1 55 55 sz o 55 12 5a 92 so 5f ss ss ss ss ss1 ss ss 18 ss 51 52 55 54 5o 51 52 55 54 55 12 5e as 100 101 110 54 151 5s 12 .151 1 1 1 "1 E 1 1 0R ss ss ss ss ss ss ss ss l ao a1 s2 55 59 9o 10o 1111 1115 so 9o 5a 10e L@ 511 90 L ss ss ss ss OR ss ss 115 90 5ta 94 96 114 11g 119 95 116 t ss ss ss 0f-v 0R ss Ss 105 81s 40 OR 141 0R 145 W Aug. 6, s. KAUFMAN ET AL Re. INFORMATION RETRIEVAL SYSTEM AND METHOD 16 Sheets-Sheet 12 Original Filed Dec.
M2515@ EE/m mmmrs #zum Aug. 6, 1968 s. KAUFMAN ET A'- INFORMATION RETRIEVAL SYSTEM AND METHOD Original Filed Deo. 8, 1964 FIG. 6A
16 Sheets-Sheet 13 LOAD LOGIC OPERATION REGISTER SET =II= QUEST` REGISTER SET F F'S I7, O2,05,00 TO ZERO SET IOIS TO ACTUAL NO REQ'D MOVE LOGIC CRITERIA FOUND TO COMPARE AND RESET FOR NEXT QUESTION l TEST FOR (NOT) LOGIC I CL 2,3,4,4A
CLI 7 GET QUESTION WORD ADDRESS INCREMENT QUESTION WORD COUNTER TEST END OF QUESTION CL4B,4C,4D PICK UP LOGIC OF QUESTION WORD AND SEE IF LOGIC IS SATISFIED PICK UP QUESTION I I I Lemma I I I I I WORD CHAR/CHAR N0 STRING CL9,IO,II
PICK UP ADDRESS OF TEXT WORD NEXT PICK UP ACTUAL TEXT WORD CHAR /CI-IAR FIG.
Aug. 6, 1968 5 KAUFMAN ET AL Re. 26,429
INFORMATION RETRIEVAL SYSTEM AND METHOD Original Filed Dec. 8, 1964 16 Sheets-Sheet 14 FIG. 6B I l CLII7II8,II9,I2O CLI55 I LOOK FOR PRESENCE I OFIANDILOGIC Move Agug gigi??? I THROUGH (AND) ADDRESS CLI2I,I22,I23,I24 CLIGO NEXT WORD CONDITION ADD TO CRITERIA (AND) OR (AND'SI FOUND REGISTER CLI25 INCREMENT TO NEXT LOGIC FOUND CLIZG LOOK FOR CRITERIA #ICIS @U45 c| 13o,|31,132 SET RTER A COMPARE LOGIC LOGIC FOUND CLI4O SET CRITERIA FF To "1" CLI4I TEST FOR LAST QUESTION COMPARE CLI5O CLIZTJZB INCREMENT QUESTION LOOK FOR NEXT LOGIC COUNTER AND PICK UP IF O MEANS P* NEXT QUESTION WORD CRITERIA FOUND (NEXT QUESTION) END I VMATGH 0 woRD e. TExT WORD CHAR/CHAR S. KAUFMAN ET AL TIDN RETRIEVAL SYSTEM AND METHOD 16 Sheets-Sheet l5 EIGsc IEIII IIIURD TO0 EARLY IN ALPHABET IFE II oII 0"I) PossIBILITIEs AGREE IIITR IIIIRD CLIS sEPARIToRIsI IEFII GET NExT TExT Is oII'D' woRD ADDRESS GL 29 INIT ERD IIE IIIoRII,IIT cIIIIRIcIERs STEP TO NEXT CHARACTER 8| RESET ORIGINAL FF LDECODE LOGIC INDICATORS TEST RIGHT HAND OF QWAR FOR LOGIC /MATCII CL I5 I I I IABsoLDTE N0T 0R SINGLE w0RD YES IF 0N IF 0N IF 0N IFDN I 00 0I 0e 0T-IG sET EE 00 00ND AND AND STRING m "I" IF 0N IE 0N IF 0N I 02,04 03,05 Ie 0L 31, 32, 331 34 READ LOGIC ADDRESS a PUT IN A "0" INDIcATING sET F502 sET EP03 sET FFI-r MATcH,L00I F0R NExT TO I T0 "I' TO "I" S. KAUFMAN ET AI Re. 26,429
TRIEVALI SYSTEM AND METHOD 16 Sheets-Sheet 16 Aug. 6, 1968 INFORMATION RE Original Filed Dec. 8, 1964 FIG. 6D
No MATCH ISTRIIIc ouEsTIoIIwoRo 14D 14E I/IFFIT oIII'I IIIIIIIN IIIINIII CL IFFIIIJWI MEANS# INCREMENT TO NEXT QUESTION "MCH woRD ADDRESS a. NEXT I OGIc. DEOREMENT THE OwAR BACK TO LOGIC OPERATOR #I7 (STRING FOUND) RESET THIS POSITION TO ORIGINAL VALUE FROM I8 OI. 8O,eI,e2,e3 CL e9 DETERMINE IF STRING SET FF I7 To Is DONE OR GET NEXT F mmc ZERO e. so TO WORD 0F STRING DUNE NEXT QUESTION WORD /IF STRING IIoT DoIIE OL 9o GO TO NEXT QUESTION WORD IN SEQUENCE PICK UP CHAR/CHAR CL 60, 6I, 62, 63
READ LOGIC ADDRESS, REPLACE 8| DECREMENT SAME ADDRESS GO TO NEXT QUESTION WORD CL 50` 5I, 52, 53 READ LOGIC ADDRESS INCREMENT NEXT LOGIC a PUT N A ZERG ADDRESS 8| PUT IN A ZERO GO TO NEXT QUESTION WORD United States Patent O 26,429 INFORMATION RETRIEVAL SYSTEM AND METHOD Samuel Kaufman and Joseph J. Magnino, Jr., Yorktown Heights, N.Y., assignors to International Business Machines Corporation, Armonk, N.Y., a corporation of ABSTRACT OF THE DISCLOSURE An information retrieval system is disclosed wherein the information is initially input to the system in normal English language text form and questions are posed to the system in the same normal text form where appropriate. The data base or body of information to be searched is organized in essentially two separate formats in system memory, i.e., an alphabetized portion wherein the alrphabetization is accomplished a-ccording to word length and secondly an unalphabetized portion wherein the individual words of the data base are accessible in their normal order. Means are provided for searching for individual words in the data base and also word strings which comprise two or more words in their normal sequential order. Allowable questioning techniques include means for searching the data base with groups of question words wherein conventional and, or, not, etc. logic possibilities exist.
The present invention relates to a method and apparatus for automatically searching extremely large quantities of raw data and examinging same for content based on questions asked about said data. More particularly it relates to such an apparatus and method for searching a full normal text data base utilizing standard English text question words.
In recent years a phenomenon which has been often referred to as the information explosion has occurred in most civilized countries. In many elds of endeavor the volume of published material relative to various subjects in these fields have increased by orders of a hundredfold. Technical and trade publications containing many articles and much information which is very valuable to practitioners in the particular field which these publications refer often lies useless in vari-ous libraries purely for the lack of availability of accessibility of such articles. In the scientific area, for example, there are hundreds of different recognized technical publications each of which may contain up to fifty articles on various scientific subjects based in many cases upon studies and experiments performed by outstanding scientists in the field. `It is obviously wasteful of both time and energy for subsequent experimenters in such fields to reproduce experiments which have been exhaustively studied previously. However due to the aforementioned lack of availability or accessibility of many published articles subsequent experimenters assume that work in their particular field has never been done before, thus needlessly duplicating experiments and using time which could otherwise be valuably spent elsewhere.
The field of legal research is a similar pressing one wherein for a practicing attorney to adequately know how to prepare his case for trial, he must of necessity search Mice many many thousands of prior cases to determine or attempt to determine fact situations, legal precedents, etc., which apply to the particular case at hand. As is well known, legal libraries have been compiling volumes of printed cases practically since the beginning of our Government and every year the volume of these cases continually increases, thus presenting an ever increasing Information Retrieval problem.
Accordingly, many, many people are beginning to turn serious attention to the problems of Information Retrieval and in particular, people in the electronic data processing industry are seeking ways to utilize what are essentially electronic data processing machinery to perform Information Retrieval tasks. A number of different Information Retrieval systems have been developed in the past, among these are such systems utilizing key wording, auto-abstracting, complete concordance matching and many others. The aforementioned key wording concept requires a human being having rather broad knowledge in an area to read certain articles or text material to be made part of the Information Retrieval base and to key word this information, thus for a given paragraph, four or five Words might be listed which would in the reviewers mind indicate the general context of the paragraph or articles. Obviously, the accuracy of such key wording requires great imagination on the part of the reviewer and subsequent imagination and commonness of thought as to which key words a person asking questions of this key worded list would use in order to obtain a reasonably accurate retrieval of information based on key words. Thus, although the key wording concept greatly reduces data base, it severely limits the flexibility of the system and automatically introduces great subjectivity due to the high degree of human intervention necessary both in preparing a data base and in preparing questions.
Another similar concept requiring considerable human intervention is abstracting which, as implied, requires a human operator to review an article and greatly reduce the quantity of words in the original articles and from an article of many pages produce a highly condensed descriptive paragraph. As with the key wording concept, this introduces great subjectivity in the resulting data base and severely limits the retrieval of information since a subsequent questioner must be thinking along very similar lines to the person who prepared the abstract of the particular article.
A third Information Retrieval system being currently used involves the use of a partial text, i.e., some common words removed; however, the entire text is alphabetized and a complicated address indicati-on of the alphabetized word in the original text is carried with the word `in the alphabetized format. This is done so that subsequent searching and word adjacency tests may be made to determine the existence of words and word STRINGS as will be set forth subsequently in the description of the present invention. Further, with this latter system an entire data base is completely alphabetized in addition to the relative addhesses of a word in a data base and index or reference of some sort to the particular batch or piece of data, publication, etc., from which the particular word was taken must be included in the data base. Subsequent to all the matching operations with question words, a very large amount of bookkeeping and interrogation of answers must follow to see what word matches come from single data sources, etc. The handling of word STRINGS and word adjacency situations is especially difficult with the above system.
The key wording and abstracting systems outlined previously normally use an inverted file system very much like the full alphabetization scheme outlined previously. Thus, it will be seen that information Retrieval systems utilizing human condensation or reduction of the data base together with current outmoded Information Retrieval searching schemes suffer from the disadvantage of the considerable possibility of human error plus very cumbersome searching techniques.
A further technique utilizing the concept of data reduction is referred to as auto-abstracting wherein a computer scans data and discards irrelevant words. A very simplified example of this would be the discarding of articles and perhaps very common verbs whose location and meanings would be clearly implied. However, it is to be, of course, understood that most auto-abstracting techniques `go well beyond this very obvious method of reducing the data base and often will condense a given segment of data by well over 50 percent. It will be obvious that interrogation of said reduced data will require considerable knowledge of the manner in which the data was reduced. Further, any shortcomings insofar as loss of information due to data reduction in certain schemes in certain instances will obviously cause the results of any search made on such a machine reduced data `base to suer accordingly.
From the above discussion it will be apparent that the optimum Information Retrieval system insofar as obtaining a maximum amount of information and avoiding errors due to loss of data because of any sort of data reduction scheme are best avoided by utilizing the com plete data base for interrogation purposes. Further, utilizing such a data base allows for maximum flexibility of questions and any untrained person would be capable of asking questions of such a data base and would in all probability be able to phrase questions which would provide a read out at least comparable to that which he would obtain by manually going through the data in a printed format.
It has now been found that an Information Retrieval system is possible utilizing a full normal text English data base format and questions may be asked of this data base using very straightforward questioning techniques. Further, this system provides for very powerful logic capabilities and the searching of long word STRINGS and word adjacency pairs in a far more efficient manner than has heretofore been available in the art.
It is accordingly a primary object of the present in vention to provide a vastly improved Information Retrieval system using electronic data processing techniques and apparatus.
It is a further object to provide such a system which is designed to work with a data base in normal text form whether English or a foreign language.
It is a further object of the invention to provide a method for pre-processing a data base for optimum utilization in an Information Retrieval system.
It is yet another object of the invention to provide a method and apparatus for searching alphanumeric data and making word comparisons based on word lengths as well as alphabetical matching.
It is another object of the invention to provide such method and apparatus including broad logic capabilities in performing search operations.
It is still another object of the invention to provide a method and apparatus for efficiently performing a word STRING search in an Information Retrieval system.
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings.
In the drawings:
FIGURE 1 is a functional block diagram of the disclosed embodiment of the system disclosed in FIGURE 2,
FIGURES 2 through 2E comprise a composite logical schematic diagram of a possible embodiment of an Information Retrieval system constructed in accordance with the general teachings of the present invention.
FIGURES 3 through 3C comprise a composite logical schematic diagram of the logical Decoder shown in FIGURE 2A.
FIGURES 4 through 4B comprise a composite logical schematic diagram of the System Clock utilized to perform all of the timing and control functions of the Information Retrieval system embodiment illustrated in FIG- URES 2A through 2E.
FIGURE 5 is a functional block diagram of a typical random access magnetic memory such as would specically be used as the Fixed Length Memory illustrated in the FIGURE 2C: and
FIGURES 6 through 6D are a composite flow diagram of the operation of the Information Retrieval system embodied in FIGURES 2A through 2E.
The objects of the present invention are accomplished in general by a method of performing normal text Information Retrieval operations which method comprises first preparing the data base by determining the relative address of every word within a given data base, said data base being arranged in normal text format, storing the normal text format data base in a first machine storage location, every word of said data base being separately addressable, alphabetizing the data base words together with their relative addresses, and discarding all `but the relative addresses of the words, and storing same in sequential order in a second machine storage location. The questions are prepared `by first preparing a list of question words including their relative addresses and a special logic operation indicator and storing the question words in their normal text format in a third machine storage location. Next, the question words are alphabetized together with the logic operation indicators associated with each word and subsequently, the word is discarded and the alphabetized list of relative addresses together with the appropriate operation indicator is stored in a fourth machine storage location. Next, the searching operation is performed utilizing the alphabetizcd list of relative addresses of both the question word list and data base word list and said addresses are utilized to access the actual words stored in memory. Whenever a match is found for a question word, the logic operation indicator for that question word is examined and an indication of the match is stored at a machine storage location directly related to said operation indicator. The search is continued until all question words have been accessed and compared against the data base. At the conclusion of a search, all of the match found indications for each complete question are examined and a determination made as to whether the desired number of logical matches for a given question has been satisfied by this Search.
According to a further aspect of the invention, word STRINGS in a question may be very conveniently searched by transferring the data base accessing control from the alphabetized relative address list of data base words to an indexing counter so that beginning with the first word of a desired STRING located in the normal text data base portion of said machine storage consecutive data base words may be gated out of said machine storage and compared against the question STRING and a very rapid determination of Whether such STRING exists in the data `base may be made. In this case, the first word of the STRING being sought is alphabetized in the question word list and the special logic operation indcator will indicate that a word STRING is being sought and control suitably shifted to accomplish this search operation.
Other question logic operations are suitably indicated by the special logical o-peration indicators or numbers so that, for example, ANDs, ORs, ABSOLUTE YES, NOTs, etc., may readily be searched for and the success or failure of said logical operation in the search suitably noted in memory. said results being obtainable at the end of a search.
An additional feature of the present Information Retrieval system is that whereby both question words and data `base words are characterized by word length indicators. That is, a special word length symbol or number is carried at a predetermined location with respect to each word which indicates the length of said word. Thus, as the alphabetizing is performed, words are rst grouped into groups of ascending length and then alphabetized, that is, all single character Words are alphabetized, all words having two characters, all words having three char acter, etc. Special recognition and control circuitry is then utilized in the Word Comparison Unit of the system so that when a given word is being looked for, if a data base word is brought out which is too short, the system control will be told that this word is of a different length than the one being looked for and thus, could not possibly result in a successful match. The system provides for automatically continuing access of data base words until words of the proper length are found. Conversely, if the first question word is shorter than the first data word encountered, the subsequent question words will until a question word of equal or greater length is found.
Once in a the proper alphabet, i.e., proper word length, searches for the proper alphabetic matches continues in a similar manner. Thus, assuming that the first three words in a particular data base word and a question word match, when the fourth character is analyzed, it will be found that the letter, for example, M," in the question word is further up in the alphabet than, for example, an H, in the data base word. Thus, the next data base word will automatically be accessed on the occurrence of the mismatch. The converse is also true, so that if the letter in the data base word is further along in the alphabet than the question word, the next question word would be accessed.
This type of word length alphabetizing greatly reduces searching time and thus, the cost per search which as is apparent is of paramount importance in such systems.
From the above very general description of the present system it may be seen that the complete Information Retrieval process occurs in three distinct steps. The first is the preparation of the data base itself which, as stated previously, comprises assigning relative addresses to each Word in the data base, said data base being organized in its original or normal text format. Secondly, the data base is alphabetized carrying the relative address for each word with that word during the alphabetizing routine. Next, the actual word itself is deleted and only the alphabetized list of relative addresses is kept. Thus, using the alphabetized list of relative addresses, the data base words in normal test form stored somewhere in machine memory may be accessed in alphabetical order. Thus, it may be seen that subsequent to the preparation of the data base, there will be two distinct batches of information for each data base. The first is the list of words in their normal text sequence and second, the list of alphabetizcd relative addresses. As will be explained more fully subsequently, these two batches or segments of the data base are stored in the machine memory at two distinct locations. In the embodiment of the invention set forth in FIGURES 2A through 2E, the two batches of the data base are actually stored in different memories in order to achieve maximum memory utilization.
The second distinct operation is the preparation of questions to be asked of a given data base or plurality of data bases since each question set may be continuously repeated against a plurality of different and distinct data bases as will be also clearly described subsequently.
The first step in preparing the question list comprises assembling the question. ANDs and ORs which are equivalent will normally be grouped together, NOTs and ABSOLUTE YESs could also be grouped together and single words listed consecutively. The only area wherein the normal text arrangement of the question must be maintained is in the word STRING wherein it is desired to nd a particular STRING of two or more words such be accessed as to be or not to be." There must be provided an operation indicator for each of the words in the question to indicate whether the word is part of an AND, OR, NOT, WORD, STRING, etc. In the present system a special number is utilized to indicate a particular logical operation which is to be performed in connection with a particular question word. In the embodiment of FIGURES 2A through 2E this number also happens to be the address of a particular storage location in a machine storage area which is to be utilized to compile the results of successful matches on the Word associated with said operation indicator number. The precise manner in which this number is utilized in conducting the search and controlling subsequent entry of results in memory will, of course, be explained specifically subsequently in the specification. It is also necessary to provide some indication of word separations to be carried with each question word in that section of memory wherein the question words are stored in their original format. This could be either a special symbol or a blank. Thus, each question word prior to alphabetization will have associated therewith a relative address, a word length indicator and a question word separator. The next operation is the alphabetization of the question words. As indicated previously with respect to alphabetization of the data base, relative addresses and all other associated information is carried along with each question word. Subsequent to the alphabetization, the relative addresses together with the respective special logic operation indicators are retained in the alphabetized list to be utilized to extract the question words in alphabetical order from memory and stored in an appropriate machine storage location and the normal text question words together with word length indicators and word separators are stored in the additional machine storage location. This manner in which the logic operation indicators are utilized together with the alphabetized list of relative addresses to access question words from memory will likewise be clearly explained subsequently with respect to the description of the specific embodiment of the invention disclosed in FIGURES 2A through 2E.
Subsequent to the above alphabetizing operations for both the data base and the question words, this information is appropriately stored in four different predetermined sections of the machine storage. In the disclosed embodiment the normal test form for both the data base and the questions is stored in the Variable Length Memory while the alphabetized list for both data base and questions is stored in the Fixed Length Memory.
The specific content of the memories as anticipated by the present embodiment is clearly shown by the examples and tables which follow subsequently in the description. In these examples the structure and content of the various sections of memory will be readily apparent.
As stated previously, these four separate segments or batches of information are stored in the machine at the four different storage locations indicated at predetermined addresses therein and are thus ready for accessing during the actual searching operations. The searching actually comprises withdrawing in a sequential fashion the question words from the memory and comparing same with the data base. As indicated before, the actual comparison or matching follows certain prescribed lines until it is determined that a particular question word is or is not contained in the data base and if not, the search will proceed to the next desired question word until such question word is located with the successful match. Each time a match is found, an indication of such match is stored at a fth location in main memory, such location being directly ascertainable from the operation indicator stored with that particular question word. Thus, as the search proceeds through a list of question words and matches are found, a compilation is built up in memory at the special logic operation addresses of the results of said search. After the search is complete, the results of the search are determined by accessing the storage locations where such result indications have been placed and the results of the search compared with the results desired as stated in the question. The answers provided by this system may either be print outs of the test or data base material satisfying the question criteria or may alternatively be a mere print out of an identification of the particular portion of the data base in which a successful match was found.
It should be noted that the present embodiment as disclosed in FIGURES 2A through 2E provides means for concurrently processing a plurality of questions, however, each question is completed before the next is begun and the results indicated in a special series of result storage devices which may be interrogated at will. The exact manner in which the results are kept separate will `be apparent from the subsequent description of the disclosed embodiment.
It will be apparent from the above very general description of the present Information Retrieval system that since machine memory must be used in processing the questions and the data base that there will `be sorne finite limit placed on the size of the data base and/or the number of questions which may be concurrently processed. Since the data `base is normally many, many orders of magnitude larger than the questions to be asked to same, it is anticipated by the present invention that the data base may be broken up in convenient segments capable of storage in the machine memory and the very same questions processed against these various segments of the data base. Thus, the data base may be `broken up into convenient size segments susceptible of storage in the machine memory and each segment be completely preprocessed and may be run against any set of questions desired. Further, a given set of questions may be run against all of the segments in the data base or any desired portion thereof. Thus, the over-all flexibility of the system is readily apparent.
In summation, the Information Retrieval system of the present invention offers simplicity, flexibility and eliiciency in operation in that it `bypasses the usual coding, pre-indexing, classification, and thesauri problems often associated with currently used Information Retrieval systems. The three primary concepts which are interrelated and provide the above enumerated advantages are the provision of the distinct two section data base, i.e., the normal text form and the alphabetized list of relative addresses relating thereto. The second is the utilization of the word length alphabetizing scheme for very rapid matching and thirdly, the utilization of the word STRING matching techniques which latter feature is very closely related to the setting up of the two part data base. The above three techniques all contribute to the over-all efficiency of the system in terms of greatly reducing machine time for search and especially where it is desired to search for adjacent word groups or word STRINGS.
Before proceeding further with a description of the particular embodiment of the invention disclosed herein. a discussion of the more important varieties of question logic will be set forth. While there are obviously a great many logical possibilities for doing any Information Retrieval problem, only the more important logic operations will be set forth and described in the present invention since it is believed that a description of these will be suticient to allow a person skilled in the art to expand into other more complicated logic configurations. The simplest and most direct type of match is, of course, the individual or single `word match. By this is meant a mere match of a single word which it is desired to nd in a data hase. In many instances a compilation of a list of salient words specified in the question will result in a successful match against a data base if a sufficient number of such words is given and found in the data base.
A second logic operation is the OR logic. As the name implies, one would desire to phrase a question in terms of OR logic where any one of a number of different words would satisfy the question if found in the data base; for example, if one were interested in finding a four wheeled, self-powered conveyance, the OR logic possibility could set up the words, automobile, or car, or trucks, or vehicle, etc. Thus, if any of these words were found in a particular data `base, a satisfactory match of the desired OR logic would have been obtained.
Another common logic operator is the AND logic. This logic operator would be used where for a particular question it is desired to find a plurality of words, all of which are deemed necessary by the questioner in order to satisfy a question. For example, if one were studying citrus fruits in general, an AND STRING might be oranges, lemons, grapefruits, and limes. Thus, for this logicl operation to be satisfied, all four of these words would have to be found in the data base. It should be noted that the AND differs from the word STRING in that for the AND, the words requested may occur at any location in the data base and need not be contiguous whereas in a word STRING they must both be contiguous and in a particular order.
Yet another logic possibility is the ABSOLUTE YES logic. In the situation where a questioner desires to see all references, i.e., data base or examples when aspecilic item or name is used regardless of the other search logic or matching criteria, the questioner would `use the AB- SOLUTE YES operator to find these cases. This instruction is essentially an override and will cause a correct answer indication regardless of whether or not the remainder of the question criteria is satisfactorily located in the data base. For example, where it is desired to search for all references or examples of aluminum submarines, the words aluminum and submarines might be single match words; however, if it is desired to find all references using the particular term aluminaut regardless of any other criteria, the ABSOLUTE YES operator would be used with the term aluminaut Thus, if the word aluminaut were found in any data base segment, a positive answer for this segment against this question will automatically be given whether the words aluminum and submarine are found or not.
The CONDITIONAL AND is a logical operation combining the ABSOLUTE YES within an AND group wherein a plurality of words are ANDed together. The occurrence of a particular word of the AND forces a match for the entire AND. Thus, if the words aluminum, submarine and aluminaut were part of the group and aluminaut the conditional member, the occurrence of this word would force the satisfaction of the entire AND group.
The last and perhaps most important logical operation which will be dealt with is the word STRING. This logical operation is probably the most powerful search requirement that can be made as it not only requires particular words but also a particular order. The previous example of to be or not to be is a typical one for such a word STRING. Obviously, if a data base consisting of a plurality of literary references were searched, very, very few would have the above expression therein; thus, it may be seen that such a logic operator will automatically exclude a great quantity of the data base. It will also be apparent that the questioner must have very specific knowledge of the information desired or perhaps valuable reference sources may be lost. In any event, the ability of the present system to handle such word STRING searches in a very efficient manner lends great power to the Information Retrieval capabilities of this system.
The final logic operator, although not an operator as such, is the match criteria which states the results desired of the search based on a particular set of question words for a particular question. In other words, if sixteen single word matches, two AND sets, one 0R set and a word STRING were asked for, any `match found in a particular data base exceeding the number seventeen might be acceptable to the questioner and the actual data would merit