CA2245913C - A system and method for finding information in a distributed information system using query learning and meta search - Google Patents

A system and method for finding information in a distributed information system using query learning and meta search Download PDF

Info

Publication number
CA2245913C
CA2245913C CA002245913A CA2245913A CA2245913C CA 2245913 C CA2245913 C CA 2245913C CA 002245913 A CA002245913 A CA 002245913A CA 2245913 A CA2245913 A CA 2245913A CA 2245913 C CA2245913 C CA 2245913C
Authority
CA
Canada
Prior art keywords
documents
query
learning
print
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CA002245913A
Other languages
French (fr)
Other versions
CA2245913A1 (en
Inventor
William W. Cohen
Yoram Singer
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
AT&T Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AT&T Corp filed Critical AT&T Corp
Publication of CA2245913A1 publication Critical patent/CA2245913A1/en
Application granted granted Critical
Publication of CA2245913C publication Critical patent/CA2245913C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access

Abstract

An information retrieval system finds information in a (DIS) Distributed Information System, {the Internet} using query learning and meta search (figure 2) for adding documents to resource directories contained in the DIS.
A selection means (figure 4; yes/no link) generates training data characterized as positive and negative examples of a particular class of data residing in the DIS. A learning means (figure 4; learn link) generates from the training data at least one query that can be submitted to any one of a plurality of search engines for searching the DIS to find "new" items of the particular class. An evaluation means (figure 4; review previous link) determines and verifies that the new item(s) is a new subset of the particular class and adds or updates the particular class in the resource directory.

Description

A SYSTEM AND METHOD FOR FINDII~fG INFORMATION
IN A DISTRIBUTED INFORMATION SY~~TEM USING QUERY
LEARNING AND META SE.?~RCH
Notice This document discloses source code for implementing the invention. No license is granted directly, indirectly or by implication to the source code for any purpose by disclosure in this document except copying for informational purposes only or as authorized in writing by the assigneE: under suitable terms and conditions.
Background of the Invention (1) Field of the Invention This invention relates to information retrieval systems. More particularly, the invention relates to information retrieval in distributed information system, e.g Internet using query learning and meta search.
12) Description of the Prior Art The World Wide Web (WWW) is currently filled with documents that collect together links to all known documents on a topic; henceforth, we will refer to documents of this sort as resource directories. While resource directories are often valuable, they can be difficult to create and maintain. Maintenance is especially problematic because the rapid growth in on-line documents makes it difficult to keep a resource directory up-to-date.
This invention proposes to describe machine learning methods to address the resource directory maintenance problem. In particular, we propose to treat a resource directory as an extensional definition of an unknown concept--i.e. documents pointed to by the resource list will be considered positive examples of WO 97138377 PCTIUS97l05355 COFiEN 6-1 the unknown concept, and all other documents will be considered negative examples of th.e concept. Machine learning methods can then be used to construct from these examples an intensional definition of the concept. If an appropriate learning method is used, this definition can be translated into a query for a nW
W~TW search engine, such as Altavis~ta, Infoseek or nH
Lycos. If the query is accurate, then re-submitting the query at a later date will detect any new instances of the concept that have been added. We will present experimental results on this prob7Lem with two implemented systems. One is an interactive system---an augmented WWW browser that allows the user to label any document, and to learn a search query from previously labeled examples. This system is useful in locating documents similar to those in a resource directory, thus making it more comprehensive. The other is a batch system which repeatedly learns queries from examples, and then collects and labels pages using these queries. In labeling examples, this system assumes that the original resource directory is complete, and hence can only be used with a nearly exhaustive initial resource directory; however, it can operate without human intervention.
WO 97/38377 PC'TILlS97105355 COF~iEN 6-1 Prior art related to machine learning methods includes the following:
USP 5278980 issued January 11, 1994 discloses an information retrieval system and method in which an operator inputs one or more query words which are used to determine a search key for searching through a corpus of a document, and which returns any matches between the search key and. the corpus of a documents as a phrase containing the word data matching the query word(s), a non-stop (content) word next adjacent to the matching word data., and all intervening stop-words between the matching ward data and the next adjacent non-stop word. The operator, after reviewing one or more of the returned phrasea can then use one or more of the next adjacent non-sto~r words as new query words to reformulate the search key and perform a subsequent search through the document corpus. This process can be conducted iterative:ly, until the appropriate documents of interest are located. The additional non-stop words for each phrase are preferably aligned With each other- (e. g., columination) to ease viewing of the "new" content words.

Other prior art related to machine learning methods is disclosed in the references attached to the specification as Appendix 1.
None of the prior art discloses a system and method of adding documents to a resource directory in a distributed information system by using a learning means to generate from training data a plurality of items as positive and/or negatives examples of a particular class and using a learning means to generate at least one query that can be submitted to any of a plurality of methods for searching the system for a new item, after which the new item is evaluated by learning means with the aim of verifying that the new item is a new subset of the class.
Summary of the Invention An information retrieval~system finds information in a Distributed Information System (DIS), e.g. the Internet using query learning and meta search for adding documents to resource directories contained in the DIS. A selection means generates training data characterized as positive and negative examples of a particular class of data residing in the DIS. A
learning means generates from the training data at least one query that can be submitted to any one of a plurality of search engines for searching the DIS to find "new" items of the particular class. An evaluation means determines and verifies that the new items) is a new subset of the particular class and adds or updates the particular class in the resource directory.
In a preferred embodiment t:he invention is directed to a method of adding new documents to a resource list of existing documents, comprising the steps of: learning selection information which selects the documents on the resource list; making a persistent association between the selection inf=ormation and the resource list; using the selection information to select a set of documents which the information specifies; and adding new documents to the resource list, the new documents being added belonging to a subset of the selected set of documents which contains documents which are not already on the resource list.
In another embodiment therE: is provided apparatus for making a resource list of documents which have contents belonging to a class, t:he apparatus comprising: a first list of documents, all of which have contents belonging to the class; a second list of documents, none of which have contents belonging to the class; learning means responsive to t:he first list of documents and the second list of documents for learning selection information which specifie~~ documents whose contents belong to the class; means x-esponsive to the selection information for finding thE: documents whose contents belong to the class, using t:he documents to make the resource list, and making a persistent association between the selection information and the resource list.
In a still further embodiment there is provided in an information system which stores related data and information as items for a plurality of interconnected computers accessible by a plurality of users, a method for finding items of a particular class residing in the information system comprising the steps of: a) identifying as training data a plurality of items characterized as positive and/or negative examples of the class; b) using a learning technique to generate from the training data at least on query that can be submitted to any of a plurality of methods for searching the information system; c) submitting said query to at least one search method and collecting any new items) as a response to the query; d) evaluating the new items) by a learned model with the aim of verifying that the new items) is indeed a new subset of the particular class; and e) presenting t:he new subset of the new items) to a user of the system.
In another embodiment therE: is provided an article of manufacture comprising: a computer useable medium having computer readable program code means embodied therein for finding items of. a particular class residing an information system which stored related data and information as items for a plurality of interconnected computers accessible by a plurality of users, the computer readable program code means in said article of manufacture comprising: a) program code means for identifying as training data a p7_urality of 6a items characterized as positive and/or negative examples of the class; b) program code means f=or using a learning technique to generate from the training data at least one query that can be submitted to any of a plurality of methods for searching the information system; c) program code means for submitting said query to at least one search method and collecting any new items) as a response to the query; d) program code means for evaluating the new items) by the at least one search method with the aim of verifying that: the new items) is indeed a new subset of the particular- class; and e) program code means for presenting the. new subset of new items) to a user of the system.
Description of the Drawinas Fig. 1 is a representation of a prior art distributed information system which implements the principles of the present invention.
Fig. 2 is a listing of pseudo code for a batch-query - learner incorporating t:he principles of the present invention in the system of Fig. 1.
Fig. 3 is a representation of an interactive query-learning system incorporating t:he principles of the present invention.
6b Fig. 4 is a user interface to the query learning system of the present invention.
Fig. 5 is a listing of pseudo code for an on line prediction algorithm incorporating the principles of the present invention.
Fig. 6 is a Table summarizing experiments with the learning system of Fig. 3.
Fig. 7 is a Table summarizing experiments with the learning system of Fig. 2.
Figs. 8A, 8B, 8C and 8D are graphs of data showing the results of precision-recall tradeoff for the three problems studied the batch query learning system of Fig. 2.
Fig. 9 is a Table of results of a generalization error study for the learning systems of Fig. 2 and Fig. 5.

r Description of Preferred Embodiments The problem addressed by the present invention is a variant of the problem of relevance feedback, which is well-studied in information retrieval. One novel aspect of the present invention (other than the WWW-based setting) is that we will focus, as much as is practical, on learning methods that are independent of the search engine used to answer a query. This emphasis seems natural in a WWW
setting, as there are currently a number of general-purpose WWW search engines, all under constant development, none clearly superior to the others, and none directly supporting relevance feedback (at the time of this application); hence it seems i-nappropriate to rely too heavily on a single search engine. The current implementation can use any of several search engines. A second motivation for investigating search engine independent learning methods is that there are many search engines accessible from the WWW that index databases partially or entirely separate from the WWW. As WWW browsers and the Common Gateway Interf ace (CGI) now provide a nearly uniform interface to many search engines, it seems reasonable to consider the problem of designing general-purpose relevance feedback mechanisms that require few assumptions to be made about the search engine.
A distributed information system 10, e.g., the Internet to which the invention is applicable is shown in Fig. 1 The Internet is further described in the text "How The Internet Works" by Joshua Eddings, published by Ziff Davis, 1994. The system includes a to plurality of processors 12 and related databases 14 coupled together through routers {not shown) for directing messages among the processors in accordance with network protocols. Each processor and related database is coupled to a plurality of users through servers {not shown). The users may originate messages for purposes of communication with other users and/ or search the system for information using search engines.
The initial research goal was to implement a WWW-based query-learning system in the system of Fig. 1 and support meaningful experimentation to provide a qualitative evaluation of the difficulty of the task.

CA 02245913 2001-03-27 ~~

To conduct this initial evaluation two different systems were implemented: one designed for batch use, and the other designed for interacaive use, as will be described hereinafter.
S A Hatch System The first implementation is a Perl script that runs as a " batch " system-- it requires no user intervention. The input of the batch system is a list of Uniform Resource Locators ( URh's) that correspond to the positive examples of an un~;nown concept. The batch system has two outputs: an »ntentional representation of the unknown concept, and a set of example documents that include al7l of the positive examples plus a sample of negative examples.
The procedure used to accomplish this is shown in Fig 2. Three subroutine; are used. The first, Learn comprehends a concept from a sample. The only assumption made by the query-learning system about the' learning system is that the hypothesis of the learning system is in disjunctive normal form (DNF) , where the WO 9713$377 PCTILJS97/05355 primitive conditions test for the presence of wards.
For example, a DNF hypothesis learned from a resource list on college basketball might be:
(college n basketball) V (college n hoops) V (NCAA n basketball) Henceforth we will call each term (conjunction) in this DNF a "rule" .
A set of k rules can be easily converted to k search queries, each of which consists of a conjunction of words---a query format that is supported by practically every search engine. The restriction, therefore, makes the system largely independent of the search engine used.
The second subroutine used by the query-learning system, Corresponding Query, converts a single rule to a query far the search engine being used. Some knowledge about the search engine is clearly needed to appropriately encode the query;
however, because most search engines use similar formats for queries, adding the knowledge needed to support a new search engine is usually straightforward.
Some search engines can handle more expressive queries---queries that require terms to appear near each other, or queries that contain word stems like "comput$*$". Most advanced queries are not currently supported by the existing Corresponding Query routine.
One exception are queries containing conditions that check for the absence (rather than the presence) of words, such as (basketball n ~ college n ~ NCAA). These can be used if both the learning system and the query system allow it, but were not used in any of the experiments of this invention.
The final subroutine,{Top-k-Documents}, submits a query to a search engine and collects the top k documents returned. Again, some knowledge about the search engine is needed to perform this task.
The basic procedure followed by the batch query-learner is to repeatedly learn a set of rules, convert these rules to queries, and then use incorrect responses to these queries as negative examples. The premise behind this approach is that the responses to learned queries will be more useful than randomly WO 971383?? PCT/US9?/05355 selected documents in determining the boundary of the concept. Although this simple meithod works reasonably well, and can be easily implemented with existing search engines, we suspect that otther strategies for collecting examples may be competitive or superior; for instance, promising results have laeen obtained with "uncertainty sampling". See Lewia and Gale (15) and query-learning by committee. See ;5eung et a1 (25). Also see Dagan and Engelson (10).
1e A few final details reqisire some discussion.
Constraining the initial query: To construct the ffirst query, a large set of documents were used as default negative examples. A ~~de:Eault negative example " is treated as a ordinary negative example unless it has already been labeled as positive example, in which case the example is ignored. We used 363 documents collected from a cache Bused by our labs' HTTP
proxy server as default negative examples.
Termination: In the current implementation, the process of learning rules and then collecting negative examples is repeated until some resource limit set by the user is exceeded. Currently the user can limit the number of negative examples collected, and the number of times the learning system is called.
Avoiding loopina: It may be that on a particular iteration, no new documents are collected.
If this occurs, then the training data on the next iteration will be the same as the training data on the previous iteration, and the system will loop. To avoid this problem, if no new documents are collected an a cycle, heuristics are used to vary the parameters of the learning system for the next cycle. In the current implementation, two heuristics are followed: if the hypothesis of the learning system is an empty rule set, then the cost of a false negative is raised; otherwise, the cost of a false positive is raised. The proper application of these heuristics, of course, depends on the learning system being used.
An Interactive System The batch system assumes that every document not on the resource list is a negative example. This means that it cannot be successfully used unless one is confident that the initial set of documents is reasonably complete. Our experience so far is that this is seldom the case. For this reason, we also implemented an interactive query-learning system, which does not assume completeness of an initial set of positive examples; instead, it relies on the user to provide appropriate labels.
The interactive system does not force any l0 particular fixed sequence for collecting documents and labeling; instead 'it is simply an augmented WWW
browser, which allows the user to label the document being browsed, to invoke the learning system, or to conduct a search using previously learned rules.
The architecture of the interactive system is shown in Fig. 3. The user's interface to the query-learning system is implemented as a separate module that is interposed between a WWW browser and an HTTP proxy server. This module performs two main jobs.
First, every HTML document that is transmitted from the proxy server to the browser is augmented, before being sent to the browser, by adding a small amount of text, ii and a small number of special linl~cs at the beginning of the document. Second, while most HTTP requests generated by the browser are passed along unmodified to the proxy server, the FiTTP requesits that are generated by clicking on the special inserted links are trapped out and treated specially.
This implementation has the advantage of being browser-independent. Following current practice, an acronym Surfing while Inducing Methods to Search far URLs or SWIMSUIT has been assigned to the system.
The user's view of the query-learning system is a set of special links that appear at the top of each HTML
page. Clicking on these links allows the user to perform operations such as classifying a document or invoking the learning system.
Functionally, the special links inserted by the query-learning interface act as additional " control buttons " for the browsser---similar to the buttons labeled "Back" and "Net Search" on the Netscape browser. By clicking on special links, the user can classify pages, invoke i~he learning system, WO 97!38377 PCTlUS97105355 and so on. The user's view of the interactive system is shown in Fig. 4.
The special links are:
Document labeling: The yes link and no link allow the user to classify the current page as a positive (respectively negative) example of the current class.
lnvokina the learner: The learn link returns a form that allows the user to set options for the actual learning system and/or invoke the learner on the current class. The behavior of this link can be easily changed, so that different learning systems can be used in experiments. As in the batch system, learning is normally constrained by using default negative examples. This means that reasonable rules can often be found even if only a few positive examples are marked.
- Sgarchina: The search link returns a list of previously learned rules. Clicking on any rule will submit the corresponding query to the currently selected search engine, and return the result.
Configuration and help: The set options Link returns a form that allows the user to change the current class (or to name a new class). or to change the current search engine; the review previous link returns an HTML
page that lists all previously marked examples of the current class; and the help link returns a help page.
Learning Systems Two learning systems have been integrated with TM
the system: RIPPER, a propositional rule learner that is TM
related to FOIL, see Quinlan (21), and a rule-learning version of "Sleeping experts". Sleeping experts is a new prediction algorithm that combines ideas from used for online prediction, see Freund (11) with the infinite attribute model of Blum (3). These algorithms have different strengths and weaknesses. RIPPER implicitly assumes that examples are i.i.d---which is not the case for samples collected WO 97!38377 PCT/US97I05355 via browsing or by the batch query-learning system.
However, formal results suggest that sleeping experts will perform well even on data sets that are selected in a non-random manner. The sleeping experts algorithm is also largely incremental, which is potentially an advantage in this setting. On th~e other hand, sleeping experts uses a more restricted hypothesis space, and cannot learn large rules, whereas. RIPPER can (at least in principle).
RIPPER
Briefly, RIPPER builds a set of rules by repeatedly adding rules to an empty ruleset until all positive examples are covered. Rules are formed by greedily adding conditions to them antecedent of a rule with an empty antecedent until no negative examples are covered. After a ruleset is con:~tructed, a optimization postpass massages the ruleset so as to reduce its size and improve its :Eit to the training data. A combination of cross-validation and minimum-description length techniques are used to .
prevent overfitting. In previoua experiments, RIPPER

i1 CA 02245913 2001-03-27 -- ..
WO 97138377 PCTItJS97105355 was shown to be comparable to C4..°irules, Quinlan (22) in terms of generalization accuracy, but much faster for large noisy datasets. Fat morsa detail, see Cohere (8) .
The version of RIPPER u:~ed here was extended to handle ~~set-valued features~~,, as described in Cohere (9y. In this implementation of RIPPER, the value of a feature can be a set of symbols, rather than (say) a number or a single symbol" The primitive conditions that are allowed for a set-valued feature F
are of the form c a F, where c is any constant value that appears as a value of F in the dataset. This leads to a natural way of representing documents: a document is represented by a sing7Le feature, the value of which is the set of all tokens appearing in the document. In the experiments, documents were tokenized by deleting e-mail addresses, HTMh special characters, and HTML markup commands; converging punctuation to.
spaces; converting upper to lower case; removing words from a standard stoplist, Lewis x(17) and (finally treating every remaining sequence of alphanumeric characters as a token. To keep pearformance from being degraded by very large documents, we only used tokens WO 97/38377 PCTlITS97105355 COFiEN 6-1 from the first 100 lines of a file. This also approximates the behavior of some search engines, which typically index only the initial section of a document.
A second extension to RIPPER allows the user to specify a loss ratio, see Lewis and Catlett (14).
A loss ratio indicates the ratio of the cost of a false negative error to the cost of a false positive error;
the goal of learning is to minimize total misclassification cost, rather than simply the number of errors, on unseen data. Loss ratios in RIPPER are implemented by changing the weights given to false positive errors and false negative errors in the pruning and optimization stages of the learning algorithm.
One additional modification to RIPPER was also made specifically to improve performance on the query-learning task. The basic RIPPER algorithm is heavily biased toward producing simple, and hence ' 20 general, conjunctions; for example, for RIPPER, when a conjunction of conditions is specific enough to cover no negative examples, no further conditions will be added. This bias appears to be inappropriate in learning queries, where the concepts to be learned are typically extremely specific. Thus, we added a postpass to RIPPER that adds to each of rule al 1 conditions that are true for every positive covered by the rule. Actually, the number of conditions added was limited to a constant k---in the experiments below, to k=20. Without this restriction, a rule that covers a group of documents that are nearly identical could be l0 nearly as long as the documents themselves; many search engines do not gracefully handle very long queries. We note that a similar scheme has been investigated in the context of the " small disjunct problem " , see Holte (14). The postpass implements a bias towards specific rules rather than general rules.
bleedings Experts In the past years there has been a growing interest in online prediction algorithms. The vast majority of the prediction algorithms are given a pool of fixed " experts " ---each of which is a simple, fixed, classifier---and build a master algorithm, which combines the classifications of the experts in some manner. Typically, the master algorithm classifies an example by using a~weighted combination of the predictions of the experts. Building a good master algorithm is thus a matter of f finding an appropriate weight far each of the experts. Formal results show that by using a multiplicative weight update, see Littlestone (18), the master algrorithm is able to maintain a set of weights such that the predictions of the master algorithm are almost as good as the best l0 expert in the pool, even for a sequence of prediction problems that is chosen by an adversary.
The sleeping experts algorithm is a procedure of this type. It is based on two recent advances in multiplicative update algorithms. The first is a weight I5 allocation algorithm called Hedc;e, due to Freund and Schapire, see Freund (11), which is applicable to a broad class of learning problems and loss functions_ The second is the infinite attribute model of Blum (3).
In this setting, there may be any number of experts, 2D but only a few actually post predictions on any given example; the remainder are said to be « sleeping " on that example. A multiplicative update algorithm for the inf mite attribute ..~..odel ( b<~sed on Winnow, Littlestone(19) has also been implemented, see Blum (4) .
Below we summarize the sleeping experts procedure, which combines the Hedge algorithm with the inffinite attribute model to efficiently maintain an arbitrarily large pool of experts with an arbitrary loss function.
The Master Alaorithm Pseudo-code for the algorithm is shown in Fig. 5. The master algorithm maintains a pool, which is a set recording which experts have been active on any previous example, and a set of weights, denoted by p, for every expert in the pool. At all times, all weights in p will be non-negative. However, the weights need not sum to one. At each time step t, the learner is given a new instance Xt to classify; the master algorithm is then given a set Wf of integer indices, which represent the experts that are active (i.e., not "sleeping") on Xt. The prediction of expert i on x, is denoted by r Ya Based on the experts in W;, the master algorithm must make a prediction for the class of X;, and then update the pool and the weight set p.
To make a prediction, the master algorithm decides on a distribution p over the active experts, which is determined by restricting the set of weights p to the set of active experts W; and normalizing the weights. We denote the vector of normalized weights by p, where t t P~ ~p~l ~''wpJ
The prediction of the master algorithm is ~- f t 2 5 F~ LrrCW, p ~ 3' i We use F(r)= 1n(1 - r + r(3) / (1n(1- r + r(3) + 1n ((1-r) (3+r)), the function used by Vovk (26) for predicting binary sequences.

Each active expert, then suffers some "loss" 1';.
In the implementation described here, this loss is 0 if the expert's prediction is correct and 1 otherwise.
Next, the master algorithm updates the weights of the active experts based on the losses. (The weight of the experts who are asleep remains the same, hence we implicitly set 'di ~ Wt : p~ ~' ). When an expert is first encountered its weight is initialized to 1. At each time step t, the master algorithm updates the weights of the active experts as follows, !~i E W~ : p,."1 - ~~ Up~l where Z is chosen such that r t_r t+I
~itW~ pl LritWi pj .
The "update function" Up is any function satisfying [Cesa-Binachi et al . , ( 5 ) ] fir < Up (r) < 1- ( 1-~i ) r. In our implementation, we used the linear update. UR(r)=1 -(1-(3)r, which is simple to implement and it avoids expensive exponentiations.
Briefly, if one defines the loss of the master algorithm to be the average loss with respect to the distribution '~« ~~
2o Pi the cumulative loss of the master algorithm over all t-can be bounded relative to the loss suffered by the best possible fixed weight vector. These bounds hold for any sequence of examples (xi, yi) , . . . . (xt,yt) , in particular, the bounds hold for sequences whose instances are not statistically independent.

The Pool of Exerts It remains to describe the experts used for WWW page classification. In our experiments each expert corresponds to a space that: appears in a document. That is, if cup is the i.th token appearing in the document, each expert is of the fona cu,tc~...c.~~
where 1 <_ l, < l, < . . . i~_, < it and _zt ' l, < n . This is a generalization of the ngram\footnote model. Note that our goal is to classify WWW documents; hence each l0 ngram expert is used to predict the classification of the document in which it appears, rather than the next token (ward). For each ngram we construct two mini-experts, one which always predicts 0 (not in the class), and one that always predicts I. The loss of each mini-expert is either 0 or 1 depending on the actual classification of the document.
Extracting! Rules Front Experts Finally, heuristics are used to construct rules based on the weights constructed by the sleeping experts algorithm. we constructed a rule for each expert predicts 1 and that has a large weight. This is done by scanning~the weights of the combined experts (each combined expert containing two mini-experts) and selecting those: which have large weight. More formally, an expert: l is used to construct a rule if pT/~,E Pool P~ 21N~, COIiEN 6-1 where T is the number of training examples, and w~ is a weight threshold for extracting experts. In practice, we have found that most of the weight is often concentrated on few experts, and hence the number of experts extracted is not too sensitive to particular choices of w~. We used wm;" = 0.0625 and set the learning rate (3 to be 0.5 in the experiments described below.
Typically, the " heavy " experts correspond to phrases that frequently appear in documents labeled as positive examples; however, they may also appear in many of the negative labeled documents. We therefore examined the mini-experts of each extracted expert and selected those experts which are statistically correlated only with the positive examples. We define the average prediction p; of expert i, based on its two mini-experts (i,0) and (i, 1) , to be p; = F3 (p;.o/ (P;,o-t-pt.i) ) ~ An expert is finally chosen to be used as a rule if its average prediction is larger than p",;,j. In the experiments we used pm;~ _ 0.95 as the default value, and increased or decreased this threshold to encourage proportionally more or fewer positive predictions.
Finally, as was done with RIPPER, we add to each rule the list of all tokens that appear in all positive documents covered by a rule. We also remove all rules that have strictly fewer conditions than another rule in the set. The result is a rule set where each rule is of the form w;1 n w~z n...n w,Y.

Although the sleeping experts algorithm treats this as an ngram, we currently treat it simply as a conjunction of features: clearly, this is suboptimal for search engines which support proximity queries.
Experimental Results We have evaluated the system with three resource directories.
ML courses is part of an excellent machine learning resource maintained by David Aha'.
This list contained (at the time the experiments were conducted) pointers to 15 on-line descriptions of courses.
AI societies is a WWW page jointly maintained by SIGART, IJCAI, and CSCSI. It contains pointers to nine AI societies.
Joaaina strollers. This is a list of pointers to discussions of, evaluations of, and advertisements for jogging and racing strollers.
Our initial goal was to find resource directories that were exhaustive {or nearly so) containing virtually all positive examples of some narrow category. Our hope was that systematic experiments could then be carried out easily with the http:l/www.aic. nri. navy. mil/ - aha/research/ machine-learning. html batch system. However, finding such a resource turned out to be much harder than we expected.
We began with the MLcourse problem, which as a narrow section of a frequently-used resource we expected to be comprehensive; however, preliminary experiments showed that it was not. (The first query constructed by the batch system using RIPPER retrieved (from Altavista) 17 machine learning course descriptions in the first 20 documents; however, only 5 of these were from the original list). For those interested in details, this query was(course $\wedge$ machine $\wedge$
instructor $\wedge$ learning).
Our next try at finding machine learning course descriptions in the first 20 documents;
comprehensive resource directory was the AI societies problem; this directory had the advantage (not shared by the ML course directory) that it explicitly stated a goal of being complete. However, similar experiments showed it to be quite incomplete. We then made an effort to construct a comprehensive list with the jogging strollers problem. This effort was again unsuccessful, in spite of spending about two hours with existing browsers and search engines on a topic deliberately chosen to be rather narrow.
We thus adopted the following strategy. With each problem we began by using the interactive system to expand an initial resource list. After the list was expanded, we invoked the batch system to collect additional negative examples and thus improve the learned rules.
Experiments With The Interactive System We used the interactive system primarily to emulate the batch system; the difference, of course, being that positive and negative labels were assigned to new documents by hand, rather than assuming all documents not in the original directory are negative. In particular, we did not attempt to uncover any more documents by browsing, or hand-constructed searches.
However, we occasionally departed from the script by varying the parameters of the learning system (in particular, the loss ratio), changing search engines, or examining varying numbers of documents returned by the search engines. We repeated the cycle of learning, searching, and labeling the results, until we were fairly sure that no new positive examples would be discovered by this procedure.
Fig. 6 summarizes our usage of the interactive system. We show the number of entries in each initial directory, the term Recal is the fraction of the time that an actual positive example is predicted to be positive by the classifier, and the term precision is the fraction of the time that an example predicted to be positive is actually positive. For convenience, we will define the precision of a classifier that always prefers the class negative as 1.00 of the initial directory relai~ive to the final list that was generated, as well a:~ the number of times a learner was invoked, the number of searches conducted, and the total number of pages labeled. We count submitting a query for each rule as a single search, and do not count the time required to label the initial positive examples. Also, we typically did not attempt to label every negative example encountered in the search.
To summari2e, the interactive system appears to be very useful in the task of locating additional relevant documents from a specific class; in each case the number of known rele~rant documents was at least quadrupled. The effort involved was modest: our use of the interactive system generally involved labeling a few dozen pages, waiting for the results a handful of searches, and invoking ithe learner a handful of times. In these experiments the time required by the learner is typically well under 30 seconds on a Sun 2o/so.
Experiments With The Batch Svstem In the next round o:E experiments, we invoked the batch system for each of these problems.
Fig. 7 shows the resource limit set for each of these problems (the column " \#Zterations Allowed " indicates how many times the learning system could be called), the number of documents k that were collected for each query, and the total number of documents collected by the batch system (not including the initial set of 363 ' WO 97/38377 PCTIUS97l05355 default negative examples). The resource limits used do not reflect any systematic attempt to find optimal limits. However, far the last two problems, the learner seemed to " converge " after a few iterations, and output a single hypothesis (or in one case alternate between two variants of a hypothesis) on all subsequent iterations. In each case, RIPPER was used as the learning system.
We then carried out: a number of other experiments using the datasets collected by the batch system. one goal was simply to mE:asure how successful the learning systems are in constructing an accurate intensional definition of the resource directories. To do this we re-ran the learning systems on the datasets constructed by the batch system, eaxecuted the corresponding queries, and recordE~d the recall and precision of these queries relati're to the resource directory used in training. To obtain an idea of the tradeoffs that axe possible, we varied the number of documents k retrieved from a querns and parameters of_ the learning systems (for RIPPER, the loss ratio, and for sleeping experts, the threshold p",~.) Altavista was used as the search engine.
The results of.thi;s experiment are shown in the graphs of Figure 8. The first three graphs show the results for the individual classes and the second graph shows the results for all three classes together.
Generally, sleeping experts generates the best high-precision classifiers. However, its rulesets are '~ i WO 97138377 PCTIUS97/05355 almost always larger than those produced by RIPPER;
occasionally they are much larger. This makes them more expensive to use in searching and, is the primary reason that RIPPER was used in the experiments With the batch and interactive systems.
The constructed rc~lesets are far from perfect, but this is to be expected. One difficulty is that neither of the learners perfectly fit the training data; another is that the search engine itself i0 is incomplete. However, it seems quite likely that even this level of performance is enough to be useful. It is instructive to compare these hypotheses to the original resource directories that were used as input for the interactive system. The original directories alI have perfect precision, but relatively poor recall.
For the jogging strollers problem, both the learners are able to obtain nearly twice the recall (48% vs 25%) at 91% precision. For the AI societes problem, both learners obtain more than three itimes the recall at 94%
precision or better. (RIPPER obtains 57% vs 15% recall with 94% precision.
We also conducted a generalization error experiment on the datasets. In each trial, a random 80% of the dataset was used for training and the remainder for testing. A total of 50 trials were run for each dataset, and the average error rate, precision and recall on the test set (using the default parameters of the learners) were recorded.

WO 97!38377 PCT/US97/05355 The results are shown in Fig. 9.
However, since the original sample is non-random, these numbers should be interpreted with great caution.
Although the results suggest that significant generalization is taking place, they do not demonstrate that the learned queries can fulfill their true goal of facilitating maintenance by alerting the maintainer to new examples of a concept. This would require a study spanning a reasonable period of time.
Summary The World Wide Web (WWW) is currently filled with resource directories---documents that collect together links to alI known documents on a specific topic. Keeping resource directories up-to-date is difficult because of the rapid growth in on-line documents. This invention describes the use of machine learning methods as an aid in maintaining resource directories. A resource directory is treated as an exhaustive list of all positive examples of an unknown concept, thus yielding an extensional definition of the concept. Machine learning methods can then be used to construct from these examples an intensional definition of the concept. The learned definition is in DNF form, where the primitive conditions test the presence (or even the absence) of particular words. This representation can be easily converted to a series of queries that can be used to search for the original documents---as well as new, similar documents that have been added recently to the WWW.

Two systems were implemented to test these ideas, bath of which make minimal assumptions about the search engine. one is a batch system which repeatedly learns a concept, generates an appropriate set search queries, and uses the queries to collect more negative examples. An advantage of this procedure is that it can collect hundreds of examples with no human intervention; however, it can only be used if the initial resource list is complete (or nearly so). The second is an interactive system. This systems augments an arbitrary WWW browser with the ability to label WWW
documents and then learn search-engine queries from the labeled documents. It can be used to perform the same sorts of sequences of actions as the batch system, but is far more flexible. In particular, keeping a human user " in the loop " means that positive examples not on the original resource list can be detected. These examples can be added to the resource list both extending the list and improving the quality of the dataset used for learning. In experiments, these systems produce usefully accurate intensional descriptions of concepts. In two of three test problems, the concepts produced had substantially higher recall than manually-constructed lists, while attaining precision of greater than 90%.
_ In support of the invention, and in particular the description of Preferred Embodiment, the following Appendices are included in the application:

Appendix 1. A list of references cited in the application by reference numeral.
Appendix 2. A copy of a README file which describes the source code implementing the presently-preferred embodiment of the invention.
Appendix 3. Commented source code written in perl for the presently-preferred embodiment of the invention.
Appendix 4. A copy of the documentation for the OreO
shell tool which was used in the implementation of the . 10 presently-preferred embodiment.

References 1. (Apte, et al., 1994) Chidanand Apte, Fred Damerau, and Sholom M. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems. 12 (3) : 233-251, 1994 .
2. (Armstrong et al., 1995) R. Armstrong, D.
Frietag, T. Joachims, and T.M. Mitchell. WebWatcher: a learning apprentice for the world wide web. In Proceedings of the 1995 AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments. Stanford, CA, 1995. AAAI Press.
3. (Blum, 1990) Avrim Blum. Learning boolean functions in a infinite attribute space. In 22°dAnnual Symposium on the Theory of Computing. ACM Press, 1990.
4. (Blum, 1990) Avrim Blum. Empirical support for WINNOW and weighted majority algorithms: results on a calendar scheduling domain. In Machine Learning:
Proceedings of the Twelfth International Conference, Lake Taho, California, 1995. Morgan Kaufmann.
5. (Cesa-Bianchi et al., 1993) Nicolo Cesa-Bianchi, Yoav Freund, David P. Helmbold, David Haussler, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. In Proceedings of the Twenty-Fifth Annual ACM symposium on the Theory of Computing, pages 382-391, May 1993. Submitted to the Journal of the ACM.
6. (Cohen, 1995a) William W. Cohen. Fast effective rule induction. In Machine Learning:
Proceedings of the Twelfth International Conference, Lake Taho, California, 1995. Morgan Kaufmann.
7. (Cohen, 1995b) William W. Cohen. Learning to classify English text with ILP methods. In Luc De Raedt, editor, Advances in ILP. IOS Press, 1995.
8. (Cohen, 1995c) William W. Cohen. Text categorization and relational learning. In Machine Learning: Proceedings of the Twelfth International Conference, Lake Taho, California, 1995. Morgan Kaufmann.
9. (Cohen, 1996) William W. Cohen. Learning with set-valued features. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, Oregon, 1996.
10. (Dagan and Engelson, 1995) Ido Dagan and Shaun Engelson. Committee-based sampling for training probabilistic classifiers. In Machine Learning:
Proceedings of the Twelfth International Conference, Lake Taho, California, 1995. Morgan Kaufmann.
11. (Freund and Schapire, 1995) Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the Second European Conference on Computational Learning Theory, pages 23-27. Springer-Verlag, 1995. A long version will appear in JCSS.
12. (Freund et al., 1992) Y. Freund, H.S. Seung, E.
Shamir, and N. Tishby. Information, prediction, and query by committee. In Advances in Neur1 Informations Processing Systems S, pages 483-490, San Mateo, CA, 1992.
Morgan Kaufmann.
13. (Harman, 1995) Donna Harman. Overview of the second text retrieval conference (TREC-2). Information Processing and Management, 3:271-289, 1995.
14. (Hope, 1989) Robert Holte, Liane Acker, and Bruce Porter. Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, Detroit, Michigan, 1989. Morgan Kaufmann.
15. (Lewis and Catlett, 1994) David Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning: Proceedings of the Eleventh Annual Conference, New Brunswick, New Jersey, 1994. Morgan Kaufmann.
16. (Lewis and Gale, 1994) David Lewis and William Gale. Training text classifiers by uncertainty sampling.
In Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1994.
17. (Lewis, 1992) David Lewis. Representation and learning in information retrieval. Technical Report 91-93, Computer Science Dept., University of Massachusetts at Amherst, 1992. PhD Thesis.
18. (Littlestone and Warmuth, 1994) Nick Littlestone and Manfred Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212-261, 1994.
19. (Littlestone, 1988) Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4), 1988.
20. (Pazzani, et al., 1995) M. Pazzani, L. Nguyen, and S. Mantik. Learning from hotlists and coldlists:
towards a WWW information filtering and seeking agent. In Proceedings of AI Tools Conference. Washington, DC, 1995.
21. (Quinlan, 1990) J. Ross Quinlan. Learning logical definitions from relations. Machine Learning, 5(3), 1990.
22. (Quinlan, 1990) J. Ross Quinlan. C4,5:
programs for machine learning. Morgan Kaufmann, 1994.
23. (Salton, et al., 1983) G. Salton, C. Buckley, and E.A. Fox. Automatic query formulations in information retrieval. Journal of the Americal Society for Information Science, 34(4):262-280, 1983.
24. (Salton, et al., 1985) G. Salton, E.A. Fox, and E. Voorhees. Advances feedback methods in information retrieval. Journal of the Americal Society for Information Science, 36 (3) :200-210, 1985.
25. (Seung, et al., 1992) H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Workshop on Computational Learning Theory, pages 287-294, San Mateo, CA, 1992.
Morgan Kaufmann.
26. (Vovk, 1990) V. Vovk. Aggregating strategies.
In Proceedings of the Third Annual Workshop on Computational Learning Theory, pages 371-383. Morgan Kaufmann, 1990.

Appendix 2. A copy of a README file which describes the source code implementing the presently-preferred embodiment of the invention.

.____________________~_-_______- s-=
utilities ___________~____...._________ =~--.r ---~__-oreo.pl - some utilities for handling http requests with oreosh general routines for use with oreosh my hope is that these can be used for other purposes as well....
__________________________.____________.__ connect.pl - simple two-way connection using oreosh.
mostly there as an example.
trap-request.pl - traps http requests that match a given regexp and handles the specially--specifically a given program is invoked and its output (which should be html) is returned to the client.
markup.pl - allows you to insert arbitrary stuff at the top of html documents, before the client gets a hold of them.
cache.pl, cache-util.pl - implements a simple local cache for .html pages track.pl - - buggy routine to track state of the browser xs=c==os=-=-x~-.==_~ --~-sc-cc=---= . -gas the form-labeling program launch.csh - invokes this cascade of oreosh creatures client ( trap-request ( markup ( cache ( proxy The client should connect to port 8111 of radish.
autosurf.pl - loads in cache and cycles through it randomly, filtering by the current class definition.
ss-marker.pl - invoked by the markup.pl daemon, this inserts an appropriate header into html documents ss-main.pl - this traps the requests included in ss-marker.pl and handles them specially. The request handling is done in the files below. All local data is stored in -user/.ss/cache or -user/.ss/data.

. . Appendix 3. Commented source code written in perl for the presently-preferred embodiment of the invention.
APPE~IX 3 s:;bia/csh 3GC r'"/hote~/~cohen/coce/ss"
icil_ali oreosh sere.~.:- SS_LE:.RVER "Sd/ss-darn-ripper. pi"
sate~-: ORLO_?oOXY radish.=esearch.att.com:e113 oreosh -? 8111 -b "Sd/trap-request.pl "GET.~~QXYZQQ-SS~S~ Sd/ss-main.pl' ~et~_.~.:~ ORgO PPOXY radish._xstarch.att.com:8113 oreosa -? 8112 -b "Sd/mazYip.pl -~ Sd/ss-marker.pl"
setea~: OREO_PPOXY radis::.research.att.com:8000 oreosh -~ 8113 -b "Sd/cac~s.pl"

_ __::3 .?~: .:ar:~..- _ .'-':e =~iusr/local/bin/perl sssssssssss#ssss*sssssssssaxsstssss*s*;s~sssasssss~ssssss*!s**ss#sssxls:sssss~
_ :narkup.pl -- insert some text into the top c. each HT:~ page s s s'_: ntax markup.pl <text-ta--zsert>
:narkup.pl -f <file-containing-text-to-inse_t>
:aarkup.pl -e <:ile-containing-text-to-insert>
s s . inserts conte.~.ts of a file s -a executes a file and inserts result s in text, 1U is replaced .pith current document's ORL
ssss*ssssssssssssssssssassssss**ss*sssss*sssss*ssssxssssssss~x:ssss***ssssssss Sssdir="/home/~.~cohen/code/ss/":
=squire "Sssdir/oreo.pi~:
sdebug=0 ;
sest;~**t~x~~ts~l**tsss axs3sasstt*atttttil*xsxsssssss*
: establish connections print STDERR "S0: connecting...\n" if Sdebug;
open(CLIENT, "y<i0~);
close(STDIN);
select(CLIENT) . 51 ~ 1 open(PROXY, "+>il"):
s~lect(PROXY); SI = 1 ;
close(STDOUT):
print STDERR "S0: connections established...\n" if Sdebug;
ssss~txssxr*sxxsxssssxax*s***w~tx~xx*rsrssxesss:»**s t get request, forward to proxy print STDgRR "S0: getting request...\n" if Sdebug:
Srequest ~ sget_request(CLIENT);
print STDERR "S0: sending request:\n" if Sdebuq;
print STDERR "'~~~ begin request =~~~~\n" if Sdebug;
print STDERR Srequest if Sdebug;
print STDERR "_~~~~ end request ~~~~~=\n" if Sdebug;
print PROXY Srequest:
saswsss*ss*swssrtrsr~xss*s*sstttrtsxs~t~wsrsrxxssw~r :figure what type of file this is...
StTRL = srequested URL(Srequest);
Scontent - iURL_type(SQRL) if SURL:
print STDERR "S0: LRL~SURL\n" if Sdebug:
print STDERR "S0: type is Scontent\n" if Sdebug;
~sssw*sss**ss*srswsx*x*ssxsxs***ri*kssssxxk*ss**ss xconstruct insertion string SWITCH: ( (SARGV(Oj '- /'-f(.+)S/) 6i do ( Sinsert = 'cat S1': last SWITCH: j;

~_ :I .. . . . .. . _ .:a: x~:p . ~ : age 2 S:,RG': :'~' ~-q "-f" ) :: do ( Sinner. ' =at S~RW: ;1j ' . last Sw--~~~ .
., ..~ :. ;:
r,SwRG~()j =- /"-~_(.-)S/) r: do ( Sinsert ' 'S1'; i35= SaITCH: ,.
(SaRGJ(7j eq "-e'') i: do ( Sinner: ~ 'SARG':(_; . last S:tITCj: j;
Sinner= ~ S~RGJ(0~:
j Sinner= '- s/\iU/SL?L/c:
:ss*:sss*sssss*sssssssssssxs:tssss»axs*sss*sssssss : send. proxy' s message to client. ~~ith S inner t ;inserted after the end of the title :rhile (<PROXY>) if (Scontent eq "html" d: /(.~<\/title>)(.~S)/i) ( pr..~.t CLIE~tT "S1\nSinsert~nS2\n":
Scontent ' "augmented_html": :ne~rer insert Mice else prl.~.t CLIEVT:
) WO 97/38377 PCTItJS97/05355 ~ _.:fJ .._- __ap-_~quest . ?ace =!/usr/iocai;~:n/perl .sssssss;ssssxxssssssssiissxslsssssSss*sxsss=s##ss**sss*sss*ss:sssss#=sssssssa = t:ap-request. pi --- in~oke a special process for certain :equests x = s.rnta;c: trap-_--_quest.pl <form-ra> <generator-program>
s traps oLt requests matching the regeYp <form-re>.
s then invokes <generator-programs. First argument s of generator-program is stuff that matches the form-re.
s and second argument is data port of request (if anr).
* Generator program .rites html to output. , ss;s1#lilt;ss*sss~s#~#1##;Ilxts;ss~»fttlsl**s111tslYwsssxlA~s~1s*sstlssss*x;sxs ss Sssdir~"/home/~cohen/code/ss/";
require "ssscir/orea.pl~;
5debug~~) ;
asssssssssssssxsssssssesssssssssss##sss:ss*s*stsss s figure out -eat to trap and hoW to hanc_: =t Sspecial_re ~ S:.RGV[0]
Sgenerator~ S:,RGV [ 1~ ;
print STDERR "S0: handle requests matching 'Sspecial_re'\n" if 5debug:
print STDERR "S0: response generator is 'Sgenerator'\n~ if Sdebug:
t»xxxx:x##ssss*:x##ssxxx#xxxx;:stxxxxxxx:xxxxxxxxx x stt up conne=lions pri~lt STDERR "S0: connecting...\n~ if Sdebug;
open(CLIENT. "i<i0");
close(STDIN);
select ( CLIE:~1T) S I ~ 1 ;
opea(PROXY. ->il"):
select(PROXY): SI ~ 1 :
close ( STDOC1T ) ;
print STDERR "S0: connections established...\n~ if Sdebug:
xtm xxxxxxxxxxxxxxx:xxxxxxxrsxwx:!rxxxxrxrxxxx»~trtw x get request and echo it print STDERR "S0: getting request...\n~ if Sdebug:
Srequeat ~ eget_request(CLIENT):
print STDERR "S0: handling request:\n~ if Sdebug:
priest STDERR "~~~~ begin request ~~~~~\n~ if Sdebug:
print STDERR Srequest if Sdebug:
print STDERR "-~~~~ end request ~~~~~~\n' if Sdebug:
axxxrxxxxxrrxm xwxx~xrxxxxxxrtxxxxrxxxxxxssxxxxxxxx x process request if (Srequest -- /(Sspecial_re)/) [
local(Ssave);
Smntch_string ' S1:

_=ec . _;:.~8 .___ ._'~ _~~::est.~ :aae Sre~uest ~- /\n\r?\n(.')5/:
Sform_argu~ents ' S1:
dsend_header(CLI=~'T) ; ' Ssave ~ S:; Si open(RESPOYSE,"Sge~erntor 'Smatc'.~. string' Sfo~ argu.:.entsi") I die("can't invoice generator"):
::Nile (<RESPONSE>) print C=I~P1T;
cloae(RESPONSE):
S i ~ Ssa~re;
] else {
print PROXY Srequest:
while (<PROXY>) {
print C:.:E~T S_ l .e~_ .. _.. : ~ ~ ~: 'r. ~ :3G::° ' ~~Qe s ! /~isr/local/bin/perl ss:t=t*tat**si****=*sasasss;***fssss*a*i*:*sssssss;*sasaxss****sassi**i*asssa;
- cache.pl --- implement a simple cache _ :or ..~.o:: - - res tric _ed to htnl doc::ments . mus : be :.a~'OKed with -_ . cotion to a~:oid coast=::==on for cache xxsss**sssx**x**xx*s*s*rx*rsx*rxssssxsxxx*xss==asses*as****x*x*ssa*xxsa*xssss*
Sssdir~"/home/:rcohen/code/ss/";
require "Sssdir/oreo.pl":
require "Sssdir/cache-util.pl":
Sdebug~0:
:sszs**ss:**xae***s*asasss**axirss:s**x:***issss:a s initialize cache ~:ar:ables, etc (5cachedir.5maxcaches~=el~sinit_~_ache:
print STD~RR "SD: max~Smafcachesize, cachdir=-=achedir''a" if Sdebug:
!s:rssssxa*xrrse*x*ssa:s;:sxxxxxxsss*r*x*xxx*s=ms**
* set up connections open(CLIENT, "+<i0");
close(STDIN);
select(CLIENT) , SI ~ 1 open(PROXY, "i» 1~):
select(PROXY): SI ~ 1 ;
cloSe(STDOUT);
r»r**xxxrxxxxrxxxxxr**xsx*rrxxrxxxsxxrxxrxxrsss*x*
* get request and echo it Srequest ~ iget_request(CLIENT):
print STDERR "S0: handling request:\n" if Sdebug:
print STDERR ~~~~~ begin request ~~~~~\n" if Sdebug:
print STD>rRR Srequest if Sdebug: ' print STDERR ~~~~~~ end request '~~~~~\n" if Sdebug;
xrwrxxxxxxxxxxxxxrxrxxxxxrrrx»xxxxxxxxxxrxxxa.s*rx r process request SURL ' crequested URL(Srequest);
Stype - i~URL_type(SURL) if SURL:
print STD>rRR ~S0: type-Stype URL~SURL\n" if 5debug:
if ( SURL) acaehe_listing ' sload cache listing:
Sf filename ' surl2file(slookup url(SURL._cache_listing)):
if (Sfilenasae ~Z -a Sfilename si !(Srequest =- /Pragma: no-cache/)) print STDERR "S0: cache has URL in Sfilename\n" if Sdebug:
rreplay the response found in cache open(R~SPO,tSE."<5cachedir/Sfilename") '! die("can't open cached file"):
SO

.2~- _~ ...,. -_.. =3C~~ ~- ?3Qe ~n_::r<RESPONSF>) ?tint CLILvT:
__-~e~°E5P0'S~):
i eae -- ' S f filename ~ ( print STD~RR "S0: removing file Sfi'_e~ame. .\n" =f Sdebug;
'request gas that cache not be used =so expunge the old cache enter snlink("Scachedir/Sfilename");~
=s~rstem "~ -f Sfilename" t:, :complain:
or (Si=0: Si<-Sxcache_listing: Si+-) ( Scache_1-sting(Siy = "" if Scache lati.~.g(5iy eq "SL'RL~,.~.'':
y 'get an answer 'tom the pro:s7, record in te:npfi'_e pr:.~.~.. STDERR "S0: cache has no CRL StRL\n" .f Sdebug;
p'- ' ~
jpe~'-~~XY'Srecuestr/T~P..) ~~ di°~"ean . open cache temp file-';
'.:h~_er<PROXY>) ( pz:nt CLIE.~JT:
print Tip:
t clo s a ( T~Ip ) ;
y ~updnte_cache(SL'RL,:cache_listing);
y else ( print STDERR "S0: not a cacheable response\n" if Sdebug;
~tnot a request--so handle it normally print PROXY Srequest;
awhile(<pROXY>) print CLIENT:
y close(TEMP);
y sub updnte_cache x(Surl,~listing) local(Surl.?liating) ~ ,a_;
local(Sfilename);
print sTDERR ~S0: caching Surl\n" if Sdebug;
tdel~_te url from cache and append to the end for (Si~O: Si<.Srlisting; Si++) ( y Slisting(Siy . "" if Slisting(Siy '- /SurlS/;
Slisting(Srlistingrl) ~ "Surl\n"~
x print STDERR "S0: cache t Surl:\n" if Sdebug:
sshoW_cache(?listing) if Sdebug;
k truncate cache to appropriate size print STDERR "S0: new cache has Sxlisting, limit a Smaxcachesize\n"
if Sdebug:
if (S~listing >- Smaxcachesize) .°~_ .. ~.:.i: ._.. ~3CA.D1 ~3Qe Sndei = S~listing-smaxcac'.~.esize:
for '.,~0: Si.Sndel: Si-w __i=ename = :url2file(Slisting(5i1):
_ :l:nk( "Scachedir/S::.iename" ) =s:rstem "::a -f Scachedir/Sfilename" ~: :comalain:
s_isting(Si~) _ ~";
) save t::e cache open( LIS:=::G, ">Scachedir/LISTIaG" ) ~ ; die ( "can' . arae cache listing f i=e'' ) ;
print LIS=I'G join("".prep(/.-/,.'-_listing)):
close(L:S::aG);
x move tie response file to appropriate place if (-e "SCachedir/TEMP" s: -s "Scachedir/TEMP") ( s f ile.~.ame - surl2: ile ( 5 url ) print STDERR "S0: url->f~le Sfilename\n" i: Sdebug:
re~ame~"Scachedir/TEMP"."Scachedir/Sfilename') ; ~comala:n:

sub complain print STDEDR "S0: cocsaaand fails!\n":
) sub show_cache x(cache) I
local(:listing) ~ =_;
~ print STDERR ~ ~~......~....~~.~.~~\n"
print S'fDEPR join("".grep(/.y/,~listiag)):
print STDERR "~~............._..~.\n":

~ec ., :~ ::2 .:95 or~_o ~ : ~Qe i!/usr; lacal/bin/perl iassssssssiiisasiiiiassassasssi~Iti~lssssssi:asassasssssssssasiaassssssasssssss s = or~_c.pl -- ?~_r? rout:aes _~ be used with the cr~osh s iiassssassaissaiiiilfisssiaasss~x~jxx~ll*iaassiasssiasssaiisisa:xxx*isssssssssa ss sub _~equestedf=L x(reques.;
( loca 1(srequest) _ =_, (S request =- /~GgT http:~.~) HTTP\/1\.0/) ? S1 : '"~
sub C~L_t:~pe i ! url) local ( SURL ) _ :_;
locai(5type):
St::pe~"html' if (SLTRL '- /\/S/ " SURL =- ~nt~?5/);
St_:pe~'htmi" if (SL'RL =- /cgi-bin/):
St_rpe:
sub get_~equest :(client) t local(Sclient) _ ~_;
local(Srequest.5content_~ength):
while(<Sclient>) Srequest .= S_;
Scontent_length - S1 if (/"Content-length: (.~)S/);
last if /~\r?\n5/:
while (Scontent_length--) Srequest . (getc Sclient);
Srequest:
sub send bender !t ( client ) t local(Sclient) ~ a_;
print Sclieat "HTTP/1.0 Z00 OX\r\n":
print Sclient "2i=1~-version: 1. 0\r\n" ;
print Sclient "Content-tjpe: text/html\r\n":
print Sclient "\r\n":

s return a true value l:

.ec .? ...._ _?95 _ -_c~..___- - :ace =~;usr/local/bin/per:
eazssssi;il~i:ii33iissstsss;;lsas;i;isttstissssssss;siiisas;;;iiiisszsst;it;si;
i = cache-util. 1 --- ' P uti~:_:~s :~r cache.pl ssssaiissixixi:iixiasassatiixis;iiiiiisiiissessseas;xisssiisxiiiisasiixtitiils sub init_cache local( Sss. Sdir. ~ma:c) ;
Saiax ' 40:
Sss~"SEYV('HpME';!.ss":
if (!(-a "Sss~)) ( m)cdir( "Sss" , ~', 77 ) ;
) edit="Sss/cache':
if (!~-a "Sdir"~~
mkdir~ "Sdir' . 77~w, ;Sdir.smax);
sub load_cache_listing local(?listing);
open(LISTING,"<Scachedir/LISTING");
alistiag ' <LISTIVG>:
close ( LISTI:~1G ) Misting:
) subllookup_url x(url.listing) ( local(Surl.?listing) ~ a_;
foreach Scached_url (=listing) return Surl if (Scached_url eq ~Surl\n");
) return 0;
) sub url2file x(url) ( local($url) Surl -- sI\/?(~S)g;
Surl =- s(\?)(~Q)g:
Surl '- s(\&)I~A)g;
Surl -- s(\n)I~N)g:
Surl:
1;

PCTlIJS97/05355 =ec . _.::..?. ~s-:e~~,~i :age t!/usr/=oca;/~ln/per~
sssassssssssszssssssisssssssssst=ts*issszsszssssz*ssxszazssss***sssssslsssssss s ' ss-ne?p.p: -- help routines ssasssssssssassst**x,******s*x*~txx:**xszsassssss*ssssssss*****sttsstst***sssi require "Sssd:~/ss-uti=.pl":
Sdebug~~~
a*t**r**s*s*x.s*sxx**r*:*xxxxxxx**x:xs:*=.:*t***x*
3 invoke ripper on these examples sub hclp_cocmnand ~ (datadir, class, url) ~ocal(Sdatadir,5c=ass,Surl) i a_.
prl.~.t «E::D OF_TEX::
<htai>
<title>Help screen</ti='_e>
<h2>Fielp</h~>
Vo on-line help a available yet. If you rant an e~pianaticn of this system contact <a href~"mailto:.rcohen3research.att.com">
William Cohen </a>
END_O F_Tr7CT
zsend_foot(Surl):
1:

'_.~_ i :S :.:.. _ _ _s-_abe_ - =nac =~/usr/local/bin;ue:_ ~sxxxss;t:sassssssssssssxaxsssxssxsxssissss:4Sis3sxisssss:isssssssss~;~istsias = ss - iabe i . p 1 - - r ~ut_aes to iabe 1 a.~. htial 3ocument sssssstfittissssssssaasssssst~alaissxsss~si;s;~~tsssaasssssst~sasssszitaassssss s _equire "Sssd::;cace-util.pl":
=equine "Sssd::/ss-ut:l.pl';
sslsssssssssasasasssssssassssasssssssxsssssss;xssx = label a L'PL. ~ett~ag the actual document from - the cache pub label_comnanc =ldatadir,class.ur=.labe l - local(Sdatasir.5class.5url.5labeli =
localt5fi_~_name.5ti_=e~:
(Scache~?i=.sasnxcachesize~~sinit_cache:
P E.'-..? " 5 0 tint STL :nax=5maxcaches:=e. cachdir=SCac~:_:_'~~~' __ 5~:ebug xtr~ to find the url in the cache :cnche_list:ag ~ sload_cache_listing;
Sfilename = surl2:ile(dlookup_ur:;Surl,=cache _listing)):
if (!Sfilename~
asend error("can't find anything in cache for C'RZ Sur'_. Try reloading, then re-la.
else set up subdirectories of labeled URL~s mkdir("Sdatadir/classes/Sclass/Y"~0777) unless (-a "Sdatadir/classea/Sclnsa/Y~);
m3cdir(~Sdatadir/classes/Sclass/N~,0777) unless (-a "Sdatadir/classes/Sclass/N");
xcopy cache file to new label file open(CaCIiE~~<Scachedir/Sfilename") if die(~can't open cached file~~;
open(LA>!EL~~>Sdatadir/clnsaes/Sclass/Slabel/Sfilename") II die(~can~t open label file for write");
while ( <CACH~ ) ~
print LAD>rZ;
close ( CaCI~) ;
close( LA8>EL) ;
record that URL :ras labeled open(~XAMpLELIST2VG,"»Sdatadir/classes/Sclass/LISTING.Slabel") II die(~can't append to listing file");
print EXA~pLELISTING Surl.~\n~;
close ( EXAftpLELISTIVG ) ;
if (-a "Sdatadir/classes/Sclaas/COGNTER~) {
open(COUNTER,"<Sdatadir/classes/Sclass/COUNTER~) fl die(~can~t rend counter~~;
Scount - <COUNTER>: chop(Scount):
closefCOUNTER):
else {

- PCT/US97/0535g ~e= ~ :o:i8 ::95 ss.. . i _ =age _aoe .
Scount ~ 0;
open(COL'VTLR,°:Sdatadir/classesi5class/COU:~tTER") .I die("can't grit-_ counter"~~
print COCStTgR --scount,"\n"w ..
close(COUNTER);
xsend acknowledgement Stitle ~ thtru_tale(~Sdatadir,~classes/Sclass/Slabel/sfilename~;;
print «END OF TFJCT:
< html>
<title> Label acknowledgement</title>
<bodv>
<hZ>Label ackno.rledgement</hZ>
<p><strong>Receized</st:ong>: label of a,quot<strong>Slabel</strong>bquot (class squot<strong>Sclass</strong>~cuot) for the document entit=ed scuot<strong>Stitle</strong>squot and located at http:sur?.
</P>
END OF_T~CT
ssend_foot~surl):
l ; else filename gas found in cack:e 1;

.°. ~ __:':8 ~-.- .S'.°_~_~ ~_.-w.Ce. . c3Ce ! /usr/1 ocal/bi:~/perl t~s;sSSx*;**saiis*;ss;;;si;*****sssss;$s~:*;:sawas'a;i*stast*:x**its;**;;**siss *
t = ss-learn-ripper.pl -- routines .,, allow use. ~o lear.~. search coa3aands :~r SS using Ripper _ ~. :ile should defi.~.e these rout..~.es:
slearn_command(datadir.class,url) -- generates the !?T HI. page * :hat the user sees when he clicks on the '.earn" option.
* The generated page is printed :~ STDOUT. generals-; it will contain, somewhere, this special anchor:
http://QpXyZpQ-SS-Surl-invoke-learner when accessed this cases sin:oke_learner ;see belo~.r) to be run.
a might also contain a form ::ith the special action x http://QpgyZQQ~SS-Surl-learner-options-'ozm which, when submitted, causes t:~e process _?earaer_°~rm to be invoked on its subnu tted arguments.
s yin:oke_learner(datadir.class.u:l) -- runs the lenrner and * generates an tiTML cage indicate :g status c_ t:~e run command.
*
sprocess_learner_Loz~n(datadir.class.url.optiona) -- handles the setting of run-time options for the learninq sstem.
*
*******************s**s***********s*******x**r**************x*****************
require "Sasdir/ss-util.pl~;
Sdebug~0:
* send an options form for the learner, and a link that invokes the learner sub learn_comnsand *(datadir.class.url) local(Sdatadir.Scla a .Surly - ~_;
local(Snchk.Sychk.Sropts.5nwt.Spwt.Swts);
print "<html><title>Invoke Rule Learner</title>\n";
print "<body><h2>Invoke Rule Learner</h2>\n":
print ~Click <a href ~\~http://ppXYZQp-SS-5ur1-invoke-learner\"> here\n~;
print "</a> to invoke the rule learner on class squotSclasstquot.\n";
print "This may take a minute or two.\n";
print ~<hr>\n":
options ~ sread_options(Sdatadir.Sclass):
!toptions form print ":h2>Set learning options</h2>\n";
print "<form action~\"http://pQXYZQQ-S5-Surl-learner-options-form\~>\n";
print "<p>The options below modify RIPPER'S behavior in learning\n°;
print "the class squotSclass~quot.\n";
Snchk - Sychk ~ ~
Snchk - "checked" if (grep(/rare No/.?options));

PCTlUS97/05355 ___ , . _ - __ _. ____ ss-:=ar--_ Pper.- oac-S.rc::k = _a=ked" _zless S
nc::k prl.~.~.~ "<-><bt _ =ong> assume the class a :are : ~: st=,.~.e>\n» ;
print " _ =ct pa =v~r \ _ me , are " t.rpe=\~:adio\" S~rcak '-aiue=~,~Yes\">Yes~/:-z~:>\n..;
pri~' "~,~y>\nLiname=\";are\" t_rpe=\..;adio\" Snc::c -:aiue=\. "VO\">Vo<
/ -Pu.:'~,n"
~' "'ts ' grep ( /':re fights/ . :options 1 ) '~D'..'~~. ~_.~) a ! $wLS =- /"ilelght3 l~~S~) (\S')/) else .
1$p:~...~.i7t) _ (l. l):
print ::'p><strong> weight for positi:~e examples: <istrong>\n":
print <_aput name= ' _ ~
print "~hr>\n"; \~''~\» t?Pe \ te:ct'~" value=~~,"Spin\~></input>\n~;
print "~~><strong> ::eight 'or negati~e examples: </strong>\n";
print " ..-.put name= ~
print --;p>yn"; \..,~~\., t.rpe=\ t~xt\" value=\"Sn-"rt\"></input>\n°';
(Sropts: = yrep(/~c_narand :fine .~S/.:options):
Sropts =- s/~command_?ine //;
Sropts = :~1~ unless Sropts:
print 5:.,=_'-..? "ropts: Sropts\n" if Sdebug;
print "gyp><strong>RIPPER conanand-line options: </st=ang>\n~;
Tint ' <_ p = "~ pP \" ~ -P ' :~ ut name \ ~i er tjpe=r, ''text\" size=~0 :~alue=\ Sropts\">-/input>\c".
print "</p>\n";
print "<irput type=\"submit\~ value=\~Set options\~> </input>\n~:
print ~<input type.\"reset\' value=\~Reset\~> </input>\n";
print ~</torau\n";
tsend_~~c~(Surl):
invoke ripper on these examples sub invoke_~earner x(datadir.claas.url) local(Sdatadir.Sclass.surl) local(.options.Sdefneg.Sripopts.Scatcom,SWd.Ssave.spwt,5nwt,Svts);
rautoflush buffers Ssave ~ S 1 ; 5 ~ = 1:
Swd = "Sdatadir/classes/Sclass~;
print "<htsal>\n" ;
print "<title>output of RIPPER rule learner</title>\n~;
print "<body>\n";
print "<h3>output of RIPPER rule learner</h3>\n~;
print "<p>preparing data for RIppER...\n~;
tread in options ?options = ~read_optiona:
unless (grep(/rare 'o/.=options)) ( Sdefr.~_g = ~Sdatadir/defneg.data";
if (!(-a Sdefneg) )I !(-s Sdefneg)) ( _°_. .,~~~.7~
_... :S' _='J~~'~ . ~'~~ _ . 3C°

'==.ate a data file :or default negat_:-_ ':armies _~cai(Sf:accom) -.-nd '' /usr/local/:~w/cache/htta \\ ( --a~ ~ ~ . rtm1 ' o ~aase ' ~ . ht..n ',\ ) ..
r~'-»t FxtractZng negati-: a examples t,: c- pro,.; cache . . . \n" .
p==-'~t STDEPR " f indcom: S f indcom~n ~ ~ ' ooen(DgF~;EG,~>Sdefneg°) ;; die("can't c~eate defneg~);
opearC:,CH=~:5;,~Sfindeom -print;~) whi_e(<C:.CHELIST>) ( w - , an't list prox.; cache";
chop: ' print "<dd> <code>S_</code>\n~:
ofile2example(Sdatadir."".S_,~,t",DEr~-FG) ;f (.= 5_);
) ~-ose( C:,C'siF:.IST) ;
e'_ose(DEFVEG);
.: ;yS,rtsi ~ grep(/"~eightsi.'.-optional) t '(Spwt.Sn,rt) ~ (S:rts -- ;";heights (',ST) (\5-.;
e_se ( ,.
(Spwt.SZ,ut) ~ !1.1);
open(DAT:,. °:Swd/web.data " ) ~ die ( "can' t write .. ,reb. dnta'' ) ;
:prepare_~_:samples(Sdatadir.Sclass.DATA.~Y~,";spwt"):
iprepnre__ :amples(Sdatadir.sclass.DATA,~~t".";Snwt~);
close(DATA);
t~create names file open(NAMES.">Svd/web.names") !I die("can't write to web.names"):
print NAMES "Y.:1. \n~ ;
print NAMES "woRDS: set.~n";
~close(YAMES);
print "prepared.\n</p>";
print "<p>Invoking RIPPER...</p>\n":
print "<pre>\n":
»invoke ripper--read output from a pipe (Sripopts) ~ (join("" ,options) ~_ /command_line (.~)S/);
print STDCRR ~use defneg: Sdefneg~n' if Sdebug;
print STDERR "ripper options: Sripopts\n" if Sdebug;
Scatcom ~ "cat Swd/web.data Sdefneg I clean-data -c Y -s Swd/web~;
open(RIPOUT."Scatcom I ripper -a given -s Sripopts Swd/webl") (l die(~can't execute ripper");
while(<RIPOOT>) print;
close ( RIPOf.~t) ;
print "</pre>~n":
tupdate counter open(COUNTER,~>Sdatadir/clnsses/Sclass/COUNTER") II die("can't write to counter");
print coONTER ~0\n";
close(COUNTER);
finish off html file 6send_foot(Surl):

_ WO 97/38377 -°c '_ _-:~8 -??~ fiS'_=a=- . ocr . . ~ r =
5' ~ SsaVe:
sub p reoare_examples stsda=adir.Sclass.Sout.~:abel.S:~tt local(Sdatadir,Sclass.sout,5label.S:rt) .
local(Sfi=~);
opendir(L,B~LED,"Sdatacir/clnsses/Sclass/Slabel") !! die(~can't list labelled files");
ahi'_e (Stile - readdi=;j,8~L~D)) if (Sfile ~- /f-.;/) f print STD~RR "S~: :rill label file 5file as Slabel\n" i° 5debug:
&file2examplc~5datadir. -"5datadir/classes/5class/stopwords", "sdatadir/classes/Sc:ass/slabel/Sfile".
51abe1,5out,s:rt);

closedir t =:.8~LED) ;
asssrsssssssssssstsssss*sssssxssssswssssxssrs.ssxl sub p rocess_lenrner_°orm =tdatadir.class.uri.cptions) local(Sdatadir.5class,5url.Soptions) local(Srare,Spwt,Snwt,Sconanand_line):-(Srare) ~ (soptions ~- /f\?\~)rare~(I'\t\/)~)f\tV)/):
ISpwt) ~ (Soptions '- /f\?\s)Pwt~tf"\iVl'>f\~\/)/):
(Snwt) ~ (Soptions ~- /f\'\slnwt~(f"\t\/l')t\t\/)/);
(Scoemaand line) ~ (Soptions ~- /f\?\6)ripper~(.~)S/);
Scossoand_line ~- tr/t/ /;
Scoaacand_liae ~- s/tZl/!/g;
print STDLRR "S0; rare~Srare\n" if Sdebug:
print S'TDERR "S0: coa~and_line~Scommand_line\n" if Sdebug:
iwrite_options(5datadir.Sclass.
"rare Srare", "weights Spwt Snort~, "conmaand_line $comsaand_line" ) ;
print "<html>\n":
print "<title>New option acknowledgement</title>\n";
print "<body>\n";
print "<hI>New option acknowledgement</h2>\n";
print "<p><strong>Received</strong>: the following options\n", print "have been set for class oquot<strong>Sclass</strong>squot:\n":
print "</p>\n";
Print "<p> <dd> Assume the class is rare: <strong>Srnre</strong> </p>\n~~;
print "<p> <dd> Example weights: .strong>Spwt for pos. Snwt for neg</strong>
:/p>\
print "<p> <dd> options to RIPB~R: <strong>Scomsnand_line</strong> </p>\n~;
print "<p> Click <a href~\~http://pp~ylpQ.SS-5url-invoke-learner\~~ here\n";
print ~</a> to invoke the rule learner with these options.\n";
print "This may take a minute or two.\n";

=ec _ __::3 :.:a3 s_-- - _aa=.-. _ , Vie: . -' ?age ~3e~~.d_FOOt( Surl; :
i sub ;ead_options t _ocai(Sdatadir.sc:ass) _ :_, open(OPTIO.fS."<Sdatadir/classes/Sclass/RIPp=~.~pTS";;
:options = <OPT-ONS>:
close(OPTIONS);
print STDERR "options: 'options\n" if SdebLa:
options: ' y suc ;;r:=. options ~ocal(Sdatadir.sc?ass.options) open(OPT=~YS.">Sdatadir/classes/Sclass/RIP?r~ OPTS"' die("can't '-rite to options file"):
pr;nt OPTIONS 7cin( "\n", :options), "\n";
close(OPTIONS);
1:

f, f' ='= ~ .. : , 3 ._ : 5 ss -.:a. :. r? ?age t!/usr/i=cai/'~:z/perl s*s*issass:sssss:xs*sssssstx**~~=;#;~~:sssssaasssss:*ssssss**sa*tss::tst sssssss ss-ma~a.pl -.
- :rain routines for handli:.g :3beling co:~ands s invoked b.r _:ap-request.pl, so arguments are match-string>
and <_ar-,-arguments>....only match-string is used s sss*t;ssssassssssx**ss*ss*sxx~lxxxxpsxs~s~txsss3ss*ss*sss**s**sssssssssssasssss sasdir~'' /ho~/:rcohen/cade/ss " ;
require ~Sssdir/ss-util.pl~.
require "Sssd_=/ss-label.pl~;
require "Sssdir/sa-review.pl~;
require "sssdi:/ss-options.pl":
require "sssd:r/ss-senrc:~.pl":
require "sssd_=/ss-help.pl~;
=load the use_--refined learning program require "SE:~fV('SS_LEAR~ER')":
Sdebug~0 ;
t~l*etssssss**ssst*xft**s;*ss**sxs*s*sxxssssssss**xs k figure out class (Sdatadir,Sclass) ~ cinit_ss:
xxx~exxwe~rsxses~esxx~yast~ss»~~txxx»s~xrs~s~t*sxxxxt figure out intended conanand, and perform it Swatch ~ SRRGV(pJ;.
print STDERR "ss-main.pl wan invoked for match 'Swatch'\n" if Sdebug;
if (Swatch w- /GLT.~QQXYZQQ-SS-(.~J-label-((NYJJS/) ilabel_coceaoand(Sdatadir.Sclass,Sl,S2J;
) elsif (Swatch ~- /GL~T.~QCXYZQQ-SS-(.~)-reviewS/) ireview_co~nd(Sdatadir,Sclass,SlJ;
elsif (Scratch ~- /G~.~QQXYZQQ-SS-(.~)-learns/) learn_ceam~and(Sdatadir,Sclaas.Sl);
elsif (Swatch ~- /G~..QQXYZQQ-SS-(.~J-invoke-learnerS/J ( iinvoke_learner(Sdatadir,Sclass,S1);
elsif (Swatch ~- /G~.~QQXYZQQ-SS-(.~)-learner-options-form(\~.~)S/) iprocess_learner_form(Sdatadir,Sclass,SI.SZJ;
) elsif (Swatch -- /Gy~.~QQXYZQQ-SS-(.~)-searchS/) ( Gsearch_command(gdatadir.Sclasa,SlJ;
f elsif (Swatch ~- /GET.~QQXYZQQ-SS-(.~)~ss-class-form(\?.~)S/J ( ~proceas_ss_class_form(Sdatadir,Sclaas.SI.SZJ;
E elaif (Swatch ~- /GE'i'.~QQXYZQQ-SS-(.~)-set-ss-optionss/> ( sset_ss_options_cou*aand(Sdatadir,Sclass,51);
I el'hel(Smntch -- /GET.~QaXYZpQ-S5-(.~)~helpS/) ( p conanand(Sdatadir,Sclass,SlJ;
E else send error( "Unknown conanand -- snatch was Snatch" ) ;

~aC . ~~ : i$ .... _..'.:.".1~.'..~' ~'1Q~
PCT/US97/03~i55 =a~_ _8 .-w 0 ,?g:
.. _ _a._ , age ss-.:a_..__ *!/usr/local/bin/perl :sssssy;xx~xp~xxx~~~***ssss~t;xsit~Ixxxtsisss~sssss:*!x:#ss*s:~ss*s*sxx~~sst:ss s s = ss-marker.pl -- generat~s the text used to mare up html =orms by markup.pl s ' -= in yoked b j oreosh -b a'aricup . p1 sssssssssssxxsssssssssssssss~~l~lxxa~xkxxsssssss.ssss*sfsssssssssssssssssssssss ss Sdebug~0:
Sssdir~"/home/wcohen/code/ss";
require "Sssdir/ss-util.pl~;
print STDERR ~S0: ~rriting marker...\n" if Sdebug;
!Sdatadir,Sclass) = tinit_ss:
print STDERR "S0: class a Sclass...\n" .: Sdebug:
print « END OF_IYSERT;
<i>
<strong>Swimsuit:</strong>
Is this in the class 'SClass'?
( <n href~"http://QQgytQp-SS-1V-ia.bel-Y~> yes </a>
<a href=~http://QQXYZQQ-SS-~O-label-v"> no </a>
1 <br>
<a href~"http://QQXYtQQ-SS-10-zeview"> Review Previous </a>
I <a href~"http://QQXYZQQ-SS-t0-learn"> Learn </a>
I <a href~"http://QQXYZQQ-SS-1U-search"> Search </a>
I <a href=~http://QQXYZQQ-SS-1U-set-ss-options"> Set options </a>
I <a href="http://QQXyZQQ-SS-10-help"> Help </a> j </i>
<br>
END OF_INSERT
print STDERR "S0: marker for class Sclass .mitten\n" if Sdebug;

=ec . _.:.:3 .-.~ _=s-=p::~ns.~ :a~~

= !; usr/~ocal/bin/per'._ =s a ssas:**x*::**;:srs;sas***pssssss-s3sss::*sass-sssas#Sx#*3~l:ssss sssxss:ass = ss-cations.pl -- routines to ailo~ user to se. ~ot:.ons f or =:~e labeling s _: s tem ' s::lasss*;#**;ss~xlssxxssss**:*xatax*;*sss**x;sssss;;;;s***#;*ii*=ass***xxxxss requ_re "5ssdir/ss-ut:=.p1":
Sdebug~l;
'.**!*sss*;;x*:s*xx#*ssss;******xs*ssax*iss*#*sasss * process a (Set Opt_ons~ coaanand ::om the s inserted .html page sub set_ss_opt~ons_c=mxnand *(datadi~.class.url;
( local(Sdatadir.5class.Sur1) local(sold_class.~engine,salt= agine~:
*send head of control panel print "<html><title>Control Panel</title>\a';
print "<bod~i><h2>Control Panel</h2><hr>\n";
print "<forta act:on~\~http://pQgyZQQ-SS-Surl-ss-class-form\~>\n";
tthe change classes section of the form xprint a menu of old classes print " <strong> Change the current class</strong> to an old class:\n~;
print " <salect nanee~\~oldclass\" size-1>\n~;
print. " <option selected> Sclass\n";
opendir(CL1SSDIR.~Sdatadir/classes") II die("can't list classes~);
while~Sold_class ~ readdir(CLASSDIR)) ( next if (Sold_class ~- /~\,\,p/);
next if (Sold_class eq Sclass);
print " <optiotvSold_class\n~:
closedir(CL1SSDIR);
print " </select>\n~;
sprint an option to enter a new class print " or a new class: <input nanve~\~newclass\~> </input>\n";
print " <p>E'or now, please do not include punctuation or spaces\n~;
print ~ is class names.</p>\n";
xa chnnge search-procedure fona Sengine ~ sload engine_name(gdatadir);
print "<hr>\n~;
print "<form action-\"http://ppXYZQQ-S5-Surl-ss-class-form\~>\n";
sprint a menu of known engines print " <strong> Chnnge the current senrc'.; engine</strong>: \n";
print " <select name-\~engine\" size~l>\n':
print " <option selected> Sengine\n";
opendir(ENGIaEDIR,"Sdatadir/engines") II die("can't list engines");
while(Salt_engine ~ readdir(ENGINEDIR)) next if (Salt engine -- /~\.~.~/);

=a~_ 5 _..._ _.._ ss-=~t:ors.~ ?ace ner_ := tSalt_eagine =- /-S/); =skip backups nest _- !Salt_~ngine eq Sengine;:
pr=°= ~ <opt_otvSalt engine\a'':

ClOSe.~-~~~ ' ~;Gi~IEDiF? ) ;
print " </seleet>\n";
subm~= Buttons print " <ar>\n";
print " <:nput tpe~\~submit\" :~alue~\~Change ':ais:es\~></input>\n";
print " as indicated above, or\n~;
print " <:nput t.rpe~\"reset\" -value=\"reset\~> </;nput> the fona.\n";
print "</:c~\n";
4send_:=ct!Surl);
x~*~dt;*xssas~aas~*ds*satt#sssassx;;x~~sstx~tk~lli~C
process the ;o~ associated ;rith the 3 [Set opt:trsj page sub process_ss_class_fo ~n.xtdatadir,class.options) localtSdatadir.Sclass.Surl,Soptions) ~ 3 .
local(Soldclaas.Snewclass.Soldengine.Snewenqine);
print STDERR "S0: options = Soptions\n" if Sdebug;
(Soidclass) ~ (Soptions -- /(\?\iloldclass~t('\s\/1~)(\4\/1/y;
tSnewc_asa) - tsoptions =- /(\'\ilnewclass~t(~\t\/1')(\Z\/1/):
(Snewengine) - tsoptions ~- /(\t\clangine~t.~)S/);
print STDERR "S0: oldcl~Soldclass\n" if Sdebug;
print STDERR "S0: newcl~Snewclass\n~ if Sdebug;
print STDERR "S0: engine~Sengine\n" if Sdebug;
Snext class ~ Snewclaas I; Soldclass;
unless ttSnext_class eq Sclass)) ( print STDERR "S0: changing class. Sclaas->Snewclass\n" if Sdebug;
opentCLASS.">Sdatadir/CLASS") ji diet'can't write current class~);
print CLASS °Snext_class\n";
closetCLASS);
Sclass ~ Snext class:
Soldengine ~ aload_engine_nametSdatadir);
unless ttSnewengine eq Soldengine)) ( print STDERR "S0: changing engine to Snewengine\n" if Sdebug;
opentENCZNE.~>Sdatadir/EZtcINE~) 11 diet"can't write current engine");
print >trtJGINE " S newengine\n~
_ eiosetENCZNE);
print "<htsal><title>New option acknowledgement</title>\n";
print "<body><h2>New option acknowledgement</h2>\n~;
print "<p><strong>Changed</strong>: the current class\n";
print "is now &quotSclasssquot</p>\n~;
Print "'p><strong>Changed</strong>: the current search engine\n"~
print ' a now ~quot5newenginesquot</p>\n~;
.,.."_"",...,__.... ...._.,-._,_..-...... ~''~

wo 97/38377 =ec _ ~5:~c .:95 .s-.~~_~as.= .'-ace oaCflC3_f00~.'$tlr~; ;
S

~c ~ _~ : 4B ..: 5 ss-:..._~.~ ~ ?age i!/usr/local/bin/perl i#******###ss#i#s*is#sissssssiiidil~t.s:ssssxs*ø:3#ssas;i#~tsssss#sitississ s i ss-r~_vie~~r.pl -- routi.~.es =o r~view pre~.W ousl_ labels documents fif*ii;i*si #itif#ssssii~~111~Itss#~ssssiill#iissssii#iii:i~iiiiaiix if*
require 'sssdir/ss-util.pi-, require "Sssdir/cache-m=.:.pl";
sub r~_~~iew_conm~and a (datad~r, class. Surl) local(sdatadir,Sclass,SUrl) - t_;
print ~<html><title>Listing for class Sclass</title><bodl>\n";
print "<h2>Premousl-yarked examples of ~cuot<stroaa>Sclassc/strong>squot</h3-~\z"
sre~riew_examples~~y~,"?ositi-re");
sre :~iew_examples ( ~v° , °~regati re" ) ;
aend_foot~surl);
sub re~riew_e~camples i ( shot:, long) ( local(Slab,Slabelname~
local(Stitle.Surl,5fi1~.sfound):
print "<h3>Slabelnnene F:camples</h>>\n<u~\n";
if (-a "Sdatadir/classes/Sclass/LZSTING.Slab") Sfound ~ open(LIST."sort Sdatadir/classes/Sclass/LISTIVG.Slab ; uniq II die("can't r~ad/sort/uniq classes");
l if (lSfound) print ~<li> <i>No examples marked</i></li>\n"~
else ( ' while(Surl ' <LIST>) chop Surl:
Sfile - surl2file(Surl);
Stitle - shtasl_title("Sdatadir/classes/Sclass/Slab/Sf ile");
print ~<li><a href-\"Surl\"> Stitle </a> </li>\n";
s close(LIST);

print "</ul>\n";
1:
~A

WO 97/38377 PCTfUS97/05355 ec . . .=3 .??S ss-searc::.~: 2aae a!/us=;=~cal/bin/perl ssssazzzazzstssit#ssssazsss;;::::*ssz#sssssssssaassazssasz !;#s"s:szszss*csss;s:
s = ss-searca.pl -- rout.~es to alloy user to learn searca commands x sxsa*sssssssss*ssssassasass**s*t**ss*ssssssiszsasans*sssssg;#sisssssas****ssis Sssdi:=''/home/vcohen/code/ss":
requZ:e "Sssdir/ss-util.pl";
Sdebuc~!
tsssssass;sss*x;t#sssssssi~***xsss:*sss#s*x*ssssts x process the form associated ~rith the s (Set :ptionsl page sub search_cot~and :(datadir.class.url) t local(Sdatadir.Sclass.Surl) ' =_, loealc :fields. rule) ;
local(Sengine) ~ ~load_engine_name(Sdatadi=):
print "<html>\n":
print "<title>ways to search for class tquotSclasstquot</title>\n"-print "<body>\n";
print "<h3>ways to search for the clnss 6quot5claas~quot</h3>\n";
if (!(-a "Sdatadir/classes/Sclass/xeb.hyp")) ( print "<li><i>~1o rules have been learned</i>\n"~
) else ( .
wAccess counter open(COUNTER."<Sdatadir/classes/Sclass/COU?JTgR") II die("can't read counter"):
Scount ~ <COUNTTZrR>; chop(Scount);
close ( COUNTER) ;
Scount ~ "NO" if (!Scouat);
Sverb - (Scount~~l) ? "example ha:" . "exansplea have";
open(RVLES."Sdatadir/classes/Sclaas/web.hyp") Il die("can't read rules file"):
print "<p>Scount Sverb been labeled since thane rules were learned.\n</p>";
print "<p>The current search engine is Sengine.</p>\n";
print "<ul>\n":
while (<RULFS>) ( afieids ~ split:
arule ~ t if (Sfields(0] eq "Y")' collect terms front rule for (Si~6: Si<S~fields: Si+~3) if (Sfields(Si-1] eq '-') push(?rule,("+" . Sfields(Si]));
) else ( push(~~rule.("- . Sfields(Si)));
) print ":li> (Sfields(1) right/Sfields(_~ wrong) .
print STDERR "Sdatadir/enginesiSengine :rule\n" if Sdebug:
print 'Sdatadir/engines/Sengine :rule'.
~0 _: __:5 ss-sen=~z.~i Paae print ~</li>\n";
) eisif (Sfields(0) eq "v~) ( print "<1i> ~Sfields(1) r:yht/Sfields(2) ~rrong) "
print "::ot covered by the r::les aboae . \::</1:>\a~ :
) close f °L'LES ) print "</ul>\n":
~aend_f~ot(Surl);
) 1:

PCT/LTS97/p5355 ___ , - - ~ -_ _... ca "-___. ~1 ~'~QO
=!/usr/loca?;~in/perl LxiL:i';$'j',ZwL~SS~~i~fLaaii rrr ss:"....x_aaaa:xssxssx:ssasssssssiisssssssssixsssssss Z
- S3 "ltll ' e~- :ZIiSC L:~~ ltlCS
a sssssssssassc=ssissasssssii;x;xsxsssassts?i*txxti;s:isssiii1111tisssiix;issi111 ;
SdGL~uCJ~O ;
~3SrOOt=~$EV'V('Hp~' ~/,SS":
s~ssss=**x**sss*txssssass*3tf!!It*;itssssx!*xx*4ltt~
_ -:~t~alize cirect~r~es, return director7sclass sub .:it_ss r =ocal(Sss,Sdata,Scll:
Sas = Sssroot:
f ( ! ( -__ ''sss" ) ~ ( mkdi="'Sss"..-.;') ~ die("c3n': create 'Sss' 3izector_:"):

Sdata = "Sss/data'', if (!(-e Sdata)) (, tNcdir(Sdata.0~~~) II die("can't create data directory~);
if (!(-a "Sdata/engines")) ( print STDERR 'S0: need an Sdata/engines directo rr~n" if Sdebug;
cnkdir("Sdata/engines",0'77) II die("can't create engine directorl Sdata/engines");
if (.! ( -a "Sdata/E~tGI~tE~ ) ) ( open(~,1GINE,">Sdata/ENGI~IE~) ,I die("can't create Sdata/EYGINE");
print SNGIN~ "xebcrawler~n";
close(LNGINL);
) if (open(CIASS."<Sdata/CL1SS")) ( Scl ~ <CIASS>; chop(Scl);
close(CLASS);
) else Scl = "cool~;

if ('( e_~Sdata/classes")) ( aikdir(~Sdata/clasaes",0777);

if (!(-a ~Sdata/classes/Scl")) eNcdir("Sdata/classes/Scl~,0'777);
(Sdata,Scl);

s»ixsssxxxxxxf;!xxsxx:~:ss:ksxsssssisssxxsxaxtwxxsx s figure out the title of an html document sub !~tasl_titie r(filename!
( local(Sfile) > >_-.~.~~,"~.~~_... ,..4"~,.. . ..~.....~ _%.-..~ 2 e- ~ _ :?. .?95 ss-ut::.?1 ?age 2 local(Sn.Sstart,Stitle);
*get up to first V lines Sn ' S0;
open(HTML,"<sfile') ,, die("can't open HTI~ file sfile";;
~~hile ( ( s_ ~ <HTt~>) , ds 5n-- ) [
chop:
_ Sstart . ~ ( 5_ . ,. " ) ;
close(HTML);
stitle - S1 if (sstart -- /<title>(.~~<\/title>/i);
Stitle s "Apparentll entitled" if (!Stitle);
* print STDERR "title is Stitle\n";
Stitle:
sss;s,~xsssxx~txx,~,~xstxyasssssxxtxxxxssxxrtxrx~rsvssws * send an error message sub send_error *(message) I
print "<hta~.1>\n"
print "<title>Ezror message</title>\n";
print "<body>\n";
print "<p>Error: S_[0)\n";
print "</body>\n";
print ~ </htasl>\n"
xxx~rtxxxxxxxxx,~xxxxxxxsxxarxxxxxxxxxxxxxxxxxwxxxxx x print a file footer sub send_foot *(url) [
local(Surl) print «>=ND_OF_TE7tT;
<hr>
<p>< i>
[ <a href~"http:Surl"> Resume Browsing </a>
I <n href-"http://QpXyZQQ-SS-Surl-review"> Review Previous </a>
I <a href~"http://QQXYZQp-sS-Surl-learn"> Learn </a>
I <a href~"http://QQXYZQp-SS-Surl-aeareh"> Search </a>
1 <a href-"http://QQXYZGQ-sS-Surl-set-ss-options"> set options </a>
1 <a href-"http://ppXYZQQ-SS-Surl-help"> Help </a> j </i></p>
</body>
</html>
END OF_T~XT
xxxxxxxxxxxxxxxxxxxxwxxxxxxxxxxxxxxxxxxxxxxx*xxxxx convert a file to an exnmple for ripper r _ R'O 97/38377 PGT/US97/05355 ~ec d . _~ .?9= s3-st:l _ ?age =if ~-t a present .t should be an argument "~:c"
sub file2e~amole =(datadir,stopwordfile.file.label.out»andle[,:~rt);
local(Sdatadir.SStcpwordfile.5filc.Siabel.sout.swt~ . ~_;
local(Slines,stop::ords.extra_stop:rords.sextra_stcpword_pattera);
unless (SStopword~attern) [
open(STOPwORDS.~Sdatadir/stopwords") '; die("can't find stopwords file"):
atopwords ~ <STOPwORDS>:
close(STOPwORDS);
chop(=stopwords);
SStopword_pattern a "~~b(~ loin(".".3stopworcs) ")~~5";
) if fSstopwordfile s~ -a Sstopwordfi:e~ ( ;class-specific stopwords open(STOPwOR.DS,~<Sstopwordfile") ; diet"can't :ind stopwords file Sstopwor_:;le extra_stopwords ~ <STOPWOPDS>:
close(STOPwOR.DS);
chop(?extra_stopwords);
Sextra_stopword~attern ~ ~~~b(" . joiZ(~'~,ext:a stopwords) . ")~~b'~;
else [
Sextra_stopword~attern ~ ~~.
open(FILE,Sfile) '~ die("can't open html file Sfile");
skip header while ( <FILF.> ) last if /~~r?~n5/;
Slines~0:
while ( <FILFa ) [
last if SlinesT- > 100:
»delete e~mail addresses and stopwords s/SStopword,~attern//gio:
s/Sextra_stogword~attern//gio if Sextra_stopword~attern:
s~wy ~ [ ~'J~ ~ ) ~//g :
delete HTZB,. special characters s/t.~;//g~
»convert to lowercase tr/A-Z/e~z/;
xdelete cotaplete FiZTQ, coa~ands of the foxie< < . . . >
s/<(">I~>//g:
now print what's left, again x deleting stuff between <..>'s if (s/<[~>)~S//) [
open without close »remowe non-alphanumerics p~-"1'/i1S97/05355 sec 3 1'._, _:95 ss-s::=,pl Page s _=/a-z0-9\a/ /c;
~riat Sout s_;
sopen_bracket ~ 1:
? -_a:f ~Sopen_bracket si s/~(~<j~>//) ( '=lone without open 'remove non-alphanuax ric3 _./n-z0-9\n/ /c:
print Sout 5_;
Sopen_bracket ~ 0:
? els:: ~:gopen_bracket~
remove non-nlphanuinerics tr/a-z0-9\n/ /c;
print Sout s_;
print Sout '' , \n51abe1Swt. \n" ;
close~Fl=;~;
) ( ..
sub load_eng=-e_nante :datadir local(5datadir) ~ ._ local(Sengine);
open(ENGIy,"<Sdatadir/ENGIaE") p die("can't local search engine");
chop( Senglne ~ <EetGI'Fa ) ;
close ( L.TGI~tE) ;
Sengine;
l:

WO 97/38377 PC'T/US97/05355 Appendix 4. A copy of the documentation for the OreO --shell tool which was used in the implementation of the presently-preferred embodiment.

Developing an OreO Agent Table of Contents ~ Types of OreO Agents ~ Library Routines OreO Shell API
~ Future Directions OreO Agents By some measurements, OreO agents appear to function as servers in that they support connections from multiple clients and provide services to these clients. In this respect, the design of an Oreo Agent (agent) uses the same techniques as designing any other network-based server.
We define a connection as consisting of two socket; one to the "upstream" client and one to the "downstream" server. The OreO shell (oreosh) is responsible for setting up these connections and making them available to the actual processing code. An Oreo agent is thus a combination of the OreO shell and some processing code: in the simplest case, a plain OreO
shell acts as a simple pass-thru mechanism. The agent receives HTTP request data on the client socket, and HTTP response data on the server socket.
OreO Agents expect to see HTTP proxy requests. The HTTP proxy protocol simply specifies that URLs presented in the various HTTP methods will be absolute; normally, an HTTP server don't see the scheme and host address/port number portion of the URL. These proxy requests are then forwarded to the host specified by the OREO PROXY environment variable (specified as a <hostaddress>":"<port number>tuple).
Since the agent "speaks" the HTTP proxy protocol on both its upstream and downstream side, OreO
agents may be nested in a manner similar to Unix pipelines.

When designing an agent, we can utilize several different designs. These are ~ whether connection should be processes serially or in parallel ~ whether an new process is generated for each connection.
The above results in four different agent models, which we discuss below. Our use of them process is influenced by the canonical Posix process model, which supports (at present) a single thread of control per process. The design of an agent will change dramatically for those systems (like WindoWS/NT) that provide multiple threads of control per process.
serial connections, multiple processes In this model, a new process is generated for each connection and the shell waits for this process to finish before accept()ing another connection. This model is useful when the agent code requires sequential access to a shared resource, and no mechanism exists to synchronize shared access to that resource. This form of processing is enabled by specifying the -1 switch to the OreO shell.
Parallel connection, multiple process In this model, the shell guarantees a new process for each connection, but the shell immediately returns to accept() another incoming connection. This provides maximum parallelism, but not necessarily optimum thruput. The application must synchronize access to shared, writeable resources such as databases, files, etc.
In both instances, the shell supports different ways to process the HTTP request and response. The agent author can choose to filter either the HTTP request, the HTTP response, or both. If only the request or response stream is desired, the shell takes responsibility for forwarding the other.
The shell supports this via the following command line arguments.
-i process the HTTP request stream -o process the HTTP response stream -b process the request and response stream If no arguments are specified, the Oreo shell simply copies its input and output from the client to the server, and vice versa.
When filtering the request stream, the shell arranges to connect the client socket to the standard pcrivs97ros3ss input (stdin) of the child process, and the server socket to the standard output (stdout). This is reversed for processing the response stream.
Connecting the sockets in this way permits the use of pre-existing Unix style filters as processing code, or creating processing code in various script languages that easily read and write stdin and stdout.
Processes that read from the client side normally will never see EOF, since the client is waiting on that channel to receive the HTTP response.
Therefore, the shell intervenes on the processes behalf, and sends a SIGTERM when EOF is seen on the HTTP response stream. Processes that read the response stream will see EOF when the server closes the connection; at this point, the socket to the client can be closed after the data has been written.
If only one of -i or -o are specified, the shell takes responsibility for processing the other side of the connection.
Single process for all connections, serial processing In this model, a single process (the co-process) is generated by the she 11 upon startup; the shell still generates the connections, but passes these connections to the co-process. via an IPC mechanism.
The shell does not wait for the IPC to be acknowledged, but rather passes an identifier that uniquely identifies the particular pair of sockets corresponding to this connection. Once the co-process has taken control of these connections, the co-process acknowledges this to the shell, and the shell closes its copy of the sockets ( this is necessary since the client side will never see an EOF on its socket if multiple processes have this connection open) .
Single proc$as for all connections; parallel processing This implementation works exactly as described above under the serial processing case, but the co-process manipulates each connection in parallel instead of sequentially . Note that it is the responsibility of the co-process to implement sequential vs. parallel processing; the shell is always asynchronous with respect to transferring connections to the co-process.

Library Routines . . This version of the OreO shell packages several functions into a library into a library (liboreo.a). These routines are used both by the OreO
shell and by the Oreo Shell API functions. These routines are documented here as an aid to those who wish to program OreO agents at a low level interface.
typedef int Socket;
int makeArgv(char * string, char * av[;, int maxargs) Takes as input a text string, and returns a vector of pointers that no point to individuals tokens in that string (where a token is defined to be a series of non-white-space characters separated by a series of white-space characters. White-space characters are spaces and tabs. Returns the number of tokens, which will be <= the max number of strings allowed. The caller must allocate space for the vector of pointers.
int readn(Socket s, void * buffer, unsigned int size) Like read(), but guarantees that size bytes are read before returning.
int written(Socket, void * buffer, unsigned int size) Like write(), but guarantees the specified number of bytes will be written before the call returns. This is important because of protocol buffering and flow control, since it is very possible that the write() call will return less than number of bytes requested.
int RecvRights ( Socket IPSock, Socket client, Socket * server) This call returns a socket corresponding to connections to the client and downstream server. This call hides the mechanisms used to retrieve these sockets; such mechanisms are guaranteed to be different across operator systems, and may change from release to release.
Signal Handing Agent writers should not have to worry about signal handling; in fact, a correct implementation relies on the default signal handling behavior as specified by the POSIX signal handling mechanisms.

OreO Shell API
In order to facilitate the creation of OreO
agents, we have defined a higher-level API than that presented by the Winsock API. We call this the OreO
shell API. This API presents the notion of a connection that can be created and deleted. Each connection contains our two Sockets, and a variable indicating the state of the connection; uninitialized, processing the request, processing the response, or terminating. This API either supports agents written using the co-process model, or agents that receive their sockets on stdin or stdout.
The following example is a rudimentary example of using the Shell API to implement an agent that could be invoked via the -b switch.
ConnectionPtr cp = newOSHConnection( StdioConnection);
// process the request while (nn = OSHConnectionRead(cp->browser, buffer, sizeof buffer) > 0) (void)OSFConnectionWrite( cp->proxy, buffer, n) ;
while (nn = OSHConnectionRead( cp->proxy, buffer, sizeof buffer) > 0 OSFConnectionWrite( cp->client, buffer, n) ;
deleteOSHConnection(&cp);
This named code would be suitable for generating a program to be used as a co-process; in this case, the connection would be created by a call to newOSHConnection( IPCor~ection Future Directions This is the first version of a UNIX (POSIX) release. Future releases will buffer in implementation details; however, the interfaces defined above will not change, nor will the implementation defined by the OreO
Shell API.
One notion is to re-implement the OreO shell as an agent analogue of the Internet inetd. In this version, the shell would initialize via a configuration mechanism that would indicate a specif is port nuaber, a process to run, and how that process should be started.
The shell would accept connections on all such port numbers, and generate the appropriate sequence of commands to start the appropriate agents.
An alternative would be to re-implement the shell as a "location broker" for agents, in the style of the DEC RPC daemon. Processes would connect to the Agent daemon, and request services; if available, the daemon would redirect these requests to the appropriate agent. This would probably require a change to the HTTP proxy protocol model.

While the invention has been shown and described with respect to preferred imbodiments, various modifications can be made therein without departing from the spirit and scope of the invention, as described in the specification and defined in the claims, as follows:
I claim:

Claims (33)

1. A method of adding new documents to a resource list of existing documents, comprising the steps of:
learning selection information which selects the documents on the resource list;
making a persistent association between the selection information and the resource list;
using the selection information to select a set of documents which the information specifies; and adding new documents to the resource list, the new documents being added belonging to a subset of the selected set of documents which contains documents which are not already on the resource list.
2. The method set forth in claim 1 wherein the step of adding documents comprises the steps of:
interactively determining whether a document in the subset should be added to the resource list; and adding the document only if it has been determined that the document should be added.
3. The method set forth in claim 2 further comprising the steps of:

using a document for which it has been determined that the document should not be added together with documents on the resource list to learn new selection information; and associating the new selection information with the resource list.
4. The method set forth in claim 1 wherein the step of learning the selection information comprises the steps of:
learning a rule for which the documents on the resource list are positive examples;
translating the rule into a query; and in the step of using the selection information, using the query to select the set of documents.
5. The method set forth in any of claims 1 through 4 wherein:
a system in which the method is practiced has access to-a plurality of searching means;
the step of learning the selection information learns a plurality of queries as required by the plurality of searching means; and the step of using the selection information to select a set of documents uses the plurality of queries in the plurality of searching means.
6. The method set forth in claim 5 wherein:
a system in which the method is practiced has access to the world wide web; and the searching means are searching means in the world wide web.
7. An improved web page of a type which contains a list of documents, the improvement comprising:
selection information associated with the web page which selects documents having content which is similar to the documents on the list, whereby the list of documents on the web page may be updated using the selection information.
8. Apparatus for making a resource list of documents which have contents belonging to a class, the apparatus comprising:

a first list of documents, all of which have contents belonging to the class;
a second list of documents, none of which have contents belonging to the class;
learning means responsive to the first list of documents and the second list of documents for learning selection information which specifies documents whose contents belong to the class;
means responsive to the selection information for finding the documents whose contents belong to the class, using the documents to make the resource list, and making a persistent association between the selection information and the resource list.
9. The apparatus set forth in claim 8 further comprising:
first interactive means for indicating whether a given document is to be added to the first list or the second list.
10. The apparatus set forth in claim 9 further comprising:
second interactive means for activating the learning means.
11. The apparatus set forth in claim 10 further comprising:
third interactive means for activating the means for finding the documents.
12. The apparatus set forth in any of claims 9 through 11 wherein:
the apparatus is used in a system which includes a document browser; and the interactive means are implemented in the document browser.
13. In an information system which stores related data and information as items for a plurality of interconnected computers accessible by a plurality of users, a method for finding items of a particular class residing in the information system comprising the steps of:
a) identifying as training data a plurality of items characterized as positive. and/or negative examples of the class;

b) using a learning technique to generate from the training data at least one that can be submitted to any of a plurality of methods for searching the information system;

c) submitting said query to at least one search method and collecting any new items) as a response to the query;

d) evacuating the new item(s) by a learned model with the aim of verifying that the new item(s) is indeed a new subset of the particular class; and e) presenting the new subset of the new item(s) to a user of the system.
14. The method of claim 13 wherein the information system is a distributed information system (DIS) and the items are documents collected in resource directories in the DIS.
15. The method of claim 14 wherein step a) the positive examples are a set of documents in the resource directories and the negative examples are a selection of documents obtained by using the process of steps a-d.
16. The method of claim 15 wherein step b) the query is (i) a conjunction of terms which must appear in a document as a positive example; (ii) contains all the terms appearing in the training data covered by the query, and (iii) learned by the system using a prepositional rule-learning or prediction algorithm method.
17. The method of claim 16 wherein step d) a learning technique generates from the training data a learned model that computes a score for the new item(s), such that the new item(s) which has a low probability of being classified within the particular class.
18. The method of claim 17 further comprising the step of providing a user on the system an ordered list of the new item(s) according to the score assigned by the learned model.
19. The method of claim 17 further comprising the step of providing a user by electronic mail or facsimile an ordered list of the new item(s) having a score exceeding a threshold probability.
20. The method of claim 17 further comprising the step of using an batch process to identify documents as positive or negative examples of a search concept.
21. The method of claim 17 further comprising the step of using an interactive process to identify documents as positive examples of a search concept by browsing the distributed information system.
22. The method of claim 17 further comprising the step of resubmitting a query to the system to detect any new item added to the system and related to the query.
23. An information system which stores related data and information as items for a plurality of interconnected computers accessible by a plurality of users for finding items of a particular class residing in the information system using query learning and meta search, comprising:
a) means for identifying as training data in the system a plurality of items characterized as positive and/or negative examples of the class;
b) means for using a learning technique to generate from the training data at least one query that can be submitted to any of a plurality of search engines for searching the information system;
c) means for submitting said query to at least one search engine and collecting any new item(s) as a response to the query;
d) means for evaluating the new item(s) by the at least one search engine with the aim of verifying that the new item(s) is indeed a new subset of the particular class; and e) means for presenting the new subset of the new item(s) to a user of the system.
24. The system of claim 23 wherein the information system is a distributed information system (DIS) and the items are documents stored in resource directories in the DIS.
25. The system of claim 24 wherein the positive examples are a set of items in the resource directories and the negative examples are a selection of documents obtained by the search engine in responding to the query.
26. The system of claim 25 the query is (i) a conjunction of terms which must appear in a document as a positive example; (ii) contains all the terms appearing in the training data covered by the query, and iii) learned by the system using a propositional rule-learning or prediction algorithm method.
27. The system of claim 26 wherein step the learning technique generates from the training data a learned model that computes a score for the new item(s), such that the new item(s) which has a high probability of being classified within the particular class will be assigned a higher score than the new item(s) which has a low probability of being classified within the particular class.
28. The system of claim 27 further comprising means for providing a user on the system an ordered list of the new item(s) according to the score assigned by the learned model.
29. The system of claim 27 further comprising means for providing a user by electronic mail or facsimile an ordered list of the new item(s) haying a score exceeding a threshold probability.
30. The system of claim 27 further comprising means for using a batch process to select documents as positive examples of a search concept.
31. The system of claim 27 further comprising means for using an interactive process to identify documents as positive examples of the query by browsing the distributed information system.
32. The system of claim 27 further comprising means for resubmitting a query to the system to detect any new item added to the system and related to the query.
33. An article of manufacture comprising:
a computer useable medium having computer readable program code means embodied therein for finding items of a particular class residing an information system which stored related data and information as items for a plurality of interconnected computers accessible by a plurality of users, the computer readable program code means in said article of manufacture comprising:
a) program code means for identifying as training data a plurality of items characterized as positive and/or negative examples of the class;
b) program code means for using a learning technique to generate from the training data at least one query that can be submitted to any of a plurality of methods for searching the information system;
c) program code means for submitting said query to at least one search method and collecting any new item(s) as a response to the query;
d) program code means for evaluating the new item(s) by the at least one search method with the aim of verifying that the new item(s) is indeed a new subset of the particular class; and e) program code means for presenting the new subset of new item(s) to a user of the system.
CA002245913A 1996-04-10 1997-04-09 A system and method for finding information in a distributed information system using query learning and meta search Expired - Lifetime CA2245913C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US1523196P 1996-04-10 1996-04-10
US60/015,231 1996-04-10
PCT/US1997/005355 WO1997038377A1 (en) 1996-04-10 1997-04-09 A system and method for finding information in a distributed information system using query learning and meta search

Publications (2)

Publication Number Publication Date
CA2245913A1 CA2245913A1 (en) 1997-10-16
CA2245913C true CA2245913C (en) 2002-06-11

Family

ID=21770237

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002245913A Expired - Lifetime CA2245913C (en) 1996-04-10 1997-04-09 A system and method for finding information in a distributed information system using query learning and meta search

Country Status (3)

Country Link
US (1) US6418432B1 (en)
CA (1) CA2245913C (en)
WO (1) WO1997038377A1 (en)

Families Citing this family (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4545317B2 (en) * 1998-10-28 2010-09-15 ヤフー! インコーポレイテッド Internet browser interface control method and controllable browser interface
EP1006458A1 (en) * 1998-12-01 2000-06-07 BRITISH TELECOMMUNICATIONS public limited company Methods and apparatus for information retrieval
US6606623B1 (en) * 1999-04-09 2003-08-12 Industrial Technology Research Institute Method and apparatus for content-based image retrieval with learning function
US6990628B1 (en) * 1999-06-14 2006-01-24 Yahoo! Inc. Method and apparatus for measuring similarity among electronic documents
WO2001001277A2 (en) * 1999-06-30 2001-01-04 Winstar New Media System and method for conducting and coordinating search queries over information exchange networks and private databases
US6523020B1 (en) * 2000-03-22 2003-02-18 International Business Machines Corporation Lightweight rule induction
US6516308B1 (en) * 2000-05-10 2003-02-04 At&T Corp. Method and apparatus for extracting data from data sources on a network
GB2362238A (en) * 2000-05-12 2001-11-14 Applied Psychology Res Ltd Automatic text classification
US20070027672A1 (en) * 2000-07-31 2007-02-01 Michel Decary Computer method and apparatus for extracting data from web pages
US6618717B1 (en) * 2000-07-31 2003-09-09 Eliyon Technologies Corporation Computer method and apparatus for determining content owner of a website
US7451136B2 (en) * 2000-10-11 2008-11-11 Microsoft Corporation System and method for searching multiple disparate search engines
US6721728B2 (en) * 2001-03-02 2004-04-13 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration System, method and apparatus for discovering phrases in a database
US20030016250A1 (en) * 2001-04-02 2003-01-23 Chang Edward Y. Computer user interface for perception-based information retrieval
US6976016B2 (en) * 2001-04-02 2005-12-13 Vima Technologies, Inc. Maximizing expected generalization for learning complex query concepts
KR100657267B1 (en) * 2003-10-30 2006-12-14 삼성전자주식회사 Storage medium including meta information for search, display playback device, and display playback method therefor
US7120647B2 (en) * 2001-10-30 2006-10-10 General Electric Company Web-based method and system for providing expert information on selected matters
US7209876B2 (en) * 2001-11-13 2007-04-24 Groove Unlimited, Llc System and method for automated answering of natural language questions and queries
EP1376420A1 (en) * 2002-06-19 2004-01-02 Pitsos Errikos Method and system for classifying electronic documents
US20040107319A1 (en) * 2002-12-03 2004-06-03 D'orto David M. Cache management system and method
US8055553B1 (en) 2006-01-19 2011-11-08 Verizon Laboratories Inc. Dynamic comparison text functionality
US7711878B2 (en) * 2004-05-21 2010-05-04 Intel Corporation Method and apparatus for acknowledgement-based handshake mechanism for interactively training links
CN101313271A (en) * 2005-08-12 2008-11-26 勘努优有限公司 Improved process and apparatus for selecting an item from a database
US7949646B1 (en) 2005-12-23 2011-05-24 At&T Intellectual Property Ii, L.P. Method and apparatus for building sales tools by mining data from websites
US8862572B2 (en) * 2006-02-17 2014-10-14 Google Inc. Sharing user distributed search results
CN101421732B (en) * 2006-02-17 2013-09-04 谷歌公司 User distributed search results
US8122019B2 (en) 2006-02-17 2012-02-21 Google Inc. Sharing user distributed search results
US7844603B2 (en) * 2006-02-17 2010-11-30 Google Inc. Sharing user distributed search results
US20070233679A1 (en) * 2006-04-03 2007-10-04 Microsoft Corporation Learning a document ranking function using query-level error measurements
US9529903B2 (en) 2006-04-26 2016-12-27 The Bureau Of National Affairs, Inc. System and method for topical document searching
US7593934B2 (en) 2006-07-28 2009-09-22 Microsoft Corporation Learning a document ranking using a loss function with a rank pair or a query parameter
US8224713B2 (en) 2006-07-28 2012-07-17 Visible World, Inc. Systems and methods for enhanced information visualization
US20080320014A1 (en) * 2007-06-24 2008-12-25 Feng Chu Discriminative Feature Selection System Using Active Mining Technique
US8805861B2 (en) * 2008-12-09 2014-08-12 Google Inc. Methods and systems to train models to extract and integrate information from data sources
US9519712B2 (en) * 2009-01-06 2016-12-13 At&T Intellectual Property I, L.P. Systems and methods to evaluate search qualities
US8145622B2 (en) * 2009-01-09 2012-03-27 Microsoft Corporation System for finding queries aiming at tail URLs
US9529915B2 (en) * 2011-06-16 2016-12-27 Microsoft Technology Licensing, Llc Search results based on user and result profiles
US10623523B2 (en) * 2018-05-18 2020-04-14 Oracle International Corporation Distributed communication and task handling to facilitate operations of application system
RU2744029C1 (en) * 2018-12-29 2021-03-02 Общество С Ограниченной Ответственностью "Яндекс" System and method of forming training set for machine learning algorithm
CN111126627B (en) * 2019-12-25 2023-07-04 四川新网银行股份有限公司 Model training system based on separation index
US20220318283A1 (en) * 2021-03-31 2022-10-06 Rovi Guides, Inc. Query correction based on reattempts learning

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5278980A (en) * 1991-08-16 1994-01-11 Xerox Corporation Iterative technique for phrase query formation and an information retrieval system employing same
US5488725A (en) * 1991-10-08 1996-01-30 West Publishing Company System of document representation retrieval by successive iterated probability sampling
US6081750A (en) * 1991-12-23 2000-06-27 Hoffberg; Steven Mark Ergonomic man-machine interface incorporating adaptive pattern recognition based control system
US5600831A (en) * 1994-02-28 1997-02-04 Lucent Technologies Inc. Apparatus and methods for retrieving information by modifying query plan based on description of information sources
US5623652A (en) * 1994-07-25 1997-04-22 Apple Computer, Inc. Method and apparatus for searching for information in a network and for controlling the display of searchable information on display devices in the network
US5491820A (en) * 1994-11-10 1996-02-13 At&T Corporation Distributed, intermittently connected, object-oriented database and management system
US5530852A (en) * 1994-12-20 1996-06-25 Sun Microsystems, Inc. Method for extracting profiles and topics from a first file written in a first markup language and generating files in different markup languages containing the profiles and topics for use in accessing data described by the profiles and topics
US5717914A (en) * 1995-09-15 1998-02-10 Infonautics Corporation Method for categorizing documents into subjects using relevance normalization for documents retrieved from an information retrieval system in response to a query
US5572643A (en) 1995-10-19 1996-11-05 Judson; David H. Web browser with dynamic display of information objects during linking
US5867799A (en) * 1996-04-04 1999-02-02 Lang; Andrew K. Information system and method for filtering a massive flow of information entities to meet user information classification needs

Also Published As

Publication number Publication date
US6418432B1 (en) 2002-07-09
CA2245913A1 (en) 1997-10-16
WO1997038377A1 (en) 1997-10-16

Similar Documents

Publication Publication Date Title
CA2245913C (en) A system and method for finding information in a distributed information system using query learning and meta search
Yuwono et al. WISE: a world wide web resource database system
Pant et al. Crawling the web.
US6701310B1 (en) Information search device and information search method using topic-centric query routing
US7630973B2 (en) Method for identifying related pages in a hyperlinked database
EP0807291B1 (en) Methods and/or systems for accessing information
US6490579B1 (en) Search engine system and method utilizing context of heterogeneous information resources
US6633867B1 (en) System and method for providing a session query within the context of a dynamic search result set
US6138113A (en) Method for identifying near duplicate pages in a hyperlinked database
Diligenti et al. Focused Crawling Using Context Graphs.
Davison Topical locality in the web
US5931907A (en) Software agent for comparing locally accessible keywords with meta-information and having pointers associated with distributed information
US6601061B1 (en) Scalable information search and retrieval including use of special purpose searching resources
Novak A survey of focused web crawling algorithms
WO2001027793A2 (en) Indexing a network with agents
EP1428138A2 (en) Indexing a network with agents
Hardy et al. Customized information extraction as a basis for resource discovery
Frei et al. Making use of hypertext links when retrieving information
US20040015485A1 (en) Method and apparatus for improved internet searching
US9275145B2 (en) Electronic document retrieval system with links to external documents
Chen et al. Websail: From on-line learning to web search
JP4428850B2 (en) Information search apparatus and information search method
WO2000048057A2 (en) Bookmark search engine
O’Riordan et al. Information filtering and retrieval: An overview
JP3632354B2 (en) Information retrieval device

Legal Events

Date Code Title Description
EEER Examination request
MKEX Expiry

Effective date: 20170410