CA2235868A1 - Method for converting formatted documents to ordered word lists - Google Patents

Method for converting formatted documents to ordered word lists Download PDF

Info

Publication number
CA2235868A1
CA2235868A1 CA002235868A CA2235868A CA2235868A1 CA 2235868 A1 CA2235868 A1 CA 2235868A1 CA 002235868 A CA002235868 A CA 002235868A CA 2235868 A CA2235868 A CA 2235868A CA 2235868 A1 CA2235868 A1 CA 2235868A1
Authority
CA
Canada
Prior art keywords
words
word
fragments
ordered
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA002235868A
Other languages
French (fr)
Inventor
Jeremy Dion
Robert A. Eustace
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Equipment Corp
Original Assignee
Digital Equipment Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Equipment Corp filed Critical Digital Equipment Corp
Publication of CA2235868A1 publication Critical patent/CA2235868A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/131Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

A computer implemented method is applied to convert a formatted document or text to an ordered list of words.
The formatted document is first partitioned into first and second data structures stored in a memory of a computer.
The first data structure stores text fragments, and the second data structure stores code fragments of the formatted document. Adjacent text fragments are concatenated to form possible ordered word lists. Possible words are matched against a dictionary of representative words. A best ordered word list having the fewest number of words is selected from the possible ordered word lists.

Description

CA 0223~868 1998-04-23 METHOD FOR ~-Ohv~-.lN~ FORMATTED DOCUMENTS
TO oRn~R~.n WORD LISTS

FIELD OF THE lNV~-~..lON
This invention relates generally to converting formatted documents, and more particularly to converting documents that are formatted with a mark-up language.

~"~KGROUND OF THE lNVL~. lON
PostScript and its variant Portable Document Format (PDF) are standard mark-up languages for formatting documents produced by word processing software programs. With a mark-up language, it is possible to exactly reproduce text, graphics, and bit maps (generally "text") on a printed page or display screen. As an advantage, formatted documents are easily communicated and processed by many different types of output devices.

In formatted document files, text fragments and formatting commands for rendering the text are interleaved. The formatted documents are processed by interpreters. An interpreter reads the formatted file to "execute~ the commands so that the location of the dots of ink on the page or the pixels on a screen can exactly be determined.
The interpreter does not exactly deal with words or sentences, but with the more fundamental document components such as characters, lines, and graphics.

An excerpt from a PostScript formatted document may include~0 the following commands and text:
"%!PS-Adobe-2.0 ...
16b(Re)o(ad)f(b)q(et)o(we)o(en)I(the lines!)."

In PostScript, text fragments are enclosed in parenthesis,~5 and the commands are interspersed among the text. A text CA 0223~868 1998-04-23 fragment can be a single character, a sequence of characters, a word, or parts of multiple words delimited by, perhaps, blanks and punctuation marks. As shown in the example above, words may often be split over several fragments so that the beginning and ends of the words themselves are difficult to discern.

The commands between the text fragments move the cursor to new positions on the page or new coordinates on the display, usually to modify the spacing between the letters and lines. Word separators, such as space characters visible in plain text, are usually not indicated in the formatted text, instead explicit cursor movement commands are used. Hence, word separators only become apparent as more white space when the text is rendered.

The general problem of determining where words start and end, i.e., word ordering, is difficult. PostScript does not require that characters be rendered in a left-to-right order on lines, and a top-to-bottom order on the page or display. Indeed, the characters may be rendered in any order and at arbitrary positions.

Therefore, the only completely reliable way to identify words in a formatted document is to interpret the commands down to the character level, and to record the position and orientation of the characters as they are rendered. Then, characters that are close enough together on the same line, according to some threshold, and taking the character~s font and size into consideration, are assumed to be in the same word. Those characters which are farther apart than the threshold are assigned to different words.

Finding the correct position of each character is particularly useful when rendering text for reading, since tabs, line spacing, centering, and other visual formatting CA 0223~868 1998-04-23 attributes facilitate comprehension of the text. As is evident, exactly locating words in formatted text can be computationally more expensive than just simply rendering the text for reading.

This becomes a problem if it is desired to automatically process formatted document in order to create, for example, an index of the words. On the World Wide Web (the "Web"), many documents are available in PostScript (or PDF) formats. This allows users of the Web to exactly reproduce graphically rich documents as they were originally authored.

In order to locate documents of interest on the Web, it is common to use a search engine such as AltaVista (tm) or Lycos (tm). With a search engine, the user specifies one or more key words. The search engine then attempts to locate all documents that include the specified key words.
Now the exact location of the words on the page is of minimal interest, only their respective ordering.

Some known techniques for indexing formatted documents, such as by using the PostScript interpreter Ghostscript, perform a total interpretation of the formatting commands, and apply some other heuristic to recover word delineations. This takes time.

A simple sampling of the Web would seem to indicate that the Web contains hundreds of thousands of formatted documents having a m;n;mum total projected size of some 40 Gigabytes. With traditional formatted document parsing techniques, which can process about 400 bytes per second, it would take about 1200 days to index the bulk of the current PostScript formatted Web documents. Given the rapid growth of the Web, indexing the Web using known techniques would be a formidable task.

CA 0223~868 1998-04-23 SUMMARY OF THE lNvL.lION
Provided herein is a high-speed computer implemented method for converting a formatted document to an ordered list of words. This method can, on an average, convert formatted Web documents about fifty times faster than known methods.

The invention, in its broad form, resides in a computer-implemented method for converting a formatted text to an ordered list of words, as recited in claim 1.
According to the method described hereinafter, the formatted document is first partitioned into first and second data structures stored in a memory of a computer by separately identifying text and code fragments of the formatted document. The first data structure stores the text fragments, and the second data structure stores the code fragments of the formatted document.

Adjacent text fragments are locally concatenated and matched against a word dictionary to form possible ordered word lists. This list contains every possible word that could be made from the text fragments contained in the document. A best ordered word list is formed by choosing a set of words that includes all of the text fragments and contains the fewest number of words.

As described hereinafter, the text and code are organized fragments as arcs and nodes of a graph. The nodes represent the code fragments, or equivalently the gaps between text fragments. In addition, the nodes define all places where a word might begin or end. An arc between two nodes represents the possibility of concatenating the intervening text fragments into a single word. The best possible word list is the one which can be graphically represented by the smallest chain of arcs starting at the CA 0223~868 1998-04-23 first node and ending at the last node, and where each arc ends at a node where the next arc begins. This corresponds to a covering of the text fragments with the smallest number words, each word defined by one arc. In the case where there are multiple best ordered lists, we select the one with the highest minimum weight. The weight of an arc is determined by the number of times the word defined by the arc is used in a large corpus of documents.

Advantageously, the best possible word list is used to annotate the code fragments to show whether they represent a word break or not. Because code fragments reoccur frequently in documents, this accumulation of local information allows for a global determination to be made whether a particular code fragment is more likely to bind adjacent text fragments into a word, or to separate them.
The global determination is used to correct occasional errors in the local matching.

BRIEF DESCRIPTION OF THE DRAWINGS
A more detailed understanding of the invention may be had from the following description of a preferred embodiment, given by way of example, and to be understood with reference to the accompanying drawing wherein:

~Figure 1 is a graph showing text and code fragments of a formatted document represented as arcs and nodes;
~Figure 2 is an augmented graph showing possible choices of words in the formatted document;
~Figure 3 is the augmented graph including weighted orderings of the words;
~Figure 4 is a graph showing code fragments used as disjunctions and conjunctions of the text fragments;
~Figure 5 is a table showing unique code fragment entries; and CA 0223~868 l998-04-23 ~Figure 6 is flow dlagram of a process for converting formatted documents according to a preferred embodiment of the invention.

DET~TT-T~n nT~'-SrRTPTION OF r ~ P'PhUU~ E~IBODIMENTS

Described hereinafter is a method for converting a document to an ordered word list without performing a full interpretation of the commands that format the document.
The ordered word list can be indexed, or simply printed or displayed as text for perusal. We take advantage of the observation that in order to just producé an ordered list of words from the formatted document, it is not necessary to generate a perfectly formatted output.

In theory, the characters which compose a document can be rendered in any order; however in practice, document formatting systems invariably render text fragments in the same order as they would be read. Thus, the order of the parenthetical text fragments in, for example, a PostScript formatted document is their correct order in the output text.

This means a full interpretation of the commands, for the 2 5 purpose of indexing words, is not necessary. We propose that a formatted document first is partitioned and organized in a memory as two separate data structures. A
first data structure stores text fragments in their correct order. The fragments can be located by sequentially reading the document and identifying parenthetically enclosed strings of characters. The second data structure simply stores the remaining fragments, e.g., the commands that are interspersed among the text. We call these code fragments.

CA 0223~868 1998-04-23 Figure 1 shows how we represent the organization of the partitioned documents as a graph 100. In Figure 1, the first data structure is represented by arcs 102, e.g., the text fragments. The code fragments that separate arcs 102 are represented by nodes 101. Each node 101 represents a position where a break between words may occur.

At this point in order to recover the words, a number of different concatenations of the text fragments are possible, for example:
Re ad b et weenthe lines!
Re ad bet we en the lines!
Re ad b et ween the lines!
Re ad bet ween the lines!
Re ad between the lines!
Read b et we en the lines!
Read bet we en the lines!
Read bet we en the lines!
Read bet ween the lines!
Read between the lines!

We now make a second observation. A reader when faced with a string such as: "Readbetweenthe lines" tries to find the "best fit" of recognizable words in the string. We suggest a computer implemented method that mimics this behavior.

Potential "words" are looked up (or matched) in a frequency-weighted dictionary of words. For example, the dictionary that is maintained by the AltaVista search engine can be used. This dictionary reflects word usage in a large corpus of Web documents and newsletters.
Associated with each word is a frequency count of how often each word occurs in the corpus as a whole.

As shown in Figure 2, by looking-up possible text fragment combinations in the dictionary, an augmented graph 200 can CA 0223~868 1998-04-23 be generated with additional arcs 103 which represent local concatenations of adjacent fragments to form possible words. The concatenations can readily be represented in the first data structure using, for example pointers, or special delimiters.

There are many possible pathways (orderings) through the graph 200. In one embodiment of the invention, a best ordering of words is along the path which has the fewest number of arcs, e.g., "Read between the lines!" This path corresponds to a possible ordering which has the fewest number of words. Documents partitioned and organized in this manner can form the basis for how the words of the document are ordered and indexed without any time consuming interpretation or processing of the formatting commands themselves.

As shown in the graph 300 of Figure 3 for one variation on our embodiment, we can also weigh each arc by the word's frequency 104 as determined from a large corpus of documents. Then, our best path algorithm can factor in these weight during the best path determination. For example, if there are multiple best paths with the identical number of fewest arcs, i.e., a tie, then the weighted arcs can be sorted in ascending order, and the path with the largest minimally weighted arc is selected.

For example, consider the case where there are two possible paths of three arcs each. The sorted weights in the first path are {300, 800, 1300} and sorted weights in the second path are { 300, 600, 1200 ~. The best path is found by parsing each list from lowest to highest weight and choosing the first path with a higher value weight. In this case, the first path is selected, since the first element in both lists is the same (300), and the second element in the first path is larger ( 800 >= 600).

CA 0223~868 1998-04-23 _ g _ Intuitively, this algorithm avoids paths that contain very uncommon words, and therefore, this strategy tends to penalize sentences which uses rare, or "low" weight words.

It is to be realized that in some cases the best path chosen as described above is the wrong path. For example, if any of the words of the original document are not found in the dictionary, then we cannot completely augment the graph. There may also be other rare cases where the path with the fewest number of arcs does not exactly reflect the word separation as intended by the formatting commands.

Therefore, a method is proposed which can be applied to the second data structure, i.e., the nodes 102, to improve the accuracy of output sequential word list. This is described with reference to graph 400 of Figure 4. Here, we rely on a third observation which can be made about formatted documents.

When a code fragment is used repeatedly in a document, such as the code fragment "o" of nodes 401 separating the fragments "(Re)o(ad)," "(et)o(we)," and "(we)o(en)," the code fragment is almost always used in a consistent way throughout the document. For example, the code fragment either adds a small spacing within the word, or a large spacing between words, but almost never both.

This means that it is possible to use the described local word matching technique above to accumulate information about how the code fragments are used to bind the text fragments without actually interpreting the commands. If word matching suggests that all uses of the code fragment "o~ are within a word, then in the few cases in which the local matching technique might have indicated a use between words, those guesses are probably wrong. Thus, we can use CA 0223~868 1998-04-23 global information about how fragments are concatenated to find local errors in the matches.

In Figure 4, the best path according to word length and word frequency is highlighted in dark lines. Thus, we would consider the command "fn 402 to be a word separator.
Likewise, the command "o" 401 is used as a "conjunction"
between text fragments.

As shown in Figure 5, we maintain a table 500 of all code fragments as they are detected. There is one entry 510 in the table for each unique code fragment. Associated with each entry is a "disjunction" field 520 and a "conjunction"
field 530.
In order to minimize the amount of memory required, and to accelerate matching, we use a fingerprinting technique to convert the variable length code fragment to fixed length bit strings. Fingerprinting is a well known technique that can convert character strings of arbitrary length to, for example, 64 bit words. As an advantage of fingerprinting, there is only a minute probability that two different character strings will have the same fingerprint. This means that the fingerprints are substantially as unique as the code fragments they represent.

Note that as described, no attempt is made to determine the true meanings or functions of the commands of the code fragments, this would consume time. We "learn" their meanings from the local matching. This means, as an advantage, that our method does not need a complete and detailed grammatical specifications of the many variants of formatting language that can be used.

During word matching, the fields 520 or 530 are appropriately incremented depending on how fragments are CA 0223~868 1998-04-23 locally matched up into words. After matching up all of the text fragments into words, we use the table 500 to make a global decision for each possible code fragment to determine whether our matching guesses were correct. For example, if the code fragment is used more often as a disjunction then as a conjunction, using a simple majority rule, the code fragment can be globally characterized as a word break. Similarly conjunctions of text fragments can be confirmed, or corrected.
During a final pass, we can produce an ordered word list suitable for indexing the document. Alternatively, the words can be printed, displayed or written to a file for further processing by a text editor.
Figure 6 shows the data structures and process steps 600 of our preferred embodiment of the invention. A formatted document 601 is partitioned into text fragments and code fragments in step 610. In step 620, the text and code fragments are organized into arcs and nodes of a graph 630.
The text fragments are locally matched in step 640 against a word dictionary 650, and optionally weighted (step 660) to generate an augmented graph 670. The code fragments are fingerprinted in step 690 to globally correct the graph in step 670 to produce an ordered best word list 680.

Our processing of the code fragments can be further enhanced to take care of special formatting commands.
PostScript often contains special characters that represent other characters. For example, the code fragment "a\213" might be used instead of "fi" in some character sets, and for "ffi" in others. Our technique can be modified to try both sequences and determine which is more - appropriate for forming words.

CA 0223~868 1998-04-23 Furthermore, the technique can be used to recover ordered words from a formatted document in any language which is well represented in the dictionary. Our implementation works well on all major European languages which may include accented characters. For example, characters expressed using the Isolatin character set can be handled as a variant of the special characters mentioned above.

Our method can also be applied to documents formatted with other languages such as PDF. With PDF, a formatted document is also compressed within a file. Therefore in this case, we first decompress the file prior to local and global matching.

Described hereinafter is a method for converting a formatted document to an ordered list of words. The words can be indexed, printed, displayed, or put in a file.
Depending on the complexity of the formatted document, our technique, in the worst case, is at least twice as fast as known methods. Some documents can be processed several hundred times faster than with prior art methods. We estimate that we can index all of the formatted documents accessible via the Web in about 26 days, instead of the estimated three to four years that would be required using competing prior art techniques, an improvement by a factor of 50.

The foregoing description has been directed to specific embodiments of this invention. It will be apparent, however, that variations and modifications may be made to the described embodiments to come within and scope of the invention.

Claims (5)

1. A computer-implemented method for converting a formatted document to an ordered list of words, comprising the steps of:
partitioning the formatted document into a first data structure and a second data structure in a memory of a computer, the first data structure storing text fragments and the second data structure storing code fragments of the formatted document;
concatenating adjacent text fragments into possible ordered word lists using a dictionary of representative words; and selecting a best ordered word list from the possible ordered word lists, the best word list having the fewest number of words.
2. The method of claim 1 further comprising:
weighting each representative word of the dictionary by the frequency at which the word appears in text of a representative set of documents; and selecting the best ordered word list as the possible word list also having a highest minimum word frequency.
3. The method of claim 1 further comprising:
storing each. unique code fragment as an entry in a table, each entry also including a disjunction field and a conjunction field; and incrementing the disjunction field when associated code fragment is used to separate words, and incrementing the conjunction field when the code fragment is used to concatenate adjacent text fragments into a word.
4. A computer-implemented method for converting formatted document to an ordered list of words, comprising the steps of:
partitioning the document into text fragments and code fragments;
organizing the text fragments as arcs and the code fragments as nodes in a graph;
matching concatenations of adjacent text fragments against a word dictionary to determine possible paths through the graph; and selecting a particular path having the fewest number of arcs as representing a best ordering of the words of the formatted document.
5. A computer implemented method for converting formatted document to an ordered list of words, comprising the steps of:
locating text fragments in the formatted document;
concatenating adjacent text fragments into possible ordered word lists using a dictionary of representative words; and selecting a best ordered word list from the possible ordered word lists, the best word list including all of the text fragments and having the fewest number of words.
CA002235868A 1997-05-16 1998-04-23 Method for converting formatted documents to ordered word lists Abandoned CA2235868A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US08/857,458 US6470362B1 (en) 1997-05-16 1997-05-16 Extracting ordered list of words from documents comprising text and code fragments, without interpreting the code fragments
US08/857,458 1997-05-16

Publications (1)

Publication Number Publication Date
CA2235868A1 true CA2235868A1 (en) 1998-11-16

Family

ID=25326034

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002235868A Abandoned CA2235868A1 (en) 1997-05-16 1998-04-23 Method for converting formatted documents to ordered word lists

Country Status (4)

Country Link
US (1) US6470362B1 (en)
EP (1) EP0878766A2 (en)
JP (1) JPH1139315A (en)
CA (1) CA2235868A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6526398B2 (en) * 1999-05-28 2003-02-25 Ricoh Co., Ltd. Generating labels indicating gaps in retrieval of electronic documents
US7106905B2 (en) 2002-08-23 2006-09-12 Hewlett-Packard Development Company, L.P. Systems and methods for processing text-based electronic documents
US20070067397A1 (en) * 2005-09-19 2007-03-22 Available For Licensing Systems and methods for sharing documents
JP4236055B2 (en) * 2005-12-27 2009-03-11 インターナショナル・ビジネス・マシーンズ・コーポレーション Structured document processing apparatus, method, and program
US8909516B2 (en) * 2011-10-27 2014-12-09 Microsoft Corporation Functionality for normalizing linguistic items
CN103942200B (en) * 2013-01-18 2017-08-18 佳能株式会社 Ordered list matching process and equipment, document character matching process and equipment
US10586168B2 (en) 2015-10-08 2020-03-10 Facebook, Inc. Deep translations
US9990361B2 (en) * 2015-10-08 2018-06-05 Facebook, Inc. Language independent representations

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4843389A (en) * 1986-12-04 1989-06-27 International Business Machines Corp. Text compression and expansion method and apparatus
US4864502A (en) * 1987-10-07 1989-09-05 Houghton Mifflin Company Sentence analyzer
US5146406A (en) * 1989-08-16 1992-09-08 International Business Machines Corporation Computer method for identifying predicate-argument structures in natural language text
IL100990A (en) * 1991-02-27 1995-10-31 Digital Equipment Corp Multilanguage optimizing compiler using templates in multiple pass code generation
US5161245A (en) * 1991-05-01 1992-11-03 Apple Computer, Inc. Pattern recognition system having inter-pattern spacing correction
US5265065A (en) * 1991-10-08 1993-11-23 West Publishing Company Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query
US5392212A (en) * 1993-07-07 1995-02-21 The United States Of America As Represented By The Secretary Of Commerce Apparatus for identifying unknown words by comparison to known words
JPH0756957A (en) * 1993-08-03 1995-03-03 Xerox Corp Method for provision of information to user
IL109268A (en) * 1994-04-10 1999-01-26 Advanced Recognition Tech Pattern recognition method and system
DE69525401T2 (en) * 1994-09-12 2002-11-21 Adobe Systems Inc Method and device for identifying words described in a portable electronic document
JP2734386B2 (en) * 1994-12-20 1998-03-30 日本電気株式会社 String reader
US5903860A (en) * 1996-06-21 1999-05-11 Xerox Corporation Method of conjoining clauses during unification using opaque clauses
GB2314433A (en) * 1996-06-22 1997-12-24 Xerox Corp Finding and modifying strings of a regular language in a text
US5724033A (en) * 1996-08-09 1998-03-03 Digital Equipment Corporation Method for encoding delta values
US5745898A (en) * 1996-08-09 1998-04-28 Digital Equipment Corporation Method for generating a compressed index of information of records of a database
US5999664A (en) * 1997-11-14 1999-12-07 Xerox Corporation System for searching a corpus of document images by user specified document layout components

Also Published As

Publication number Publication date
JPH1139315A (en) 1999-02-12
US6470362B1 (en) 2002-10-22
EP0878766A2 (en) 1998-11-18

Similar Documents

Publication Publication Date Title
US7072889B2 (en) Document retrieval using index of reduced size
US6671856B1 (en) Method, system, and program for determining boundaries in a string using a dictionary
US6415250B1 (en) System and method for identifying language using morphologically-based techniques
US5590317A (en) Document information compression and retrieval system and document information registration and retrieval method
US6279018B1 (en) Abbreviating and compacting text to cope with display space constraint in computer software
EP1367501B1 (en) Lexicon with sectionalized data and method of using the same
US20060080309A1 (en) Article extraction
JP2007265458A (en) Method and computer for generating a plurality of compression options
JP2009266244A (en) System and method of creating and using compact linguistic data
US20040225497A1 (en) Compressed yet quickly searchable digital textual data format
JP2005165598A (en) Device and method for searching variable-length character string, and program
US6470362B1 (en) Extracting ordered list of words from documents comprising text and code fragments, without interpreting the code fragments
US7073122B1 (en) Method and apparatus for extracting structured data from HTML pages
JP2693914B2 (en) Search system
JP5447368B2 (en) NEW CASE GENERATION DEVICE, NEW CASE GENERATION METHOD, AND NEW CASE GENERATION PROGRAM
US11741121B2 (en) Computerized data compression and analysis using potentially non-adjacent pairs
WO2002059726A2 (en) Method of performing a search of a numerical document object model
JP4382663B2 (en) System and method for generating and using concise linguistic data
WO2002021291A1 (en) Method and apparatus for extracting structured data from html pages
JP4148247B2 (en) Vocabulary acquisition method and apparatus, program, and computer-readable recording medium
JP3531222B2 (en) Similar character string search device
EP0539965A2 (en) An electronic dictionary including a pointer file and a word information correction file
US20040164989A1 (en) Method and apparatus for disclosing information, and medium for recording information disclosure program
CN113298914B (en) Knowledge chunk extraction method and device, electronic equipment and storage medium
KR20020003701A (en) Method of automatic key feature extraction for a digital document

Legal Events

Date Code Title Description
EEER Examination request
FZDE Dead