US20150363446A1 - System and Method for Indexing Streams Containing Unstructured Text Data - Google Patents

System and Method for Indexing Streams Containing Unstructured Text Data Download PDF

Info

Publication number
US20150363446A1
US20150363446A1 US14/836,400 US201514836400A US2015363446A1 US 20150363446 A1 US20150363446 A1 US 20150363446A1 US 201514836400 A US201514836400 A US 201514836400A US 2015363446 A1 US2015363446 A1 US 2015363446A1
Authority
US
United States
Prior art keywords
data
elements
block
allocated
processors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/836,400
Inventor
Adam Leko
Robert Bird
Matthew Whitlock
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Red Lambda Inc
Original Assignee
Red Lambda Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Red Lambda Inc filed Critical Red Lambda Inc
Priority to US14/836,400 priority Critical patent/US20150363446A1/en
Assigned to RED LAMBDA, INC. reassignment RED LAMBDA, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BIRD, ROBERT, LEKO, ADAM, WHITLOCK, MATTHEW
Publication of US20150363446A1 publication Critical patent/US20150363446A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F17/30327
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • G06F17/30516

Definitions

  • the presently disclosed invention relate in general to the field of indexing and retrieving data, and in particular to a system and method for indexing streaming text data in distributed systems.
  • indexing systems are problematic for full-text indexing on large volumes of streaming data.
  • the presently disclosed invention addresses such limitations, inter alias, by providing an improved text indexing system and method with acceptable worst-case insertion performance to enable real-time querying of streams of data.
  • the presently disclosed invention may be embodied in various forms, including a system, computer readable medium or a method for indexing data.
  • the system may comprise block-stores adapted to store data-elements of data-streams.
  • the system may comprise one or more data-blocks of the block-stores.
  • the stored data-elements may be allocated to the one or more data-blocks.
  • the system may comprise events of the one or more data-blocks.
  • the block-allocated data-elements may be further allocated to the events of the data-blocks.
  • Each of the data-blocks may comprise one or more events.
  • Each of the events may comprise the block-allocated data-elements of a corresponding data-block.
  • the system may further comprise terms generated by splitting the event-allocated data-elements, and term frequencies calculated based on the frequency of each term in each of the event.
  • the system may also comprise block-level term frequency data calculated for the event-allocated data-elements that are stored in a corresponding data-block.
  • the block-level term frequency data may be based on the term frequencies.
  • the system may comprise tree index structures generated for the event-allocated data-elements based on the block-level term frequency data.
  • the tree index structures may comprise Y-tree index structures.
  • the terms may be used in the Y-tree index structures as keys.
  • the block-stores may be stored on a plurality of distributed devices.
  • the data-streams may be received from a plurality of distributed devices.
  • the plurality of distributed devices may be connected via a network.
  • a computer readable medium for the presently disclosed invention comprising computer readable instructions stored thereon for execution by a processor.
  • the instructions on the computer-usable medium may be adapted to enable a computing device to receive data-streams, wherein the data-streams may comprise data-elements, and store the data-elements of the received data-streams, wherein the stored data-elements may be stored in block-stores.
  • the instructions may enable a computing device to allocate the stored data-elements to data-blocks of the block-stores and further allocate the block-allocated data-elements to events of the data-blocks.
  • Each of the data-blocks may comprise one or more events.
  • Each of the events may comprise the block-allocated data-elements of the corresponding data-block.
  • the instructions may enable a computing device to split the event-allocated data-elements into terms, calculate a frequency of each term in each of the event, and calculate block-level term frequency data for the event-allocated data-elements stored in the corresponding data-block based on the calculated term frequencies.
  • the instructions may also enable a computing device to generate tree index structures for the event-allocated data-elements based on the block-level term frequency data.
  • the tree index structures may comprise Y-tree index structures. The terms may be used in the Y-tree index structures as keys.
  • an embodiment of a method for the presently disclosed invention may include the step of receiving data-streams.
  • the data-streams may be received from a single distributed device or from a plurality of distributed devices.
  • the distributed devices may be connected via a network.
  • Each of the data-streams may comprise data-elements.
  • the method may include the step of storing the data-elements of the received data-streams.
  • the stored data-elements may be stored in block-stores.
  • the block-stores may be stored on a single distributed device or across a plurality of distributed devices. Such distributed devices may be connected via a network.
  • the method may include the step of allocating the stored data-elements to data-blocks of the block-stores.
  • Each of the block-stores may comprise one or more data-blocks.
  • each data-block may comprise the stored data-elements of only one of the received data-streams.
  • the data-blocks of a single data-stream may be logically grouped together.
  • Each of the data-blocks may be read and written as a single unit.
  • the method may include allocating the block-allocated data-elements to events of the data-blocks.
  • Each of the data-blocks may comprise one or more events.
  • Each of the events may comprise the block-allocated data-elements of the corresponding data-block.
  • the method may include the step of splitting the event-allocated data-elements into terms. Further, the method may include the step of calculating a frequency of each term within each of the event, and the step of calculating block-level term frequency statics or data for the event-allocated data-elements that are stored in the corresponding data-block based on the calculated term frequencies. The method may also include the step of generating tree index structures for the event-allocated data-elements based on the block-level term frequency data.
  • the tree index structures may comprise Y-tree index structures.
  • the Y-tree index structures may use the terms as keys.
  • pointers to the data-blocks may be generated.
  • Such pointers may comprise values stored in the Y-tree index structures that identify, or point to, the corresponding data-blocks and the corresponding block-level term frequency data.
  • event-allocated data-elements may comprise text data.
  • Event-allocated data-elements may also comprise a text representation of non-text data, as data-streams may comprise non-text data that is converted into a text representation.
  • Event-allocated data-elements may comprise unstructured data, which may be split in accordance with processes outlined in an Unicode Standard Annex #29 published the Unicode Consortium. Multiple writers to the data-blocks may have an independent tree structure.
  • a search query of the tree index structures may be performed.
  • the search query for the data-elements may be performed over all of the tree index structures.
  • Term statistics may be extracted from query text of the search query.
  • a list of candidate data-blocks may be generated that satisfy the search query based on the Y-tree index structure.
  • the search query may be evaluated against each of the data-blocks to generate a list of matching records.
  • a term-proximity search query of the tree index structures may be performed.
  • the term-proximity search query may be a wildcard suffix matches search query, wherein a minimum key in the Y-tree index structures satisfies a pattern requirement. The keys may be iterated through until a key is reached that is different from the pattern requirement.
  • the term-proximity search query may be a fuzzy matches search.
  • the term-proximity search query may be based on a Soundex algorithm. In some embodiments, a list of the terms that are present in a master Y-tree index structure may be maintained.
  • individual pages within the Y-tree index structure may be compressed.
  • the individual pages within the Y-tree index may be compressed utilizing compression algorithms.
  • the individual pages within the Y-tree index may be compressed by storing data-block in the Y-tree index structures via gap-compressed encodings, such as ⁇ or ⁇ gap-compressed encodings.
  • search queries may be multicasted to a set of dedicated search nodes that perform searches.
  • Search queries may be transmitted to a group of destination computing devices simultaneously in a single transmission from the requesting computing device, which may have minimal resources.
  • query results that are gathered from the search nodes may be combined.
  • Separate Y-tree index structures may be utilized per stream writers.
  • an index data page cache for the search nodes may be generated.
  • a pre-determined amount of time that an index data page may reside in the cache before being refreshed with a new page from the backing storage may be adjusted.
  • a block-identifier may be assigned to each of the data-blocks. Such a block-identifier may be globally unique.
  • Each of the block-stores may comprise one or more data-blocks.
  • Each of the data-blocks may be read and written as a single unit.
  • the data-blocks of a single data-stream may be logically grouped.
  • FIG. 1 is a graph illustrating the results of a scale test performed with a text-indexing system, in accordance with certain embodiments of the invention.
  • FIG. 2 is a graph illustrating the results of a scale test performed with a text-indexing system, in accordance with certain embodiments of the invention.
  • FIG. 3 is a block diagram illustrating components of an embodiment of a data indexing system, in accordance with certain embodiments of the invention.
  • FIG. 4 is a flowchart illustrating steps of an embodiment of a data indexing method, in accordance with certain embodiments of the invention.
  • One of the objects of the present system and method may be an application in which full-text indexing on large volumes of streaming data 1 is performed with acceptable worst-case insertion.
  • the object for certain embodiments may concern such an application which enables real-time querying of data 2 streaming over a network 3 .
  • Such data-elements 2 of streaming data 1 may be received and stored in block-stores 4 .
  • These block-stores 4 may be stored on a distributed device 5 or across a plurality of distributed devices 5 .
  • the block-stores 4 may comprise data-blocks 6 , which may be assigned with block-identifiers 7 .
  • the data-blocks 6 may comprise events 8 .
  • the block-allocated data-elements 5 may be further allocated to the events 8 of the data-blocks 6 .
  • Embodiments may not be limited, however, to any one application, example or object.
  • the embodiments may be applicable in virtually any application in which text data is indexed for later searching, retrieval, updating, and deletion.
  • the embodiments of the present system and method are well suited for implementation in a distributed environment in which streams 1 of text data are persisted across multiple data storage nodes connected in a network 3 .
  • Text indexing approaches have generally fallen into two categories: inverted indices and suffix arrays.
  • Inverted indices are generally more space efficient, but require preprocessing text into individual word tokens and restricting queries to matches on whole word tokens.
  • Suffix arrays, and their variants, allow for searching arbitrary substrings on large bodies of text, but require more up-front computation to generate the index data structures and generally have a much higher space penalty as compared to inverted indices.
  • an embodiment of the present invention may provide efficient incremental updates that work in a distributed streaming environment.
  • Prior solutions for inverted indices include performing incremental merges of separate index segments using techniques such as a log-structured merge or a multi-way merge, which is the approach that is taken by the Apache Lucene Core project referenced above. More recent solutions include processes based on “cache-oblivious” data structures developed by researchers at Massachusetts Institute of Technology (MIT). While such methods have good amortized complexity for insertions, those approaches are inefficient as they require expensive periodic reorganizations of data within the index.
  • MIT Massachusetts Institute of Technology
  • An object of an embodiment of the present invention may include a data structure designed for efficient batch insertions with the traditional inverted indices working on top of a compression layer.
  • a benefit of such an embodiment is that it may work with block-based stream storage mechanisms and techniques, such as those disclosed in U.S. patent application Ser. No. 13/600,853, entitled “System and Method for Storing Data Streams in a Distributed Environment,” which has been incorporated by reference above.
  • FIGS. 1 and 2 An advantage of an indexing embodiment for the presently disclosed invention which does not require utilization of traditional storage mechanisms may be appreciated when comparing the results illustrated in FIGS. 1 and 2 .
  • the results obtained from an indexing implementation of an embodiment of the presently disclosed invention are labeled “New Index,” while the results obtained from the Apache Lucene implementation are labeled “Lucene.”
  • the Apache Lucene implementation is a full-text index implementation that uses traditional techniques known in the art.
  • FIG. 1 compares the processing time results obtained by the Lucene implementation versus the New Index implementation.
  • FIG. 2 compares the space overhead results obtained by the Lucene implementation versus the New Index implementation. As illustrated in these figures, the New Index implementation results in significantly shorter processing time and lower space overhead than the Lucene implementation as the amount of indexed data increases.
  • FIGS. 1 and 2 do not illustrate background input/output (I/O) operations performed by the Apache Lucene implementation, the disadvantages in performance due to such additional operations can still be realized.
  • the amount of background I/O operations that the Apache Lucene implementation uses for merge operations grows as the amount of indexed data grows. Once the index size is large enough, these periodic rebuild operations dominate system resources.
  • the novel indexing implementation of an embodiment of the presently disclosed invention requires no background I/O operations, and thus performance does not degrade as significantly as the amount of indexed data increases.
  • An embodiment of the invention may be performed in two phases: 1) front-end processing and 2) back-end storage.
  • the front-end processing phase may utilize unstructured data-elements 2 as input. If such data 2 is not in text form, the data 2 may be first converted to a text representation using any appropriate method known in the art. Once the data 2 is in text form, the text data-elements 2 may be split into individual terms 9 , such as words. The text data-elements 2 may be split using the process outlined in Unicode Standard Annex #29 such that language-specific word boundaries are respected. The annex, entitled “Unicode Text Segmentation” and published on the Internet by the Unicode Consortium, describes guidelines for determining default boundaries between certain significant text elements. After the input text data-elements 2 have been split into individual terms 9 , the front-end processor calculates the frequency 10 of each term 9 in each event 8 .
  • the term frequencies 10 may be reduced down to a set of frequency statistics 11 for the records or data-elements 2 contained in a single data-block 6 .
  • the block-level term frequency data 11 may be flushed to the tree index structure 12 described below.
  • Parallelism in the front-end may be achieved as in the aforementioned patent application.
  • a pool of data-blocks 6 may be made available during the insertion process, and threads performing insertions may update the term frequency data structures 10 for each pooled block after each insertion.
  • Other storage mechanisms may introduce parallelism in the front-end phase as appropriate based on the underlying storage for each event 8 . Such an embodiment which utilizes such a block-based storage mechanism is further described below.
  • This tree structure 12 preferably uses a modified B + -tree variant known as a Y-tree 12 .
  • the tree structure 12 may use other indexes that support fast bulk insertions while still providing efficient query performance, as accomplished by a B+-tree variant.
  • the Y-tree paper entitled “A Novel Index Supporting High Volume Data Warehouse Insertions,” which has been incorporated by reference above, provides sufficient information for a competent developer familiar with B + -trees to implement a fully functional Y-tree 12 .
  • Records/data-elements 2 may be represented within such a Y-tree structure 12 by using individual terms 9 as keys 13 and lists 14 of record entries as values 15 .
  • Each value entry 15 may contain a pointer 16 to a data-block 6 along with overall statistics 11 for the records/data-elements 2 contained within that data-block 6 .
  • Parallelism in the back-end may be achieved by allowing multiple writers to each have an independent tree structure 12 . In an embodiment, searches must be performed over all tree instances. Read-write locks or atomic lock-free schemes that work over tree structures 12 may also be used to allow parallel updates to a master tree structure 12 .
  • explicitly caching frequently used data pages within the Y-tree 12 is an effective way to reduce the number of I/O operations performed when records are inserted into the index structure 12 .
  • the search query is processed in a manner similar to insertions.
  • the front-end is used to extract term statistics 10 from the query text.
  • the Y-tree structure 12 is used to generate a list of candidate data-blocks 6 that satisfy the search query, and the search query is evaluated against each data-block 6 to generate a list of matching records.
  • Efficient ranked retrieval may be provided by using the aggregated block-level term statistics 11 contained in the master tree structure 12 .
  • Ranking functions based on cosine similarity may be used with a straightforward application of the procedure described in “Filtered Document Retrieval with Frequency-Sorted Indexes” authored by M. Persin et al. and published in the October, 1996 edition of the Journal of the American Society for Information Science.
  • Extended query operations such as term proximity searches may also be performed using such an embodiment, as described above.
  • Wildcard suffix matches may be performed by finding the minimum key 13 in the Y-tree structure 12 that satisfies the pattern requirement and iterating through the keys 13 until reaching a key 13 that does not satisfy the pattern.
  • individual pages within the Y-tree 12 may also be compressed using general-purpose, compression algorithms or by storing block or record lists in the Y-tree 12 via ⁇ or ⁇ gap-compressed encodings.
  • searches are to be performed by clients with minimal resources, these clients may multicast their queries to a set of dedicated nodes that perform searches on behalf of those clients. If using separate Y-tree indices 12 per stream writer, results gathered from each search node may be combined with a straightforward application of techniques used in parallel map-reduce systems. These search nodes may also benefit from the addition of an index data page cache to ensure queries containing popular terms 9 are quickly serviced. Additionally, the amount of time that pages are allowed to reside in the cache before being refreshed with newer pages from backing storage may be tuned or adjusted to balance freshness of results versus the amount of I/O operations performed in order to keep those results up-to-date.
  • An object of an embodiment of the present invention may be to balance fast insertion speeds with ensuring that data is available as soon as possible, all while efficiently mapping to file system operations available in distributed file systems.
  • Selection of an indexing structure 12 that accomplishes these objectives was an important part of the design process used to generate the techniques embodied by the presently disclosed invention.
  • an embodiment may require augmenting inverted indices with data structures that support efficient batch insertions.
  • a specific object for an embodiment may be to achieve a sustained insertion rate of over 50,000 events 8 per second for an input data set containing 5 billion events 8 .
  • searches against the generated index 12 may require only one block of index data in memory at a time.
  • Conventional search structures often require holding significant portions of the index structure 12 in memory in order to perform searches against the indexed data.
  • the storage space used by the index 12 and the data stored by the index 12 may be approximately one-quarter of the size of the raw data set in its uncompressed form.
  • data element 2 shall mean a set of binary data containing a unit of information.
  • Examples of data-elements 2 include, without limitation, a packet of data flowing across a network 3 ; a row returned from a database query; a line in a digital file such as a text file, document file, or log file; an email message; a message system message; a text message; a binary large object; a digitally stored file; an object capable of storage in an object-oriented database; and an image file, music file, or video file.
  • Data-elements 2 often, but do not always, represent physical objects such as sections of a DNA molecule, a physical document, or any other binary representation of a real world object.
  • instructions shall mean a set of digital data containing steps to be performed by a computing device.
  • examples of “instructions” include, without limitation, a computer program, macro, or remote procedure call that is executed when an event occurs (such as detection of an input data-element 2 that has a high probability of falling within a particular category).
  • “instructions” can include an indication that no operation is to take place, which can be useful when an event that is expected, and has a high likelihood of being harmless, has been detected, as it indicates that such event can be ignored.
  • “instructions” may implement state machines.
  • machine readable storage shall mean a medium containing random access or read-only memory that is adapted to be read from and/or written to by a computing device having a processor.
  • machine readable storage shall include, without limitation, random access memory in a computer; random access memory or read only memory in a network device such as a router switch, gateway, network storage device, network security device, or other network device; a CD or DVD formatted to be readable by a hardware device; a thumb drive or memory card formatted to be readable by a hardware device; a computer hard drive; a tape adapted to be readable by a computer tape drive; or other media adapted to store data that can be read by a computer having appropriate hardware and software.
  • network 3 or “computer network” shall mean an electronic communications network adapted to enable one or more computing devices to communicate by wired or wireless signals.
  • networks 3 include, but are not limited to, local area networks (LANs), wide area networks (WANs) such as the Internet, wired TCP and similar networks, wireless networks (including without limitation wireless networks conforming to IEEE 802.11 and the Bluetooth standards), and any other combination of hardware, software, and communications capabilities adapted to allow digital communication between computing devices.
  • operably connected shall mean connected either directly or indirectly by one or more cable, wired network, or wireless network connections in such a way that the operably connected components are able to communicate digital data from one to another.
  • output shall mean to render (or cause to be rendered) to a human-readable display such as a computer or handheld device screen, to write to (or cause to be written to) a digital file or database, to print (or cause to be printed), or to otherwise generate (or cause to be generated) a copy of information in a non-transient form.
  • output shall include creation and storage of digital, visual and sound-based representations of information.
  • server shall mean a computing device adapted to be operably connected to a network 3 such that it can receive and/or send data to other devices operably connected to the same network, or service requests from such devices.
  • a server has at least one processor and at least one machine-readable storage medium operably connected to that processor, such that the processor can read data from that machine-readable storage.
  • system shall mean a plurality of components adapted and arranged as indicated.
  • meanings and definitions of other terms used herein shall be apparent to those of ordinary skill in the art based upon the following disclosure.
  • FIG. 3 is a block diagram illustrating components of an embodiment of a data indexing system, in accordance with certain embodiments of the invention.
  • an embodiment may comprise block-stores 4 stored on distributed devices 5 .
  • the distributed devices 5 may be adapted to communicate via a network 3 .
  • a distributed device 5 may comprise a computing device, such as a computer or a smart phone, which can share data with other distributed devices 5 via a network 3 .
  • Data-streams 1 may be transmitted and received from the distributed devices 5 .
  • Each of the data-streams 1 may comprise data-elements 2 , which may be transmitted via the network 3 .
  • the block-stores 4 may store the data-elements 2 of the received data-streams 1 .
  • Such stored data-elements 2 may comprise digital copies of the transmitted data-elements 2 .
  • the data-elements 2 may comprise unstructured data.
  • the stored data-elements 2 may be allocated to data-blocks 6 of a block-store 1 , as illustrated in FIG. 3 . Such block-allocated data-elements 2 may be logically grouped.
  • Each of the block-stores 4 may comprise one or more data-blocks 6 .
  • each data-block 6 may comprise the stored data-elements 2 of only one of the received data-streams 1 .
  • the data-blocks 6 of a single data-stream 1 may be logically grouped.
  • a block-identifier 7 may be assigned to each of the data-blocks 6 . Such a block-identifier 7 may be globally unique.
  • the data-blocks 6 may comprise events 8 .
  • the block-allocated data-elements 2 may be further allocated to the events 8 of the data-blocks 6 . Such event-allocated data-elements 2 may be logically grouped. Each of the data-blocks 6 may comprise one or more events 8 . Each of the events 8 may comprise the block-allocated data-elements 2 of the corresponding data-block 6 . In an embodiment, these event-allocated data-elements 2 may comprise the stored data-elements 2 of only one of the received data-streams 1 .
  • FIG. 4 is a flowchart illustrating steps of an embodiment of a data indexing method, in accordance with certain embodiments of the invention.
  • such an embodiment may comprise the step of receiving 401 data-streams 1 .
  • the data-streams 1 may be received from a single distributed device 5 or from a plurality of distributed devices 5 .
  • the distributed devices 5 may be connected via a network 3 .
  • Each of the data-streams 1 may comprise data-elements 2 .
  • the method may include the step of storing 402 the data-elements 2 of the received data-streams 1 .
  • the stored data-elements 2 may be stored in block-stores 4 .
  • the block-stores 4 may be stored on a single distributed device 5 or across a plurality of distributed devices 5 .
  • Such distributed devices 5 may be connected via a network 3 .
  • the method may include the step of allocating 403 the stored data-elements 2 to data-blocks 6 of the block-stores 4 .
  • Each of the block-stores 4 may comprise one or more data-blocks 6 .
  • each data-block 6 may comprise the stored data-elements 2 of only one of the received data-streams 1 .
  • the data-blocks 6 of a single data-stream 1 may be logically grouped together.
  • Each of the data-blocks 6 may be read and written as a single unit.
  • the method may include allocating 404 the block-allocated data-elements 2 to events 8 of the data-blocks 6 .
  • Each of the data-blocks 6 may comprise one or more events 8 .
  • Each of the events 8 may comprise the block-allocated data-elements 2 of the corresponding data-block 6 .
  • the method may include the step of splitting 405 the event-allocated data-elements 2 into terms 9 . Further, the method may include the step of calculating 406 a frequency 10 of each term 9 within each of the event 8 , and the step of calculating 407 block-level term frequency statics or data 11 for the event-allocated data-elements 2 that are stored in the corresponding data-block 6 based on the calculated term frequencies 10 . The method may also include the step of generating 408 tree index structures 12 for the event-allocated data-elements 2 based on the block-level term frequency data 11 . The tree index structures 12 may comprise Y-tree index structures. The Y-tree index structures 12 may use the terms 9 as keys 13 . The method may further include the step of performing 409 a search query for the data-elements 2 over all of the tree index structures 12 .

Abstract

A system, method and computer readable medium for indexing streaming data. Data may be received from distributed devices connected via a network. Data elements may be stored and allocated to data blocks and events of the block stores. Non-text data may be converted into a text representation. The data may be split into terms, and term frequencies of each term within each of the event may be calculated. Block-level term frequency statics may be calculated based on the term frequencies. Tree index structures, such as the Y-tree index, may be generated based on the block-level term frequency data. The Y-tree index structures may use the terms as keys and pointers to the corresponding data blocks and block-level term frequency data. A search query may be performed over the tree index structures.

Description

  • This non-provisional patent application claims priority to, and incorporates herein by reference, U.S. Provisional Patent Application No. 61/677,171 which was filed Jul. 30, 2012 and further incorporates herein by reference, U.S. patent application Ser. No. 13/600,853 which was filed Aug. 31, 2012.
  • This application includes material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent disclosure, as it appears in the Patent and Trademark Office files or records, but otherwise reserves all copyright rights whatsoever.
  • FIELD OF THE INVENTION
  • The presently disclosed invention relate in general to the field of indexing and retrieving data, and in particular to a system and method for indexing streaming text data in distributed systems.
  • BACKGROUND OF THE INVENTION
  • Systems for indexing text data are known in the art. Basic data indexing and information retrieval techniques have been described in a book entitled “Introduction to Information Retrieval”, ISBN 0521865719. Technology for applications that require full-text searches is well known in the Apache community for the Apache Lucene Core open-source software project, which is supported over the Internet by the Apache Software Foundation. In addition, the paper entitled “A Novel Index Supporting High Volume Data Warehouse Insertions” which is authored by C. Jermaine et al., while failing to address text indexing, describes certain indexing techniques and Y-tree index structures for processing fast insertions of telephone Call Detail Records (CDR). All of these publications are incorporated herein by reference. Such indexing systems, however, are problematic for full-text indexing on large volumes of streaming data. The presently disclosed invention addresses such limitations, inter alias, by providing an improved text indexing system and method with acceptable worst-case insertion performance to enable real-time querying of streams of data.
  • SUMMARY OF THE INVENTION
  • The presently disclosed invention may be embodied in various forms, including a system, computer readable medium or a method for indexing data.
  • In an embodiment of a data indexing system, the system may comprise block-stores adapted to store data-elements of data-streams. The system may comprise one or more data-blocks of the block-stores. The stored data-elements may be allocated to the one or more data-blocks. In addition, the system may comprise events of the one or more data-blocks. The block-allocated data-elements may be further allocated to the events of the data-blocks. Each of the data-blocks may comprise one or more events. Each of the events may comprise the block-allocated data-elements of a corresponding data-block.
  • The system may further comprise terms generated by splitting the event-allocated data-elements, and term frequencies calculated based on the frequency of each term in each of the event. The system may also comprise block-level term frequency data calculated for the event-allocated data-elements that are stored in a corresponding data-block. The block-level term frequency data may be based on the term frequencies. Further, the system may comprise tree index structures generated for the event-allocated data-elements based on the block-level term frequency data. The tree index structures may comprise Y-tree index structures. The terms may be used in the Y-tree index structures as keys. In an embodiment, the block-stores may be stored on a plurality of distributed devices. In certain embodiments, the data-streams may be received from a plurality of distributed devices. The plurality of distributed devices may be connected via a network.
  • Further disclosed is an embodiment of a computer readable medium for the presently disclosed invention comprising computer readable instructions stored thereon for execution by a processor. The instructions on the computer-usable medium may be adapted to enable a computing device to receive data-streams, wherein the data-streams may comprise data-elements, and store the data-elements of the received data-streams, wherein the stored data-elements may be stored in block-stores. In addition, the instructions may enable a computing device to allocate the stored data-elements to data-blocks of the block-stores and further allocate the block-allocated data-elements to events of the data-blocks. Each of the data-blocks may comprise one or more events. Each of the events may comprise the block-allocated data-elements of the corresponding data-block. Further, the instructions may enable a computing device to split the event-allocated data-elements into terms, calculate a frequency of each term in each of the event, and calculate block-level term frequency data for the event-allocated data-elements stored in the corresponding data-block based on the calculated term frequencies. The instructions may also enable a computing device to generate tree index structures for the event-allocated data-elements based on the block-level term frequency data. The tree index structures may comprise Y-tree index structures. The terms may be used in the Y-tree index structures as keys.
  • Similarly, an embodiment of a method for the presently disclosed invention may include the step of receiving data-streams. The data-streams may be received from a single distributed device or from a plurality of distributed devices. The distributed devices may be connected via a network. Each of the data-streams may comprise data-elements. The method may include the step of storing the data-elements of the received data-streams. The stored data-elements may be stored in block-stores. The block-stores may be stored on a single distributed device or across a plurality of distributed devices. Such distributed devices may be connected via a network.
  • Further, the method may include the step of allocating the stored data-elements to data-blocks of the block-stores. Each of the block-stores may comprise one or more data-blocks. In an embodiment, each data-block may comprise the stored data-elements of only one of the received data-streams. The data-blocks of a single data-stream may be logically grouped together. Each of the data-blocks may be read and written as a single unit. In addition, the method may include allocating the block-allocated data-elements to events of the data-blocks. Each of the data-blocks may comprise one or more events. Each of the events may comprise the block-allocated data-elements of the corresponding data-block.
  • In addition, the method may include the step of splitting the event-allocated data-elements into terms. Further, the method may include the step of calculating a frequency of each term within each of the event, and the step of calculating block-level term frequency statics or data for the event-allocated data-elements that are stored in the corresponding data-block based on the calculated term frequencies. The method may also include the step of generating tree index structures for the event-allocated data-elements based on the block-level term frequency data. The tree index structures may comprise Y-tree index structures. The Y-tree index structures may use the terms as keys.
  • In embodiments of the above-disclosed system, computer readable medium, and method, pointers to the data-blocks may be generated. Such pointers may comprise values stored in the Y-tree index structures that identify, or point to, the corresponding data-blocks and the corresponding block-level term frequency data.
  • In some embodiments of the above-disclosed system, computer readable medium, and method, event-allocated data-elements may comprise text data. Event-allocated data-elements may also comprise a text representation of non-text data, as data-streams may comprise non-text data that is converted into a text representation. Event-allocated data-elements may comprise unstructured data, which may be split in accordance with processes outlined in an Unicode Standard Annex #29 published the Unicode Consortium. Multiple writers to the data-blocks may have an independent tree structure.
  • In certain embodiments of the above-disclosed system, computer readable medium, and method, a search query of the tree index structures may be performed. The search query for the data-elements may be performed over all of the tree index structures. Term statistics may be extracted from query text of the search query. In some embodiments, a list of candidate data-blocks may be generated that satisfy the search query based on the Y-tree index structure. The search query may be evaluated against each of the data-blocks to generate a list of matching records.
  • In some embodiments of the above-disclosed system, computer readable medium, and method, a term-proximity search query of the tree index structures may be performed. In an embodiment, the term-proximity search query may be a wildcard suffix matches search query, wherein a minimum key in the Y-tree index structures satisfies a pattern requirement. The keys may be iterated through until a key is reached that is different from the pattern requirement. In certain embodiments, the term-proximity search query may be a fuzzy matches search. In an embodiment, the term-proximity search query may be based on a Soundex algorithm. In some embodiments, a list of the terms that are present in a master Y-tree index structure may be maintained.
  • In an embodiment of the above-disclosed system, computer readable medium, and method, individual pages within the Y-tree index structure may be compressed. The individual pages within the Y-tree index may be compressed utilizing compression algorithms. The individual pages within the Y-tree index may be compressed by storing data-block in the Y-tree index structures via gap-compressed encodings, such as γ or δ gap-compressed encodings.
  • In an embodiment of the above-disclosed system, computer readable medium, and method, search queries may be multicasted to a set of dedicated search nodes that perform searches. Search queries may be transmitted to a group of destination computing devices simultaneously in a single transmission from the requesting computing device, which may have minimal resources. In some embodiments, query results that are gathered from the search nodes may be combined. Separate Y-tree index structures may be utilized per stream writers. In certain embodiments, an index data page cache for the search nodes may be generated. In an embodiment, a pre-determined amount of time that an index data page may reside in the cache before being refreshed with a new page from the backing storage may be adjusted.
  • In an embodiment of the above-disclosed system, computer readable medium, and method, a block-identifier may be assigned to each of the data-blocks. Such a block-identifier may be globally unique. Each of the block-stores may comprise one or more data-blocks. Each of the data-blocks may be read and written as a single unit. The data-blocks of a single data-stream may be logically grouped.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The foregoing and other objects, features, and advantages of the invention will be apparent from the following more particular description of embodiments as illustrated in the accompanying drawings, in which reference characters refer to the same parts throughout the various views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating principles of the invention.
  • FIG. 1 is a graph illustrating the results of a scale test performed with a text-indexing system, in accordance with certain embodiments of the invention.
  • FIG. 2 is a graph illustrating the results of a scale test performed with a text-indexing system, in accordance with certain embodiments of the invention.
  • FIG. 3 is a block diagram illustrating components of an embodiment of a data indexing system, in accordance with certain embodiments of the invention.
  • FIG. 4 is a flowchart illustrating steps of an embodiment of a data indexing method, in accordance with certain embodiments of the invention.
  • DETAILED DESCRIPTION OF THE EMBODIMENTS
  • Reference will now be made in detail to the embodiments of the presently disclosed invention, examples of which are illustrated in the accompanying drawings.
  • One of the objects of the present system and method may be an application in which full-text indexing on large volumes of streaming data 1 is performed with acceptable worst-case insertion. The object for certain embodiments may concern such an application which enables real-time querying of data 2 streaming over a network 3. Such data-elements 2 of streaming data 1 may be received and stored in block-stores 4. These block-stores 4 may be stored on a distributed device 5 or across a plurality of distributed devices 5. The block-stores 4 may comprise data-blocks 6, which may be assigned with block-identifiers 7. The data-blocks 6 may comprise events 8. The block-allocated data-elements 5 may be further allocated to the events 8 of the data-blocks 6. Embodiments may not be limited, however, to any one application, example or object. The embodiments may be applicable in virtually any application in which text data is indexed for later searching, retrieval, updating, and deletion. The embodiments of the present system and method are well suited for implementation in a distributed environment in which streams 1 of text data are persisted across multiple data storage nodes connected in a network 3.
  • Existing indexing solutions for text data fail to work with unbounded streams of incoming data. Prior approaches rely on periodically rebuilding index structures after the data set reaches a certain threshold. This may work well for batch-processed index updates, such as those used by web search engines. However, when applied to streaming data, such techniques result in long periods of time where the index is not available or where the index state reflects stale data.
  • Text indexing approaches have generally fallen into two categories: inverted indices and suffix arrays. Inverted indices are generally more space efficient, but require preprocessing text into individual word tokens and restricting queries to matches on whole word tokens. Suffix arrays, and their variants, allow for searching arbitrary substrings on large bodies of text, but require more up-front computation to generate the index data structures and generally have a much higher space penalty as compared to inverted indices.
  • As practical approaches for efficient incremental updates to suffix array-based text indices are lacking, recent applications which utilize such indices may leverage advanced compression techniques such as those used by the FM-index (developed by Ferragina and Manzini) and the Wavelet tree (developed by Navarro et al.). While these variants greatly reduce the space overhead required for suffix arrays, they fail to support efficient incremental updates at the rates required to handle incoming streams of data. In contrast, an embodiment of the present invention may provide efficient incremental updates that work in a distributed streaming environment.
  • Prior solutions for inverted indices include performing incremental merges of separate index segments using techniques such as a log-structured merge or a multi-way merge, which is the approach that is taken by the Apache Lucene Core project referenced above. More recent solutions include processes based on “cache-oblivious” data structures developed by researchers at Massachusetts Institute of Technology (MIT). While such methods have good amortized complexity for insertions, those approaches are inefficient as they require expensive periodic reorganizations of data within the index.
  • An object of an embodiment of the present invention may include a data structure designed for efficient batch insertions with the traditional inverted indices working on top of a compression layer. A benefit of such an embodiment is that it may work with block-based stream storage mechanisms and techniques, such as those disclosed in U.S. patent application Ser. No. 13/600,853, entitled “System and Method for Storing Data Streams in a Distributed Environment,” which has been incorporated by reference above.
  • An advantage of an indexing embodiment for the presently disclosed invention which does not require utilization of traditional storage mechanisms may be appreciated when comparing the results illustrated in FIGS. 1 and 2. The results obtained from an indexing implementation of an embodiment of the presently disclosed invention are labeled “New Index,” while the results obtained from the Apache Lucene implementation are labeled “Lucene.” The Apache Lucene implementation is a full-text index implementation that uses traditional techniques known in the art. FIG. 1 compares the processing time results obtained by the Lucene implementation versus the New Index implementation. FIG. 2 compares the space overhead results obtained by the Lucene implementation versus the New Index implementation. As illustrated in these figures, the New Index implementation results in significantly shorter processing time and lower space overhead than the Lucene implementation as the amount of indexed data increases.
  • While FIGS. 1 and 2 do not illustrate background input/output (I/O) operations performed by the Apache Lucene implementation, the disadvantages in performance due to such additional operations can still be realized. The amount of background I/O operations that the Apache Lucene implementation uses for merge operations grows as the amount of indexed data grows. Once the index size is large enough, these periodic rebuild operations dominate system resources. In contrast, the novel indexing implementation of an embodiment of the presently disclosed invention requires no background I/O operations, and thus performance does not degrade as significantly as the amount of indexed data increases.
  • An embodiment of the invention may be performed in two phases: 1) front-end processing and 2) back-end storage. The front-end processing phase may utilize unstructured data-elements 2 as input. If such data 2 is not in text form, the data 2 may be first converted to a text representation using any appropriate method known in the art. Once the data 2 is in text form, the text data-elements 2 may be split into individual terms 9, such as words. The text data-elements 2 may be split using the process outlined in Unicode Standard Annex #29 such that language-specific word boundaries are respected. The annex, entitled “Unicode Text Segmentation” and published on the Internet by the Unicode Consortium, describes guidelines for determining default boundaries between certain significant text elements. After the input text data-elements 2 have been split into individual terms 9, the front-end processor calculates the frequency 10 of each term 9 in each event 8.
  • If using a block-based storage mechanism such as the one described in U.S. patent application Ser. No. 13/600,853, the term frequencies 10 may be reduced down to a set of frequency statistics 11 for the records or data-elements 2 contained in a single data-block 6. Once the data-block 6 is full and is flushed to disk, the block-level term frequency data 11 may be flushed to the tree index structure 12 described below. Parallelism in the front-end may be achieved as in the aforementioned patent application. A pool of data-blocks 6 may be made available during the insertion process, and threads performing insertions may update the term frequency data structures 10 for each pooled block after each insertion. Other storage mechanisms may introduce parallelism in the front-end phase as appropriate based on the underlying storage for each event 8. Such an embodiment which utilizes such a block-based storage mechanism is further described below.
  • Once frequency statistics/data 11 have been aggregated for each group of inserted records/data-elements 2, pointers 16 to each data-block 6, or to each individual record/data-element 2, may be inserted into the back-end storage tree structure 12. This tree structure 12 preferably uses a modified B+-tree variant known as a Y-tree 12. The tree structure 12 may use other indexes that support fast bulk insertions while still providing efficient query performance, as accomplished by a B+-tree variant. The Y-tree paper, entitled “A Novel Index Supporting High Volume Data Warehouse Insertions,” which has been incorporated by reference above, provides sufficient information for a competent developer familiar with B+-trees to implement a fully functional Y-tree 12.
  • Records/data-elements 2 may be represented within such a Y-tree structure 12 by using individual terms 9 as keys 13 and lists 14 of record entries as values 15. Each value entry 15 may contain a pointer 16 to a data-block 6 along with overall statistics 11 for the records/data-elements 2 contained within that data-block 6. Parallelism in the back-end may be achieved by allowing multiple writers to each have an independent tree structure 12. In an embodiment, searches must be performed over all tree instances. Read-write locks or atomic lock-free schemes that work over tree structures 12 may also be used to allow parallel updates to a master tree structure 12.
  • As with traditional B+-trees, explicitly caching frequently used data pages within the Y-tree 12 is an effective way to reduce the number of I/O operations performed when records are inserted into the index structure 12.
  • To perform searches against the index structure 12, the search query is processed in a manner similar to insertions. The front-end is used to extract term statistics 10 from the query text. For each referenced term, the Y-tree structure 12 is used to generate a list of candidate data-blocks 6 that satisfy the search query, and the search query is evaluated against each data-block 6 to generate a list of matching records. Efficient ranked retrieval may be provided by using the aggregated block-level term statistics 11 contained in the master tree structure 12. Ranking functions based on cosine similarity may be used with a straightforward application of the procedure described in “Filtered Document Retrieval with Frequency-Sorted Indexes” authored by M. Persin et al. and published in the October, 1996 edition of the Journal of the American Society for Information Science.
  • Extended query operations such as term proximity searches may also be performed using such an embodiment, as described above. Wildcard suffix matches may be performed by finding the minimum key 13 in the Y-tree structure 12 that satisfies the pattern requirement and iterating through the keys 13 until reaching a key 13 that does not satisfy the pattern. Some types of searches—such as fuzzy matches or searches using the Soundex algorithm—may require a global term list. Soundex is a phonetic algorithm for indexing words by sound, as pronounced in English. This may be accomplished by iterating over the keys 13 in the Y-tree or, more efficiently, by maintaining a separate smaller list containing only terms 9 that are present in the Y-tree master index 12.
  • As with the aforementioned block-based storage system, individual pages within the Y-tree 12 may also be compressed using general-purpose, compression algorithms or by storing block or record lists in the Y-tree 12 via γ or δ gap-compressed encodings.
  • If searches are to be performed by clients with minimal resources, these clients may multicast their queries to a set of dedicated nodes that perform searches on behalf of those clients. If using separate Y-tree indices 12 per stream writer, results gathered from each search node may be combined with a straightforward application of techniques used in parallel map-reduce systems. These search nodes may also benefit from the addition of an index data page cache to ensure queries containing popular terms 9 are quickly serviced. Additionally, the amount of time that pages are allowed to reside in the cache before being refreshed with newer pages from backing storage may be tuned or adjusted to balance freshness of results versus the amount of I/O operations performed in order to keep those results up-to-date.
  • An object of an embodiment of the present invention may be to balance fast insertion speeds with ensuring that data is available as soon as possible, all while efficiently mapping to file system operations available in distributed file systems. Selection of an indexing structure 12 that accomplishes these objectives was an important part of the design process used to generate the techniques embodied by the presently disclosed invention. In addition to selecting such a data structure 12, an embodiment may require augmenting inverted indices with data structures that support efficient batch insertions. These techniques, together with the extended Y-tree index structure 12 disclosed in this present application, solve the technical problems associated with satisfying the requirements presented by the above objectives.
  • A specific object for an embodiment may be to achieve a sustained insertion rate of over 50,000 events 8 per second for an input data set containing 5 billion events 8. Due to the paged nature of certain approaches, searches against the generated index 12 may require only one block of index data in memory at a time. Conventional search structures often require holding significant portions of the index structure 12 in memory in order to perform searches against the indexed data. Due to the compression mechanism used by the index structure 12, the storage space used by the index 12 and the data stored by the index 12 may be approximately one-quarter of the size of the raw data set in its uncompressed form.
  • The term “data element” 2 shall mean a set of binary data containing a unit of information. Examples of data-elements 2 include, without limitation, a packet of data flowing across a network 3; a row returned from a database query; a line in a digital file such as a text file, document file, or log file; an email message; a message system message; a text message; a binary large object; a digitally stored file; an object capable of storage in an object-oriented database; and an image file, music file, or video file. Data-elements 2 often, but do not always, represent physical objects such as sections of a DNA molecule, a physical document, or any other binary representation of a real world object.
  • The term “instructions” shall mean a set of digital data containing steps to be performed by a computing device. Examples of “instructions” include, without limitation, a computer program, macro, or remote procedure call that is executed when an event occurs (such as detection of an input data-element 2 that has a high probability of falling within a particular category). For the purposes of this disclosure, “instructions” can include an indication that no operation is to take place, which can be useful when an event that is expected, and has a high likelihood of being harmless, has been detected, as it indicates that such event can be ignored. In certain preferred embodiments, “instructions” may implement state machines.
  • The term “machine readable storage” shall mean a medium containing random access or read-only memory that is adapted to be read from and/or written to by a computing device having a processor. Examples of machine readable storage shall include, without limitation, random access memory in a computer; random access memory or read only memory in a network device such as a router switch, gateway, network storage device, network security device, or other network device; a CD or DVD formatted to be readable by a hardware device; a thumb drive or memory card formatted to be readable by a hardware device; a computer hard drive; a tape adapted to be readable by a computer tape drive; or other media adapted to store data that can be read by a computer having appropriate hardware and software.
  • The term “network” 3 or “computer network” shall mean an electronic communications network adapted to enable one or more computing devices to communicate by wired or wireless signals. Examples of networks 3 include, but are not limited to, local area networks (LANs), wide area networks (WANs) such as the Internet, wired TCP and similar networks, wireless networks (including without limitation wireless networks conforming to IEEE 802.11 and the Bluetooth standards), and any other combination of hardware, software, and communications capabilities adapted to allow digital communication between computing devices.
  • The term “operably connected” shall mean connected either directly or indirectly by one or more cable, wired network, or wireless network connections in such a way that the operably connected components are able to communicate digital data from one to another.
  • The term “output” shall mean to render (or cause to be rendered) to a human-readable display such as a computer or handheld device screen, to write to (or cause to be written to) a digital file or database, to print (or cause to be printed), or to otherwise generate (or cause to be generated) a copy of information in a non-transient form. The term “output” shall include creation and storage of digital, visual and sound-based representations of information.
  • The term “server” shall mean a computing device adapted to be operably connected to a network 3 such that it can receive and/or send data to other devices operably connected to the same network, or service requests from such devices. A server has at least one processor and at least one machine-readable storage medium operably connected to that processor, such that the processor can read data from that machine-readable storage.
  • The term “system” shall mean a plurality of components adapted and arranged as indicated. The meanings and definitions of other terms used herein shall be apparent to those of ordinary skill in the art based upon the following disclosure.
  • FIG. 3 is a block diagram illustrating components of an embodiment of a data indexing system, in accordance with certain embodiments of the invention. As shown, such an embodiment may comprise block-stores 4 stored on distributed devices 5. The distributed devices 5 may be adapted to communicate via a network 3. A distributed device 5 may comprise a computing device, such as a computer or a smart phone, which can share data with other distributed devices 5 via a network 3. Data-streams 1 may be transmitted and received from the distributed devices 5. Each of the data-streams 1 may comprise data-elements 2, which may be transmitted via the network 3. The block-stores 4 may store the data-elements 2 of the received data-streams 1. Such stored data-elements 2 may comprise digital copies of the transmitted data-elements 2. The data-elements 2 may comprise unstructured data.
  • The stored data-elements 2 may be allocated to data-blocks 6 of a block-store 1, as illustrated in FIG. 3. Such block-allocated data-elements 2 may be logically grouped. Each of the block-stores 4 may comprise one or more data-blocks 6. In an embodiment, each data-block 6 may comprise the stored data-elements 2 of only one of the received data-streams 1. In addition, the data-blocks 6 of a single data-stream 1 may be logically grouped. In an embodiment, a block-identifier 7 may be assigned to each of the data-blocks 6. Such a block-identifier 7 may be globally unique. The data-blocks 6 may comprise events 8. The block-allocated data-elements 2 may be further allocated to the events 8 of the data-blocks 6. Such event-allocated data-elements 2 may be logically grouped. Each of the data-blocks 6 may comprise one or more events 8. Each of the events 8 may comprise the block-allocated data-elements 2 of the corresponding data-block 6. In an embodiment, these event-allocated data-elements 2 may comprise the stored data-elements 2 of only one of the received data-streams 1.
  • FIG. 4 is a flowchart illustrating steps of an embodiment of a data indexing method, in accordance with certain embodiments of the invention. As shown, such an embodiment may comprise the step of receiving 401 data-streams 1. The data-streams 1 may be received from a single distributed device 5 or from a plurality of distributed devices 5. The distributed devices 5 may be connected via a network 3. Each of the data-streams 1 may comprise data-elements 2. The method may include the step of storing 402 the data-elements 2 of the received data-streams 1. The stored data-elements 2 may be stored in block-stores 4. The block-stores 4 may be stored on a single distributed device 5 or across a plurality of distributed devices 5. Such distributed devices 5 may be connected via a network 3.
  • Further, the method may include the step of allocating 403 the stored data-elements 2 to data-blocks 6 of the block-stores 4. Each of the block-stores 4 may comprise one or more data-blocks 6. In an embodiment, each data-block 6 may comprise the stored data-elements 2 of only one of the received data-streams 1. The data-blocks 6 of a single data-stream 1 may be logically grouped together. Each of the data-blocks 6 may be read and written as a single unit. In addition, the method may include allocating 404 the block-allocated data-elements 2 to events 8 of the data-blocks 6. Each of the data-blocks 6 may comprise one or more events 8. Each of the events 8 may comprise the block-allocated data-elements 2 of the corresponding data-block 6.
  • In addition, the method may include the step of splitting 405 the event-allocated data-elements 2 into terms 9. Further, the method may include the step of calculating 406 a frequency 10 of each term 9 within each of the event 8, and the step of calculating 407 block-level term frequency statics or data 11 for the event-allocated data-elements 2 that are stored in the corresponding data-block 6 based on the calculated term frequencies 10. The method may also include the step of generating 408 tree index structures 12 for the event-allocated data-elements 2 based on the block-level term frequency data 11. The tree index structures 12 may comprise Y-tree index structures. The Y-tree index structures 12 may use the terms 9 as keys 13. The method may further include the step of performing 409 a search query for the data-elements 2 over all of the tree index structures 12.
  • While the invention has been particularly shown and described with reference to an embodiment thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (10)

1-29. (canceled)
30. A method for indexing data, comprising the steps of:
allocating stored data-elements to data-blocks of block-stores, wherein the stored data-elements are stored in the block-stores, wherein the stored data-elements are allocated via one or more processors to the data-blocks;
further allocating the block-allocated data-elements to events of the data-blocks, wherein each of the data-blocks comprise one or more events, wherein each of the events comprise the block-allocated data-elements of the corresponding data-block, wherein the block-allocated data-elements are allocated via the one or more processors to the events; and,
splitting the event-allocated data-elements into terms, wherein the event-allocated data-elements are split via the one or more processors into the terms, wherein the terms are adapted to be used in tree index structures as keys, wherein the tree index structures are adapted to be generated for the event-allocated data-elements, wherein the tree index structures are generated via the one or more processors.
31. The method of claim 30, further comprising the step of:
receiving data-streams, wherein the data-streams comprise streamed data-elements, wherein the streamed data-elements are stored via one or more processors in block-stores, wherein the streamed data-elements are said stored data-elements.
32. The method of claim 30, further comprising the step of:
calculating a term frequencies of each term in each of the events, wherein the term frequencies are calculated via the one or more processors;
calculating block-level term frequency data for the event-allocated data-elements stored in the corresponding data-block based on the term frequencies, wherein the block-level term frequency data is calculated via the one or more processors; and,
generating the tree index structures for the event-allocated data-elements based on the block-level term frequency data, wherein the terms are used in the tree index structures as keys, wherein the tree index structures are calculated via the one or more processors.
33. A system for indexing data, comprising:
block-stores adapted to store data-elements;
data-blocks of the block-stores, the stored data-elements being allocated via one or more processors to the data-blocks;
events of the data-blocks, the block-allocated data-elements being further allocated via the one or more processors to the events of the data-blocks, each of the data-blocks comprising one or more events, each of the events comprising the block-allocated data-elements of a corresponding data-block; and,
terms generated via the one or more processors by splitting the event-allocated data-elements, wherein the terms are adapted to be used in tree index structures as keys, wherein the tree index structures are adapted to be generated for the event-allocated data-elements, wherein the tree index structures are generated via the one or more processors.
34. The system of claim 33, wherein the data-elements comprise streamed data-elements, wherein the data-elements are received via data-streams.
35. The system of claim 33, further comprising:
term frequencies calculated via the one or more processors based on the frequency of each term in each of the event;
block-level term frequency data calculated via the one or more processors for the event-allocated data-elements that are stored in a corresponding data-block, the block-level term frequency data being based on the term frequencies; and,
the tree index structures generated via the one or more processors for the event-allocated data-elements based on the block-level term frequency data, the terms being used in the tree index structures as keys.
36. A non-transitory computer readable medium having computer readable instructions stored thereon for execution by a processor, wherein the instructions on the non-transitory computer readable medium are adapted to enable a computing device to:
allocate stored data-elements to data-blocks of block-stores;
further allocate the block-allocated data-elements to events of the data-blocks, wherein each of the data-blocks comprise one or more events, wherein each of the events comprise the block-allocated data-elements of the corresponding data-block; and,
split the event-allocated data-elements into terms, wherein the terms are adapted to be used in tree index structures as keys, wherein the tree index structures are adapted to be generated for the event-allocated data-elements, wherein the tree index structures are generated via the one or more processors.
37. The non-transitory computer readable medium of claim 36, wherein the instructions on the non-transitory computer readable medium are further adapted to enable a computing device to:
receive data-streams, wherein the data-streams comprise streamed data-elements, wherein the streamed data-elements are stored via one or more processors in block-stores, wherein the streamed data-elements are said stored data-elements.
38. The non-transitory computer readable medium of claim 36, wherein the instructions on the non-transitory computer readable medium are further adapted to enable a computing device to:
calculate a term frequencies of each term in each of the events, wherein the term frequencies are calculated via the one or more processors;
calculate block-level term frequency data for the event-allocated data-elements stored in the corresponding data-block based on the term frequencies, wherein the block-level term frequency data is calculated via the one or more processors; and,
generate the tree index structures for the event-allocated data-elements based on the block-level term frequency data, wherein the terms are used in the tree index structures as keys, wherein the tree index structures are calculated via the one or more processors.
US14/836,400 2012-07-30 2015-08-26 System and Method for Indexing Streams Containing Unstructured Text Data Abandoned US20150363446A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/836,400 US20150363446A1 (en) 2012-07-30 2015-08-26 System and Method for Indexing Streams Containing Unstructured Text Data

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201261677171P 2012-07-30 2012-07-30
US13/832,767 US9262511B2 (en) 2012-07-30 2013-03-15 System and method for indexing streams containing unstructured text data
US14/836,400 US20150363446A1 (en) 2012-07-30 2015-08-26 System and Method for Indexing Streams Containing Unstructured Text Data

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US13/832,767 Continuation US9262511B2 (en) 2012-07-30 2013-03-15 System and method for indexing streams containing unstructured text data

Publications (1)

Publication Number Publication Date
US20150363446A1 true US20150363446A1 (en) 2015-12-17

Family

ID=49995921

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/832,767 Active 2033-03-28 US9262511B2 (en) 2012-07-30 2013-03-15 System and method for indexing streams containing unstructured text data
US14/836,400 Abandoned US20150363446A1 (en) 2012-07-30 2015-08-26 System and Method for Indexing Streams Containing Unstructured Text Data

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US13/832,767 Active 2033-03-28 US9262511B2 (en) 2012-07-30 2013-03-15 System and method for indexing streams containing unstructured text data

Country Status (1)

Country Link
US (2) US9262511B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062691A (en) * 2017-12-26 2018-05-22 广州酷狗计算机科技有限公司 Resource allocation methods, device, server and storage medium

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10019510B2 (en) * 2014-07-29 2018-07-10 Ca, Inc. Indexing and searching log records using templates index and attributes index
US20160306810A1 (en) * 2015-04-15 2016-10-20 Futurewei Technologies, Inc. Big data statistics at data-block level
US10296655B2 (en) 2016-06-24 2019-05-21 International Business Machines Corporation Unbounded list processing
US10339014B2 (en) 2016-09-28 2019-07-02 Mcafee, Llc Query optimized distributed ledger system
US10838931B1 (en) * 2017-04-28 2020-11-17 EMC IP Holding Company LLC Use of stream-oriented log data structure for full-text search oriented inverted index metadata
US10326732B1 (en) * 2018-10-08 2019-06-18 Quest Automated Services, LLC Automation system with address generation
CN111125120B (en) * 2019-12-30 2023-08-18 广州数锐智能科技有限公司 Stream data-oriented rapid indexing method, device, equipment and storage medium
CN112203122B (en) * 2020-10-10 2024-01-26 腾讯科技(深圳)有限公司 Similar video processing method and device based on artificial intelligence and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6681295B1 (en) * 2000-08-31 2004-01-20 Hewlett-Packard Development Company, L.P. Fast lane prefetching
US20040181544A1 (en) * 2002-12-18 2004-09-16 Schemalogic Schema server object model
US7133400B1 (en) * 1998-08-07 2006-11-07 Intel Corporation System and method for filtering data
US20070214172A1 (en) * 2005-11-18 2007-09-13 University Of Kentucky Research Foundation Scalable object recognition using hierarchical quantization with a vocabulary tree
US20080205770A1 (en) * 2007-02-26 2008-08-28 Microsoft Corporation Generating a Multi-Use Vocabulary based on Image Data
US20090319365A1 (en) * 2006-09-13 2009-12-24 James Hallowell Waggoner System and method for assessing marketing data
US20120221572A1 (en) * 2011-02-24 2012-08-30 Nec Laboratories America, Inc. Contextual weighting and efficient re-ranking for vocabulary tree based image retrieval
US20150142842A1 (en) * 2005-07-25 2015-05-21 Splunk Inc. Uniform storage and search of events derived from machine data from different sources

Family Cites Families (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5668897A (en) * 1994-03-15 1997-09-16 Stolfo; Salvatore J. Method and apparatus for imaging, image processing and data compression merge/purge techniques for document image databases
US6072830A (en) * 1996-08-09 2000-06-06 U.S. Robotics Access Corp. Method for generating a compressed video signal
JP3213697B2 (en) * 1997-01-14 2001-10-02 株式会社ディジタル・ビジョン・ラボラトリーズ Relay node system and relay control method in the relay node system
US6623529B1 (en) * 1998-02-23 2003-09-23 David Lakritz Multilingual electronic document translation, management, and delivery system
US6100906A (en) * 1998-04-22 2000-08-08 Ati Technologies, Inc. Method and apparatus for improved double buffering
US6122640A (en) * 1998-09-22 2000-09-19 Platinum Technology Ip, Inc. Method and apparatus for reorganizing an active DBMS table
US7089530B1 (en) * 1999-05-17 2006-08-08 Invensys Systems, Inc. Process control configuration system with connection validation and configuration
US6349310B1 (en) * 1999-07-06 2002-02-19 Compaq Computer Corporation Database management system and method for accessing rows in a partitioned table
JP2003536329A (en) * 2000-06-02 2003-12-02 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Method and system for reading a block from a storage medium
US8396859B2 (en) * 2000-06-26 2013-03-12 Oracle International Corporation Subject matter context search engine
US20020083183A1 (en) * 2000-11-06 2002-06-27 Sanjay Pujare Conventionally coded application conversion system for streamed delivery and execution
US7043637B2 (en) 2001-03-21 2006-05-09 Microsoft Corporation On-disk file format for a serverless distributed file system
CA2451208A1 (en) * 2001-06-21 2003-01-03 Paul P. Vagnozzi Database indexing method and apparatus
US20040111677A1 (en) * 2002-12-04 2004-06-10 International Business Machines Corporation Efficient means for creating MPEG-4 intermedia format from MPEG-4 textual representation
US6912646B1 (en) 2003-01-06 2005-06-28 Xilinx, Inc. Storing and selecting multiple data streams in distributed memory devices
US20040167906A1 (en) * 2003-02-25 2004-08-26 Smith Randolph C. System consolidation tool and method for patching multiple servers
US7146361B2 (en) * 2003-05-30 2006-12-05 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a Weighted AND (WAND)
US7346220B2 (en) * 2003-07-23 2008-03-18 Seiko Epson Corporation Method and apparatus for reducing the bandwidth required to transmit image data
WO2005114484A1 (en) * 2004-05-19 2005-12-01 Metacarta, Inc. Systems and methods of geographical text indexing
US20070214133A1 (en) * 2004-06-23 2007-09-13 Edo Liberty Methods for filtering data and filling in missing data using nonlinear inference
US20060101045A1 (en) * 2004-11-05 2006-05-11 International Business Machines Corporation Methods and apparatus for interval query indexing
US7831908B2 (en) * 2005-05-20 2010-11-09 Alexander Vincent Danilo Method and apparatus for layout of text and image documents
WO2007033338A2 (en) * 2005-09-14 2007-03-22 O-Ya!, Inc. Networked information indexing and search apparatus and method
JP2007115293A (en) * 2005-10-17 2007-05-10 Toshiba Corp Information storage medium, program, information reproducing method, information reproducing apparatus, data transfer method, and data processing method
JP2007207328A (en) * 2006-01-31 2007-08-16 Toshiba Corp Information storage medium, program, information reproducing method, information reproducing device, data transfer method, and data processing method
US7756805B2 (en) * 2006-03-29 2010-07-13 Alcatel-Lucent Usa Inc. Method for distributed tracking of approximate join size and related summaries
US8937911B2 (en) 2006-08-31 2015-01-20 Futurewei Technologies, Inc. Method and system for sharing resources in a wireless communication network
US8370732B2 (en) 2006-10-20 2013-02-05 Mixpo Portfolio Broadcasting, Inc. Peer-to-portal media broadcasting
US8655939B2 (en) * 2007-01-05 2014-02-18 Digital Doors, Inc. Electromagnetic pulse (EMP) hardened information infrastructure with extractor, cloud dispersal, secure storage, content analysis and classification and method therefor
US20080189429A1 (en) 2007-02-02 2008-08-07 Sony Corporation Apparatus and method for peer-to-peer streaming
US7443321B1 (en) 2007-02-13 2008-10-28 Packeteer, Inc. Compression of stream data using a hierarchically-indexed database
US8504553B2 (en) * 2007-04-19 2013-08-06 Barnesandnoble.Com Llc Unstructured and semistructured document processing and searching
US8234282B2 (en) * 2007-05-21 2012-07-31 Amazon Technologies, Inc. Managing status of search index generation
US20110035662A1 (en) * 2009-02-18 2011-02-10 King Martin T Interacting with rendered documents using a multi-function mobile device, such as a mobile phone
US20090226872A1 (en) * 2008-01-16 2009-09-10 Nicholas Langdon Gunther Electronic grading system
US8046222B2 (en) * 2008-04-16 2011-10-25 Google Inc. Segmenting words using scaled probabilities
US8078655B2 (en) * 2008-06-04 2011-12-13 Microsoft Corporation Generation of database deltas and restoration
US8352519B2 (en) * 2008-07-31 2013-01-08 Microsoft Corporation Maintaining large random sample with semi-random append-only operations
US8229916B2 (en) * 2008-10-09 2012-07-24 International Business Machines Corporation Method for massively parallel multi-core text indexing
US8620884B2 (en) * 2008-10-24 2013-12-31 Microsoft Corporation Scalable blob storage integrated with scalable structured storage
US8325724B2 (en) * 2009-03-31 2012-12-04 Emc Corporation Data redistribution in data replication systems
US8560530B2 (en) 2010-05-17 2013-10-15 Buzzmetrics, Ltd. Methods, apparatus, and articles of manufacture to rank web site influence
US8473778B2 (en) 2010-09-08 2013-06-25 Microsoft Corporation Erasure coding immutable data
US20130246431A1 (en) * 2011-12-27 2013-09-19 Mcafee, Inc. System and method for providing data protection workflows in a network environment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7133400B1 (en) * 1998-08-07 2006-11-07 Intel Corporation System and method for filtering data
US6681295B1 (en) * 2000-08-31 2004-01-20 Hewlett-Packard Development Company, L.P. Fast lane prefetching
US20040181544A1 (en) * 2002-12-18 2004-09-16 Schemalogic Schema server object model
US20150142842A1 (en) * 2005-07-25 2015-05-21 Splunk Inc. Uniform storage and search of events derived from machine data from different sources
US20070214172A1 (en) * 2005-11-18 2007-09-13 University Of Kentucky Research Foundation Scalable object recognition using hierarchical quantization with a vocabulary tree
US20090319365A1 (en) * 2006-09-13 2009-12-24 James Hallowell Waggoner System and method for assessing marketing data
US20080205770A1 (en) * 2007-02-26 2008-08-28 Microsoft Corporation Generating a Multi-Use Vocabulary based on Image Data
US20120221572A1 (en) * 2011-02-24 2012-08-30 Nec Laboratories America, Inc. Contextual weighting and efficient re-ranking for vocabulary tree based image retrieval

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Christopher Jermaine, Anindya Datta and Edward Omiecinski - "A Novel Index Supporting High Volume Data Warehouse Insertions" Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999. - Pages: 235-246 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062691A (en) * 2017-12-26 2018-05-22 广州酷狗计算机科技有限公司 Resource allocation methods, device, server and storage medium

Also Published As

Publication number Publication date
US9262511B2 (en) 2016-02-16
US20140032568A1 (en) 2014-01-30

Similar Documents

Publication Publication Date Title
US9262511B2 (en) System and method for indexing streams containing unstructured text data
US20200349205A1 (en) Search infrastructure
US8738572B2 (en) System and method for storing data streams in a distributed environment
US9959347B2 (en) Multi-layer search-engine index
KR101792168B1 (en) Managing storage of individually accessible data units
US8706710B2 (en) Methods for storing data streams in a distributed environment
US9639542B2 (en) Dynamic mapping of extensible datasets to relational database schemas
US11294920B2 (en) Method and apparatus for accessing time series data in memory
US20120173510A1 (en) Priority hash index
CN107368527B (en) Multi-attribute index method based on data stream
US20080082554A1 (en) Systems and methods for providing a dynamic document index
CN109766318B (en) File reading method and device
KR20160067289A (en) Cache Management System for Enhancing the Accessibility of Small Files in Distributed File System
CN112262379A (en) Storing data items and identifying stored data items
Asadi et al. Fast candidate generation for two-phase document ranking: Postings list intersection with Bloom filters
Zhang et al. Efficient search in large textual collections with redundancy
CN109213760A (en) The storage of high load business and search method of non-relation data storage
Zhu et al. Towards update-efficient and parallel-friendly content-based indexing scheme in cloud computing
Shi et al. Research and optimization of massive small file processing performance based on Ceph
Shen et al. A Distributed Caching Scheme for Improving Read-write Performance of HBase
CN113889199A (en) Search engine and search method based on compound
CN116680300A (en) Reverse index query optimization method and device based on Es
Wei et al. An Effective Approach to Temporally Anchored Information Retrieval

Legal Events

Date Code Title Description
AS Assignment

Owner name: RED LAMBDA, INC., FLORIDA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LEKO, ADAM;BIRD, ROBERT;WHITLOCK, MATTHEW;SIGNING DATES FROM 20130312 TO 20130313;REEL/FRAME:036428/0737

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION