US20060101060A1 - Similarity search system with compact data structures - Google Patents

Similarity search system with compact data structures Download PDF

Info

Publication number
US20060101060A1
US20060101060A1 US11/219,822 US21982205A US2006101060A1 US 20060101060 A1 US20060101060 A1 US 20060101060A1 US 21982205 A US21982205 A US 21982205A US 2006101060 A1 US2006101060 A1 US 2006101060A1
Authority
US
United States
Prior art keywords
data
image
region
images
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US11/219,822
Other versions
US7966327B2 (en
Inventor
Kai Li
Qin Lv
Moses Charikar
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Princeton University
Original Assignee
Kai Li
Qin Lv
Moses Charikar
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kai Li, Qin Lv, Moses Charikar filed Critical Kai Li
Priority to US11/219,822 priority Critical patent/US7966327B2/en
Publication of US20060101060A1 publication Critical patent/US20060101060A1/en
Application granted granted Critical
Publication of US7966327B2 publication Critical patent/US7966327B2/en
Assigned to THE TRUSTEES OF PRINCETON UNIVERSITY reassignment THE TRUSTEES OF PRINCETON UNIVERSITY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LI, KAI, CHARIKAR, MOSES, LV, Qin
Assigned to ENERGY, UNITED STATES DEPARTMENT OF reassignment ENERGY, UNITED STATES DEPARTMENT OF CONFIRMATORY LICENSE (SEE DOCUMENT FOR DETAILS). Assignors: PRINCETON UNIVERSITY
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present invention relates to a content-addressable and searchable storage system to provide effective capabilities to access, search, explore and manage massive amounts of diverse feature-rich data.
  • Feature-rich data are typically sensor data such as audio, images, video, genomics, or scientific data; they are noisy and high-dimensional.
  • Current file systems are designed for named text files, and they do not have mechanisms to manage feature-rich data.
  • directories emulate the management of paper files and have been helpful in managing paper-like documents.
  • Some recent file systems attempt to provide content-based search tools, but they are limited to exact searches for text and annotations of non-text data. Manual annotation, however, is not practical for feature-rich data because such data are massive, noisy and high dimensional.
  • Pattern matching tools are already integral components of a modern operating systems. However, such tools are limited to exploring text documents or viewing simple images; they are not useful to explore noisy, high-dimensional data.
  • search engines such as Google index documents by building an inverted index.
  • a number of data structures have been devised for nearest neighbor searching such as R-Trees, k-d trees, ss-trees, and SR-trees. These are capable of supporting similarity queries, but they do not scale satisfactorily to large high-dimensional data sets.
  • Several constructions of nearest neighbor search data structures have recently been devised in the theory community, but practical implementations of those theoretical ideas for high dimensional data do not exist yet.
  • a distance function on pairs of data items can be estimated by only examining the sketches of the data items.
  • the existence of a sketch depends crucially on the function one desires to estimate.
  • the successful construction of a small sketch as the metadata to estimate the distance between two points in high-dimensional space has significant implications on solving the efficient similarity search problem because it can provide significant savings in space and running time.
  • RBIR region based image retrieval
  • One-to-one match systems like Windsurf and WALRUS consider matching one set of regions to another set of regions and require that each region can only be matched once.
  • Windsurf uses the Hungarian Algorithm to assign regions based on region distance. Region size is then used to adjust two matching regions'similarity. Image similarity is defined as the sum of the adjusted region similarity.
  • One-to-One match assumes good image segmentation so there is good correspondence between two similar images'regions. But current segmentation techniques are not perfect and regions do not always correspond to objects. Moreover, it is hard to define an optimal segmentation, as one image may need different segmentations when comparing to different images.
  • EMD match systems use similarity measures based on the Earth Mover's Distance (EMD). Although EMD is a good measure for region matching, its effectiveness is closely linked to the underlying distance function used for pairs of regions as well as the weight given to each region. Since these systems directly use the region distance function as the ground distance for EMD and use normalized region size as the region weight, this creates problems such as regions being weighted inappropriately. As a result, these systems do not use EMD very well.
  • EMD Earth Mover's Distance
  • the present invention disclosed and claimed herein is a system and method for a content-addressable and searchable storage system for managing and exploring massive amounts of feature-rich data such as images, audio or scientific data.
  • the system comprises a segmentation and feature extraction unit for segmenting data corresponding to an object into a plurality of data segments and generating a feature vector for each data segment; a sketch construction component for converting a feature vector into a compact bit-vector corresponding to the object; a similarity index comprising a plurality of compact bit-vectors corresponding to a plurality of objects; and an index insertion component for inserting a compact bit-vector corresponding to an object into the similarity index.
  • the system may further comprise an indexing unit for identifying a candidate set of objects from said similarity index based upon a compact bit-vector corresponding to a query object. Still further, the system may additionally comprise a similarity ranking component for ranking objects in said candidate set by estimating their distances to the query object.
  • a method of comparing a search image to a first plurality of stored images comprises the steps of segmenting the search image into a plurality of search image regions; extracting a region feature vector from each of the search image regions; converting each of the region feature vectors into a region bit vector; storing the region bit vectors; calculating a region weight for each of the search image regions; embedding all of the region bit vectors and region weights into a composite search image feature vector; storing the composite search image feature vector; and selecting a second plurality of images from the database using the composite search image feature vector, wherein the second plurality of images comprises a subset of the first plurality of images.
  • a region's weight comprises a normalized square root of the region's size.
  • a method in accordance with a preferred embodiment of the invention may further comprise the steps of calculating an image dissimilarity match between the search image and each of the second plurality of images using the region bit vectors of the search image; and selecting a third plurality of images based upon the image dissimilarity matches, wherein the third plurality of images comprises a subset of the second plurality of images.
  • the image dissimilarity match may comprise an Earth Mover's Distance using a square root region size as a region weight.
  • the image dissimilarity match also may comprise an Earth Mover's Distance using a thresholded region distance.
  • a method in accordance with another preferred embodiment of the invention further comprises the steps of calculating a distance between two of the plurality of regions by XOR-ing their region bit vectors; comparing the distance to a threshold; selecting the distance as a region ground distance function if the distance is less than the threshold; selecting the threshold as the region ground distance function if the distance is greater than the threshold; calculating an image dissimilarity match between the search image and each of the second plurality of images using the region bit vectors and the ground distance function; and selecting a third plurality of images based upon the image dissimilarity matches, wherein the third plurality of images comprises a subset of the second plurality of images.
  • FIG. 1 is a block diagram of a content-addressable and searchable storage system architecture in accordance with an embodiment of the present invention
  • FIG. 2 is a block diagram of a similarity search engine in accordance with an embodiment of the present invention.
  • FIG. 3 is a block diagram of the main components and method of inserting or querying an image in accordance with an embodiment of the present invention.
  • the present invention provides a content-addressable and searchable storage system.
  • the data in the system preferably is organized dynamically based on demand.
  • the system has a similarity search engine for searching massive amounts of feature-rich, noisy, high-dimensional data.
  • the general system architecture of a preferred embodiment of the invention will be described with reference to FIG. 1 .
  • the general system architecture has four main data paths: objects 152 ; file I/O's 144 ; annotations 142 , and search queries 102 .
  • Objects can be addressed directly through a content-addressed store 150 that implements a content-addressed abstraction.
  • a client can store a data object into the content addressed store 150 and get its fingerprint or hash back as its “id.”
  • the content addressed store allows clients to access segments of an object.
  • the storage system uses fingerprints as the uniform mechanism to address both data objects and segments.
  • Clients can perform regular file I/Os via a standard file system interface implemented by the Enhanced File System 140 .
  • the standard file system interface allows existing applications to perform traditional file I/Os to input and output data.
  • As the data enters the system it will be delivered to a Segmentation and Feature Extraction Unit 130 (data type dependent) which will segment the data into multiple segments, perform feature extraction, and then pass the extracted multi-dimensional vectors to the Similarity Search Engine 120 to construct compact metadata.
  • Attributes and user-defined annotations can be associated with individual files and directories.
  • the Enhanced File System 140 implements the mechanism for the associations and also relational searches for desired attributes and annotations. Attributes are generated automatically by the input data processing units as the data enters into the system, while annotations are provided by the users.
  • Query Processing and Interface 110 implements a query interface for users to perform searches and data exploration in the system. It allows users to provide sophisticated similarity search queries that include query data and a search range. It also implements a user interface for users to browse and explore result data objects.
  • the query data will be processed and converted into multi-dimensional feature vectors and then passed to the similarity search engine.
  • a corresponding Segmentation and Feature Extraction 130 unit performs the conversion, while the Similarity Search Engine 120 implements the core similarity search capability and returns a list of data objects ranked by similarity.
  • the proposed system encourages users to use the content-based similarity search capability to search and manage massive amounts of feature-rich data instead of using the traditional file system interface. It may be useful to combine the similarity search capability with a search range constrained by attributes (such as time, size, data type, owner, and so on) and user-defined annotations. Users can find their data conveniently and quickly by meaningful contents instead of meaningless file names (such as those created by today's digital cameras). The following three examples the usefulness of such a system.
  • the first example is managing digital images.
  • a user who has millions of digital photos and would like to find as many photos with waterfalls as possible during the past three years.
  • the user's digital photo collections are stored by meaningless file names automatically generated by a digital camera.
  • the user first looks through all her directories created during the past three years and scan through each in image thumbnail mode. This process will be extremely inefficient and very time consuming.
  • the proposed system the user could upload her photos without needing to know file names and directory names.
  • the user needs to find photos with waterfalls, she would first find one photo with waterfalls and then issue a query to find all images similar to the query data and specify a search range from 2001 to 2004. The system would present her desired photos.
  • the second example is managing audio data.
  • a search query of this type could be done either by typing “mobile storage” into a text field, or saying the words “mobile storage” into a microphone. The system then would bring up all of the audio files it could find that contain the relevant speech recognized words, and also all regular text files (lecture notes, meeting minutes, emails, etc.) that contain the words either in the filename or in the content.
  • Another example query could be of the form “Display all audio segments that contain simple conversations (one or two clear speakers) over a train station background.” The system would search for audio textures matching the query, and present them in a rank ordered listing with dates and origin information.
  • the third example is data exploration of genomic data.
  • a biologist who just identified a new gene that seems to be involved in cancer progression.
  • the biologist ran a microarray experiment in which she produced a pattern of expression for this gene (and others in the genome) over a large set of conditions.
  • She wants to identify any known genes that may have the same pattern of expression, so she queries the system with her experimental data and the name of the gene of interests.
  • the biologist will then see all genes with similar expression patterns to the gene of interest over any subset of experiments. This may give her clues about the function of her gene of interest in carcinogenesis, and provide her leads for design of further experiments.
  • a preferred embodiment of the invention constructs sketches of the data. These sketches are tiny data structures that can be used to estimate properties of the original data. For example, a distance function on pairs of data items could be estimated by only examining the sketches of the data items.
  • Sketch constructions have been developed for a number of purposes, including estimating similarity of sets, estimating distinct elements and vector norms, and estimating string edit distance. Sketch constructions can be derived from rounding techniques used in approximation algorithms. Many sketch constructions for estimating similarity and distances can be viewed as embeddings (approximate distance preserving mappings) from the data points to points in a normed space, usually L 1 or L 2 . Once such a mapping is obtained, sketching techniques for L 1 or L 2 can be applied.
  • the present invention speeds up similarity searches and maintains the similarity search quality while substantially reducing the metadata size.
  • a sketch-based indexing system also may be used for efficient similarity searches.
  • a similarity search engine 120 in accordance with preferred embodiment of the invention is shown in FIG. 2 .
  • the similarity search engine 120 works with feature vectors and client-defined distance functions.
  • the similarity search engine has two main operations: data input and similarity searching.
  • Input data 202 enters a segmentation and feature extraction unit 210 , depending on its data type.
  • the segmentation and feature extraction unit 210 segments the input data 202 and generates a feature vector for each segment.
  • Each piece of input data is then represented by a group of feature vectors 212 .
  • the feature vectors 212 have a client-defined distance function.
  • the sketch construction component 222 converts a feature vector into a compact bit-vector or sketch.
  • the sketches are then passed to the index insertion component 224 which inserts them into a similarity index 226 .
  • the query data 204 is first passed to a specific segmentation and feature extraction unit 240 , depending on its data type.
  • the segmentation and feature extraction unit 240 may be the same unit as segmentation and feature extraction unit 210 or may be a different unit.
  • the feature segmentation and extraction unit 240 unit will segment the query data 204 and generate a set of feature vectors 242 .
  • the feature vectors 242 are passed to the sketch construction component 230 to convert them into a group of sketches.
  • the sketch construction component 230 may be the same component as sketch component 222 or may be a different sketch construction component.
  • the indexing unit 228 looks up the similarity index 226 to find a candidate set of objects.
  • the candidate set may include objects that are not similar to the query object, but it misses very few objects that are similar.
  • the similarity ranking component 232 will rank the objects in the candidate set by estimating their distances to the query object. It will filter out the objects whose distances to the query object are beyond a certain threshold.
  • the use of sketches in the similarity search engine achieves high-speed similarity searches and reduces the metadata space requirement.
  • the sketch construction unit 222 , 230 converts a multi-dimensional feature vector into a sketch, a very small bit vector that can be used to estimate the distance function of the original data. Such a sketch can typically be 1/10 the feature vector size without losing similarity search quality.
  • the segmentation and feature extraction units 210 , 240 are data dependant. Thus, the system of a preferred embodiment provides a convenient interface such that users can “plug-in” new segmentation and feature extraction units easily. Examples of segmentation and feature extraction units will be described for image data, audio data and genomic data. Those skilled in the art will recognize that segmentation and feature extraction units for other types of data may be used with the invention.
  • FIG. 3 shows the main components of a preferred embodiment of an image similarity search method of the present invention and illustrates the steps an image goes through when it is inserted into the system, or is submitted as a query image.
  • the preferred embodiment incorporates a new region feature representation with weighted L 1 distance function and improved Earth Mover's Distance (“EMD”) match that will be referred to herein as “EMD*.”
  • segmentation component 310 segments it into several homogeneous regions 312 .
  • feature extraction component 320 extracts a 14-dimensional feature vector 322 .
  • Each region preferably, but not necessarily, is represented by a simple feature-vector that includes two kinds of information about a region: color moments and bounding box information. Color moments are compact representations that have been shown to be only slightly worse in performance than high-dimensional color histograms.
  • the first three moments from each channel in the HSV color space are extracted, resulting in a nine-dimensional color vector.
  • a bounding box is the minimum rectangle covering a region.
  • a 5-dimensional vector is used to represent a regions bounding box information: (ln(r_), ln(s_, a_, c x , c y ).
  • more dimensions may be added to the feature vector.
  • shape information may be extracted from a region using known or new shape recognition methods and be added into the feature vector.
  • another level of segmentation by segmenting the regions into sub-regions also may be added to the system to provide more detailed information. This will change the feature representation from a group of feature vectors to a three-level tree of feature vectors. The two-level segmentation allows the implementation of the capability to query an object in addition to a whole image.
  • Bit vector conversion component then converts the 14-dimensional feature vector 322 into a region bit vector 332 using a thresholding and transformation algorithm. This results in very compact representation of each region.
  • the thresholding and transformation algorithm preferably approximates weighted (and thresholded) L 1 distance of real-valued feature vectors with Hamming distance of bit vectors.
  • the bit vector representation is much more compact than the real-valued feature vector representation; and it is also much faster to calculate Hamming distance of bit vectors (XORing bits) than weighted (and thresholded) L 1 distance of feature vectors (floating point operations).
  • Bit vectors are generated from d-dimensional vectors such that the expected Hamming distance between two bit vectors produced is proportional to the weighted L 1 distance between the corresponding vectors. In order to do this, a single bit from each d-dimensional vector such that the probability that the bit produced is different for two vectors is proportional to their weighted L 1 distance.
  • the required bit vectors are produced by repeating this process to produce several bits and concatenating them together. For example, suppose one wants to compute weighted L 1 distance for d-dimensional vectors, where the ith coordinate is in the range [l i ; h i ] and has weight w i .
  • an (i, t) pair determines the value of one bit for each vector. To make the transformation consistent across all vectors, for each bit we generate, the same (i, t) pair must be applied to each vector. The process of generating (i, t) pairs is described in Algorithm 1. Here, N K such pairs are generated where N is the size of the final bit vector desired (after thresholding) and K is a parameter which will be determined later.
  • Algorithm 1 generates N K (i; t) pairs which give rise to N groups of K bits each. A single bit is produced from each group of K bits by applying a hash function to them.
  • the hash function could be XOR, or some other random hash function. This achieves the desired thresholding.
  • An implementation of the algorithm is shown here.
  • Algorithm 1 is the initializing process, where N ⁇ K random (i, t) pairs are generated. Then for each feature vector, Algorithm 2 is called to convert the feature vector to an N-bit vector.
  • the distance between two regions can be calculated efficiently by XORing their region bit vectors.
  • all the n region bit vectors along with their weights are embedded at embedding component 340 into a single image feature vector 342 , such that the L 1 distance on two images'embedded feature vectors approximates the EMD* between these two images.
  • the preferred image similarity measure EMD* is based on Earth Mover's Distance (EMD) or Transportation Metric, which is a flexible similarity measure between multidimensional distributions. Given two distributions represented by sets of weighted features and a distance function between pairs of features, EMD reflects the minimal amount of work needed to transform one distribution into another by moving distribution “mass” (weights) around.
  • EMD Earth Mover's Distance
  • Transportation Metric Transportation Metric
  • EMD can be computed via (weighted) bipartite matching, but this is a relatively expensive operation.
  • Prior RBIR systems have used “EMD match”-based image similarity measures where the region distance function is used as the ground distance of EMD and normalized region size is used as region weight. However, these prior “EMD match”-based image similarity measures do not use EMD appropriately. In particular, the distance function and region weight information that are inputs to EMD are inappropriate.
  • a region's importance in an image is not proportional to that region's size. For example, a large region (e.g. front door) usually should not be considered much more important than a small region (e.g. a baby). Accordingly, the preferred embodiment uses the normalized square root of region size as each region's weight, which reduces the difference between small and large regions, and assigns suitable weights in most segmentation scenarios.
  • similar images may still have very different regions (e.g. the same baby with a different toy). If one simply use the region distance function, two similar images may be considered different only because they have two very different regions.
  • distance thresholding preferably is used after calculating the distance between two regions. Roughly speaking, if the distance between two regions is larger than a threshold ⁇ , we use ⁇ as the region distance. By setting an upper bound on region distance, we reduce the effect that an individual region can have on the whole image, making our image similarity measure more robust.
  • the preferred embodiment defines image dissimilarity as the EMD using square root region size as region weight, and thresholded region distance as the ground distance function. This measure is referred to as “EMD* match”-based image similarity measure.
  • the image feature vector 342 is also converted into a bit vector 352 by bit vector conversion component 350 .
  • Both the image bit vector 352 and the individual region bit vectors 332 (with region weights) are stored into a database 360 , 390 for future image retrieval.
  • a query image goes through the same process of segmentation, feature extraction, bit vector conversion, embedding, and bit vector conversion. Then the query image bit vector is used to do filtering 370 in the image database and obtain the top K images 372 that are closest to the query image's bit vector. The exact EMD* match 380 between the query image and each of the K images 372 is calculated using their region bit vectors. Finally the top k images 382 with smallest EMD* match to the query image are returned.
  • the system preferably uses a filtering method 370 via approximate EMD embedding.
  • the goal is to find a small candidate image set 372 for the EMD* match by filtering out most of the images which are very different from the query image.
  • the challenge is to quickly find a candidate image set that contains most of the similar images.
  • the first kind of filtering is to index individual regions and combine the filtering results of all the regions to form the candidate image set. This approach is not effective, because it loses the information of image-level similarity.
  • the second kind is to use a technique to embed EMD into L 1 distance and then use Locality Sensitive Hashing (LSH) to find the nearest neighbor(s) in the latter space. This method has interesting provable properties, but it does not work well with compact data structures nor does it consider distance thresholding on real-valued vectors.
  • LSH Locality Sensitive Hashing
  • the present invention uses a new EMD embedding technique that converts a set of region bit vectors 332 into a single image feature vector 342 , and the L 1 distance on the embedded image feature vector 342 approximates the EMD on the original region bit vectors 332 .
  • r i is the bit vector for the i th region and w i is its weight
  • r i,pj denotes the p j th bit of vector r i .
  • each random pattern picks out the regions in the two images that are similar, in effect matching similar regions to each other. If two images are similar, their matched weight wrt. and random pattern should be close to each other.
  • a vector is obtained for every image by listing the matched weights for a number of randomly chosen patterns, and distances between images will be computed by L 1 distances between these image vectors. When sufficiently many random patterns are used to generate the image vectors, the L 1 distance between image vectors should be able to distinguish between similar and dissimilar images.
  • the idea of computing the matched weight for a random pattern is analogous to computing the weight that falls into a cube.
  • the prior embedding such as in Indyk and Thaper, uses different levels of granularity and the weights assigned to them are exponentially decreasing. This creates problems when sampling coordinates to estimate weighted L 1 distance by hamming distance of compact bit vectors; the problem is that the random variables involved have high variance.
  • the scheme of the preferred embodiment of the present invention can be thought of as using only one level of granularity and this is designed to get around this problem with using many different levels.
  • Algorithm 3 which generates M sets of random positions and picks a random bit pattern for each set.
  • the second piece is Algorithm 4 which, given an image represented by a list of region bit vectors and their corresponding weights, computes its EMD embedding using the random patterns generated by Algorithm 3.
  • each image is represented by a M-dimensional real-valued vector. It is further converted to a bit vector using the same algorithm used for converting region feature vectors to region bit vectors. As a result, each image is now represented by a compact bit vector and the Hamming distance between two images can be efficiently computed by XORing their bit vectors.
  • the filtering algorithm ranks images based on the Hamming distance of their embedded image bit vectors to the query image's bit vector and return the top K images for exact EMD computation.
  • Audio segmentation is the process of breaking up an audio stream into time sections that are perceptually different from adjacent sections.
  • the audio “texture” within a given segment is relatively stable. Examples of segment boundaries could be a transition from background sound texture to the beginning of speech over that background. Another segment boundary might occur when the scene changes, such as leaving an office building lobby and going outside onto a busy street.
  • Audio segmentation can be accomplished in two primary ways: blind segmentation based on sudden changes in extracted audio features, and classification based segmentation based on comparing audio features to a set of trained target feature values.
  • the blind method works well and is preferred when the segment textures are varied and unpredictable, but requires the setting of thresholds to yield the best results.
  • the classification-based method works well on a corpus of pre-labeled data, such as speech/music, musical genres, indoor/outdoor scenes, etc. databases. Either method requires the extraction of audio features.
  • Audio feature extraction is the process of computing a compact numerical representation of a sound segment.
  • a variety of audio features have been used in systems for speech recognition, music/speech discrimination, musical genre (rock, pop, country, classical, jazz, etc.) labeling, and other audio classification tasks.
  • Most features are extracted from short moving windows (5-100 milliseconds in length, moving along at a rate of 5-20 windows per second) by using the Short Time Fourier Transform. Wavelets and compressed data have also been used.
  • Features can be computed at different time resolutions, and the value of each feature, along with the mean and variance of the features can be used as features themselves.
  • Common audio features include power, spectral centroid and rolloff (measures of the relative brightness of sound), spectral flux (a measure of the frame-to-frame variance in spectral shape), zero crossing rate (noisiness), and Mel-Frequency Cepstral Coefficients (MFCCs), which is a compact representation of spectral shape.
  • spectral centroid and rolloff measures of the relative brightness of sound
  • spectral flux a measure of the frame-to-frame variance in spectral shape
  • zero crossing rate noisesiness
  • MFCCs Mel-Frequency Cepstral Coefficients
  • Bi-clustering approximation algorithms may be used to identify small, incomplete bi-clusters of 2-5 genes, which can be used to define feature vectors. This limit on bi-cluster size will make bi-clustering algorithms tractable, and similarity search will allow these feature vectors to identify complete bi-clusters.
  • a preferred embodiment of the invention provides support for a default similarity distance function that is general-purpose.
  • the distance measure may be based on Earth Mover's Distance, which has been used successfully in both image and audio similarity searches.
  • EMD is a flexible metric between multidimensional distributions, represented by sets of weighted features and a distance function between pairs of features. This calculates the minimal amount of work needed to transform one distribution into another by moving distribution “mass”. This is a natural distance measure for weighted sets of features and is applicable to image, audio and scientific data. For example, two sound files that exhibit similar sub-segments, but in different order, would be judged similar by the EMD method.
  • the present invention improves upon the standard EMD measure in two ways.
  • the standard EMD uses the region size as its weight and our first improvement is to use the normalized square root of region size as each region's weight to prevent large regions from dominating the distance calculation.
  • the second improvement comes from the observation that using the raw distance function between regions may allow a pair of different regions to have a disproportionate effect on the overall distance calculation. This issue is address by thresholding the raw region distance function, thus making EMD more robust.
  • the similarity search engine interface may be designed to allow each data type to define its own weight function and threshold for its EMD measure.

Abstract

A content-addressable and searchable storage system for managing and exploring massive amounts of feature-rich data such as images, audio or scientific data, is shown. The system comprises a segmentation and feature extraction unit for segmenting data corresponding to an object into a plurality of data segments and generating a feature vector for each data segment; a sketch construction component for converting a feature vector into a compact bit-vector corresponding to the object; a similarity index comprising a plurality of compact bit-vectors corresponding to a plurality of objects; and an index insertion component for inserting a compact bit-vector corresponding to an object into the similarity index. The system may further comprise an indexing unit for identifying a candidate set of objects from said similarity index based upon a compact bit-vector corresponding to a query object. Still further, the system may additionally comprise a similarity ranking component for ranking objects in said candidate set by estimating their distances to the query object.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • The present application claims the benefit of the filing date of U.S. Provisional Patent Application Ser. No. 60/625,828, entitled “Image Similarity Search with Compact Data Structures” and filed on Nov. 9, 2004 by inventors Kai Li, Qin Lv and Moses Charikar.
  • The above cross-referenced related application is hereby incorporated by reference herein in its entirety.
  • STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
  • Not applicable.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • The present invention relates to a content-addressable and searchable storage system to provide effective capabilities to access, search, explore and manage massive amounts of diverse feature-rich data.
  • 2. Description of the Related Art
  • The world is moving into the age where all information is digitized and where the world is interconnected by digital means. Recent studies suggest that the volume of digital data on magnetic disks as well as the capacity of a disk have been doubling every year in the past decade. If this trend continues, the capacity of a single disk will reach 1 terabyte in 2007 and 1 petabyte by 2022. As data volume and storage capacity continue to increase exponentially, storage systems, as part of the operating system, must provide new abilities to access, search, explore, and manage massive amounts of data.
  • A key challenge in building next-generation storage systems is to manage massive amounts of feature-rich (non-text) data, which has dominated the increasing volume of digital information. Feature-rich data are typically sensor data such as audio, images, video, genomics, or scientific data; they are noisy and high-dimensional. Current file systems are designed for named text files, and they do not have mechanisms to manage feature-rich data.
  • In current systems, the user must name each file and find a place to store it, and then she must know the name in order to access it later. For example, today's digital cameras automatically generate meaningless file names for their images. These file names are difficult to remember, they often are duplicative of names of files previously downloaded from the camera, and they have no correlation with the image content. To find a specific image file, the user has to look through the image thumbnails instead of the file names.
  • Further, current file systems use directories to organize files. Directories emulate the management of paper files and have been helpful in managing paper-like documents. Some recent file systems attempt to provide content-based search tools, but they are limited to exact searches for text and annotations of non-text data. Manual annotation, however, is not practical for feature-rich data because such data are massive, noisy and high dimensional.
  • Pattern matching tools, document viewers, image thumbnail generators, and directory browsers are already integral components of a modern operating systems. However, such tools are limited to exploring text documents or viewing simple images; they are not useful to explore noisy, high-dimensional data.
  • The management of digital data calls for a fundamentally different paradigm. A disk in the future will store significantly more data than the amount of paper data one can handle in one's lifetime; in fact, much more data than the entire Library of Congress. A paper document is inherently tied to a physical location, but this is not true for digital data. Paper management systems force users to put a file into a fixed category, and current file systems follow a similar paradigm. In contrast, feature-rich data can be organized in multiple ways and thus have many attributes, most of which are unknown at the time the data is created.
  • Since searching in high dimensional spaces is a challenging problem, practical proposed search solutions such as the Google search engine have been limited to searching for exact matches—they tend to work only for text documents and text annotations. Search engines such as Google index documents by building an inverted index. A number of data structures have been devised for nearest neighbor searching such as R-Trees, k-d trees, ss-trees, and SR-trees. These are capable of supporting similarity queries, but they do not scale satisfactorily to large high-dimensional data sets. Several constructions of nearest neighbor search data structures have recently been devised in the theory community, but practical implementations of those theoretical ideas for high dimensional data do not exist yet.
  • Similarity searching on time series or sequence data have been investigated recently. Range searches and nearest neighbor searches in whole matching and subsequence matching have been the principal queries of interest for time series data. For whole matching, several techniques have been proposed to transform the time sequence to the frequency domain by using DFT (Discrete Fourier Transform) and wavelets to reduce dimensions. For subsequence matching, solutions include I-adaptive index to solve the matching problem for searches of pre-specified lengths, PAA (Piecewise Aggregate Approximation) technique to average values of equal-size windows of the time sequence or APCA (Adaptive Piecewise Constant Approximation) to average values of variable-size windows of the time sequence of the time sequence, and a multi-resolution index data structure. These techniques focus on the specifics of time series and not a general-purpose similarity search engine.
  • Thus, to date, there is no practical file system with the ability to do similarity searches for noisy, high-dimensional data and there is no index engine designed for efficient similarity searches.
  • Recently, the theory research community has made advances in areas such as compact data structures (sketches) and dimension reduction techniques. For example, a distance function on pairs of data items can be estimated by only examining the sketches of the data items. The existence of a sketch depends crucially on the function one desires to estimate. The successful construction of a small sketch as the metadata to estimate the distance between two points in high-dimensional space has significant implications on solving the efficient similarity search problem because it can provide significant savings in space and running time.
  • Sketching techniques for documents (represented as sets) have been developed. The construction, based on min-wise independent permutations, was used to compute compact sketches for eliminating near-duplicate documents in the Altavista search engine. Other research introduced the notion of locality-sensitive hashing, which is a family of hash functions where the collision probability is higher for objects that are closer. Such hash functions are very useful in the construction of data structures for nearest neighbor search. A variant of locality-sensitive hashing, called similarity-preserving hashing, was investigated by co-inventor of the present invention, Moses Charikar. He developed a sketch construction for the earth mover's distance (EMD) which had been investigated and used before in the context of determining image similarity and navigating image databases. A closely related idea for sketching EMD was devised and used for image retrieval and was evaluated using exact EMD as ground truth, i.e. they were not concerned with how well their method performed compared to perceptual similarity of images.
  • Many other techniques have been proposed for image similarity search. One technique may be referred to as region based image retrieval (RBIR). Most RBIR systems use a combination of color, texture, shape, and spatial information to represent a region.
  • In C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld, “Blobworld: A system for region-based image indexing and retrieval,” In Proc. of 3rd Intl. Conf. on Visual Information and Information Systems, pages 509-516 (1999), the authors describe a technique in which each region is represented by a 218-bin color histogram, mean texture contrast and anisotropy, centroid, area, eccentricity and orientation, which is a very complicated representation.
  • In W. Ma and B. S. Manjunath, “NETRA: A toolbox for navigating large image databases,” Multimedia Systems, 7(3):184-198 (1999), the authors describe another technique that uses a complicated region representation. It quantizes the RGB color space into 256 colors, and each region's color is represented by {(c1, p1), . . . , (cn; pn)}, where ci is the color code and pi is the fraction of that color in the region. Texture is represented by normalized mean and standard deviation of a set of Gabor wavelet transformations with different scales and directions.
  • In J. R. Smith and S. F. Chang, “VisualSEEk: A fully automated content-based image query system,” In Proc. of ACM Multimedia'96, pages 87-98 (1996), the authors describe a technique that extracts salient color regions using a back-projection technique and supports joint color-spatial queries. A selection of 166 colors in the HSV color space are used. Each region is represented by a color set, region centroid, area, width and height of the minimum bounding rectangle.
  • In A. Natsev, R. Rastogi, and K. Shim, “WALRUS: A similarity retrieval algorithm for image databases,” In Proc. of ACM SIGMOD'99, pages 395-406 (1999), the authors describe a technique that segments each image by computing wavelet based signatures for sliding windows of various sizes and then clusters them based on the proximity of their signatures. Each region is then represented by the average signature.
  • In S. Ardizzoni, I. Bartolini, and M. Patella, “Windsurf: Region-based image retrieval using wavelets,” In DEXA Workshop, pages 167-173 (1999) and I. Bartolini, P. Ciaccia, and M. Patella, “A sound algorithm for region-based image retrieval using an index,” In DEXA Workshop, pages 930-934 (2000), the authors describe a technique that performs 3-level Haar wavelet transformation in the HSV color space and the wavelet coefficients of the 3rd level LL subband are used for clustering. Each region is represented by its size, centroid and corresponding covariance matrices.
  • In J. Z. Wang, J. Li, and G. Wiederhold, “SIMPLIcity: Semantics-sensitive integrated matching for picture libraries,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(9):947-963 (2001), the authors describe a system that partitions an image into 4×4 blocks and computes average color and wavelet coefficients in high frequency bands.
  • Current region-based image similarity measures can be roughly divided into three categories: (independent best match; (2) one-to-one match; and (3) EMD match. Independent best match systems such as Blobworld and NETRA find the best matched region for each query region and calculate the overall similarity score using fuzzy-logic operations or weighted sum. Since each query region is matched independently, multiple regions in the query image might be matched to the same region in a target image, which is undesirable in many cases. As an extreme example, consider an image A full of red balloons and a very different image B with a red ball in it. Since each red balloon in A matches the red ball in B very well, these two images will be considered very similar by independent best match.
  • One-to-one match systems like Windsurf and WALRUS consider matching one set of regions to another set of regions and require that each region can only be matched once. For example, Windsurf uses the Hungarian Algorithm to assign regions based on region distance. Region size is then used to adjust two matching regions'similarity. Image similarity is defined as the sum of the adjusted region similarity. One-to-One match assumes good image segmentation so there is good correspondence between two similar images'regions. But current segmentation techniques are not perfect and regions do not always correspond to objects. Moreover, it is hard to define an optimal segmentation, as one image may need different segmentations when comparing to different images.
  • EMD match systems use similarity measures based on the Earth Mover's Distance (EMD). Although EMD is a good measure for region matching, its effectiveness is closely linked to the underlying distance function used for pairs of regions as well as the weight given to each region. Since these systems directly use the region distance function as the ground distance for EMD and use normalized region size as the region weight, this creates problems such as regions being weighted inappropriately. As a result, these systems do not use EMD very well.
  • There are no commercial systems for automatic audio query with the complexity or capabilities desired for a general purpose search engine. Websites such as Findsounds.com rely on text-based searching of sound file names. The technology of Comparisonics Inc. (the developer of Findsounds.com) allows the colorized display of sound feature data once the sound is found by name, but the features are not used for the indexing/query. Other music websites such as Moodlogic.com combine filenames with user preference rankings to generate similarities for music recommendation. The largest and most popular available research system for audio segmentation, classification, and query is MARSYAS, developed by George Tzanetakis and Co-PI Perry Cook at Princeton University. This software is publicly available, and recent conferences such as the International Symposium on Music Information Retrieval, the Conference on Digital Audio Effects, and the International Computer Music Conference revealed that MARSYAS is now the basis of approximately 80% of the current research in music information retrieval.
  • Most research in audio query has focused on the music domain. Some recent research projects include identifying the passages within a song when a singing voice is present and identifying the singer in a complex recorded song. Another recent project is the WinPitch Corpus, which automatically aligns speech recordings with text files.
  • The closest related work to similarity searches for genomic data is work in clustering of gene expression matrices to identify related patterns. Many different clustering algorithms have been proposed for microarray analysis. The general goal of such algorithms is to find biologically relevant groupings of genes and/or experiments from microarray data. Hierarchical clustering using average or complete linkage is probably most widely applied. Self organizing maps (SOM) are another commonly used technique.
  • Other authors have suggested using mutual information relevance networks, clustering by simulated annealing, model-based clustering, graph-theoretic approaches, as well as other methods. A recent promising trend in clustering algorithms has been an emergence of methods that are probabilistic in nature, thus allowing one gene to be a member of more than one cluster. However, all these algorithms have one common and serious limitation—they define similarity over the whole gene expression vector, thus making it impossible to successfully apply these techniques to large diverse databases of expression information that cover thousands of experiments, with different sets of genes coexpressed in different subsets of experiments. This problem can be addressed by bi-clustering algorithms, but exact solution to this problem for microarray data is NP-complete. Some approximation methods have been developed recently. These include a two-sided clustering algorithms called plaid models, a biclustering method in which low-variance submatrices of the complete data matrix are found, and a bi-graph based biclustering method. However, all these algorithms are very slow and have various limitations on bicluster size and qualities. They cannot be realistically applied to databases of thousands of microarray experiments.
  • SUMMARY OF THE INVENTION
  • The present invention disclosed and claimed herein is a system and method for a content-addressable and searchable storage system for managing and exploring massive amounts of feature-rich data such as images, audio or scientific data. In a preferred embodiment of the invention, the system comprises a segmentation and feature extraction unit for segmenting data corresponding to an object into a plurality of data segments and generating a feature vector for each data segment; a sketch construction component for converting a feature vector into a compact bit-vector corresponding to the object; a similarity index comprising a plurality of compact bit-vectors corresponding to a plurality of objects; and an index insertion component for inserting a compact bit-vector corresponding to an object into the similarity index. The system may further comprise an indexing unit for identifying a candidate set of objects from said similarity index based upon a compact bit-vector corresponding to a query object. Still further, the system may additionally comprise a similarity ranking component for ranking objects in said candidate set by estimating their distances to the query object.
  • A method of comparing a search image to a first plurality of stored images in accordance with an embodiment of the invention comprises the steps of segmenting the search image into a plurality of search image regions; extracting a region feature vector from each of the search image regions; converting each of the region feature vectors into a region bit vector; storing the region bit vectors; calculating a region weight for each of the search image regions; embedding all of the region bit vectors and region weights into a composite search image feature vector; storing the composite search image feature vector; and selecting a second plurality of images from the database using the composite search image feature vector, wherein the second plurality of images comprises a subset of the first plurality of images. A region's weight comprises a normalized square root of the region's size.
  • A method in accordance with a preferred embodiment of the invention may further comprise the steps of calculating an image dissimilarity match between the search image and each of the second plurality of images using the region bit vectors of the search image; and selecting a third plurality of images based upon the image dissimilarity matches, wherein the third plurality of images comprises a subset of the second plurality of images. The image dissimilarity match may comprise an Earth Mover's Distance using a square root region size as a region weight. The image dissimilarity match also may comprise an Earth Mover's Distance using a thresholded region distance.
  • A method in accordance with another preferred embodiment of the invention further comprises the steps of calculating a distance between two of the plurality of regions by XOR-ing their region bit vectors; comparing the distance to a threshold; selecting the distance as a region ground distance function if the distance is less than the threshold; selecting the threshold as the region ground distance function if the distance is greater than the threshold; calculating an image dissimilarity match between the search image and each of the second plurality of images using the region bit vectors and the ground distance function; and selecting a third plurality of images based upon the image dissimilarity matches, wherein the third plurality of images comprises a subset of the second plurality of images.
  • Still other aspects, features, and advantages of the present invention are readily apparent from the following detailed description, simply by illustrating preferable embodiments and implementations. The present invention is also capable of other and different embodiments, and its several details can be modified in various respects, all without departing from the spirit and scope of the present invention. Accordingly, the drawings and descriptions are to be regarded as illustrative in nature, and not as restrictive.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are incorporated in and constitute a part of this specification illustrate some embodiments of the invention and, together with the description, serve to explain some aspects, advantages, and principles of the invention. In the drawings,
  • FIG. 1 is a block diagram of a content-addressable and searchable storage system architecture in accordance with an embodiment of the present invention;
  • FIG. 2 is a block diagram of a similarity search engine in accordance with an embodiment of the present invention; and
  • FIG. 3 is a block diagram of the main components and method of inserting or querying an image in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION OF THE INVENTION
  • The present invention provides a content-addressable and searchable storage system. The data in the system preferably is organized dynamically based on demand. The system has a similarity search engine for searching massive amounts of feature-rich, noisy, high-dimensional data.
  • The general system architecture of a preferred embodiment of the invention will be described with reference to FIG. 1. The general system architecture has four main data paths: objects 152; file I/O's 144; annotations 142, and search queries 102. Objects can be addressed directly through a content-addressed store 150 that implements a content-addressed abstraction. A client can store a data object into the content addressed store 150 and get its fingerprint or hash back as its “id.” The content addressed store allows clients to access segments of an object. The storage system uses fingerprints as the uniform mechanism to address both data objects and segments.
  • Clients can perform regular file I/Os via a standard file system interface implemented by the Enhanced File System 140. The standard file system interface allows existing applications to perform traditional file I/Os to input and output data. As the data enters the system, it will be delivered to a Segmentation and Feature Extraction Unit 130 (data type dependent) which will segment the data into multiple segments, perform feature extraction, and then pass the extracted multi-dimensional vectors to the Similarity Search Engine 120 to construct compact metadata.
  • Attributes and user-defined annotations can be associated with individual files and directories. The Enhanced File System 140 implements the mechanism for the associations and also relational searches for desired attributes and annotations. Attributes are generated automatically by the input data processing units as the data enters into the system, while annotations are provided by the users.
  • Query Processing and Interface 110 implements a query interface for users to perform searches and data exploration in the system. It allows users to provide sophisticated similarity search queries that include query data and a search range. It also implements a user interface for users to browse and explore result data objects. The query data will be processed and converted into multi-dimensional feature vectors and then passed to the similarity search engine. A corresponding Segmentation and Feature Extraction 130 unit performs the conversion, while the Similarity Search Engine 120 implements the core similarity search capability and returns a list of data objects ranked by similarity.
  • The proposed system encourages users to use the content-based similarity search capability to search and manage massive amounts of feature-rich data instead of using the traditional file system interface. It may be useful to combine the similarity search capability with a search range constrained by attributes (such as time, size, data type, owner, and so on) and user-defined annotations. Users can find their data conveniently and quickly by meaningful contents instead of meaningless file names (such as those created by today's digital cameras). The following three examples the usefulness of such a system.
  • The first example is managing digital images. Consider a user who has millions of digital photos and would like to find as many photos with waterfalls as possible during the past three years. With traditional file systems, the user's digital photo collections are stored by meaningless file names automatically generated by a digital camera. To find the desired digital photos, the user first looks through all her directories created during the past three years and scan through each in image thumbnail mode. This process will be extremely inefficient and very time consuming. With the proposed system, the user could upload her photos without needing to know file names and directory names. When the user needs to find photos with waterfalls, she would first find one photo with waterfalls and then issue a query to find all images similar to the query data and specify a search range from 2001 to 2004. The system would present her desired photos.
  • The second example is managing audio data. Consider a research professor who regularly records her lectures, as well as her research group meetings. Years later they might want to make a summary based on what was said every time mobile storage was mentioned. A search query of this type could be done either by typing “mobile storage” into a text field, or saying the words “mobile storage” into a microphone. The system then would bring up all of the audio files it could find that contain the relevant speech recognized words, and also all regular text files (lecture notes, meeting minutes, emails, etc.) that contain the words either in the filename or in the content. Another example query could be of the form “Display all audio segments that contain simple conversations (one or two clear speakers) over a train station background.” The system would search for audio textures matching the query, and present them in a rank ordered listing with dates and origin information.
  • The third example is data exploration of genomic data. Consider a biologist who just identified a new gene that seems to be involved in cancer progression. The biologist ran a microarray experiment in which she produced a pattern of expression for this gene (and others in the genome) over a large set of conditions. She wants to identify any known genes that may have the same pattern of expression, so she queries the system with her experimental data and the name of the gene of interests. The biologist will then see all genes with similar expression patterns to the gene of interest over any subset of experiments. This may give her clues about the function of her gene of interest in carcinogenesis, and provide her leads for design of further experiments.
  • A preferred embodiment of the invention constructs sketches of the data. These sketches are tiny data structures that can be used to estimate properties of the original data. For example, a distance function on pairs of data items could be estimated by only examining the sketches of the data items. Sketch constructions have been developed for a number of purposes, including estimating similarity of sets, estimating distinct elements and vector norms, and estimating string edit distance. Sketch constructions can be derived from rounding techniques used in approximation algorithms. Many sketch constructions for estimating similarity and distances can be viewed as embeddings (approximate distance preserving mappings) from the data points to points in a normed space, usually L1 or L2. Once such a mapping is obtained, sketching techniques for L1 or L2 can be applied.
  • By using sketches constructed from feature vectors in the similarity search engine, the present invention speeds up similarity searches and maintains the similarity search quality while substantially reducing the metadata size. A sketch-based indexing system also may be used for efficient similarity searches.
  • A similarity search engine 120 in accordance with preferred embodiment of the invention is shown in FIG. 2. The similarity search engine 120 works with feature vectors and client-defined distance functions. The similarity search engine has two main operations: data input and similarity searching. Input data 202 enters a segmentation and feature extraction unit 210, depending on its data type. The segmentation and feature extraction unit 210 segments the input data 202 and generates a feature vector for each segment. Each piece of input data is then represented by a group of feature vectors 212. The feature vectors 212 have a client-defined distance function. The sketch construction component 222 converts a feature vector into a compact bit-vector or sketch. The sketches are then passed to the index insertion component 224 which inserts them into a similarity index 226.
  • When a query is presented to the similarity search engine 120, the query data 204 is first passed to a specific segmentation and feature extraction unit 240, depending on its data type. The segmentation and feature extraction unit 240 may be the same unit as segmentation and feature extraction unit 210 or may be a different unit. The feature segmentation and extraction unit 240 unit will segment the query data 204 and generate a set of feature vectors 242. The feature vectors 242 are passed to the sketch construction component 230 to convert them into a group of sketches. The sketch construction component 230 may be the same component as sketch component 222 or may be a different sketch construction component. The indexing unit 228 looks up the similarity index 226 to find a candidate set of objects. The candidate set may include objects that are not similar to the query object, but it misses very few objects that are similar. The similarity ranking component 232 will rank the objects in the candidate set by estimating their distances to the query object. It will filter out the objects whose distances to the query object are beyond a certain threshold.
  • The use of sketches in the similarity search engine achieves high-speed similarity searches and reduces the metadata space requirement. The sketch construction unit 222, 230 converts a multi-dimensional feature vector into a sketch, a very small bit vector that can be used to estimate the distance function of the original data. Such a sketch can typically be 1/10 the feature vector size without losing similarity search quality.
  • The segmentation and feature extraction units 210, 240 are data dependant. Thus, the system of a preferred embodiment provides a convenient interface such that users can “plug-in” new segmentation and feature extraction units easily. Examples of segmentation and feature extraction units will be described for image data, audio data and genomic data. Those skilled in the art will recognize that segmentation and feature extraction units for other types of data may be used with the invention.
  • FIG. 3 shows the main components of a preferred embodiment of an image similarity search method of the present invention and illustrates the steps an image goes through when it is inserted into the system, or is submitted as a query image. The preferred embodiment incorporates a new region feature representation with weighted L1 distance function and improved Earth Mover's Distance (“EMD”) match that will be referred to herein as “EMD*.”
  • When an image 302 is inserted into the system, segmentation component 310 segments it into several homogeneous regions 312. For each region 314, feature extraction component 320 extracts a 14-dimensional feature vector 322. Each region preferably, but not necessarily, is represented by a simple feature-vector that includes two kinds of information about a region: color moments and bounding box information. Color moments are compact representations that have been shown to be only slightly worse in performance than high-dimensional color histograms. In the preferred embodiment, the first three moments from each channel in the HSV color space are extracted, resulting in a nine-dimensional color vector. A bounding box is the minimum rectangle covering a region. Each region's bounding box is calculated, thereby obtaining the following information:
    x bounding box width
    y bounding box height
    p # of pixels in a region
    r_ = x_/y aspect ratio
    s_ = x_y bounding box size
    a_ = p_/s area ratio
    (cx, cy) region centroid

    A 5-dimensional vector is used to represent a regions bounding box information: (ln(r_), ln(s_, a_, cx, cy).
  • In other embodiments, more dimensions may be added to the feature vector. For example, shape information may be extracted from a region using known or new shape recognition methods and be added into the feature vector. Additionally, another level of segmentation by segmenting the regions into sub-regions also may be added to the system to provide more detailed information. This will change the feature representation from a group of feature vectors to a three-level tree of feature vectors. The two-level segmentation allows the implementation of the capability to query an object in addition to a whole image.
  • Bit vector conversion component then converts the 14-dimensional feature vector 322 into a region bit vector 332 using a thresholding and transformation algorithm. This results in very compact representation of each region.
  • The thresholding and transformation algorithm preferably approximates weighted (and thresholded) L1 distance of real-valued feature vectors with Hamming distance of bit vectors. The bit vector representation is much more compact than the real-valued feature vector representation; and it is also much faster to calculate Hamming distance of bit vectors (XORing bits) than weighted (and thresholded) L1 distance of feature vectors (floating point operations).
    Algorithm 1: Generate N × K Random (i, t) Pairs
    input: N, K, d, l[d], u[d], w[d]
    output: p[d]; rnd_i[N][K]; rnd_t[N][K]
    pi = wi × (u, −li); for i = 0, . . . , d − 1
    normalize p i s . t . i = 0 d - 1 p i = 1.0
    for (n = 0; n < N; n + +) do
     for (k = 0; k < K; k + +) do
     pick random number r ∈ [0, 1)
    find i s . t . j = 0 i - 1 p i <= r < j = 0 i p i
     rnd_i[n][k] = i
     pick random number t ∈ [li, ui]
     rnd_t[n][k] = t
     end for
    end for
  • Bit vectors are generated from d-dimensional vectors such that the expected Hamming distance between two bit vectors produced is proportional to the weighted L1 distance between the corresponding vectors. In order to do this, a single bit from each d-dimensional vector such that the probability that the bit produced is different for two vectors is proportional to their weighted L1 distance. The required bit vectors are produced by repeating this process to produce several bits and concatenating them together. For example, suppose one wants to compute weighted L1 distance for d-dimensional vectors, where the ith coordinate is in the range [li; hi] and has weight wi. Let T=Σiwi×(hi−li), and pi=wi×(hi−li)=T. Note that Σipi=1. To generate a single bit, pick iε[0; d−1] with probability pi, pick a uniform random number tε[li; hi]. For each vector v=(v1, . . . , vd), bit = { 0 if v i < t 1 i f v i >= t
    Algorithm 2: Convert Feature Vector to N-Bit Vector
    input: v[d], N, K, rnd_i[N][K], rnd_t[N][K]
    output: b[N]
    for (n = 0; n < N; n + +) do
    x = 0
    for (k = 0; k < K; k + +) do
    i = rnd_i[n][k]
    t = rnd_t[n][k]
    y = (vi < t ? 0 : 1)
    x = x ⊕ y
    end for
    bn = x
    end for
  • Note that an (i, t) pair determines the value of one bit for each vector. To make the transformation consistent across all vectors, for each bit we generate, the same (i, t) pair must be applied to each vector. The process of generating (i, t) pairs is described in Algorithm 1. Here, N K such pairs are generated where N is the size of the final bit vector desired (after thresholding) and K is a parameter which will be determined later.
  • Next, the distance function is transformed so the distance is thresholded at a given threshold δ. Algorithm 1 generates N K (i; t) pairs which give rise to N groups of K bits each. A single bit is produced from each group of K bits by applying a hash function to them. The hash function could be XOR, or some other random hash function. This achieves the desired thresholding. An implementation of the algorithm is shown here. Algorithm 1 is the initializing process, where N×K random (i, t) pairs are generated. Then for each feature vector, Algorithm 2 is called to convert the feature vector to an N-bit vector.
  • The distance between two regions can be calculated efficiently by XORing their region bit vectors. Next, all the n region bit vectors along with their weights are embedded at embedding component 340 into a single image feature vector 342, such that the L1 distance on two images'embedded feature vectors approximates the EMD* between these two images.
  • The preferred image similarity measure EMD* is based on Earth Mover's Distance (EMD) or Transportation Metric, which is a flexible similarity measure between multidimensional distributions. Given two distributions represented by sets of weighted features and a distance function between pairs of features, EMD reflects the minimal amount of work needed to transform one distribution into another by moving distribution “mass” (weights) around.
  • EMD can be computed via (weighted) bipartite matching, but this is a relatively expensive operation. Prior RBIR systems have used “EMD match”-based image similarity measures where the region distance function is used as the ground distance of EMD and normalized region size is used as region weight. However, these prior “EMD match”-based image similarity measures do not use EMD appropriately. In particular, the distance function and region weight information that are inputs to EMD are inappropriate.
  • First, a region's importance in an image is not proportional to that region's size. For example, a large region (e.g. front door) usually should not be considered much more important than a small region (e.g. a baby). Accordingly, the preferred embodiment uses the normalized square root of region size as each region's weight, which reduces the difference between small and large regions, and assigns suitable weights in most segmentation scenarios. Second, similar images may still have very different regions (e.g. the same baby with a different toy). If one simply use the region distance function, two similar images may be considered different only because they have two very different regions. To address this problem, distance thresholding preferably is used after calculating the distance between two regions. Roughly speaking, if the distance between two regions is larger than a threshold δ, we use δ as the region distance. By setting an upper bound on region distance, we reduce the effect that an individual region can have on the whole image, making our image similarity measure more robust.
  • Thus, the preferred embodiment defines image dissimilarity as the EMD using square root region size as region weight, and thresholded region distance as the ground distance function. This measure is referred to as “EMD* match”-based image similarity measure.
  • For compactness and efficiency in distance calculation, the image feature vector 342 is also converted into a bit vector 352 by bit vector conversion component 350. Both the image bit vector 352 and the individual region bit vectors 332 (with region weights) are stored into a database 360, 390 for future image retrieval.
  • A query image goes through the same process of segmentation, feature extraction, bit vector conversion, embedding, and bit vector conversion. Then the query image bit vector is used to do filtering 370 in the image database and obtain the top K images 372 that are closest to the query image's bit vector. The exact EMD* match 380 between the query image and each of the K images 372 is calculated using their region bit vectors. Finally the top k images 382 with smallest EMD* match to the query image are returned.
  • In order to perform similarity searches on a large image database, the system preferably uses a filtering method 370 via approximate EMD embedding. The goal is to find a small candidate image set 372 for the EMD* match by filtering out most of the images which are very different from the query image. The challenge is to quickly find a candidate image set that contains most of the similar images.
  • Previous filtering methods do not work well. The first kind of filtering is to index individual regions and combine the filtering results of all the regions to form the candidate image set. This approach is not effective, because it loses the information of image-level similarity. The second kind is to use a technique to embed EMD into L1 distance and then use Locality Sensitive Hashing (LSH) to find the nearest neighbor(s) in the latter space. This method has interesting provable properties, but it does not work well with compact data structures nor does it consider distance thresholding on real-valued vectors.
  • The present invention uses a new EMD embedding technique that converts a set of region bit vectors 332 into a single image feature vector 342, and the L1 distance on the embedded image feature vector 342 approximates the EMD on the original region bit vectors 332. The basic step involves picking several random positions (p1, . . . , pn) and checking for a particular bit pattern (b1, . . . , bn) at these positions. Given an image
    I={(r 1 , w 1), . . . , (r k , w k)}
    where ri is the bit vector for the ith region and wi is its weight, and a random pattern
    P={(p 1 , b 1), . . . , (p h , b h)}
    where pjε0, N−1 and bjε0, 1, we say region ri fits pattern P if
    ri,pj=bj for j=1, 2, . . . , h.
    Here ri,pj denotes the pjth bit of vector ri. The matched weight of image I wrt. pattern P preferably is defined as the sum of the weights of the regions in image I that fit pattern P: MW ( I , P ) = i w i
    ∀i st. ri fits pattern P
  • In the example below, if random positions 3, 5 and 7, were picked and random bit pattern “011”, both r1 and r3 fit the pattern (shown in bold numbers), so the matched weight is 0.1+0.3=0.4.
    1 2 3 4 5 6 7 8 wi
    r1 1 0 0 1 1 0 1 0 0.1
    r2 0 0 1 1 0 1 1 0 0.6
    r3 0 1 0 0 1 0 1 1 0.3
    MW 0.4
  • Intuitively, if two region vectors are similar, they have more bits in common than other regions. So there is a higher chance that two similar regions both fit (or not fit) a random pattern. Given two similar images, each random pattern picks out the regions in the two images that are similar, in effect matching similar regions to each other. If two images are similar, their matched weight wrt. and random pattern should be close to each other. A vector is obtained for every image by listing the matched weights for a number of randomly chosen patterns, and distances between images will be computed by L1 distances between these image vectors. When sufficiently many random patterns are used to generate the image vectors, the L1 distance between image vectors should be able to distinguish between similar and dissimilar images.
  • These techniques are designed for distributions on high dimensional bit vectors, while prior methods, such as that disclosed in P. Indyk and N. Thaper, “Fast Image Retrieval via Embeddings,” 3rd Int'l Workshop on Statistical and Computational Theories of Vision, 2003, are described for distributions of points in Rd, where d is small. Roughly, they decompose the space into collections of disjoint d-dimensional cubes. In fact they have a hierarchy of decompositions for different granularities. For each cube in this decomposition, they calculate the weight of the distribution that falls into this cube and build a vector by listing these counts (suitably weighted). In the technique of the preferred embodiment, the idea of computing the matched weight for a random pattern is analogous to computing the weight that falls into a cube. The prior embedding, such as in Indyk and Thaper, uses different levels of granularity and the weights assigned to them are exponentially decreasing. This creates problems when sampling coordinates to estimate weighted L1 distance by hamming distance of compact bit vectors; the problem is that the random variables involved have high variance. The scheme of the preferred embodiment of the present invention can be thought of as using only one level of granularity and this is designed to get around this problem with using many different levels.
  • The implementation of the embedding algorithm is divided into two pieces. The first is Algorithm 3 which generates M sets of random positions and picks a random bit pattern for each set.
    Algorithm 3 Generate M H-bit Random Patterns
    input: M, H, N(region bit vector length)
    output: P[M][H], B[M][H]
    for (i = 0; i < M; i + +) do
    for (j = 0;j < H; j + +) do
    pick a random position p ε [0, N − 1]
    pick a random bit b ε {0, 1}
    P[I][j] = p
    B[i][j] = b
    end for
    end for
  • The second piece is Algorithm 4 which, given an image represented by a list of region bit vectors and their corresponding weights, computes its EMD embedding using the random patterns generated by Algorithm 3.
    Algorithm 4 Image EMD Embedding
    input: k, r[k][N], w[k], M, H, P[M][H], B[M][H]
    output: MW[M]
    for (i = 0; i < M; i + +) do
    mw = 0:0
    for (j = 0; j < k; j + +) do
    h = 0
    while (h < H) && (r[j][P[i][h]] == B[i][h]) do
    h + +
    end while
    if h == H then
    mw = mw + w[j]
    end if
    end for
    MW[i] = mw
    end for
  • After the embedding, each image is represented by a M-dimensional real-valued vector. It is further converted to a bit vector using the same algorithm used for converting region feature vectors to region bit vectors. As a result, each image is now represented by a compact bit vector and the Hamming distance between two images can be efficiently computed by XORing their bit vectors. The filtering algorithm ranks images based on the Hamming distance of their embedded image bit vectors to the query image's bit vector and return the top K images for exact EMD computation.
  • Unlike two-dimensional images, audio takes place in time, so audio segmentation is the process of breaking up an audio stream into time sections that are perceptually different from adjacent sections. The audio “texture” within a given segment is relatively stable. Examples of segment boundaries could be a transition from background sound texture to the beginning of speech over that background. Another segment boundary might occur when the scene changes, such as leaving an office building lobby and going outside onto a busy street. Audio segmentation can be accomplished in two primary ways: blind segmentation based on sudden changes in extracted audio features, and classification based segmentation based on comparing audio features to a set of trained target feature values. The blind method works well and is preferred when the segment textures are varied and unpredictable, but requires the setting of thresholds to yield the best results. The classification-based method works well on a corpus of pre-labeled data, such as speech/music, musical genres, indoor/outdoor scenes, etc. databases. Either method requires the extraction of audio features.
  • Audio feature extraction is the process of computing a compact numerical representation of a sound segment. A variety of audio features have been used in systems for speech recognition, music/speech discrimination, musical genre (rock, pop, country, classical, jazz, etc.) labeling, and other audio classification tasks. Most features are extracted from short moving windows (5-100 milliseconds in length, moving along at a rate of 5-20 windows per second) by using the Short Time Fourier Transform. Wavelets and compressed data have also been used. Features can be computed at different time resolutions, and the value of each feature, along with the mean and variance of the features can be used as features themselves.
  • Common audio features include power, spectral centroid and rolloff (measures of the relative brightness of sound), spectral flux (a measure of the frame-to-frame variance in spectral shape), zero crossing rate (noisiness), and Mel-Frequency Cepstral Coefficients (MFCCs), which is a compact representation of spectral shape. For domain-specific tasks such as music query/recognition, features such as the Parametric Pitch Histogram, and Beat/Periodicity Histogram can be calculated and used. These might be of limited use in certain real-world situations as well. Selection of the correct feature set for a given task has proven to be an important part of building successful systems for machine “audio understanding.” For a fixed corpus, computing many features (40 dimensions or more), then using Principal Components Analysis has proven successful for reducing the dimensionality of the feature/search space.
  • Large numbers of gene expression microarray experiments are represented as matrices of real valued meas-urenments, where a value in row i and column j is the expression level of gene i in experiment j. Thus, these data are already represented in terms of real-valued vectors, but they do require pre-processing for effective search. The goal of similarity search on genomic data is to identify genes that share patterns of expression. A simple way to do this is to identify closest expression vectors over all experiments, but this is not biologically relevant. In cells, genes act in varying ways under different conditions, and thus two genes coexpressed under one set of conditions may not be co-expressed under another set of conditions. Thus, it is necessary to identify groups of experiments under which sets of genes are potentially co-regulated.
  • This means that to solve the problem of search for gene expression data exactly, one needs to solve the bi-clustering problem for large matrices of gene expression data—an NP complete problem. As no such solution exists, similarity search is essential in this domain. Bi-clustering approximation algorithms may be used to identify small, incomplete bi-clusters of 2-5 genes, which can be used to define feature vectors. This limit on bi-cluster size will make bi-clustering algorithms tractable, and similarity search will allow these feature vectors to identify complete bi-clusters.
  • Although the system allows users to define a similarity distance function for a specific data type, a preferred embodiment of the invention provides support for a default similarity distance function that is general-purpose. The distance measure may be based on Earth Mover's Distance, which has been used successfully in both image and audio similarity searches. EMD is a flexible metric between multidimensional distributions, represented by sets of weighted features and a distance function between pairs of features. This calculates the minimal amount of work needed to transform one distribution into another by moving distribution “mass”. This is a natural distance measure for weighted sets of features and is applicable to image, audio and scientific data. For example, two sound files that exhibit similar sub-segments, but in different order, would be judged similar by the EMD method.
  • For image feature vectors, the present invention improves upon the standard EMD measure in two ways. The standard EMD uses the region size as its weight and our first improvement is to use the normalized square root of region size as each region's weight to prevent large regions from dominating the distance calculation. The second improvement comes from the observation that using the raw distance function between regions may allow a pair of different regions to have a disproportionate effect on the overall distance calculation. This issue is address by thresholding the raw region distance function, thus making EMD more robust.
  • Although these improvements are described with respect to applications in the image domain, the underlying ideas are generally applicable to other domains as well. To make the similarity search engine general, the similarity search engine interface may be designed to allow each data type to define its own weight function and threshold for its EMD measure.
  • The foregoing description of the preferred embodiment of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. The embodiments were chosen and described in order to explain the principles of the invention and its practical application to enable one skilled in the art to utilize the invention in various embodiments as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto, and their equivalents. The entirety of each of the aforementioned documents is incorporated by reference herein.

Claims (35)

1. A method of searching a plurality of stored objects comprising the steps of:
generating a collection of multi-dimensional vectors representing each said object, each of said multi-dimensional vectors having an associated weight;
defining a similarity distance between said objects using a distance function; and
finding objects closest to a query object based upon said distance function.
2. A method of searching a plurality of stored objects according to claim 1, wherein said distance function comprises a monotone function of matched distances.
3. A method of searching a plurality of stored objects according to claim 1 wherein said distance function comprises an Earth Mover's Distance.
4. A method of searching a plurality of stored objects according to claim 1, wherein said step of finding objects closest to a query object comprises the step of using sketches to filter objects to form a candidate set.
5. A method of searching a plurality of stored objects according to claim 4, further comprising the step of using a distance calculation to rank objects in said candidate set.
6. A method of searching a plurality of stored objects according to claim 1 further comprising the step of applying a transformation to each multi-dimensional vector to threshold the distances between pairs of vectors.
7. A search system comprising:
means for inputting data; a segmentation and feature extraction unit for segmenting data and generating feature vectors representing segmented data; and
a similarity search engine comprising:
a sketch construction unit for converting feature vectors into sketches;
a similarity index;
an indexing unit for identifying a candidate set of objects in said similarity index; and
a similarity ranking component for ranking objects in the candidate set.
8. A search system according to claim 7 wherein said feature vectors have a client-defined distance function.
9. A search system according to claim 7 further comprising an index insertion unit for inserting data into said similarity index.
10. A search system according to claim 7 wherein said sketch construction unit converts a feature vector into a compact representation.
11. A search system according to claim 10 wherein said compact representation comprises a compact bit-vector.
12. A search system according to claim 7 wherein said sketch construction unit maps said feature vectors to a lower-dimensional vector such that said mapping approximates an ordering of objects in an original high dimensional space.
13. A method of processing data comprising the steps of:
segmenting said data into a plurality of segments;
extracting a feature vector from each of said plurality of segments;
converting each of said feature vectors into a segment sketch;
calculating a segment weight for each of said segments; and
embedding a plurality of said segment sketches and weights into a composite data feature vector.
14. A method of processing data according to claim 13 wherein said data comprises at least one of image data, audio data, and genomic data.
15. A method of comparing a search image to a first plurality of stored images comprising the steps of:
segmenting the search image into a plurality of search image regions;
extracting a region feature vector from each of said search image regions;
converting each of said region feature vectors into a region sketch;
storing said region sketches;
calculating a region weight for each of said search image regions;
embedding all of said region sketches and region weights into a composite search image feature vector;
storing said composite search image feature vector; and
selecting a second plurality of images from said database using said composite search image feature vector, wherein said second plurality of images comprises a subset of said first plurality of images.
16. A method of comparing a search image to a database of images according to claim 15, wherein a region's weight is a function of the regions size.
17. A method of comparing a search image to a database of images according to claim 16, wherein said function of a region's size comprises a normalized square root of said region's size.
18. A method of comparing a search image to a database of images according to claim 15, further comprising the steps of:
calculating an image dissimilarity match between said search image and each of said second plurality of images using said region bit vectors of said search image; and
selecting a third plurality of images based upon said image dissimilarity matches, wherein said third plurality of images comprises a subset of said second plurality of images.
19. A method of comparing a search image to a database of images according to claim 18, wherein said image dissimilarity match comprises a distance function.
20. A method of comparing a search image to a database of images according to claim 19, wherein said distance function comprises an Earth Mover's Distance.
21. A method of comparing a search image to a database of images according to claim 19, wherein said distance function uses a function of a region size as a region weight.
22. A method of comparing a search image to a database of images according to claim 21 wherein said function of a region size comprises a square root of a region size.
23. A method of comparing a search image to a database of images according to claim 19, wherein said distance function uses a thresholded region distance.
24. A method of comparing a search image to a database of images according to claim 15 further comprising the steps of:
calculating a distance between two of said plurality of regions by XOR-ing their region bit vectors;
comparing said distance to a threshold;
selecting said distance as a region ground distance function if said distance is less than said threshold;
selecting said threshold as said region ground distance function if said distance is greater than said threshold;
calculating an image dissimilarity match between said search image and each of said second plurality of images using said region bit vectors and said ground distance function; and
selecting a third plurality of images based upon said image dissimilarity matches, wherein said third plurality of images comprises a subset of said second plurality of images.
25. A method of processing an image comprising the steps of:
segmenting said image into a plurality of regions;
extracting a feature vector from each of said regions;
converting each of said feature vectors into a region bit vector;
storing each of said region bit vectors;
embedding all of said region bit vectors into a composite image feature vector;
converting said composite image feature vector into an image bit vector;
storing said image bit vector.
26. A method for performing a similarity search:
segmenting input data;
extracting input data feature vectors from said segmented input data;
constructing an input data sketch from said feature vectors;
indexing said input data based upon said sketch;
segmenting query data;
extracting query data feature vectors from said segmented query data;
constructing a query data sketch from said query data feature vectors; and
comparing said query data sketch to a plurality of input data sketches.
27. A system for performing similarity searches on data comprising:
a segmentation and feature extraction unit for segmenting data corresponding to an object into a plurality of data segments and generating a feature vector for each data segment;
a sketch construction component for converting a feature vector into a compact bit-vector corresponding to said object;
a similarity index comprising a plurality of compact bit-vectors corresponding to a plurality of objects; and
an index insertion component for inserting a compact bit-vector corresponding to an object into said similarity index.
28. A system for performing similarity searches on data according to claim 27, further comprising:
an indexing unit for identifying a candidate set of objects from said similarity index based upon a compact bit-vector corresponding to a query object.
29. A system for performing similarity searches on data according to claim 28, further comprising:
a similarity ranking component for ranking objects in said candidate set by estimating their distances to the query object.
30. A system for performing similarity searches on data comprising:
a first segmentation and feature extraction unit for segmenting data corresponding to a first type of object into a plurality of data segments and generating a feature vector for each data segment;
a second segmentation and feature extraction unit for segmenting data corresponding to a second type of object into a plurality of data segments and generating a feature vector for each data segment;
a sketch construction component for converting a feature vector into a compact bit-vector corresponding to an object;
a similarity index comprising a plurality of compact bit-vectors corresponding to a plurality of objects; and
an index insertion component for inserting a compact bit-vector corresponding to an object into said similarity index.
31. A system for performing similarity searches on data according to claim 30, wherein said first data type comprises image data.
32. A system for performing similarity searches on data according to claim 31, wherein said second data type comprises audio data.
33. A system for performing similarity searches on data according to claim 30, wherein said first and second data types each comprise a different data type selected from the group of image data, audio data, and genomic data.
34. A system for performing similarity searches on data according to claim 30, further comprising:
an indexing unit for identifying a candidate set of objects from said similarity index based upon a compact bit-vector corresponding to a query object.
35. A system for performing similarity searches on data according to claim 31, further comprising:
a similarity ranking component for ranking objects in said candidate set by estimating their distances to the query object.
US11/219,822 2004-11-08 2005-09-07 Similarity search system with compact data structures Expired - Fee Related US7966327B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/219,822 US7966327B2 (en) 2004-11-08 2005-09-07 Similarity search system with compact data structures

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US62582804P 2004-11-08 2004-11-08
US11/219,822 US7966327B2 (en) 2004-11-08 2005-09-07 Similarity search system with compact data structures

Publications (2)

Publication Number Publication Date
US20060101060A1 true US20060101060A1 (en) 2006-05-11
US7966327B2 US7966327B2 (en) 2011-06-21

Family

ID=36317590

Family Applications (1)

Application Number Title Priority Date Filing Date
US11/219,822 Expired - Fee Related US7966327B2 (en) 2004-11-08 2005-09-07 Similarity search system with compact data structures

Country Status (1)

Country Link
US (1) US7966327B2 (en)

Cited By (96)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060277178A1 (en) * 2005-06-02 2006-12-07 Wang Ting Z Table look-up method with adaptive hashing
US20070085716A1 (en) * 2005-09-30 2007-04-19 International Business Machines Corporation System and method for detecting matches of small edit distance
US20070150486A1 (en) * 2005-12-14 2007-06-28 Microsoft Corporation Two-dimensional conditional random fields for web extraction
US20070156652A1 (en) * 2005-12-29 2007-07-05 Microsoft Corporation Displaying key differentiators based on standard deviations within a distance metric
US20070299865A1 (en) * 2006-06-27 2007-12-27 Nahava Inc. Method and Apparatus for fast similarity-based query, self-join, and join for massive, high-dimension datasets
US20080027910A1 (en) * 2006-07-25 2008-01-31 Microsoft Corporation Web object retrieval based on a language model
US20080027969A1 (en) * 2006-07-31 2008-01-31 Microsoft Corporation Hierarchical conditional random fields for web extraction
US20080033915A1 (en) * 2006-08-03 2008-02-07 Microsoft Corporation Group-by attribute value in search results
US20080044016A1 (en) * 2006-08-04 2008-02-21 Henzinger Monika H Detecting duplicate and near-duplicate files
US20080133496A1 (en) * 2006-12-01 2008-06-05 International Business Machines Corporation Method, computer program product, and device for conducting a multi-criteria similarity search
US20080133441A1 (en) * 2006-12-01 2008-06-05 Sun Microsystems, Inc. Method and system for recommending music
US20080162478A1 (en) * 2001-01-24 2008-07-03 William Pugh Detecting duplicate and near-duplicate files
US20080198844A1 (en) * 2007-02-20 2008-08-21 Searete, Llc Cross-media communication coordination
US20080201651A1 (en) * 2007-02-16 2008-08-21 Palo Alto Research Center Incorporated System and method for annotating documents using a viewer
US20080201389A1 (en) * 2007-02-20 2008-08-21 Searete, Llc Cross-media storage coordination
WO2008105962A2 (en) * 2006-10-16 2008-09-04 The Penn State Research Foundation Real-time computerized annotation of pictures
US20080263010A1 (en) * 2006-12-12 2008-10-23 Microsoft Corporation Techniques to selectively access meeting content
US20090024580A1 (en) * 2007-07-20 2009-01-22 Pere Obrador Compositional balance and color driven content retrieval
US20090192997A1 (en) * 2008-01-25 2009-07-30 International Business Machines Corporation Service search system, method, and program
US20090259606A1 (en) * 2008-04-11 2009-10-15 Seah Vincent Pei-Wen Diversified, self-organizing map system and method
US20100010973A1 (en) * 2008-07-09 2010-01-14 International Business Machines Corporation Vector Space Lightweight Directory Access Protocol Data Search
US20100057804A1 (en) * 2008-07-24 2010-03-04 Nahava Inc. Method and Apparatus for partitioning high-dimension vectors for use in a massive index tree
US20100070509A1 (en) * 2008-08-15 2010-03-18 Kai Li System And Method For High-Dimensional Similarity Search
US20100076671A1 (en) * 2008-03-19 2010-03-25 Harman Becker Automotive Systems Gmbh Method for providing a traffic pattern for navigation map data and navigation map data
US20100125553A1 (en) * 2008-11-14 2010-05-20 Data Domain, Inc. Delta compression after identity deduplication
US20100135527A1 (en) * 2008-12-02 2010-06-03 Yi Wu Image recognition algorithm, method of identifying a target image using same, and method of selecting data for transmission to a portable electronic device
US7761466B1 (en) 2007-07-30 2010-07-20 Hewlett-Packard Development Company, L.P. Hash-based image identification
US20100313036A1 (en) * 2009-06-09 2010-12-09 Data Domain, Inc. Segment deduplication system with encryption of segments
US20100312800A1 (en) * 2009-06-09 2010-12-09 Data Domain, Inc. Segment deduplication system with compression of segments
US20100313040A1 (en) * 2009-06-09 2010-12-09 Data Domain, Inc. Segment deduplication system with encryption and compression of segments
US20110037766A1 (en) * 2009-08-17 2011-02-17 Nexidia Inc. Cluster map display
US20110077998A1 (en) * 2009-09-29 2011-03-31 Microsoft Corporation Categorizing online user behavior data
US7958101B1 (en) * 2006-01-03 2011-06-07 Emc Corporation Methods and apparatus for mounting a file system
US20110153677A1 (en) * 2009-12-18 2011-06-23 Electronics And Telecommunications Research Institute Apparatus and method for managing index information of high-dimensional data
US20110317009A1 (en) * 2010-06-23 2011-12-29 MindTree Limited Capturing Events Of Interest By Spatio-temporal Video Analysis
US8095542B1 (en) * 2006-01-03 2012-01-10 Emc Corporation Methods and apparatus for allowing access to content
WO2012054399A1 (en) * 2010-10-17 2012-04-26 Canon Kabushiki Kaisha Systems and methods for cluster comparison
US8184953B1 (en) * 2008-02-22 2012-05-22 Google Inc. Selection of hash lookup keys for efficient retrieval
US20120141019A1 (en) * 2010-12-07 2012-06-07 Sony Corporation Region description and modeling for image subscene recognition
US20130039584A1 (en) * 2011-08-11 2013-02-14 Oztan Harmanci Method and apparatus for detecting near-duplicate images using content adaptive hash lookups
US8447740B1 (en) 2008-11-14 2013-05-21 Emc Corporation Stream locality delta compression
EP2608078A1 (en) * 2011-12-23 2013-06-26 Thomson Licensing Method of automatic management of images in a collection of images and corresponding device
US8553981B2 (en) 2011-05-17 2013-10-08 Microsoft Corporation Gesture-based visual search
US8782077B1 (en) * 2011-06-10 2014-07-15 Google Inc. Query image search
US8849772B1 (en) * 2008-11-14 2014-09-30 Emc Corporation Data replication with delta compression
US20140365463A1 (en) * 2013-06-05 2014-12-11 Digitalglobe, Inc. Modular image mining and search
CN104462217A (en) * 2014-11-09 2015-03-25 浙江大学 Time-series similarity measurement method based on segmented statistical approximate representation
US20150120760A1 (en) * 2013-10-31 2015-04-30 Adobe Systems Incorporated Image tagging
US9026536B2 (en) 2010-10-17 2015-05-05 Canon Kabushiki Kaisha Systems and methods for cluster comparison
US20150170068A1 (en) * 2013-12-17 2015-06-18 International Business Machines Corporation Determining analysis recommendations based on data analysis context
US20160098613A1 (en) * 2005-09-30 2016-04-07 Facebook, Inc. Apparatus, method and program for image search
US9336302B1 (en) 2012-07-20 2016-05-10 Zuci Realty Llc Insight and algorithmic clustering for automated synthesis
US20160224636A1 (en) * 2015-01-30 2016-08-04 Nec Europe Ltd. Scalable system and method for weighted similarity estimation in massive datasets revealed in a streaming fashion
CN105930873A (en) * 2016-04-27 2016-09-07 天津中科智能识别产业技术研究院有限公司 Self-paced cross-modal matching method based on subspace
CN106095893A (en) * 2016-06-06 2016-11-09 北京大学深圳研究生院 A kind of cross-media retrieval method
WO2016187417A1 (en) * 2015-05-20 2016-11-24 Ebay Inc. Multi-faceted entity identification in search
WO2017115218A1 (en) * 2015-12-30 2017-07-06 International Business Machines Corporation Predicting target characteristic data
US20180157681A1 (en) * 2016-12-06 2018-06-07 Ebay Inc. Anchored search
WO2018125932A1 (en) * 2016-12-29 2018-07-05 Shutterstock, Inc. Clustering search results based on image composition
US20180322195A1 (en) * 2017-05-04 2018-11-08 Buzzmuisq Inc. Method for recommending musing in playlist and apparatus using the same
US10229143B2 (en) 2015-06-23 2019-03-12 Microsoft Technology Licensing, Llc Storage and retrieval of data from a bit vector search index
US10242071B2 (en) 2015-06-23 2019-03-26 Microsoft Technology Licensing, Llc Preliminary ranker for scoring matching documents
US20190197134A1 (en) * 2017-12-22 2019-06-27 Oracle International Corporation Computerized geo-referencing for images
CN110209663A (en) * 2018-02-14 2019-09-06 阿里巴巴集团控股有限公司 The method, apparatus and storage medium that search range determines
US10437878B2 (en) * 2016-12-28 2019-10-08 Shutterstock, Inc. Identification of a salient portion of an image
CN110377778A (en) * 2019-07-11 2019-10-25 北京字节跳动网络技术有限公司 Figure sort method, device and electronic equipment based on title figure correlation
US10467215B2 (en) 2015-06-23 2019-11-05 Microsoft Technology Licensing, Llc Matching documents using a bit vector search index
US10503775B1 (en) * 2016-12-28 2019-12-10 Shutterstock, Inc. Composition aware image querying
CN110765356A (en) * 2019-10-23 2020-02-07 绍兴柯桥浙工大创新研究院发展有限公司 Industrial design man-machine data query system for retrieving and sorting according to user habits
US10565198B2 (en) 2015-06-23 2020-02-18 Microsoft Technology Licensing, Llc Bit vector search index using shards
US10595054B2 (en) 2016-05-10 2020-03-17 Google Llc Method and apparatus for a virtual online video channel
US10733164B2 (en) 2015-06-23 2020-08-04 Microsoft Technology Licensing, Llc Updating a bit vector search index
US10750248B1 (en) 2016-05-10 2020-08-18 Google Llc Method and apparatus for server-side content delivery network switching
US10750216B1 (en) 2016-05-10 2020-08-18 Google Llc Method and apparatus for providing peer-to-peer content delivery
US10771824B1 (en) 2016-05-10 2020-09-08 Google Llc System for managing video playback using a server generated manifest/playlist
US10785508B2 (en) 2016-05-10 2020-09-22 Google Llc System for measuring video playback events using a server generated manifest/playlist
CN111767419A (en) * 2019-05-22 2020-10-13 北京京东尚科信息技术有限公司 Picture searching method, device, equipment and computer readable storage medium
WO2020243437A1 (en) * 2019-05-31 2020-12-03 Q2 Software, Inc. System and method for information retrieval for noisy data
US10877948B1 (en) * 2020-07-01 2020-12-29 Tamr, Inc. Method and computer program product for geospatial binning
US20210026821A1 (en) * 2019-07-26 2021-01-28 Introhive Services Inc. Data cleansing system and method
US10949467B2 (en) * 2018-03-01 2021-03-16 Huawei Technologies Canada Co., Ltd. Random draw forest index structure for searching large scale unstructured data
US11032588B2 (en) 2016-05-16 2021-06-08 Google Llc Method and apparatus for spatial enhanced adaptive bitrate live streaming for 360 degree video playback
US11039181B1 (en) 2016-05-09 2021-06-15 Google Llc Method and apparatus for secure video manifest/playlist generation and playback
WO2021121129A1 (en) * 2020-06-30 2021-06-24 平安科技(深圳)有限公司 Method and apparatus for similar case detection, device, and storage medium
US20210200768A1 (en) * 2018-09-11 2021-07-01 Intuit Inc. Responding to similarity queries using vector dimensionality reduction
US11069378B1 (en) 2016-05-10 2021-07-20 Google Llc Method and apparatus for frame accurate high resolution video editing in cloud using live video streams
CN113239222A (en) * 2021-01-19 2021-08-10 佳木斯大学 Image retrieval method based on image information extraction and EMD distance improvement
US11106708B2 (en) * 2018-03-01 2021-08-31 Huawei Technologies Canada Co., Ltd. Layered locality sensitive hashing (LSH) partition indexing for big data applications
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US11281639B2 (en) 2015-06-23 2022-03-22 Microsoft Technology Licensing, Llc Match fix-up to remove matching documents
US11321486B2 (en) * 2020-07-21 2022-05-03 Alipay (Hangzhou) Information Technology Co., Ltd. Method, apparatus, device, and readable medium for identifying private data
US11386262B1 (en) 2016-04-27 2022-07-12 Google Llc Systems and methods for a knowledge-based form creation platform
US11392568B2 (en) 2015-06-23 2022-07-19 Microsoft Technology Licensing, Llc Reducing matching documents for a search query
EP4071627A1 (en) * 2021-04-09 2022-10-12 INTEL Corporation Technologies for tuning performance and/or accuracy of similarity search using stochastic associative memories
US20220414108A1 (en) * 2021-06-29 2022-12-29 United States Of America As Represented By The Secretary Of The Army Classification engineering using regional locality-sensitive hashing (lsh) searches
US11797531B2 (en) * 2020-08-04 2023-10-24 Micron Technology, Inc. Acceleration of data queries in memory

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7404150B2 (en) * 2005-11-14 2008-07-22 Red Hat, Inc. Searching desktop objects based on time comparison
US8417037B2 (en) * 2007-07-16 2013-04-09 Alexander Bronstein Methods and systems for representation and matching of video content
JP4956452B2 (en) * 2008-01-25 2012-06-20 富士重工業株式会社 Vehicle environment recognition device
JP4876080B2 (en) * 2008-01-25 2012-02-15 富士重工業株式会社 Environment recognition device
KR101266358B1 (en) * 2008-12-22 2013-05-22 한국전자통신연구원 A distributed index system based on multi-length signature files and method thereof
US8889976B2 (en) * 2009-08-14 2014-11-18 Honda Motor Co., Ltd. Musical score position estimating device, musical score position estimating method, and musical score position estimating robot
US9449024B2 (en) 2010-11-19 2016-09-20 Microsoft Technology Licensing, Llc File kinship for multimedia data tracking
US8370338B2 (en) * 2010-12-03 2013-02-05 Xerox Corporation Large-scale asymmetric comparison computation for binary embeddings
US9043326B2 (en) 2011-01-28 2015-05-26 The Curators Of The University Of Missouri Methods and systems for biclustering algorithm
US9087395B1 (en) * 2011-04-28 2015-07-21 A9.Com, Inc. Techniques for providing content animation
US9042648B2 (en) 2012-02-23 2015-05-26 Microsoft Technology Licensing, Llc Salient object segmentation
US8705870B2 (en) 2012-03-02 2014-04-22 Microsoft Corporation Image searching by approximate κ-NN graph
US9710493B2 (en) 2013-03-08 2017-07-18 Microsoft Technology Licensing, Llc Approximate K-means via cluster closures
US9361329B2 (en) 2013-12-13 2016-06-07 International Business Machines Corporation Managing time series databases
US20190332619A1 (en) * 2014-08-07 2019-10-31 Cortical.Io Ag Methods and systems for mapping data items to sparse distributed representations
CN104391866B (en) * 2014-10-24 2017-07-28 宁波大学 A kind of approximate member's querying method based on high dimensional data filter
US10516782B2 (en) 2015-02-03 2019-12-24 Dolby Laboratories Licensing Corporation Conference searching and playback of search results
JP6638484B2 (en) * 2016-03-10 2020-01-29 富士通株式会社 Information processing apparatus, similarity search program, and similarity search method
WO2017204819A1 (en) * 2016-05-27 2017-11-30 Hewlett Packard Enterprise Development Lp Similarity analyses in analytics workflows

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4683496A (en) * 1985-08-23 1987-07-28 The Analytic Sciences Corporation System for and method of enhancing images using multiband information
US5220441A (en) * 1990-09-28 1993-06-15 Eastman Kodak Company Mechanism for determining parallax between digital images
US5432895A (en) * 1992-10-01 1995-07-11 University Corporation For Atmospheric Research Virtual reality imaging system
US20020128997A1 (en) * 2001-03-07 2002-09-12 Rockwell Technologies, Llc System and method for estimating the point of diminishing returns in data mining processing
US20040133526A1 (en) * 2001-03-20 2004-07-08 Oded Shmueli Negotiating platform
US7158961B1 (en) * 2001-12-31 2007-01-02 Google, Inc. Methods and apparatus for estimating similarity
US20070118432A1 (en) * 2005-11-21 2007-05-24 Vijay Vazirani Systems and methods for optimizing revenue in search engine auctions

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4683496A (en) * 1985-08-23 1987-07-28 The Analytic Sciences Corporation System for and method of enhancing images using multiband information
US5220441A (en) * 1990-09-28 1993-06-15 Eastman Kodak Company Mechanism for determining parallax between digital images
US5432895A (en) * 1992-10-01 1995-07-11 University Corporation For Atmospheric Research Virtual reality imaging system
US20020128997A1 (en) * 2001-03-07 2002-09-12 Rockwell Technologies, Llc System and method for estimating the point of diminishing returns in data mining processing
US20040133526A1 (en) * 2001-03-20 2004-07-08 Oded Shmueli Negotiating platform
US7158961B1 (en) * 2001-12-31 2007-01-02 Google, Inc. Methods and apparatus for estimating similarity
US20070118432A1 (en) * 2005-11-21 2007-05-24 Vijay Vazirani Systems and methods for optimizing revenue in search engine auctions

Cited By (169)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9275143B2 (en) 2001-01-24 2016-03-01 Google Inc. Detecting duplicate and near-duplicate files
US20080162478A1 (en) * 2001-01-24 2008-07-03 William Pugh Detecting duplicate and near-duplicate files
US7941009B2 (en) 2003-04-08 2011-05-10 The Penn State Research Foundation Real-time computerized annotation of pictures
US20090204637A1 (en) * 2003-04-08 2009-08-13 The Penn State Research Foundation Real-time computerized annotation of pictures
US20060277178A1 (en) * 2005-06-02 2006-12-07 Wang Ting Z Table look-up method with adaptive hashing
US7539661B2 (en) * 2005-06-02 2009-05-26 Delphi Technologies, Inc. Table look-up method with adaptive hashing
US9881229B2 (en) * 2005-09-30 2018-01-30 Facebook, Inc. Apparatus, method and program for image search
US10810454B2 (en) * 2005-09-30 2020-10-20 Facebook, Inc. Apparatus, method and program for image search
US20180129898A1 (en) * 2005-09-30 2018-05-10 Facebook, Inc. Apparatus, method and program for image search
US20070085716A1 (en) * 2005-09-30 2007-04-19 International Business Machines Corporation System and method for detecting matches of small edit distance
US20160098613A1 (en) * 2005-09-30 2016-04-07 Facebook, Inc. Apparatus, method and program for image search
US20070150486A1 (en) * 2005-12-14 2007-06-28 Microsoft Corporation Two-dimensional conditional random fields for web extraction
US7529761B2 (en) 2005-12-14 2009-05-05 Microsoft Corporation Two-dimensional conditional random fields for web extraction
US20100318524A1 (en) * 2005-12-29 2010-12-16 Microsoft Corporation Displaying Key Differentiators Based On Standard Deviations Within A Distance Metric
US20070156652A1 (en) * 2005-12-29 2007-07-05 Microsoft Corporation Displaying key differentiators based on standard deviations within a distance metric
US7774344B2 (en) * 2005-12-29 2010-08-10 Microsoft Corporation Displaying key differentiators based on standard deviations within a distance metric
US8095542B1 (en) * 2006-01-03 2012-01-10 Emc Corporation Methods and apparatus for allowing access to content
US7958101B1 (en) * 2006-01-03 2011-06-07 Emc Corporation Methods and apparatus for mounting a file system
US8117213B1 (en) 2006-06-27 2012-02-14 Nahava Inc. Method and apparatus for fast similarity-based query, self-join, and join for massive, high-dimension datasets
US20070299865A1 (en) * 2006-06-27 2007-12-27 Nahava Inc. Method and Apparatus for fast similarity-based query, self-join, and join for massive, high-dimension datasets
US7644090B2 (en) * 2006-06-27 2010-01-05 Nahava Inc. Method and apparatus for fast similarity-based query, self-join, and join for massive, high-dimension datasets
US20080027910A1 (en) * 2006-07-25 2008-01-31 Microsoft Corporation Web object retrieval based on a language model
US8001130B2 (en) 2006-07-25 2011-08-16 Microsoft Corporation Web object retrieval based on a language model
US20100281009A1 (en) * 2006-07-31 2010-11-04 Microsoft Corporation Hierarchical conditional random fields for web extraction
US20080027969A1 (en) * 2006-07-31 2008-01-31 Microsoft Corporation Hierarchical conditional random fields for web extraction
US7720830B2 (en) * 2006-07-31 2010-05-18 Microsoft Corporation Hierarchical conditional random fields for web extraction
US20080033915A1 (en) * 2006-08-03 2008-02-07 Microsoft Corporation Group-by attribute value in search results
US7921106B2 (en) 2006-08-03 2011-04-05 Microsoft Corporation Group-by attribute value in search results
US8015162B2 (en) 2006-08-04 2011-09-06 Google Inc. Detecting duplicate and near-duplicate files
US20080044016A1 (en) * 2006-08-04 2008-02-21 Henzinger Monika H Detecting duplicate and near-duplicate files
WO2008105962A3 (en) * 2006-10-16 2008-11-13 Penn State Res Found Real-time computerized annotation of pictures
WO2008105962A2 (en) * 2006-10-16 2008-09-04 The Penn State Research Foundation Real-time computerized annotation of pictures
US7696427B2 (en) * 2006-12-01 2010-04-13 Oracle America, Inc. Method and system for recommending music
US20080133496A1 (en) * 2006-12-01 2008-06-05 International Business Machines Corporation Method, computer program product, and device for conducting a multi-criteria similarity search
US20080133441A1 (en) * 2006-12-01 2008-06-05 Sun Microsystems, Inc. Method and system for recommending music
US20080263010A1 (en) * 2006-12-12 2008-10-23 Microsoft Corporation Techniques to selectively access meeting content
US8276060B2 (en) * 2007-02-16 2012-09-25 Palo Alto Research Center Incorporated System and method for annotating documents using a viewer
US20080201651A1 (en) * 2007-02-16 2008-08-21 Palo Alto Research Center Incorporated System and method for annotating documents using a viewer
US7860887B2 (en) 2007-02-20 2010-12-28 The Invention Science Fund I, Llc Cross-media storage coordination
US9760588B2 (en) 2007-02-20 2017-09-12 Invention Science Fund I, Llc Cross-media storage coordination
US9008117B2 (en) 2007-02-20 2015-04-14 The Invention Science Fund I, Llc Cross-media storage coordination
US9008116B2 (en) 2007-02-20 2015-04-14 The Invention Science Fund I, Llc Cross-media communication coordination
US20080198844A1 (en) * 2007-02-20 2008-08-21 Searete, Llc Cross-media communication coordination
US20080201389A1 (en) * 2007-02-20 2008-08-21 Searete, Llc Cross-media storage coordination
US7917518B2 (en) * 2007-07-20 2011-03-29 Hewlett-Packard Development Company, L.P. Compositional balance and color driven content retrieval
US20090024580A1 (en) * 2007-07-20 2009-01-22 Pere Obrador Compositional balance and color driven content retrieval
US7761466B1 (en) 2007-07-30 2010-07-20 Hewlett-Packard Development Company, L.P. Hash-based image identification
US20090192997A1 (en) * 2008-01-25 2009-07-30 International Business Machines Corporation Service search system, method, and program
US8121995B2 (en) * 2008-01-25 2012-02-21 International Business Machines Corporation Service search system, method, and program
US8712216B1 (en) * 2008-02-22 2014-04-29 Google Inc. Selection of hash lookup keys for efficient retrieval
US8184953B1 (en) * 2008-02-22 2012-05-22 Google Inc. Selection of hash lookup keys for efficient retrieval
US20100076671A1 (en) * 2008-03-19 2010-03-25 Harman Becker Automotive Systems Gmbh Method for providing a traffic pattern for navigation map data and navigation map data
US20090259606A1 (en) * 2008-04-11 2009-10-15 Seah Vincent Pei-Wen Diversified, self-organizing map system and method
US8918383B2 (en) * 2008-07-09 2014-12-23 International Business Machines Corporation Vector space lightweight directory access protocol data search
US20100010973A1 (en) * 2008-07-09 2010-01-14 International Business Machines Corporation Vector Space Lightweight Directory Access Protocol Data Search
US20100057804A1 (en) * 2008-07-24 2010-03-04 Nahava Inc. Method and Apparatus for partitioning high-dimension vectors for use in a massive index tree
US20100070509A1 (en) * 2008-08-15 2010-03-18 Kai Li System And Method For High-Dimensional Similarity Search
US20100125553A1 (en) * 2008-11-14 2010-05-20 Data Domain, Inc. Delta compression after identity deduplication
US20150052103A1 (en) * 2008-11-14 2015-02-19 Emc Corporation Data replication with delta compression
US8447740B1 (en) 2008-11-14 2013-05-21 Emc Corporation Stream locality delta compression
US9418133B2 (en) * 2008-11-14 2016-08-16 Emc Corporation Data replication with delta compression
US8849772B1 (en) * 2008-11-14 2014-09-30 Emc Corporation Data replication with delta compression
US8751462B2 (en) * 2008-11-14 2014-06-10 Emc Corporation Delta compression after identity deduplication
US20100135527A1 (en) * 2008-12-02 2010-06-03 Yi Wu Image recognition algorithm, method of identifying a target image using same, and method of selecting data for transmission to a portable electronic device
CN101950351A (en) * 2008-12-02 2011-01-19 英特尔公司 Method of identifying target image using image recognition algorithm
US8391615B2 (en) 2008-12-02 2013-03-05 Intel Corporation Image recognition algorithm, method of identifying a target image using same, and method of selecting data for transmission to a portable electronic device
US20100313040A1 (en) * 2009-06-09 2010-12-09 Data Domain, Inc. Segment deduplication system with encryption and compression of segments
US20100313036A1 (en) * 2009-06-09 2010-12-09 Data Domain, Inc. Segment deduplication system with encryption of segments
US8731190B2 (en) 2009-06-09 2014-05-20 Emc Corporation Segment deduplication system with encryption and compression of segments
US20100312800A1 (en) * 2009-06-09 2010-12-09 Data Domain, Inc. Segment deduplication system with compression of segments
US8762348B2 (en) * 2009-06-09 2014-06-24 Emc Corporation Segment deduplication system with compression of segments
US8401181B2 (en) 2009-06-09 2013-03-19 Emc Corporation Segment deduplication system with encryption of segments
US20110037766A1 (en) * 2009-08-17 2011-02-17 Nexidia Inc. Cluster map display
US20110077998A1 (en) * 2009-09-29 2011-03-31 Microsoft Corporation Categorizing online user behavior data
US20110153677A1 (en) * 2009-12-18 2011-06-23 Electronics And Telecommunications Research Institute Apparatus and method for managing index information of high-dimensional data
US20110317009A1 (en) * 2010-06-23 2011-12-29 MindTree Limited Capturing Events Of Interest By Spatio-temporal Video Analysis
US8730396B2 (en) * 2010-06-23 2014-05-20 MindTree Limited Capturing events of interest by spatio-temporal video analysis
WO2012054399A1 (en) * 2010-10-17 2012-04-26 Canon Kabushiki Kaisha Systems and methods for cluster comparison
US9026536B2 (en) 2010-10-17 2015-05-05 Canon Kabushiki Kaisha Systems and methods for cluster comparison
US20120141019A1 (en) * 2010-12-07 2012-06-07 Sony Corporation Region description and modeling for image subscene recognition
US8705866B2 (en) * 2010-12-07 2014-04-22 Sony Corporation Region description and modeling for image subscene recognition
US8831349B2 (en) 2011-05-17 2014-09-09 Microsoft Corporation Gesture-based visual search
US8553981B2 (en) 2011-05-17 2013-10-08 Microsoft Corporation Gesture-based visual search
US9031960B1 (en) 2011-06-10 2015-05-12 Google Inc. Query image search
US8782077B1 (en) * 2011-06-10 2014-07-15 Google Inc. Query image search
US9002831B1 (en) 2011-06-10 2015-04-07 Google Inc. Query image search
US8983939B1 (en) 2011-06-10 2015-03-17 Google Inc. Query image search
US9047534B2 (en) * 2011-08-11 2015-06-02 Anvato, Inc. Method and apparatus for detecting near-duplicate images using content adaptive hash lookups
US20130039584A1 (en) * 2011-08-11 2013-02-14 Oztan Harmanci Method and apparatus for detecting near-duplicate images using content adaptive hash lookups
US9141884B2 (en) 2011-12-23 2015-09-22 Thomson Licensing Method of automatic management of images in a collection of images and corresponding device
EP2608062A1 (en) * 2011-12-23 2013-06-26 Thomson Licensing Method of automatic management of images in a collection of images and corresponding device
EP2608078A1 (en) * 2011-12-23 2013-06-26 Thomson Licensing Method of automatic management of images in a collection of images and corresponding device
US11216428B1 (en) 2012-07-20 2022-01-04 Ool Llc Insight and algorithmic clustering for automated synthesis
US9336302B1 (en) 2012-07-20 2016-05-10 Zuci Realty Llc Insight and algorithmic clustering for automated synthesis
US10318503B1 (en) 2012-07-20 2019-06-11 Ool Llc Insight and algorithmic clustering for automated synthesis
US9607023B1 (en) 2012-07-20 2017-03-28 Ool Llc Insight and algorithmic clustering for automated synthesis
US10482122B2 (en) * 2013-06-05 2019-11-19 Digitalglobe, Inc. System and method for multiresolution and multitemporal image search
US20140365463A1 (en) * 2013-06-05 2014-12-11 Digitalglobe, Inc. Modular image mining and search
US20170235767A1 (en) * 2013-06-05 2017-08-17 Digitalglobe, Inc. System and method for multiresolution and multitemporal image search
US9529824B2 (en) * 2013-06-05 2016-12-27 Digitalglobe, Inc. System and method for multi resolution and multi temporal image search
US9607014B2 (en) * 2013-10-31 2017-03-28 Adobe Systems Incorporated Image tagging
US20150120760A1 (en) * 2013-10-31 2015-04-30 Adobe Systems Incorporated Image tagging
US20150170068A1 (en) * 2013-12-17 2015-06-18 International Business Machines Corporation Determining analysis recommendations based on data analysis context
CN104462217A (en) * 2014-11-09 2015-03-25 浙江大学 Time-series similarity measurement method based on segmented statistical approximate representation
US20160224636A1 (en) * 2015-01-30 2016-08-04 Nec Europe Ltd. Scalable system and method for weighted similarity estimation in massive datasets revealed in a streaming fashion
US10970296B2 (en) * 2015-01-30 2021-04-06 Nec Corporation System and method for data mining and similarity estimation
US10402414B2 (en) * 2015-01-30 2019-09-03 Nec Corporation Scalable system and method for weighted similarity estimation in massive datasets revealed in a streaming fashion
CN108140026A (en) * 2015-05-20 2018-06-08 电子湾有限公司 Multi-panel Entity recognition in search
WO2016187417A1 (en) * 2015-05-20 2016-11-24 Ebay Inc. Multi-faceted entity identification in search
US10360621B2 (en) 2015-05-20 2019-07-23 Ebay Inc. Near-identical multi-faceted entity identification in search
US10733164B2 (en) 2015-06-23 2020-08-04 Microsoft Technology Licensing, Llc Updating a bit vector search index
US11281639B2 (en) 2015-06-23 2022-03-22 Microsoft Technology Licensing, Llc Match fix-up to remove matching documents
US10229143B2 (en) 2015-06-23 2019-03-12 Microsoft Technology Licensing, Llc Storage and retrieval of data from a bit vector search index
US10242071B2 (en) 2015-06-23 2019-03-26 Microsoft Technology Licensing, Llc Preliminary ranker for scoring matching documents
US11392568B2 (en) 2015-06-23 2022-07-19 Microsoft Technology Licensing, Llc Reducing matching documents for a search query
US10565198B2 (en) 2015-06-23 2020-02-18 Microsoft Technology Licensing, Llc Bit vector search index using shards
US10467215B2 (en) 2015-06-23 2019-11-05 Microsoft Technology Licensing, Llc Matching documents using a bit vector search index
US10599788B2 (en) 2015-12-30 2020-03-24 International Business Machines Corporation Predicting target characteristic data
WO2017115218A1 (en) * 2015-12-30 2017-07-06 International Business Machines Corporation Predicting target characteristic data
US11200357B2 (en) 2015-12-30 2021-12-14 International Business Machines Corporation Predicting target characteristic data
CN105930873A (en) * 2016-04-27 2016-09-07 天津中科智能识别产业技术研究院有限公司 Self-paced cross-modal matching method based on subspace
US11386262B1 (en) 2016-04-27 2022-07-12 Google Llc Systems and methods for a knowledge-based form creation platform
US11039181B1 (en) 2016-05-09 2021-06-15 Google Llc Method and apparatus for secure video manifest/playlist generation and playback
US11647237B1 (en) 2016-05-09 2023-05-09 Google Llc Method and apparatus for secure video manifest/playlist generation and playback
US10785508B2 (en) 2016-05-10 2020-09-22 Google Llc System for measuring video playback events using a server generated manifest/playlist
US11069378B1 (en) 2016-05-10 2021-07-20 Google Llc Method and apparatus for frame accurate high resolution video editing in cloud using live video streams
US11589085B2 (en) 2016-05-10 2023-02-21 Google Llc Method and apparatus for a virtual online video channel
US10750248B1 (en) 2016-05-10 2020-08-18 Google Llc Method and apparatus for server-side content delivery network switching
US10750216B1 (en) 2016-05-10 2020-08-18 Google Llc Method and apparatus for providing peer-to-peer content delivery
US10771824B1 (en) 2016-05-10 2020-09-08 Google Llc System for managing video playback using a server generated manifest/playlist
US11545185B1 (en) 2016-05-10 2023-01-03 Google Llc Method and apparatus for frame accurate high resolution video editing in cloud using live video streams
US10595054B2 (en) 2016-05-10 2020-03-17 Google Llc Method and apparatus for a virtual online video channel
US11785268B1 (en) 2016-05-10 2023-10-10 Google Llc System for managing video playback using a server generated manifest/playlist
US11877017B2 (en) 2016-05-10 2024-01-16 Google Llc System for measuring video playback events using a server generated manifest/playlist
US11683540B2 (en) 2016-05-16 2023-06-20 Google Llc Method and apparatus for spatial enhanced adaptive bitrate live streaming for 360 degree video playback
US11032588B2 (en) 2016-05-16 2021-06-08 Google Llc Method and apparatus for spatial enhanced adaptive bitrate live streaming for 360 degree video playback
CN106095893A (en) * 2016-06-06 2016-11-09 北京大学深圳研究生院 A kind of cross-media retrieval method
WO2018106663A1 (en) * 2016-12-06 2018-06-14 Ebay Inc. Anchored search
US20180157681A1 (en) * 2016-12-06 2018-06-07 Ebay Inc. Anchored search
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US10437878B2 (en) * 2016-12-28 2019-10-08 Shutterstock, Inc. Identification of a salient portion of an image
US10503775B1 (en) * 2016-12-28 2019-12-10 Shutterstock, Inc. Composition aware image querying
US11042586B2 (en) 2016-12-29 2021-06-22 Shutterstock, Inc. Clustering search results based on image composition
WO2018125932A1 (en) * 2016-12-29 2018-07-05 Shutterstock, Inc. Clustering search results based on image composition
US20180322195A1 (en) * 2017-05-04 2018-11-08 Buzzmuisq Inc. Method for recommending musing in playlist and apparatus using the same
US20190197134A1 (en) * 2017-12-22 2019-06-27 Oracle International Corporation Computerized geo-referencing for images
US10896218B2 (en) * 2017-12-22 2021-01-19 Oracle International Corporation Computerized geo-referencing for images
CN110209663A (en) * 2018-02-14 2019-09-06 阿里巴巴集团控股有限公司 The method, apparatus and storage medium that search range determines
US10949467B2 (en) * 2018-03-01 2021-03-16 Huawei Technologies Canada Co., Ltd. Random draw forest index structure for searching large scale unstructured data
US11106708B2 (en) * 2018-03-01 2021-08-31 Huawei Technologies Canada Co., Ltd. Layered locality sensitive hashing (LSH) partition indexing for big data applications
US20210200768A1 (en) * 2018-09-11 2021-07-01 Intuit Inc. Responding to similarity queries using vector dimensionality reduction
CN111767419A (en) * 2019-05-22 2020-10-13 北京京东尚科信息技术有限公司 Picture searching method, device, equipment and computer readable storage medium
WO2020243437A1 (en) * 2019-05-31 2020-12-03 Q2 Software, Inc. System and method for information retrieval for noisy data
US11640417B2 (en) * 2019-05-31 2023-05-02 Q2 Software, Inc. System and method for information retrieval for noisy data
US20220114201A1 (en) * 2019-05-31 2022-04-14 Q2 Software, Inc. System and method for information retrieval for noisy data
US11226998B2 (en) 2019-05-31 2022-01-18 Q2 Software, Inc. System and method for information retrieval for noisy data
CN110377778A (en) * 2019-07-11 2019-10-25 北京字节跳动网络技术有限公司 Figure sort method, device and electronic equipment based on title figure correlation
US20210026821A1 (en) * 2019-07-26 2021-01-28 Introhive Services Inc. Data cleansing system and method
US11675753B2 (en) * 2019-07-26 2023-06-13 Introhive Services Inc. Data cleansing system and method
CN110765356A (en) * 2019-10-23 2020-02-07 绍兴柯桥浙工大创新研究院发展有限公司 Industrial design man-machine data query system for retrieving and sorting according to user habits
WO2021121129A1 (en) * 2020-06-30 2021-06-24 平安科技(深圳)有限公司 Method and apparatus for similar case detection, device, and storage medium
US10877948B1 (en) * 2020-07-01 2020-12-29 Tamr, Inc. Method and computer program product for geospatial binning
US11321486B2 (en) * 2020-07-21 2022-05-03 Alipay (Hangzhou) Information Technology Co., Ltd. Method, apparatus, device, and readable medium for identifying private data
US11797531B2 (en) * 2020-08-04 2023-10-24 Micron Technology, Inc. Acceleration of data queries in memory
CN113239222A (en) * 2021-01-19 2021-08-10 佳木斯大学 Image retrieval method based on image information extraction and EMD distance improvement
US11500887B2 (en) 2021-04-09 2022-11-15 Intel Corporation Technologies for tuning performance and/or accuracy of similarity search using stochastic associative memories
EP4071627A1 (en) * 2021-04-09 2022-10-12 INTEL Corporation Technologies for tuning performance and/or accuracy of similarity search using stochastic associative memories
US20220414108A1 (en) * 2021-06-29 2022-12-29 United States Of America As Represented By The Secretary Of The Army Classification engineering using regional locality-sensitive hashing (lsh) searches
US11886445B2 (en) * 2021-06-29 2024-01-30 United States Of America As Represented By The Secretary Of The Army Classification engineering using regional locality-sensitive hashing (LSH) searches

Also Published As

Publication number Publication date
US7966327B2 (en) 2011-06-21

Similar Documents

Publication Publication Date Title
US7966327B2 (en) Similarity search system with compact data structures
CN111198959B (en) Two-stage image retrieval method based on convolutional neural network
Chen et al. A region-based fuzzy feature matching approach to content-based image retrieval
Liu et al. A survey of content-based image retrieval with high-level semantics
Liu et al. An investigation of practical approximate nearest neighbor algorithms
US8908997B2 (en) Methods and apparatus for automated true object-based image analysis and retrieval
Deselaers et al. Features for image retrieval: an experimental comparison
Patel et al. Content based video retrieval systems
US6181817B1 (en) Method and system for comparing data objects using joint histograms
Zachary et al. Content based image retrieval systems
US7577684B2 (en) Fast generalized 2-Dimensional heap for Hausdorff and earth mover&#39;s distance
Stan et al. eID: A system for exploration of image databases
Ahmad et al. Indexing and retrieval of images by spatial constraints
Lee et al. Cluster-driven refinement for content-based digital image retrieval
Huang et al. Improved AdaBoost-based image retrieval with relevance feedback via paired feature learning
Tirilly et al. A review of weighting schemes for bag of visual words image retrieval
Shabbir et al. Tetragonal Local Octa-Pattern (T-LOP) based image retrieval using genetically optimized support vector machines
Qamra et al. Scalable landmark recognition using EXTENT
Seth et al. A review on content based image retrieval
Salamah Efficient content based image retrieval
Vijayashanthi et al. Survey on recent advances in content based image retrieval techniques
Zhu et al. Using keyblock statistics to model image retrieval
Natsev et al. CAMEL: concept annotated image libraries
Tsai et al. Color-texture-based image retrieval system using Gaussian Markov random field model
Goncharov et al. Pseudometric approach to content based image retrieval and near duplicates detection

Legal Events

Date Code Title Description
AS Assignment

Owner name: THE TRUSTEES OF PRINCETON UNIVERSITY, NEW JERSEY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LI, KAI;LV, QIN;CHARIKAR, MOSES;SIGNING DATES FROM 20110509 TO 20110614;REEL/FRAME:026613/0024

CC Certificate of correction
AS Assignment

Owner name: ENERGY, UNITED STATES DEPARTMENT OF, DISTRICT OF C

Free format text: CONFIRMATORY LICENSE;ASSIGNOR:PRINCETON UNIVERSITY;REEL/FRAME:030584/0170

Effective date: 20121220

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.)

FP Lapsed due to failure to pay maintenance fee

Effective date: 20150621

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362