Search Images Maps Play YouTube News Gmail Drive More »
Advanced Patent Search | Web History | Sign in

Patents

The present invention provides an innovative technique for rapidly and accurately determining whether two audio samples match, as well as being immune to various kinds of transformations, such as playback speed variation. The relationship between the two audio samples is characterized by first matching certain fingerprint objects derived from the respective samples. A set (230) of fingerprint objects (231,232), each occurring at a particular location (242), is generated for each audio sample (210). Each location (242) is determined in dependence upon the content of the respective audio sample (210) and each fingerprint object (232) characterizes one or more local features (222) at or near the respective particular location (242). A relative value is next determined for each pair of matched fingerprint objects. A histogram of the relative values is then generated. If a statistically significant peak is found, the two audio samples can be characterized as substantially matching.

Referenced by

Citing PatentFiling dateIssue dateOriginal AssigneeTitle
US8140331Jul 4, 2008Mar 20, 2012Xia LouFeature extraction for identification and classification of audio signals

Claims

1. A method of characterizing a relationship between a first and a second audio samples, comprising the steps of:

generating a first set of fingerprint objects for the first audio sample, each fingerprint object occurring at a respective location within the first audio sample, the respective location being determined in dependence upon the content of the first audio sample, and each fingerprint object characterising one or more features of the first audio sample at or near each respective location;

generating a second set of fingerprint objects for the second audio sample, each fingerprint object occurring at a respective location within the second audio sample, the respective location being determined in dependence upon the content of the second audio sample, and each fingerprint object characterising one or more features of the second audio sample at or near each respective location;

pairing fingerprint objects by matching a first fingerprint object from the first audio sample with a second fingerprint object from the second audio sample that is substantially similar to the first fingerprint object;

generating, based on the pairing step, a list of pairs of matched fingerprint objects;
determining a relative value for each pair of matched fingerprint objects;
generating a histogram of the relative values; and
searching for a statistically significant peak in the histogram, the peak characterizing the relationship between the first and second audio samples.

2. The method according to claim 1 in which the relationship between the first and second audio samples is characterized as substantially matching if a statistically significant peak is found.

3. The method according to claim 1 or 2, further comprising the step of estimating a global relative value with a location of the peak on an axis of the histogram, the global relative value further characterizing the relationship between the first and second audio samples.

4. The method according to claim 3, further comprising the step of determining a hyperfine estimate of the global relative value, wherein the step of determining comprises:

selecting a neighbourhood around the peak, and

calculating an average of the relative values in the neighbourhood.

5. The method according to claim 1 in which each fingerprint object has an invariant component, and the first and second fingerprint objects in each pair of matched fingerprint objects have matching invariant components.

6. The method according to claim 5 in which the invariant component is generated using at least one of:

(i) a ratio between a first and a second frequency values, each frequency value being respectively determined from a first and a second local features near the respective location of each fingerprint object;

(ii) a product between a frequency value and a delta time value, the frequency value being determined from a first local feature, and the delta time value being determined between the first local feature and a second local feature near the respective location of each fingerprint object; and

(iii) a ratio between a first and a second delta time values, the first delta time value being determined from a first and a second local features, the second delta time value being determined from the first and a third local features, each local feature being near the respective location of each fingerprint object.

7. The method according to claim 6 in which each local feature is a spectrogram peak and each frequency value is determined from a frequency coordinate of a corresponding spectrogram peak.

8. The method according to claim 1 or 5 in which each fingerprint object has a variant component, and the relative value of each pair of matched fingerprint objects is determined using respective variant components of the first and second fingerprint objects.

9. The method according to claim 8 in which the variant component is a frequency value determined from a local feature near the respective location of each fingerprint object such that the relative value of a pair of matched fingerprint objects being characterized as a ratio of respective frequency values of the first and second fingerprint objects and the peak in the histogram characterizing the relationship between the first and second audio samples being characterized as a relative pitch, or, in case of linear stretch, a relative playback speed.

10. The method according to claim 9, wherein the ratio of respective frequency values is characterized as being either a division or a difference of logarithms.

11. The method according to claim 9, in which each local feature is a spectrogram peak and each frequency value is determined from a frequency coordinate of a corresponding spectrogram peak.

12. The method according to claim 8, in which the variant component is a delta time value determined from a first and a second local features near the respective location of each fingerprint object such that the relative value of a pair of matched fingerprint objects being characterized as a ratio of respective variant delta time values and the peak in the histogram characterizing the relationship between the first and second audio samples being characterized as a relative playback speed, or, in case of linear stretch, a relative pitch.

13. The method according to claim 12, wherein the ratio of respective variant delta time values is characterized as being either a division or a difference of logarithms.

14. The method according to claim 12, in which each local feature is a spectrogram peak and each frequency value is determined from a frequency coordinate of a corresponding spectrogram peak.

15. The method according to claim 8, further comprising the steps of:

determining a relative pitch for the first and second audio samples using the respective variant components, wherein each variant component is a frequency value determined from a local feature near the respective location of each fingerprint object;

determining a relative playback speed for the first and second audio samples using the respective variant components, wherein each variant component is a delta time value determined from a first and a second local features near the respective location of each fingerprint object; and

detecting if the relative pitch and a reciprocal of the relative playback speed are substantially different, in which case the relationship between the first and second audio samples is characterized as nonlinear.

16. The method according to claim 1, wherein R is a relative playback speed value determined from the peak of the histogram of the relative values, further comprising the steps of:

for each pair of matched fingerprint objects in the list, determining a compensated relative time offset value, t−R*t′, where t and t′ are locations in time with respect to the first and second fingerprint objects;

generating a second histogram of the compensated relative time offset values; and

searching for a statistically significant peak in the second histogram of the compensated relative time offset values, the peak further characterizing the relationship between the first and second audio samples.

17. A computer program product for performing a method according to any preceding claim.

18. A computer system for performing a method according to any one of claims 1 to 16, the computer system comprising a client for sending information necessary for the characterization of the relationship between the first and second audio samples to a server that performs the characterization.