WO2009103221A1

WO2009103221A1 - Effective relating theme model data processing method and system thereof

Info

Publication number: WO2009103221A1
Application number: PCT/CN2009/000174
Authority: WO
Inventors: 李文波; 孙乐
Original assignee: 中国科学院软件研究所
Priority date: 2008-02-22
Filing date: 2009-02-20
Publication date: 2009-08-27
Also published as: CN101226557B; CN101226557A

Abstract

An effective relating theme model data processing method and a system thereof are disclosed. The method comprises: at a task initializing phase, an initial model M₀is provided by a main control node and is synchronized to all computing nodes firstly, then a task set is divided and distributed into multiple computing nodes to be computed; at a task executing phase, some data processes need to be executed, in each turn, a work thread in each computing node executes a local parallel computing and obtains a theme distribution and a model statistic about the node document subset, then the theme distribution and a model statistic about the node document subset are transmitted into the main control node to be summarized and the data processing result is judged whether it is convergent. The system includes a main control node and multiple computing nodes which constituting a cluster computer system to execute a parallel computing.

Description

Efficient association topic model data processing method and system thereof

The invention relates to a text representation method and a system thereof, in particular to an efficient data processing method and system based on implicit topic text representation, belonging to the field of computer information retrieval. Background technique

Computer information retrieval is one of the important infrastructures of the information society. The services provided range from basic network information search to information filtering and classification to various advanced data mining. In computer information retrieval, the representation of text is a fundamentally important problem: First, the processing object of computer information retrieval is mainly text information, and other types of information generally must also depend on text information or additional text information; Furthermore, the text representation method is a prerequisite for the computer information retrieval service, because the basic means of computer information retrieval is to use the natural language text to ask questions and respond to the search engine. It is necessary to first convert the text from the unstructured original form to the computer. The structured form of understanding can then be analyzed and processed; moreover, the text representation method is closely related to the processing algorithm in computer information retrieval, so the text representation method largely determines the design of the processing algorithm.

Common text representation methods are mainly divided into Vector Space Model (Reference: Salton, G The SMART Retrieval System. Englewood Cliffs, Prentice-Hall, 1971.), Probability Method

(Probability Model ) (Reference: Van Rijsbergen, C.J. A new theoretical framework for information retrieval. In proceedings of SIGIR'86, pp.194-200, 1986. ) and language model method

( Language Model ) (Reference: J. Ponte, Crpfl, W.B. A Language Modeling Approach to Informational. In proceedings of SIGIR '98, pp. 257-281, 1998. ) Associated topic model

(Correlated Topic Model) is a method of probabilistic text representation based on implicit topics (Reference: Blei, D., Lafferty, J. Correlated Topic Models [J]. Advances in neural information processing systems, 2006, 18: 147-154 . ) , and because its output can be easily embedded into the vector space and language model, it has wide adaptability to the analysis and processing algorithms in computer information retrieval. The main function of this method is to analyze some of the topics discussed in the text collection and the distribution of each topic in each text by analyzing the statistical means of a certain amount of text, and it is very important that the method can also Measure the degree of association between these topics. In this way, the text information processing is freed from the low-level processing method that relies entirely on vocabulary in the past. At the higher level of the theme.

Although the related topic model provides a high-level text representation ideally, it is mainly limited to small amounts of data, and it is difficult to use it on large-scale data in real-world environments. The fundamental reason is that its solution method is serious. Bottleneck: First, its classic implementation is based on the conventional serial computing method, that is, each step of the computing task must be performed sequentially one after another, and the result of the previous step is the beginning of the latter step. In this way, at any point in time, all computing tasks can only be performed on one hardware computing unit, so even if it is placed on a high-performance computer with multiple hardware computing units (such as multi-core, multi-processor), Speed up task solving. Furthermore, since the calculation process itself cannot be split in the serial mode, the processed data must be collected together to provide access to the calculation process at any time, thus increasing the storage load of the system, such as hard disk, memory, and especially memory. The impact of the aspect is very obvious. Excessive memory usage can lead to a sharp drop in computing speed and even cause the system to refuse execution of the computing task. Summary of the invention

The object of the present invention is to provide an efficient related topic model data processing method and system thereof, which can fully utilize the multi-processor-multi-core parallel architecture on a single machine and the massive parallel capability of a computer cluster, thereby realizing large-scale documents. The high-speed processing of the collection, that is, the purpose of pushing the text representation method of the related topic model to practical use.

The technical solution of the present invention is as follows:

Task initialization

1.1. On each node computer (including the master node and the compute node), a computing service with a corresponding number of worker threads is automatically generated according to the hardware concurrency of the node;

1.2. On the master node, use the random process to give the initial model M _c and copy Mo to all compute nodes;

1.3. On the master node, the task document corpus is equally divided into a number of computing node document subsets, and assigned to the corresponding computing nodes one by one;

2. Execution of the task (remember the number of iterations of the current round is the i-th iteration, and then use k to represent the number of the compute node) 2.1. On each compute node, divide the subset of the node document into several work blocks, each work thread Performing local parallel computing first obtains the processing result D(k, i) of the subset of the node document in this iterative process, that is, the topic distribution of each document in the subset of the node document, and then uses the topic distribution of the part of the document to obtain the topic distribution. Model statistics about a subset of the node's documentation; 2.2. On each computing node, transfer the processing result D(k, i) of its node document subset, model statistic and document calculation time to the master node;

2.3. On the master node, use the document calculation time to determine the balance of the division of the computational node document subset. If necessary, readjust the division of the computed node document subset and assign it to the corresponding compute node;

2.4. On the master node, first summarize the model statistics of all the computed node document subsets, and then estimate the model M of this iteration (that is, perform model parameter estimation to solve the associated topic model). If the model does not converge, copy M to all compute nodes for the next round of calculations and model iterations; otherwise, terminate the data processing process, at which point the final data processing result Dd ast) can be obtained at each compute node. It summarizes the final data processing result of the vested document collection Aast, that is, the topic distribution of each document in the document ensemble; and also obtains the final convergence model M _last .

The invention relates to the following key elements:

a) The present invention employs a hierarchical high performance solution architecture: cluster distributed computing, in-machine parallel computing. The cluster level consists of two basic components: a master node and several compute nodes. There is only one master node, which can use ordinary PC, mainly responsible for interface interaction, data distribution, result aggregation, model parameter estimation and other functions. There are multiple compute nodes (in principle, there is no limit to the number) and different types of computers can be used. The compute nodes take on the main computational workload of the solver task. The master node and the compute node are connected through the network, and the data only needs to be directly transmitted between the master node and the compute node, and there is no communication between the compute nodes. Node level uses in-machine parallel computing: that is, cross-thread computing, different computing nodes have different degrees of parallelism, such as high-performance servers with multiple processors can effectively support parallel threads proportional to the number of processors, dual-core workstations can be effective Support for dual-threaded parallel computing, while single-core PCs generally only support single-threaded computing.

2) Autonomous determination of the number of concurrent threads of nodes: On each node (including the master node and the compute node), the number of processors of the node computer and the number of cores per processor or supported super The number of threads automatically determines the number of valid threads: the processor information of the hardware system is directly obtained by using the assembly instruction on the windows platform, and the processor information of the hardware system is obtained by calling the function of the hardware abstraction layer HAL on the limix platform. This avoids the cumbersome task of manually configuring each node's number of worker threads in a clustered distributed computing environment.

c) The present invention employs hierarchical load balancing techniques: adaptive allocation of working sets at the cluster level, and automatic assignment of working sets at the node level. This is different from the single load balancing mode used in general high-performance computing tasks. The adaptive allocation method of the working set at the cluster level is: Because the computing power of each computing node is inconsistent, we evaluate the current iteration of each computing node on the master node and adjust the strategy in time to make the working set The computing power of the computing nodes is reasonably distributed so that each computing node completes in an approximate close time to avoid partial node idle waiting, thereby maximizing the computational efficiency of utilizing the entire cluster.

Specific methods for evaluating and adjusting the working set of a compute node - Evaluation method:

First, group the calculation time of all compute nodes into a list Γ e

Second, find the longest calculation time Max(rz' e) and the shortest calculation time Min{Time), and calculate the time difference TimeSpan=Max(Time)-Min(Time).

Third, compare TimeSpan with a predetermined threshold Threshold (default is 5 seconds), if

TimeSpari>Threshold, you need to adjust the division of the working set, otherwise retain the previous division. Adjustment method:

Let Time (0 denotes the calculation time used by the i-th computing node, let She denote the size of the working set corpus and the corresponding & (0 represents the size of the working set of the node processed by the i-th computing node (ie the number of documents processed) ), there are:

First, calculate the document processing speed of each node,

(f) Second, calculate the document allocation ratio of each node, proportion(i) = ^ ^peed(i) third, calculate the document allocation share of each node,

Fourth, the corresponding number of documents are sequentially extracted from the entire set according to the document allocation shares of the respective nodes for allocation.

The automatic allocation method of the working set at the compute node level is: Since the computing power of the worker threads on one node is consistent, each thread automatically applies for an approximately equal amount of work blocks, so that each thread completes in an approximate close time. Avoid part of the thread idle, so as to maximize the computing power of the entire compute node.

4) High concurrent access method of the working set at the compute node level: When the working set of the computing node (ie, the received subset of documents) is loaded into the memory, each concurrent thread uses the index structure to divide the respective processed text objects. When the calculation is performed after partitioning, all the threads access the working set at the same time without locking the working set, so that multiple working threads obtain complete parallelism when performing the computing task. The detailed description of this indexing method is as follows:

In-memory documents are distributed and stored, and the addresses of the documents are stored together in a continuous indexed array. The key to using the index method to improve concurrent access is - first, set the size of the work block (default is 100 documents)

Second, set the top pointer of an indexed array and set a lock (critical section mutex) for it. The initial position of the needle is in the first element of the array;

Third, all threads mutually access the top pointer of the index array under the protection of the lock to obtain the address of the document processed by the thread (ie, a contiguous element in the index array)

Fourth, the thread accesses the corresponding document through the address of the work block and processes it. At this point, all threads are completely parallel.

Therefore, this method only requires the thread to perform exclusive access to the lock on an integer (the top pointer of the index array), without the need for the exclusive exclusion of the index itself, and no need to lock the mutually exclusive scan of the document set itself. This results in maximum concurrency efficiency and avoids the overhead of using lock costs when scanning on large data structures. ·

V) The present invention employs a hierarchical working set delivery mode: a "push" delivery mode of a cluster working set, and a "pull" delivery mode of a node concurrent thread working set. The working set is divided into several parts. First, the working set is divided into a subset of computing node documents at the cluster level. This task is completed by the master node. The master node divides and copies the working set according to the computing power of each computing node. Corresponding computing node, this is the "push" type of delivery mode; on the compute node, each worker thread actively requests to obtain a work block from the node work subset for calculation, which is

"Pull" transfer mode.

VI) Synchronization mode of the master node and the compute node: The calculation and transmission are separated. The calculation task does not consider the remote access of the data but adopts the local read-write mode. The transmission task is based on the out-of-process file transfer service (FTP) or network file system service. (NFS) bears. This improves the scalability and maintainability of the system. At the same time, the data format of the data transmission is in textual format, which avoids the difference in the binary representation format caused by different hardware platforms, operating system platforms and development tool platforms, so that the system can be developed and run in a mixed platform environment.

VII) Model statistic summarization technique is used in the estimation of related topic models. The related topic model is mainly defined by three parameter matrices, namely the subject mean parameter matrix Λ, the subject variance parameter matrix C _p , and the topic word distribution (feature distribution). Parameter matrix _{p; The} key step in model estimation is to calculate the model statistic through the document (corresponding to three statistic matrices: the topic mean statistic matrix ^, the subject variance statistic matrix C _s , the topic's word distribution (feature distribution) statistics The quantity matrix ff _s ), the model parameters are calculated by the model statistic, and the process is iteratively convergent.

The difference between this process in serial and parallel mode is: Under the serial data processing method, since all data is on one computer, the model statistics are stored centrally, but distributed data processing is performed on each computer. Calculate your own part of the model statistic separately, so it must be put together, specifically - , which represents the model statistic of a compute node

Positive effects of the invention:

Compared with the prior art, the present invention divides the entire computing task into sub-tasks of different scales by mining the internal structure of the related topic model solving method, and each sub-task is independently executed and only needs to deal with its own correlation. Data, so on the whole, the storage pressure of the computing task is broken by the resolution of the single computing unit; the method is implemented, by using the multi-processor, multi-core single-computer high-performance hardware to provide computing power, and to utilize Advanced architectures such as large-scale clustering are used to solve the solution, thereby achieving the goal of increasing the calculation speed and expanding the computing scale. DRAWINGS

Figure 1. Schematic diagram of the network structure of the present invention;

Figure 2. Schematic diagram of the process of the present invention;

Figure 3. Schematic diagram of the dynamic execution structure of the present invention. detailed description

The embodiments of the method of the invention are described in detail below with reference to the accompanying drawings:

The network topology of the present invention is a computer cluster. As shown in Figure 1, it consists of two basic components: a master node and several compute nodes. There is only one master node, which can use ordinary PC, and is mainly responsible for interface interaction, data distribution, and result summary. There are multiple compute nodes (there is no limit on the number of rules) and different types of computers can be used, and the compute nodes take on the main computational workload of the solver task. The master node and the compute node are connected through the network. The data only needs to be transmitted directly at the master node and the compute node, and there is no communication between the compute nodes.

The process flow of the present invention is illustrated in Figure 2: the vertical representation of the sequential steps and the lateral representation of the components that can be paralleled in each step. The sequential steps are mainly divided into two major steps of initialization and iterative execution, and then the iterative execution can be divided into the execution steps of the computing node (including two sub-steps of calculation and transmission) and the execution steps of the master node (including two sub-steps of calculation and transmission). ). The parallel components explicitly indicated in the figure mainly include: (1) model initialization in initialization, and two parallel components in document set division; (2) parallel components independently calculated by multiple computing nodes; (3) The parallel component of the independent execution of the estimation model and the adjustment working set division on the master node. In addition, in addition to the parallel components explicitly shown in the figure, there is actually a very important parallel component, that is, the parallel components of multiple execution threads on a single compute node, which will be dynamically executed in the following high-performance solution method. The structure diagram is shown in Figure 3.

The dynamic execution structure of the present invention is shown in Figure 3: It is a two-layer architecture, macroscopic distributed computing and microscopic parallel computing. The macro-distributed computing is cross-computer. Under the coordination of the master node, the computing tasks are assigned to different computing nodes. Because the computing power of different computing nodes is different, the master node needs to manage the load between the computing nodes. Equalization, the present invention automatically adjusts the size of each node's working set by designing an adaptive method without manual intervention. Micro-parallel computing is cross-threaded. Different computing nodes have different degrees of parallelism. For example, a high-performance server with multiple processors can effectively support parallel threads such as processors. Dual-core workstations can effectively support dual-thread parallel computing. Single-core PCs generally only support single-threaded computing. Therefore, for different parallelism computing nodes to run different numbers of threads, too much or too little is not conducive to the maximum computing power of the node. The invention automatically calculates the number of supportable threads by automatically detecting the system hardware, without manual designation.

The application of the invention will be described below in connection with specific application areas:

Document clustering

Document clustering refers to grouping documents in a document collection such that document content in the same group has a high degree of similarity, while document content in different groups differs greatly. After such processing, the document collection has a reasonable grouping structure, which makes the document collection more manageable; 'More importantly, by subdividing the large document collection, the workload of the user to find a specific document can be greatly reduced, and the document is improved. Use efficiency. Document clustering technology has important applications in information retrieval. The most typical is to group the search results according to the theme, so that users can concentrate on the webpages of the topics they care about, that is, automatically filter out a large number of unrelated query results. , so document clustering can further improve the usability of general search engines.

The associated topic model can be used to perform text clustering to realize the function of the search engine's search results grouped by subject. The specific implementation manner is - a) organizing the search results of the search engine into a complete set of documents, wherein each document corresponds to the content of the title and the abstract of a search result.

b) Using the efficient associated topic model data processing method and system thereof of the present invention to process the complete set of documents to obtain the subject to which each text belongs, the specific process is as follows:

Task initialization

1.1. On each node computer (including the master node and the compute node), a computing service with a corresponding number of worker threads is automatically generated according to the hardware concurrency of the node; 1.2. On the master node, use the random process to give the initial model M _Q and copy it to all compute nodes;

2. Execution of the task (remember the number of iterations in the current round is the i-th iteration, and then use k to indicate the number of the compute node)

2.1. On each compute node, divide the node document subset into several work blocks, and each worker thread performs local parallel computing to obtain the processing result D( _kii ) of the node document subset in this iterative process. The topic distribution of each document in the subset of the node document, and then using the topic distribution of the part of the document to obtain the model statistics of the subset of the document of the node; and recording the calculation time of the document used by each node to calculate the subset of the document of the node .

2.2. On each compute node, transfer the processing result of its node document subset £»( _k ,i), model statistic and document calculation time to the master node;

2.3. On the master node, use the document calculation time to evaluate the balance of the division of the computational node document subset. If necessary, readjust the division of the computed node document subset and assign it to the corresponding compute node;

2.4. On the master node, first summarize the model statistics of all compute nodes, and then estimate the model M of this iteration (that is, perform model parameter estimation to solve the associated topic model). If the model does not converge, copy M to all compute nodes for the next round of calculations and model iterations; otherwise, terminate the data processing process, and the final data processing result can be obtained. At this time, on each compute node. Got

To its final data processing result ( _ast ), it summarizes the final data processing result of the vested document collection A _ast , that is, the topic distribution of each document in the document corpus; and also obtains the final convergence model ^. c) From the topic distribution of each document, you can get the largest topic contained in the document (that is, the topic that is most concentrated in the document), and then assign the document to the group of the corresponding topic, so that the search is obtained. The engine's search results are grouped by subject.

2. Mail filtering

E-mail is one of the most basic network services, and it is an indispensable tool for people to work and live. While fully enjoying the convenience, real-time and cheapness brought by e-mail, people in the Internet age are also suffering from the troubles caused by spam. Almost everyone's mailbox is filled with a large number of spam emails of unknown origin. According to statistics, 95% of emails are spam, which seriously pollutes the network environment and affects the normal communication of the network. Therefore, spam filtering is an essential function of the email system. In addition to the traditional technology based on identity authentication and sensitive word filtering, various filtering technologies for intelligent analysis of email content have gradually developed into spam to deal with fine camouflage. The main means. The associated topic model can be used to perform topic analysis on the content of the email to implement filtering based on the subject of the email. Its specific implementation is:

a) Divide all existing emails into two opposing collections: normal mail collections and spam collections. b) Using the associated topic model of the present invention to calculate the normal mail collection and the spam collection separately, and obtain two related topic models.

c) Calculate its similarity to two related topic models for a newly received email, and then determine whether the email is spam.

3. Product recommendation

The product recommendation feature is very important in e-commerce. It helps customers find products that are of real interest, thereby improving the customer's shopping experience and increasing the profitability of the dealer. Therefore, almost all large-scale e-commerce systems use various forms of recommendation systems (in-house publications) to varying degrees. The basic principle of product recommendation is: According to a large number of purchase record data, analyze the customer's purchase behavior, summarize the purchase pattern of the customer group, and when the new customer purchases the product purchase information, match the information with the previous purchase mode. Predict the items that the user may also need and recommend them to the customer.

The associated topic model can be used to analyze the customer's buying patterns from historical purchase records, thereby supporting the ability to offer product recommendations to new customers. The specific implementation is - a) organize all historical purchase records into a collection of texts, treating each purchase record as a "text" and the purchased item as a "word" in the text.

b) Calculating the set of texts using the associated topic model of the present invention, and finding a group of customers having different purchase patterns.

3) For a new purchase information, the associated topic model of the present invention is used to calculate the customer group to which it belongs, and finally, the product recommendation can be proposed according to the purchase mode of the customer group.

Claims

Claim

1. An efficient method for processing related topic model data, the steps of which are:

Initialization phase:

1) automatically generating, on each node computer, a computing service having a corresponding number of working threads according to the hardware concurrency of the node;

2) The master node gives the initial model and copies it to all compute nodes;

3) The master node divides the task document corpus into a plurality of computing node document subsets and assigns them to corresponding computing nodes;

Iteration phase:

1) each computing node performs data processing on the received subset of node documents to obtain a topic distribution of each document in the subset of the node document and a model statistic of the subset of the node document;

2) Each computing node returns the data result to the master node for aggregation, and obtains the topic distribution of the task document complete set;

3) The master node iterates the model according to the summary of the model statistics and judges its convergence: if it does not converge, repeat the iteration phase, otherwise it ends the data processing.

2. The method of claim 1 wherein the method of obtaining the concurrent hardware capabilities of the node computer hardware is:

1) directly obtain the processor information of the hardware system by using the assembly instruction on the windows platform, and obtain the processor information of the hardware system by calling the function of the hardware abstraction layer HAL on the linux platform: First, the number of processors of each node computer is obtained. And then get the number of cores contained in each processor;

2) Totaling the number of cores included in all processors of the node computer automatically determines the number of valid threads supported by the compute node.

3. The method according to claim 1, wherein the master node determines the balance of the subset of the document of the computing node by:

1) Make the calculation time of all computing nodes into a list Time;

2) Find the longest node calculation time Max (Time) and the shortest node calculation time Min(Time), and calculate the time difference TimeSpan=Max(Time)-Min(Time);

3) Compare TimeSpan with a predetermined threshold Threshold, if TimeSpan>Threshold, then The node document subset partitioning needs to be adjusted, otherwise the previous partition is retained.

4. The method of claim 3, wherein the method of adjusting a subset of the compute node document is:

1) each computing node records the time taken by the node to process the subset of documents when performing data data processing on the received subset of node documents;

2) each computing node transmits the time taken by the node to process the subset of documents back to the master node;

3) The master node calculates the document processing speed of each node by using the document calculation time;

4) The master node calculates the document allocation share of each node according to the processing speed of each node document;

5) The master node extracts the corresponding number of documents from the entire set according to the distribution quota of each node document for distribution.

5. The method according to claim 4, wherein the method for the data processing by the computing node is:

1) Each compute node obtains its own number of processors and the number of cores contained in each processor, thereby obtaining the number of valid threads supported by the node;

2) The computing node divides the received document subset into several working blocks according to its own effective thread number;

3) Each worker thread in the computing node actively requests to obtain a work block for data processing by using an index structure.

6. The method according to claim 5, wherein the method for obtaining a work block by using an index structure is:

1) setting a work block size after the node document subset is divided;

2) Set the top pointer of an indexed array and set a lock for it;

3) All threads mutually access the top pointer of the index array under the protection of the lock, and obtain the address of the document processed by the thread;

4) The thread accesses the corresponding document through the address of the work block and processes it.

7. An efficient related topic model data processing system, the system comprising a master node and a plurality of computing nodes: the master node is responsible for interface interaction, data distribution, result summary, model estimation;

The computing node is configured to undertake a main computing workload of the solving task;

The master node and the computing node establish a communication connection for data transmission.

8. The system of claim 7, wherein the master node and the compute node are hardware platforms having a single core processor, a multi-core processor, or a multi-processor.

9. The system of claim 7 wherein said master node and compute node perform data over a network Transmission, the numerical format of the data is in a text representation format.

10. The system according to claim 7, characterized in that the calculation and the transmission are separated, that is, the computing node performs data processing without considering remote access of data but adopts a mode of local reading and writing, the computing node and the main The data transfer task of the control node is undertaken by an out-of-process file transfer service (FTP) or a network file system service (NFS) provided by the cluster system.