WO2009103221A1 - Effective relating theme model data processing method and system thereof - Google Patents

Effective relating theme model data processing method and system thereof Download PDF

Info

Publication number
WO2009103221A1
WO2009103221A1 PCT/CN2009/000174 CN2009000174W WO2009103221A1 WO 2009103221 A1 WO2009103221 A1 WO 2009103221A1 CN 2009000174 W CN2009000174 W CN 2009000174W WO 2009103221 A1 WO2009103221 A1 WO 2009103221A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
document
computing
subset
model
Prior art date
Application number
PCT/CN2009/000174
Other languages
French (fr)
Chinese (zh)
Inventor
李文波
孙乐
Original Assignee
中国科学院软件研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院软件研究所 filed Critical 中国科学院软件研究所
Publication of WO2009103221A1 publication Critical patent/WO2009103221A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition

Definitions

  • the invention relates to a text representation method and a system thereof, in particular to an efficient data processing method and system based on implicit topic text representation, belonging to the field of computer information retrieval. Background technique
  • Computer information retrieval is one of the important infrastructures of the information society.
  • the services provided range from basic network information search to information filtering and classification to various advanced data mining.
  • the representation of text is a fundamentally important problem:
  • the processing object of computer information retrieval is mainly text information, and other types of information generally must also depend on text information or additional text information;
  • the text representation method is a prerequisite for the computer information retrieval service, because the basic means of computer information retrieval is to use the natural language text to ask questions and respond to the search engine. It is necessary to first convert the text from the unstructured original form to the computer.
  • the structured form of understanding can then be analyzed and processed; moreover, the text representation method is closely related to the processing algorithm in computer information retrieval, so the text representation method largely determines the design of the processing algorithm.
  • Correlated Topic Model is a method of probabilistic text representation based on implicit topics (Reference: Blei, D., Lafferty, J. Correlated Topic Models [J]. Advances in neural information processing systems, 2006, 18: 147-154 . ) , and because its output can be easily embedded into the vector space and language model, it has wide adaptability to the analysis and processing algorithms in computer information retrieval.
  • the main function of this method is to analyze some of the topics discussed in the text collection and the distribution of each topic in each text by analyzing the statistical means of a certain amount of text, and it is very important that the method can also Measure the degree of association between these topics. In this way, the text information processing is freed from the low-level processing method that relies entirely on vocabulary in the past. At the higher level of the theme.
  • Bottleneck First, its classic implementation is based on the conventional serial computing method, that is, each step of the computing task must be performed sequentially one after another, and the result of the previous step is the beginning of the latter step. In this way, at any point in time, all computing tasks can only be performed on one hardware computing unit, so even if it is placed on a high-performance computer with multiple hardware computing units (such as multi-core, multi-processor), Speed up task solving.
  • the object of the present invention is to provide an efficient related topic model data processing method and system thereof, which can fully utilize the multi-processor-multi-core parallel architecture on a single machine and the massive parallel capability of a computer cluster, thereby realizing large-scale documents.
  • the high-speed processing of the collection that is, the purpose of pushing the text representation method of the related topic model to practical use.
  • a computing service with a corresponding number of worker threads is automatically generated according to the hardware concurrency of the node;
  • the task document corpus is equally divided into a number of computing node document subsets, and assigned to the corresponding computing nodes one by one;
  • Execution of the task (remember the number of iterations of the current round is the i-th iteration, and then use k to represent the number of the compute node) 2.1.
  • each compute node divide the subset of the node document into several work blocks, each work thread Performing local parallel computing first obtains the processing result D(k, i) of the subset of the node document in this iterative process, that is, the topic distribution of each document in the subset of the node document, and then uses the topic distribution of the part of the document to obtain the topic distribution.
  • Model statistics about a subset of the node's documentation 2.2.
  • On each computing node transfer the processing result D(k, i) of its node document subset, model statistic and document calculation time to the master node;
  • the invention relates to the following key elements:
  • the present invention employs a hierarchical high performance solution architecture: cluster distributed computing, in-machine parallel computing.
  • the cluster level consists of two basic components: a master node and several compute nodes.
  • There is only one master node which can use ordinary PC, mainly responsible for interface interaction, data distribution, result aggregation, model parameter estimation and other functions.
  • There are multiple compute nodes (in principle, there is no limit to the number) and different types of computers can be used.
  • the compute nodes take on the main computational workload of the solver task.
  • the master node and the compute node are connected through the network, and the data only needs to be directly transmitted between the master node and the compute node, and there is no communication between the compute nodes.
  • Node level uses in-machine parallel computing: that is, cross-thread computing, different computing nodes have different degrees of parallelism, such as high-performance servers with multiple processors can effectively support parallel threads proportional to the number of processors, dual-core workstations can be effective Support for dual-threaded parallel computing, while single-core PCs generally only support single-threaded computing.
  • the present invention employs hierarchical load balancing techniques: adaptive allocation of working sets at the cluster level, and automatic assignment of working sets at the node level. This is different from the single load balancing mode used in general high-performance computing tasks.
  • the adaptive allocation method of the working set at the cluster level is: Because the computing power of each computing node is inconsistent, we evaluate the current iteration of each computing node on the master node and adjust the strategy in time to make the working set The computing power of the computing nodes is reasonably distributed so that each computing node completes in an approximate close time to avoid partial node idle waiting, thereby maximizing the computational efficiency of utilizing the entire cluster.
  • Time denotes the calculation time used by the i-th computing node
  • She denote the size of the working set corpus and the corresponding & (0 represents the size of the working set of the node processed by the i-th computing node (ie the number of documents processed) )
  • the corresponding number of documents are sequentially extracted from the entire set according to the document allocation shares of the respective nodes for allocation.
  • the automatic allocation method of the working set at the compute node level is: Since the computing power of the worker threads on one node is consistent, each thread automatically applies for an approximately equal amount of work blocks, so that each thread completes in an approximate close time. Avoid part of the thread idle, so as to maximize the computing power of the entire compute node.
  • In-memory documents are distributed and stored, and the addresses of the documents are stored together in a continuous indexed array.
  • the key to using the index method to improve concurrent access is - first, set the size of the work block (default is 100 documents)
  • the initial position of the needle is in the first element of the array;
  • the thread accesses the corresponding document through the address of the work block and processes it. At this point, all threads are completely parallel.
  • this method only requires the thread to perform exclusive access to the lock on an integer (the top pointer of the index array), without the need for the exclusive exclusion of the index itself, and no need to lock the mutually exclusive scan of the document set itself. This results in maximum concurrency efficiency and avoids the overhead of using lock costs when scanning on large data structures. ⁇
  • the present invention employs a hierarchical working set delivery mode: a "push" delivery mode of a cluster working set, and a "pull" delivery mode of a node concurrent thread working set.
  • the working set is divided into several parts. First, the working set is divided into a subset of computing node documents at the cluster level. This task is completed by the master node.
  • the master node divides and copies the working set according to the computing power of each computing node. Corresponding computing node, this is the "push" type of delivery mode; on the compute node, each worker thread actively requests to obtain a work block from the node work subset for calculation, which is
  • Model statistic summarization technique is used in the estimation of related topic models.
  • the related topic model is mainly defined by three parameter matrices, namely the subject mean parameter matrix ⁇ , the subject variance parameter matrix C p , and the topic word distribution (feature distribution).
  • Parameter matrix p The key step in model estimation is to calculate the model statistic through the document (corresponding to three statistic matrices: the topic mean statistic matrix ⁇ , the subject variance statistic matrix C s , the topic's word distribution (feature distribution) statistics
  • the quantity matrix ff s ) the model parameters are calculated by the model statistic, and the process is iteratively convergent.
  • the present invention divides the entire computing task into sub-tasks of different scales by mining the internal structure of the related topic model solving method, and each sub-task is independently executed and only needs to deal with its own correlation. Data, so on the whole, the storage pressure of the computing task is broken by the resolution of the single computing unit; the method is implemented, by using the multi-processor, multi-core single-computer high-performance hardware to provide computing power, and to utilize Advanced architectures such as large-scale clustering are used to solve the solution, thereby achieving the goal of increasing the calculation speed and expanding the computing scale.
  • FIG. 1 Schematic diagram of the network structure of the present invention
  • FIG. 1 Schematic diagram of the dynamic execution structure of the present invention. detailed description
  • the network topology of the present invention is a computer cluster. As shown in Figure 1, it consists of two basic components: a master node and several compute nodes. There is only one master node, which can use ordinary PC, and is mainly responsible for interface interaction, data distribution, and result summary. There are multiple compute nodes (there is no limit on the number of rules) and different types of computers can be used, and the compute nodes take on the main computational workload of the solver task.
  • the master node and the compute node are connected through the network. The data only needs to be transmitted directly at the master node and the compute node, and there is no communication between the compute nodes.
  • the process flow of the present invention is illustrated in Figure 2: the vertical representation of the sequential steps and the lateral representation of the components that can be paralleled in each step.
  • the sequential steps are mainly divided into two major steps of initialization and iterative execution, and then the iterative execution can be divided into the execution steps of the computing node (including two sub-steps of calculation and transmission) and the execution steps of the master node (including two sub-steps of calculation and transmission).
  • the parallel components explicitly indicated in the figure mainly include: (1) model initialization in initialization, and two parallel components in document set division; (2) parallel components independently calculated by multiple computing nodes; (3) The parallel component of the independent execution of the estimation model and the adjustment working set division on the master node.
  • the dynamic execution structure of the present invention is shown in Figure 3: It is a two-layer architecture, macroscopic distributed computing and microscopic parallel computing.
  • the macro-distributed computing is cross-computer. Under the coordination of the master node, the computing tasks are assigned to different computing nodes. Because the computing power of different computing nodes is different, the master node needs to manage the load between the computing nodes. Equalization, the present invention automatically adjusts the size of each node's working set by designing an adaptive method without manual intervention.
  • Micro-parallel computing is cross-threaded. Different computing nodes have different degrees of parallelism. For example, a high-performance server with multiple processors can effectively support parallel threads such as processors. Dual-core workstations can effectively support dual-thread parallel computing.
  • Single-core PCs generally only support single-threaded computing. Therefore, for different parallelism computing nodes to run different numbers of threads, too much or too little is not conducive to the maximum computing power of the node.
  • the invention automatically calculates the number of supportable threads by automatically detecting the system hardware, without manual designation.
  • Document clustering refers to grouping documents in a document collection such that document content in the same group has a high degree of similarity, while document content in different groups differs greatly. After such processing, the document collection has a reasonable grouping structure, which makes the document collection more manageable; 'More importantly, by subdividing the large document collection, the workload of the user to find a specific document can be greatly reduced, and the document is improved. Use efficiency.
  • Document clustering technology has important applications in information retrieval. The most typical is to group the search results according to the theme, so that users can concentrate on the webpages of the topics they care about, that is, automatically filter out a large number of unrelated query results. , so document clustering can further improve the usability of general search engines.
  • the associated topic model can be used to perform text clustering to realize the function of the search engine's search results grouped by subject.
  • the specific implementation manner is - a) organizing the search results of the search engine into a complete set of documents, wherein each document corresponds to the content of the title and the abstract of a search result.
  • the task document corpus is equally divided into a number of computing node document subsets, and assigned to the corresponding computing nodes one by one;
  • each compute node divides the node document subset into several work blocks, and each worker thread performs local parallel computing to obtain the processing result D( kii ) of the node document subset in this iterative process.
  • the topic distribution of each document in the subset of the node document and then using the topic distribution of the part of the document to obtain the model statistics of the subset of the document of the node; and recording the calculation time of the document used by each node to calculate the subset of the document of the node .
  • E-mail is one of the most basic network services, and it is an indispensable tool for people to work and live. While fully enjoying the convenience, real-time and cheapness brought by e-mail, people in the Internet age are also suffering from the troubles caused by spam. Almost everyone's mailbox is filled with a large number of spam emails of unknown origin. According to statistics, 95% of emails are spam, which seriously pollutes the network environment and affects the normal communication of the network. Therefore, spam filtering is an essential function of the email system. In addition to the traditional technology based on identity authentication and sensitive word filtering, various filtering technologies for intelligent analysis of email content have gradually developed into spam to deal with fine camouflage. The main means.
  • the associated topic model can be used to perform topic analysis on the content of the email to implement filtering based on the subject of the email. Its specific implementation is:
  • the product recommendation feature is very important in e-commerce. It helps customers find products that are of real interest, thereby improving the customer's shopping experience and increasing the profitability of the dealer. Therefore, almost all large-scale e-commerce systems use various forms of recommendation systems (in-house publications) to varying degrees.
  • the basic principle of product recommendation is: According to a large number of purchase record data, analyze the customer's purchase behavior, summarize the purchase pattern of the customer group, and when the new customer purchases the product purchase information, match the information with the previous purchase mode. Predict the items that the user may also need and recommend them to the customer.
  • the associated topic model can be used to analyze the customer's buying patterns from historical purchase records, thereby supporting the ability to offer product recommendations to new customers.
  • the specific implementation is - a) organize all historical purchase records into a collection of texts, treating each purchase record as a "text" and the purchased item as a "word” in the text.
  • the associated topic model of the present invention is used to calculate the customer group to which it belongs, and finally, the product recommendation can be proposed according to the purchase mode of the customer group.

Abstract

An effective relating theme model data processing method and a system thereof are disclosed. The method comprises: at a task initializing phase, an initial model M0 is provided by a main control node and is synchronized to all computing nodes firstly, then a task set is divided and distributed into multiple computing nodes to be computed; at a task executing phase, some data processes need to be executed, in each turn, a work thread in each computing node executes a local parallel computing and obtains a theme distribution and a model statistic about the node document subset, then the theme distribution and a model statistic about the node document subset are transmitted into the main control node to be summarized and the data processing result is judged whether it is convergent. The system includes a main control node and multiple computing nodes which constituting a cluster computer system to execute a parallel computing.

Description

一种高效的关联主题模型数据处理方法及其系统 技术领域  Efficient association topic model data processing method and system thereof
本发明涉及一种文本表示方法及其系统, 尤其涉及一种基于隐含主题文本表示的高 效数据处理方法及其系统, 属于计算机信息检索领域。 背景技术  The invention relates to a text representation method and a system thereof, in particular to an efficient data processing method and system based on implicit topic text representation, belonging to the field of computer information retrieval. Background technique
计算机信息检索是信息社会的重要基础设施之一, 所提供的服务贯穿了从基本的网 络信息搜索到信息的过滤、 分类以致各种高级的数据挖掘。在计算机信息检索中, 文本 的表示方法是一个具有根本重要性的问题: 首先, 计算机信息检索的处理对象主要是文 本信息, 其他类型的信息一般也必须依赖于文本信息或附加文本信息而存在; 再者, 文 本表示方法是计算机信息检索服务的先决条件, 因为计算机信息检索的基本手段是利用 自然语言文本向搜索引擎进行提问和应答,必须首先要将文本从无结构的原始形式转化 为计算机能够理解的结构化形式, 然后才能进行分析与处理; 还有, 文本表示方法是和 计算机信息检索中的处理算法紧密关联在一起的,所以文本表示方法很大程度上决定了 处理算法的设计。  Computer information retrieval is one of the important infrastructures of the information society. The services provided range from basic network information search to information filtering and classification to various advanced data mining. In computer information retrieval, the representation of text is a fundamentally important problem: First, the processing object of computer information retrieval is mainly text information, and other types of information generally must also depend on text information or additional text information; Furthermore, the text representation method is a prerequisite for the computer information retrieval service, because the basic means of computer information retrieval is to use the natural language text to ask questions and respond to the search engine. It is necessary to first convert the text from the unstructured original form to the computer. The structured form of understanding can then be analyzed and processed; moreover, the text representation method is closely related to the processing algorithm in computer information retrieval, so the text representation method largely determines the design of the processing algorithm.
常见的文本表示方法主要分为向量空间方法(Vector Space Model ) ( 参考: Salton,G The SMART Retrieval System. Englewood Cliffs, Prentice-Hall, 1971. )、 概率方法 Common text representation methods are mainly divided into Vector Space Model (Reference: Salton, G The SMART Retrieval System. Englewood Cliffs, Prentice-Hall, 1971.), Probability Method
( Probability Model ) (参考: Van Rijsbergen,C.J. A new theoretical framework for information retrieval. In proceedings of SIGIR'86, pp.194-200, 1986. ) 和语言模型方法(Probability Model ) (Reference: Van Rijsbergen, C.J. A new theoretical framework for information retrieval. In proceedings of SIGIR'86, pp.194-200, 1986. ) and language model method
( Language Model ) (参考: J.Ponte, Crpfl,W.B. A Language Modeling Approach to Informational. In proceedings of SIGIR'98, pp.257-281, 1998. ) 三类。 关联主题模型( Language Model ) (Reference: J. Ponte, Crpfl, W.B. A Language Modeling Approach to Informational. In proceedings of SIGIR '98, pp. 257-281, 1998. ) Associated topic model
( Correlated Topic Model) 是一种基于隐含主题的概率文本表示方法 (参考: Blei, D., Lafferty, J. Correlated Topic Models[J]. Advances in neural information processing systems, 2006, 18: 147-154. ) , 另外由于其输出可以方便地嵌入到向量空间和语言模型中, 因而对 于计算机信息检索中的分析、处理算法具有广泛的适应性。该方法的主要功能是通过对 一定数量的文本利用统计手段进行分析后,不但能挖掘出该文本集合论述的若干主题以 及各个主题在每篇文本中的分布,而且非常重要的是该方法还可以度量这些主题之间的 关联程度。 这样, 就使文本信息处理摆脱了以往完全依赖于词汇的低级处理方式, 可以 在主题这个更高的层次上进行。 (Correlated Topic Model) is a method of probabilistic text representation based on implicit topics (Reference: Blei, D., Lafferty, J. Correlated Topic Models [J]. Advances in neural information processing systems, 2006, 18: 147-154 . ) , and because its output can be easily embedded into the vector space and language model, it has wide adaptability to the analysis and processing algorithms in computer information retrieval. The main function of this method is to analyze some of the topics discussed in the text collection and the distribution of each topic in each text by analyzing the statistical means of a certain amount of text, and it is very important that the method can also Measure the degree of association between these topics. In this way, the text information processing is freed from the low-level processing method that relies entirely on vocabulary in the past. At the higher level of the theme.
虽然关联主题模型从功能上提供了一种高层次文本表示的理想手段, 但是目前还主 要限于小量数据上, 难以在现实环境下的大规模数据上使用, 根本原因在于其求解方法 存在严重的瓶颈: 首先, 其经典的实现是基于常规的串行计算方法, 也就是计算任务的 每一步必须前后相继地顺序地进行, 前一步处理的结果是后一步处理的开始。这样在任 一时间点上, 全部的计算任务只能在一个硬件计算单元上执行, 所以即便是将其放到具 有多个硬件计算单元(如多核、 多处理器) 的高性能计算机上, 也不能加快任务求解的 速度。 再者, 由于串行方式下计算过程自身不可拆分, 所以被处理的数据也就必须集中 在一起供给计算过程随时访问, 这样就加大了系统的存储负荷, 如硬盘、 内存, 特别是 内存方面的影响非常明显,过大的内存占用会导致计算速度急剧下降甚至导致系统拒绝 计算任务的执行。 发明内容  Although the related topic model provides a high-level text representation ideally, it is mainly limited to small amounts of data, and it is difficult to use it on large-scale data in real-world environments. The fundamental reason is that its solution method is serious. Bottleneck: First, its classic implementation is based on the conventional serial computing method, that is, each step of the computing task must be performed sequentially one after another, and the result of the previous step is the beginning of the latter step. In this way, at any point in time, all computing tasks can only be performed on one hardware computing unit, so even if it is placed on a high-performance computer with multiple hardware computing units (such as multi-core, multi-processor), Speed up task solving. Furthermore, since the calculation process itself cannot be split in the serial mode, the processed data must be collected together to provide access to the calculation process at any time, thus increasing the storage load of the system, such as hard disk, memory, and especially memory. The impact of the aspect is very obvious. Excessive memory usage can lead to a sharp drop in computing speed and even cause the system to refuse execution of the computing task. Summary of the invention
本发明的目的在于提供一种高效的关联主题模型数据处理方法及其系统,该方法能 够充分利用单机上的多处理器-多核并行架构和计算机集群的大规模并行能力, 进而实 现对大规模文档集合的高速处理,也即达到将关联主题模型文本表示方法推向实用化的 目的。  The object of the present invention is to provide an efficient related topic model data processing method and system thereof, which can fully utilize the multi-processor-multi-core parallel architecture on a single machine and the massive parallel capability of a computer cluster, thereby realizing large-scale documents. The high-speed processing of the collection, that is, the purpose of pushing the text representation method of the related topic model to practical use.
本发明的技术方案如下:  The technical solution of the present invention is as follows:
1. 任务初始化  Task initialization
1.1.在每个节点计算机上(包括主控节点和计算节点), 根据该节点的硬件并发 能力自动生成具有相应数量工作线程的计算服务;  1.1. On each node computer (including the master node and the compute node), a computing service with a corresponding number of worker threads is automatically generated according to the hardware concurrency of the node;
1.2. 在主控节点上, 利用随机过程给出初始模型 Mc, 并将 Mo复制到所有的计 算节点上; 1.2. On the master node, use the random process to give the initial model M c and copy Mo to all compute nodes;
1.3. 在主控节点上, 将任务文档全集等量划分成若干计算节点文档子集, 并逐 一分配到相应的计算节点上;  1.3. On the master node, the task document corpus is equally divided into a number of computing node document subsets, and assigned to the corresponding computing nodes one by one;
2. 任务的执行 (记本轮迭代次数为第 i次迭代, 再用 k表示计算节点的编号) 2.1. 在每个计算节点上, 将该节点文档子集划分成若干工作块, 各工作线程进 行局部并行计算首先获得在本次迭代过程上该节点文档子集的处理结果 D(k,i), 即该节点文档子集中每篇文档的主题分布, 进而利用这部分文档的主题分布求 得关于该节点文档子集的模型统计量; 2.2. 在每个计算节点上, 将其节点文档子集的处理结果 D(k,i)、模型统计量和文 档计算时间传送到主控节点; 2. Execution of the task (remember the number of iterations of the current round is the i-th iteration, and then use k to represent the number of the compute node) 2.1. On each compute node, divide the subset of the node document into several work blocks, each work thread Performing local parallel computing first obtains the processing result D(k, i) of the subset of the node document in this iterative process, that is, the topic distribution of each document in the subset of the node document, and then uses the topic distribution of the part of the document to obtain the topic distribution. Model statistics about a subset of the node's documentation; 2.2. On each computing node, transfer the processing result D(k, i) of its node document subset, model statistic and document calculation time to the master node;
2.3. 在主控节点上, 用文档计算时间判断计算节点文档子集的划分的均衡性。 如有必要则重新调整计算节点文档子集的划分并分配到相应的计算节点上; 2.3. On the master node, use the document calculation time to determine the balance of the division of the computational node document subset. If necessary, readjust the division of the computed node document subset and assign it to the corresponding compute node;
2.4. 在主控节点上, 先汇总所有计算节点文档子集的模型统计量, 然后估计本 次迭代的模型 M (即进行模型参数估计,求解出关联主题模型)。如果模型没有 收敛则将 M复制到所有的计算节点上进行下一轮计算和模型迭代; 否则终止数 据处理过程, 此时在每个计算节点上即可得到其最终的数据处理结果 Dd ast), 将之汇总既得文档全集最终的数据处理结果 Aast, 即文档全集中每篇文档的主 题分布; 同时也得到了最终的收敛模型 Mlast2.4. On the master node, first summarize the model statistics of all the computed node document subsets, and then estimate the model M of this iteration (that is, perform model parameter estimation to solve the associated topic model). If the model does not converge, copy M to all compute nodes for the next round of calculations and model iterations; otherwise, terminate the data processing process, at which point the final data processing result Dd ast) can be obtained at each compute node. It summarizes the final data processing result of the vested document collection Aast, that is, the topic distribution of each document in the document ensemble; and also obtains the final convergence model M last .
本发明涉及以下所述的关键要素:  The invention relates to the following key elements:
一)本发明采用分级的高性能求解体系结构: 集群分布式计算、 机内并行计算。 集 群级别由 2个基本的组成部分构成的, 分别是: 一个主控节点和若干个计算节点。 主控 节点只有一个, 该节点可以使用普通的 PC机, 主要负责界面交互、 数据分发、 结果汇 总, 模型参数估计等功能。 计算节点有多个(原则上没有数量限制)而且可以选用不同 类型的计算机, 计算节点承担求解任务的主要计算工作负荷。主控节点和计算节点通过 网络连接起来, 数据仅需要在主控节点和计算节点直接传输, 计算节点之间没有通信。 节点级别采用机内并行计算: 即跨线程的计算, 不同的计算节点具有不同的并行度, 如 具有多处理器的高性能服务器可以有效支持和处理器数量成正比的并行线程,双核工作 站可以有效支持双线程并行计算, 而单核的 PC机一般只支持单线程计算。  a) The present invention employs a hierarchical high performance solution architecture: cluster distributed computing, in-machine parallel computing. The cluster level consists of two basic components: a master node and several compute nodes. There is only one master node, which can use ordinary PC, mainly responsible for interface interaction, data distribution, result aggregation, model parameter estimation and other functions. There are multiple compute nodes (in principle, there is no limit to the number) and different types of computers can be used. The compute nodes take on the main computational workload of the solver task. The master node and the compute node are connected through the network, and the data only needs to be directly transmitted between the master node and the compute node, and there is no communication between the compute nodes. Node level uses in-machine parallel computing: that is, cross-thread computing, different computing nodes have different degrees of parallelism, such as high-performance servers with multiple processors can effectively support parallel threads proportional to the number of processors, dual-core workstations can be effective Support for dual-threaded parallel computing, while single-core PCs generally only support single-threaded computing.
二)节点并发线程数量的自主确定: 在每个节点上 (包括主控节点和计算节点), 都通过获得这个节点计算机的处理器的数量和每个处理器所含的内核数量或支持的超 线程数量来自动确定有效线程的数量:在 windows平台上利用汇编指令直接获得硬件系 统的处理器信息,在 limix平台上通过对硬件抽象层 HAL的功能调用获得硬件系统的处 理器信息。这样在采用集群分布式计算的环境下避免了手工配置每个节点的工作线程数 量的繁琐。  2) Autonomous determination of the number of concurrent threads of nodes: On each node (including the master node and the compute node), the number of processors of the node computer and the number of cores per processor or supported super The number of threads automatically determines the number of valid threads: the processor information of the hardware system is directly obtained by using the assembly instruction on the windows platform, and the processor information of the hardware system is obtained by calling the function of the hardware abstraction layer HAL on the limix platform. This avoids the cumbersome task of manually configuring each node's number of worker threads in a clustered distributed computing environment.
三)本发明釆用分级的负载均衡技术: 集群级别上工作集的自适应分配, 计算节点 级别上工作集的自动分配。 这不同于一般的高性能计算任务所采用的单一负载均衡模 式。集群级别上工作集的自适应分配方法是: 由于每个计算节点的计算能力不一致, 我 们在主控节点上对各个计算节点本次迭代进行评估并及时调整的策略,让工作集按照计 算节点的计算能力合理分布以使得各计算节点在近似接近的时间内完成以避免部分节 点空闲等待, 从而实现最大化利用整个集群的计算效能。 c) The present invention employs hierarchical load balancing techniques: adaptive allocation of working sets at the cluster level, and automatic assignment of working sets at the node level. This is different from the single load balancing mode used in general high-performance computing tasks. The adaptive allocation method of the working set at the cluster level is: Because the computing power of each computing node is inconsistent, we evaluate the current iteration of each computing node on the master node and adjust the strategy in time to make the working set The computing power of the computing nodes is reasonably distributed so that each computing node completes in an approximate close time to avoid partial node idle waiting, thereby maximizing the computational efficiency of utilizing the entire cluster.
评估和调整计算节点工作集的具体方法- 评估方法:  Specific methods for evaluating and adjusting the working set of a compute node - Evaluation method:
第一, 将所有计算节点的计算时间组成一个列表 Γ e  First, group the calculation time of all compute nodes into a list Γ e
第二, 找出最长的计算时间 Max(rz' e)和最短的计算时间 Min{Time), 并计算时 间差 TimeSpan=Max(Time)-Min(Time)。  Second, find the longest calculation time Max(rz' e) and the shortest calculation time Min{Time), and calculate the time difference TimeSpan=Max(Time)-Min(Time).
第三, 将 TimeSpan和预定的阈值 Threshold (默认是 5 秒) 进行比较, 如果 Third, compare TimeSpan with a predetermined threshold Threshold (default is 5 seconds), if
TimeSpari>Threshold, 则需要调整工作集的划分, 否则保留先前的划分。 调整方法: TimeSpari>Threshold, you need to adjust the division of the working set, otherwise retain the previous division. Adjustment method:
令 Time (0表示第 i个计算节点所用的计算时间,令 She表示工作集全集的大小 而相应的 & (0表示第 i个计算节点所处理的节点工作集的大小 (即所处理的 文档数量), 则有:  Let Time (0 denotes the calculation time used by the i-th computing node, let She denote the size of the working set corpus and the corresponding & (0 represents the size of the working set of the node processed by the i-th computing node (ie the number of documents processed) ), there are:
第一, 计算各个节点的文档处理速度,
Figure imgf000006_0001
(f) 第二, 计算各个节点的文档分配比例, proportion(i) = ^peed(i) 第三, 计算各个节点的文档分配份额,
Figure imgf000006_0002
First, calculate the document processing speed of each node,
Figure imgf000006_0001
(f) Second, calculate the document allocation ratio of each node, proportion(i) = ^ peed(i) third, calculate the document allocation share of each node,
Figure imgf000006_0002
第四,依据各个节点的文档分配份额从全集中依次取出相应数量的文档进行分 配。  Fourth, the corresponding number of documents are sequentially extracted from the entire set according to the document allocation shares of the respective nodes for allocation.
计算节点级别上工作集的自动分配方法是: 由于一个节点上的工作线程计算能力是 一致的, 所以采用每个线程自动申请近似等量的工作块, 使得各线程在近似接近的时间 内完成以避免部分线程空闲, 从而实现最大化利用整个计算节点的计算效能。  The automatic allocation method of the working set at the compute node level is: Since the computing power of the worker threads on one node is consistent, each thread automatically applies for an approximately equal amount of work blocks, so that each thread completes in an approximate close time. Avoid part of the thread idle, so as to maximize the computing power of the entire compute node.
四)计算节点级别上工作集的高并发访问方法: 当计算节点的工作集(即接收到的 文档子集)载入内存后, 各并发线程是利用索引结构来划分各自处理的文本对象的, 划 分之后进行计算时所有的线程同时访问工作集而不用将该工作集锁定,这样使得多条工 作线程在执行计算任务时获得了完全的并行, 这种索引方法的详细说明如下:  4) High concurrent access method of the working set at the compute node level: When the working set of the computing node (ie, the received subset of documents) is loaded into the memory, each concurrent thread uses the index structure to divide the respective processed text objects. When the calculation is performed after partitioning, all the threads access the working set at the same time without locking the working set, so that multiple working threads obtain complete parallelism when performing the computing task. The detailed description of this indexing method is as follows:
在内存中文档是分散存储的, 通过一个连续的索引数组将文档的地址集中起来存 放。 利用索引方法方法提高并发访问的关键是- 第一, 设置工作块的尺寸 (默认为 100个文档)  In-memory documents are distributed and stored, and the addresses of the documents are stored together in a continuous indexed array. The key to using the index method to improve concurrent access is - first, set the size of the work block (default is 100 documents)
第二, 设置一个索引数组的顶端指针, 并为之设置一个锁 (临界区互斥量), 该指 针初始位置在数组首元素; Second, set the top pointer of an indexed array and set a lock (critical section mutex) for it. The initial position of the needle is in the first element of the array;
第三,所有线程在锁的保护下互斥访问索引数组的顶端指针而获得本线程所处理的 文档的地址 (即在索引数组中一段连续的元素)  Third, all threads mutually access the top pointer of the index array under the protection of the lock to obtain the address of the document processed by the thread (ie, a contiguous element in the index array)
第四, 线程通过工作块的地址来访问相应的文档并进行处理, 此时所有线程是完全 并行的。  Fourth, the thread accesses the corresponding document through the address of the work block and processes it. At this point, all threads are completely parallel.
所以, 该方法仅要求线程在一个整数(索引数组的顶端指针)上进行锁定的互斥访 问,而无须对索引本身进行锁定的互斥扫描,更无须对文档集本身进行锁定的互斥扫描, 从而获得了最大的并发效率,避免了由于在大数据结构上扫描时使用锁成本而导致的开 销。 ·  Therefore, this method only requires the thread to perform exclusive access to the lock on an integer (the top pointer of the index array), without the need for the exclusive exclusion of the index itself, and no need to lock the mutually exclusive scan of the document set itself. This results in maximum concurrency efficiency and avoids the overhead of using lock costs when scanning on large data structures. ·
五)本发明釆用分级的工作集传递模式: 集群工作集的 "推"式传递模式、 节点并 发线程工作集的 "拉"式传递模式。 工作集全集要进行分级划分, 首先在集群级别要将 工作集分成计算节点文档子集, 这个任务由主控节点完成, 主控节点按照各个计算节点 的计算能力将工作集全集划分并对应复制给相应计算节点, 这是"推"式传递模式; 在 计算节点上, 各个工作线程主动申请从节点工作子集中获得工作块来进行计算, 这是 V) The present invention employs a hierarchical working set delivery mode: a "push" delivery mode of a cluster working set, and a "pull" delivery mode of a node concurrent thread working set. The working set is divided into several parts. First, the working set is divided into a subset of computing node documents at the cluster level. This task is completed by the master node. The master node divides and copies the working set according to the computing power of each computing node. Corresponding computing node, this is the "push" type of delivery mode; on the compute node, each worker thread actively requests to obtain a work block from the node work subset for calculation, which is
"拉"式传递模式。 "Pull" transfer mode.
六)主控节点和计算节点同步方式: 计算和传输分离, 计算任务不考虑数据的远程 访问而是采用本地读写的模式, 传输任务由基于进程外文件传输服务 (FTP) 或网络文 件系统服务 (NFS)承担。 这样提髙系统的可伸缩性、可维护性。 同时, 数据传输的数 值格式釆用文本表示格式, 这样避免了不同硬件平台、操作系统平台和开发工具平台导 致的二进制表示格式的差异, 使得系统可以在混合平台环境下开发和运行  VI) Synchronization mode of the master node and the compute node: The calculation and transmission are separated. The calculation task does not consider the remote access of the data but adopts the local read-write mode. The transmission task is based on the out-of-process file transfer service (FTP) or network file system service. (NFS) bears. This improves the scalability and maintainability of the system. At the same time, the data format of the data transmission is in textual format, which avoids the difference in the binary representation format caused by different hardware platforms, operating system platforms and development tool platforms, so that the system can be developed and run in a mixed platform environment.
七) 关联主题模型估计中采用模型统计量汇总技术- 关联主题模型主要由 3个参数矩阵来定义,分别是主题均值参数矩阵 Λ,主题方差 参数矩阵 Cp, 主题的用词分布(特征分布) 参数矩阵 p ; 模型估计关键步骤是通过文 档计算出模型统计量(对应有 3个统计量矩阵: 主题均值统计量矩阵^, 主题方差统计 量矩阵 Cs, 主题的用词分布 (特征分布)统计量矩阵 ffs), 通过模型统计量计算出模型 参数, 这个过程是迭代收敛的。 VII) Model statistic summarization technique is used in the estimation of related topic models. The related topic model is mainly defined by three parameter matrices, namely the subject mean parameter matrix Λ, the subject variance parameter matrix C p , and the topic word distribution (feature distribution). Parameter matrix p; The key step in model estimation is to calculate the model statistic through the document (corresponding to three statistic matrices: the topic mean statistic matrix ^, the subject variance statistic matrix C s , the topic's word distribution (feature distribution) statistics The quantity matrix ff s ), the model parameters are calculated by the model statistic, and the process is iteratively convergent.
在串行和并行模式该过程的差异在于: 串行数据处理方法下, 由于所有数据均在一 台计算机上, 所以模型统计量是被集中存放的, 但是分布式数据处理时, 每台计算机上 单独计算自己那部分模型统计量, 所以必须要汇总在一起, 具体地讲- , 其中 表示一个计算节点的模型统计量The difference between this process in serial and parallel mode is: Under the serial data processing method, since all data is on one computer, the model statistics are stored centrally, but distributed data processing is performed on each computer. Calculate your own part of the model statistic separately, so it must be put together, specifically - , which represents the model statistic of a compute node
Figure imgf000008_0001
Figure imgf000008_0001
本发明的积极效果: Positive effects of the invention:
与现有技术相比, 本发明通过挖掘关联主题模型求解方法的内在结构, 采用分而治 之的策略, 将整个计算任务分割成不同尺度的子任务, 每个子任务独立执行的并且是仅 需处理自身相关的数据, 所以从整体上看, 计算任务的存储压力被消解和单一计算单元 的限制被突破; 该方法在实施时, 通过利用多处理器、 多核单计算机高性能硬件提供的 计算能力, 以及利用集群大规模并行等先进体系结构来实现求解, 从而实现提高计算速 度和扩大计算规模的目标。 附图说明  Compared with the prior art, the present invention divides the entire computing task into sub-tasks of different scales by mining the internal structure of the related topic model solving method, and each sub-task is independently executed and only needs to deal with its own correlation. Data, so on the whole, the storage pressure of the computing task is broken by the resolution of the single computing unit; the method is implemented, by using the multi-processor, multi-core single-computer high-performance hardware to provide computing power, and to utilize Advanced architectures such as large-scale clustering are used to solve the solution, thereby achieving the goal of increasing the calculation speed and expanding the computing scale. DRAWINGS
图 1. 本发明的网络结构示意图;  Figure 1. Schematic diagram of the network structure of the present invention;
图 2. 本发明的方法流程示意图;  Figure 2. Schematic diagram of the process of the present invention;
图 3. 本发明的动态执行结构示意图。 具体实施方式  Figure 3. Schematic diagram of the dynamic execution structure of the present invention. detailed description
下面结合附图具体说描述本发明方法的实施方式:  The embodiments of the method of the invention are described in detail below with reference to the accompanying drawings:
本发明的网络拓扑结构是一个计算机集群, 如图 1所示, 它由 2个基本的组成部分 构成的, 分别是: 一个主控节点和若干个计算节点。 主控节点只有一个, 该节点可以使 用普通的 PC机,主要负责界面交互、数据分发、结果汇总等功能。计算节点有多个(原 则上没有数量限制)而且可以选用不同类型的计算机, 计算节点承担求解任务的主要计 算工作负荷。主控节点和计算节点通过网络连接起来, 数据仅需要在主控节点和计算节 点直接传输, 计算节点之间没有通信。  The network topology of the present invention is a computer cluster. As shown in Figure 1, it consists of two basic components: a master node and several compute nodes. There is only one master node, which can use ordinary PC, and is mainly responsible for interface interaction, data distribution, and result summary. There are multiple compute nodes (there is no limit on the number of rules) and different types of computers can be used, and the compute nodes take on the main computational workload of the solver task. The master node and the compute node are connected through the network. The data only needs to be transmitted directly at the master node and the compute node, and there is no communication between the compute nodes.
本发明的方法流程如图 2所示: 纵向表示的是顺序的步骤, 而横向表示的是每一个 步骤中可以并行的成分。顺序的步骤主要分为初始化和迭代执行两大步骤, 进而迭代执 行又可以分为计算节点的执行步骤(包含计算和传送两个子步骤)和主控节点的执行步 骤(包含计算和传送两个子步骤)。 图中显式表示的并行成分主要有: (1 )初始化中的 模型初始化、文档集合划分 2个并行成分; (2)多个计算节点独立计算的并行成分; (3 ) 主控节点上估计模型和调整工作集划分独立执行的并行成分。 另外, 除了图中显式表示 的并行成分外, 实际上还有很重要的一种并行成分, 即单个计算节点上的多条执行线程 的并行成分, 这将在下面高性能求解方法的动态执行结构示意图中展示, 如图 3所示。 The process flow of the present invention is illustrated in Figure 2: the vertical representation of the sequential steps and the lateral representation of the components that can be paralleled in each step. The sequential steps are mainly divided into two major steps of initialization and iterative execution, and then the iterative execution can be divided into the execution steps of the computing node (including two sub-steps of calculation and transmission) and the execution steps of the master node (including two sub-steps of calculation and transmission). ). The parallel components explicitly indicated in the figure mainly include: (1) model initialization in initialization, and two parallel components in document set division; (2) parallel components independently calculated by multiple computing nodes; (3) The parallel component of the independent execution of the estimation model and the adjustment working set division on the master node. In addition, in addition to the parallel components explicitly shown in the figure, there is actually a very important parallel component, that is, the parallel components of multiple execution threads on a single compute node, which will be dynamically executed in the following high-performance solution method. The structure diagram is shown in Figure 3.
本发明的动态执行结构如图 3所示: 是一种双层体系结构, 宏观分布式计算和微观 并行计算。宏观分布式计算是跨计算机的, 在主控节点的协调下, 计算任务被分配到不 同的计算节点上, 由于不同计算节点的计算能力不一样, 所以主控节点需要管理计算节 点之间的负载均衡, 本发明通过设计自适应方法来自动调整各节点工作集的大小, 无须 人工干预。微观并行计算是跨线程的, 不同的计算节点具有不同的并行度, 如具有多处 理器的高性能服务器可以有效支持和处理器等数量的并行线程,双核工作站可以有效支 持双线程并行计算, 而单核的 PC机一般只支持单线程计算。 所以, 针对不同并行度的 计算节点要运行不同数量的线程, 过多或过少都不利于发挥该节点的最大计算能力, 本 发明通过自动检测系统硬件来计算可支持线程数量, 无须手工指定。  The dynamic execution structure of the present invention is shown in Figure 3: It is a two-layer architecture, macroscopic distributed computing and microscopic parallel computing. The macro-distributed computing is cross-computer. Under the coordination of the master node, the computing tasks are assigned to different computing nodes. Because the computing power of different computing nodes is different, the master node needs to manage the load between the computing nodes. Equalization, the present invention automatically adjusts the size of each node's working set by designing an adaptive method without manual intervention. Micro-parallel computing is cross-threaded. Different computing nodes have different degrees of parallelism. For example, a high-performance server with multiple processors can effectively support parallel threads such as processors. Dual-core workstations can effectively support dual-thread parallel computing. Single-core PCs generally only support single-threaded computing. Therefore, for different parallelism computing nodes to run different numbers of threads, too much or too little is not conducive to the maximum computing power of the node. The invention automatically calculates the number of supportable threads by automatically detecting the system hardware, without manual designation.
下面结合具体的应用领域描述本发明的应用:  The application of the invention will be described below in connection with specific application areas:
1. 文档聚类 Document clustering
文档聚类是指将文档集合中的文档进行分组,使得在同一个组中的文档内容具有较 高的相似度, 而不同组中的文档内容差别较大。 经过这样的处理后, 由于文档集合具有 合理分组的结构, 使得文档集合更便于管理;'更重要的是通过将大的文档集合细分可以 极大地减轻用户查找特定文档的工作量, 提高了文档使用效率。在信息检索中文档聚类 技术具有重要用途, 最典型的就是对检索结果依据主题进行分组, 这样用户就可以集中 在自己关心的主题方面的网页, 也就是自动过滤掉了大量无关的査询结果, 所以文档聚 类可以进一步提高通用搜索引擎的可用性。  Document clustering refers to grouping documents in a document collection such that document content in the same group has a high degree of similarity, while document content in different groups differs greatly. After such processing, the document collection has a reasonable grouping structure, which makes the document collection more manageable; 'More importantly, by subdividing the large document collection, the workload of the user to find a specific document can be greatly reduced, and the document is improved. Use efficiency. Document clustering technology has important applications in information retrieval. The most typical is to group the search results according to the theme, so that users can concentrate on the webpages of the topics they care about, that is, automatically filter out a large number of unrelated query results. , so document clustering can further improve the usability of general search engines.
关联主题模型可以用来做文本聚类从而实现搜索引擎的检索结果依据主题分组的 功能。 其具体的实施方式是- 一)将搜索引擎的检索结果组织成文档全集, 其中每个文挡就对应一条搜索结果的 标题和摘要的内容。  The associated topic model can be used to perform text clustering to realize the function of the search engine's search results grouped by subject. The specific implementation manner is - a) organizing the search results of the search engine into a complete set of documents, wherein each document corresponds to the content of the title and the abstract of a search result.
二) 用本发明的高效的关联主题模型数据处理方法及其系统对该文档全集进行处 理, 以获得到每个文本所属的主题, 具体过程如下:  b) Using the efficient associated topic model data processing method and system thereof of the present invention to process the complete set of documents to obtain the subject to which each text belongs, the specific process is as follows:
1. 任务初始化  Task initialization
1.1. 在每个节点计算机上(包括主控节点和计算节点), 根据该节点的硬件并 发能力自动生成具有相应数量工作线程的计算服务; 1.2. 在主控节点上, 利用随机过程给出初始模型 MQ, 并将 )复制到所有的计 算节点上; 1.1. On each node computer (including the master node and the compute node), a computing service with a corresponding number of worker threads is automatically generated according to the hardware concurrency of the node; 1.2. On the master node, use the random process to give the initial model M Q and copy it to all compute nodes;
1.3. 在主控节点上, 将任务文档全集等量划分成若干计算节点文档子集, 并逐 一分配到相应的计算节点上;  1.3. On the master node, the task document corpus is equally divided into a number of computing node document subsets, and assigned to the corresponding computing nodes one by one;
2. 任务的执行 (记本轮迭代次数为第 i次迭代, 再用 k表示计算节点的编号) 2. Execution of the task (remember the number of iterations in the current round is the i-th iteration, and then use k to indicate the number of the compute node)
2.1. 在每个计算节点上, 将该节点文档子集划分成若干工作块, 各工作线程进 行局部并行计算首先获得在本次迭代过程上该节点文档子集的处理结果 D(kii), 即该节点文档子集中每篇文档的主题分布,进而利用这部分文档的主题分布求 得关于该节点文档子集的模型统计量; 同时记录每个节点计算本节点文档子集 时所用的文档计算时间。 2.1. On each compute node, divide the node document subset into several work blocks, and each worker thread performs local parallel computing to obtain the processing result D( kii ) of the node document subset in this iterative process. The topic distribution of each document in the subset of the node document, and then using the topic distribution of the part of the document to obtain the model statistics of the subset of the document of the node; and recording the calculation time of the document used by each node to calculate the subset of the document of the node .
2.2.在每个计算节点上, 将其节点文档子集的处理结果 £»(k,i)、模型统计量和文 档计算时间传送到主控节点; 2.2. On each compute node, transfer the processing result of its node document subset £»( k ,i), model statistic and document calculation time to the master node;
2.3. 在主控节点上, 用文档计算时间评估计算节点文档子集的划分的均衡性。 如有必要则重新调整计算节点文档子集的划分并分配到相应的计算节点上; 2.3. On the master node, use the document calculation time to evaluate the balance of the division of the computational node document subset. If necessary, readjust the division of the computed node document subset and assign it to the corresponding compute node;
2.4. 在主控节点上, 先汇总所有计算节点的模型统计量, 然后估计本次迭代 的模型 M (即进行模型参数估计, 求解出关联主题模型)。 如果模型没有收敛 则将 M复制到所有的计算节点上进行下一轮计算和模型迭代; 否则终止数据 处理过程, 此时即可得到最终的数据处理结果, 此时在每个计算节点上即可得2.4. On the master node, first summarize the model statistics of all compute nodes, and then estimate the model M of this iteration (that is, perform model parameter estimation to solve the associated topic model). If the model does not converge, copy M to all compute nodes for the next round of calculations and model iterations; otherwise, terminate the data processing process, and the final data processing result can be obtained. At this time, on each compute node. Got
. 到其最终的数据处理结果 ast), 将之汇总既得文档全集最终的数据处理结果 Aast, 即文档全集中每篇文档的主题分布; 同时也得到了最终的收敛模型 ^。 三)从每篇文档的主题分布中可以得到该文档包含的最大主题(也即该文档最集中 论述的主题),进而就将该篇文档分配到相应主题的那个组去,这样就得到了搜索引 擎的检索结果的依据主题的分组。 To its final data processing result ( ast ), it summarizes the final data processing result of the vested document collection A ast , that is, the topic distribution of each document in the document corpus; and also obtains the final convergence model ^. c) From the topic distribution of each document, you can get the largest topic contained in the document (that is, the topic that is most concentrated in the document), and then assign the document to the group of the corresponding topic, so that the search is obtained. The engine's search results are grouped by subject.
2. 邮件过滤 2. Mail filtering
电子邮件是最基本的网络服务之一, 它是人们工作、 生活中必不可少的工具。在充 分享受电子邮件带来的便捷、实时和廉价的同时, 网络时代的人们也饱尝垃圾邮件带来 的烦恼。几乎每个人的信箱都充斥着大量来历不明的垃圾邮件, 据统计 95%的邮件是垃 圾邮件, 这严重污染网络环境, 影响网络的正常通信。所以垃圾邮件过滤是电子邮件系 统的必备功能, 除了传统的基于身份认证和敏感词过滤的技术外, 各种对邮件内容进行 智能分析的过滤技术也逐渐发展起来, 成为对付精细伪装的垃圾邮件的主要手段。 关联主题模型可以用来对电子邮件的内容做主题分析从而实现依据邮件主题进行 过滤的功能。 其具体的实施方式是: E-mail is one of the most basic network services, and it is an indispensable tool for people to work and live. While fully enjoying the convenience, real-time and cheapness brought by e-mail, people in the Internet age are also suffering from the troubles caused by spam. Almost everyone's mailbox is filled with a large number of spam emails of unknown origin. According to statistics, 95% of emails are spam, which seriously pollutes the network environment and affects the normal communication of the network. Therefore, spam filtering is an essential function of the email system. In addition to the traditional technology based on identity authentication and sensitive word filtering, various filtering technologies for intelligent analysis of email content have gradually developed into spam to deal with fine camouflage. The main means. The associated topic model can be used to perform topic analysis on the content of the email to implement filtering based on the subject of the email. Its specific implementation is:
一) 将已有的全部电子邮件分成两个对立的集合: 正常邮件集合与垃圾邮件集合。 二)用本发明的关联主题模型对正常邮件集合与垃圾邮件集合分别进行计算, 得到 两个关联主题模型。  a) Divide all existing emails into two opposing collections: normal mail collections and spam collections. b) Using the associated topic model of the present invention to calculate the normal mail collection and the spam collection separately, and obtain two related topic models.
三)对于新收到的一封电子邮件计算它与两个关联主题模型的相似度, 即可作出该 邮件是否是垃圾邮件的判定。  c) Calculate its similarity to two related topic models for a newly received email, and then determine whether the email is spam.
3. 商品推荐  3. Product recommendation
商品推荐功能在电子商务中非常重要, 它可以帮助顾客发现真正感兴趣的商品, 从 而既提升客户的购物体验又提升了经销商的利润。 所以几乎所有大型的电子商务系统, 都不同程度地使用了各种形式的推荐系统 (软件所内刊)。 商品推荐的基本原理是: 根 据大量的购买记录数据, 分析客户的购买行为, 总结客户群体的购买模式, 当新的客户 购提交商品购买信息后,通过将该信息与以往的购买模式进行匹配从而预测该用户还可 能需要的商品, 进而推荐给客户。  The product recommendation feature is very important in e-commerce. It helps customers find products that are of real interest, thereby improving the customer's shopping experience and increasing the profitability of the dealer. Therefore, almost all large-scale e-commerce systems use various forms of recommendation systems (in-house publications) to varying degrees. The basic principle of product recommendation is: According to a large number of purchase record data, analyze the customer's purchase behavior, summarize the purchase pattern of the customer group, and when the new customer purchases the product purchase information, match the information with the previous purchase mode. Predict the items that the user may also need and recommend them to the customer.
关联主題模型可以用来从历史购买记录中对客户的购买模式进行分析,从而支持为 新客户提供商品推荐的功能。 其具体的实施方式是- 一) 将全部历史购买记录组织成文本集合, 将每条购买记录看作一个 "文本", 而 购买的商品看作文本中的 "词"。  The associated topic model can be used to analyze the customer's buying patterns from historical purchase records, thereby supporting the ability to offer product recommendations to new customers. The specific implementation is - a) organize all historical purchase records into a collection of texts, treating each purchase record as a "text" and the purchased item as a "word" in the text.
二)用本发明的关联主题模型对该文本集合进行计算, 可以发现具有不同购买模式 的客户群体。  b) Calculating the set of texts using the associated topic model of the present invention, and finding a group of customers having different purchase patterns.
三)对于一个新的购买信息, 利用本发明的关联主题模型计算它所属的客户群体, 最后即可根据该客户群体的购买模式提出商品推荐。  3) For a new purchase information, the associated topic model of the present invention is used to calculate the customer group to which it belongs, and finally, the product recommendation can be proposed according to the purchase mode of the customer group.

Claims

权利要求书 Claim
1. 一种高效的关联主题模型数据处理方法, 其步骤为: 1. An efficient method for processing related topic model data, the steps of which are:
初始化阶段:  Initialization phase:
1 )在每个节点计算机上根据该节点的硬件并发能力自动生成具有相应数量工作线 程的计算服务;  1) automatically generating, on each node computer, a computing service having a corresponding number of working threads according to the hardware concurrency of the node;
2)主控节点给出初始模型并将其复制到所有的计算节点上;  2) The master node gives the initial model and copies it to all compute nodes;
3 ) 主控节点将任务文档全集划分成若干计算节点文档子集, 并分配给相应计算节 点;  3) The master node divides the task document corpus into a plurality of computing node document subsets and assigns them to corresponding computing nodes;
迭代阶段:  Iteration phase:
1 )各计算节点对接收到的节点文档子集进行数据处理,得到该节点文档子集中每篇 文档的主题分布和该节点文档子集的模型统计量;  1) each computing node performs data processing on the received subset of node documents to obtain a topic distribution of each document in the subset of the node document and a model statistic of the subset of the node document;
2) 各计算节点将数据结果返回给主控节点进行汇总, 得到任务文档全集的主题分 布;  2) Each computing node returns the data result to the master node for aggregation, and obtains the topic distribution of the task document complete set;
3 )主控节点根据模型统计量的汇总, 迭代本次模型并判断其收敛性: 如未收敛则重 复迭代阶段, 否则结束数据处理。  3) The master node iterates the model according to the summary of the model statistics and judges its convergence: if it does not converge, repeat the iteration phase, otherwise it ends the data processing.
2. 如权利要求 1所述的方法,其特征在于所述节点计算机硬件并发能力的获得方法为: 2. The method of claim 1 wherein the method of obtaining the concurrent hardware capabilities of the node computer hardware is:
1 )在 windows平台上利用汇编指令直接获得硬件系统的处理器信息, 在 linux平台 上通过对硬件抽象层 HAL的功能调用获得硬件系统的处理器信息: 首先获取每 个节点计算机的处理器的数量, 然后获取每个处理器所含的内核数量; 1) directly obtain the processor information of the hardware system by using the assembly instruction on the windows platform, and obtain the processor information of the hardware system by calling the function of the hardware abstraction layer HAL on the linux platform: First, the number of processors of each node computer is obtained. And then get the number of cores contained in each processor;
2)合计节点计算机的所有处理器的包含的内核数量,自动确定该计算节点支持的有 效线程的数量。  2) Totaling the number of cores included in all processors of the node computer automatically determines the number of valid threads supported by the compute node.
3. 如权利要求 1所述的方法, 其特征在于主控节点判断所述计算节点文档子集划分的 均衡性, 其方法为:  3. The method according to claim 1, wherein the master node determines the balance of the subset of the document of the computing node by:
1 ) 将所有计算节点的计算时间组成一个列表 Time;  1) Make the calculation time of all computing nodes into a list Time;
2)找出最长的节点计算时间 Max(Time)和最短的节点计算时间 Min(Time), 并计算 时间差 TimeSpan=Max(Time)-Min(Time);  2) Find the longest node calculation time Max (Time) and the shortest node calculation time Min(Time), and calculate the time difference TimeSpan=Max(Time)-Min(Time);
3 )将 TimeSpan和预定的阈值 Threshold进行比较, 如果 TimeSpan>Threshold, 则 需要调整节点文档子集划分, 否则保留先前的划分。 3) Compare TimeSpan with a predetermined threshold Threshold, if TimeSpan>Threshold, then The node document subset partitioning needs to be adjusted, otherwise the previous partition is retained.
4. 如权利要求 3所述的方法, 其特征在于所述调整计算节点文档子集划分的方法为: 4. The method of claim 3, wherein the method of adjusting a subset of the compute node document is:
1 )每个计算节点对接收到的节点文档子集进行数据数据处理时记录该节点处理文 档子集的所用的时间; 1) each computing node records the time taken by the node to process the subset of documents when performing data data processing on the received subset of node documents;
2)每个计算节点将该节点处理文档子集的所用的时间传送回主控节点;  2) each computing node transmits the time taken by the node to process the subset of documents back to the master node;
3 )主控节点用文档计算时间计算各个节点的文档处理速度;  3) The master node calculates the document processing speed of each node by using the document calculation time;
4)主控节点根据各个节点文档处理速度计算每个节点的文档分配份额;  4) The master node calculates the document allocation share of each node according to the processing speed of each node document;
5 ) 主控节点依据各个节点文档分配份额从全集中依次取出相应数量的文档进行分 配。  5) The master node extracts the corresponding number of documents from the entire set according to the distribution quota of each node document for distribution.
5. 如权利要求 4所述的方法, 其特征在于所述计算节点进行数据处理的方法为:  5. The method according to claim 4, wherein the method for the data processing by the computing node is:
1 )每个计算节点获取自身的处理器的数量和每个处理器所含的内核数量,进而就得 到该节点支持的有效线程数量;  1) Each compute node obtains its own number of processors and the number of cores contained in each processor, thereby obtaining the number of valid threads supported by the node;
2)计算节点根据自身的有效线程数量将接收的文档子集等分为若干工作块; 2) The computing node divides the received document subset into several working blocks according to its own effective thread number;
3 )计算节点中各个工作线程利用索引结构主动申请获得工作块来进行数据处理。3) Each worker thread in the computing node actively requests to obtain a work block for data processing by using an index structure.
6. 如权利要求 5所述的方法, 其特征在于所述利用索引结构获得工作块的方法为:6. The method according to claim 5, wherein the method for obtaining a work block by using an index structure is:
1 ) 设置所述节点文档子集划分后的工作块尺寸; 1) setting a work block size after the node document subset is divided;
2)设置一个索引数组的顶端指针, 并为之设置一个锁; 2) Set the top pointer of an indexed array and set a lock for it;
3 )所有线程在锁的保护下互斥访问索引数组的顶端指针, 获得本线程所处理的文 档的地址; 3) All threads mutually access the top pointer of the index array under the protection of the lock, and obtain the address of the document processed by the thread;
4)线程通过工作块的地址访问相应的文档并进行处理。  4) The thread accesses the corresponding document through the address of the work block and processes it.
7. 一种高效的关联主题模型数据处理系统, 该系统包括主控节点和若干个计算节点: 所述主控节点用于负责界面交互、 数据分发、 结果汇总、 模型估计;  7. An efficient related topic model data processing system, the system comprising a master node and a plurality of computing nodes: the master node is responsible for interface interaction, data distribution, result summary, model estimation;
所述计算节点用于承担求解任务的主要计算工作负荷;  The computing node is configured to undertake a main computing workload of the solving task;
所述主控节点和所述计算节点建立通信连接进行数据传输。  The master node and the computing node establish a communication connection for data transmission.
8. 如权利要求 7所述的系统,其特征在于所述主控节点和计算节点为具有单核处理器、 多核处理器或多处理器的硬件平台。  8. The system of claim 7, wherein the master node and the compute node are hardware platforms having a single core processor, a multi-core processor, or a multi-processor.
9. 如权利要求 7所述的系统, 其特征在于所述主控节点和计算节点通过网络进行数据 传输, 所述数据的数值格式采用文本表示格式。 9. The system of claim 7 wherein said master node and compute node perform data over a network Transmission, the numerical format of the data is in a text representation format.
10. 如权利要求 7所述的系统, 其特征在于计算和传输分离, 即所述计算节点进行数据 处理时不考虑数据的远程访问而是釆用本地读写的模式, 所述计算节点和主控节点 的数据传输任务由基于进程外的文件传输服务(FTP) 或集群系统提供的网络文件 系统服务 (NFS ) 承担。  10. The system according to claim 7, characterized in that the calculation and the transmission are separated, that is, the computing node performs data processing without considering remote access of data but adopts a mode of local reading and writing, the computing node and the main The data transfer task of the control node is undertaken by an out-of-process file transfer service (FTP) or a network file system service (NFS) provided by the cluster system.
PCT/CN2009/000174 2008-02-22 2009-02-20 Effective relating theme model data processing method and system thereof WO2009103221A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN 200810057989 CN101226557B (en) 2008-02-22 2008-02-22 Method for processing efficient relating subject model data
CN200810057989.4 2008-02-22

Publications (1)

Publication Number Publication Date
WO2009103221A1 true WO2009103221A1 (en) 2009-08-27

Family

ID=39858552

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2009/000174 WO2009103221A1 (en) 2008-02-22 2009-02-20 Effective relating theme model data processing method and system thereof

Country Status (2)

Country Link
CN (1) CN101226557B (en)
WO (1) WO2009103221A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339283A (en) * 2010-07-20 2012-02-01 中兴通讯股份有限公司 Access control method for cluster file system and cluster node
CN105260477A (en) * 2015-11-06 2016-01-20 北京金山安全软件有限公司 Information pushing method and device

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226557B (en) * 2008-02-22 2010-07-14 中国科学院软件研究所 Method for processing efficient relating subject model data
KR101537078B1 (en) * 2008-11-05 2015-07-15 구글 인코포레이티드 Custom language models
CN101799809B (en) * 2009-02-10 2011-12-14 中国移动通信集团公司 Data mining method and system
CN101909069A (en) * 2009-06-04 2010-12-08 鸿富锦精密工业(深圳)有限公司 Data-processing system
CN102118261B (en) * 2009-12-30 2014-11-26 上海中兴软件有限责任公司 Method and device for data acquisition, and network management equipment
CN102137125A (en) * 2010-01-26 2011-07-27 复旦大学 Method for processing cross task data in distributive network system
CN102567396A (en) * 2010-12-30 2012-07-11 中国移动通信集团公司 Method, system and device for data mining on basis of cloud computing
CN103164261B (en) * 2011-12-15 2016-04-27 中国移动通信集团公司 Multicenter data task disposal route, Apparatus and system
CN102769662A (en) * 2012-05-23 2012-11-07 上海引跑信息科技有限公司 Method for simultaneously distributing data of a type of entities into cluster nodes containing various types of entities related to a type of entities
CN102799486B (en) * 2012-06-18 2014-11-26 北京大学 Data sampling and partitioning method for MapReduce system
CN103970738B (en) * 2013-01-24 2017-08-29 华为技术有限公司 A kind of method and apparatus for producing data
CN103116636B (en) * 2013-02-07 2016-06-08 中国科学院软件研究所 The big Data subject method for digging of the text of feature based spatial decomposition and device
CN105187465B (en) * 2014-06-20 2019-03-01 中国科学院深圳先进技术研究院 A kind of sharing method of file, apparatus and system
CN106034145B (en) * 2015-03-12 2019-08-09 阿里巴巴集团控股有限公司 The method and system of data processing
CN106844654A (en) * 2017-01-23 2017-06-13 公安部第三研究所 Towards the massive video distributed search method of police service practical
US10447765B2 (en) * 2017-07-13 2019-10-15 International Business Machines Corporation Shared memory device
CN109919699B (en) * 2017-12-12 2022-03-04 北京京东尚科信息技术有限公司 Item recommendation method, item recommendation system, and computer-readable medium
CN108763258B (en) * 2018-04-03 2023-01-10 平安科技(深圳)有限公司 Document theme parameter extraction method, product recommendation method, device and storage medium
CN108647244B (en) * 2018-04-13 2021-08-24 广东技术师范学院 Theme teaching resource integration method in form of thinking guide graph and network storage system
CN108616590B (en) * 2018-04-26 2020-07-31 清华大学 Billion-scale network embedded iterative random projection algorithm and device
CN109684094B (en) * 2018-12-25 2020-07-24 人和未来生物科技(长沙)有限公司 Load distribution method and system for parallel mining of massive documents in cloud platform environment
CN110874271B (en) * 2019-11-20 2022-03-11 山东省国土测绘院 Method and system for rapidly calculating mass building pattern spot characteristics
CN111898546B (en) * 2020-07-31 2022-02-18 深圳市商汤科技有限公司 Data processing method and device, electronic equipment and storage medium
CN112183668B (en) * 2020-11-03 2022-07-22 支付宝(杭州)信息技术有限公司 Method and device for training service models in parallel
CN112529720A (en) * 2020-12-28 2021-03-19 深轻(上海)科技有限公司 Method for summarizing calculation results of life insurance actuarial model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006020039A1 (en) * 2004-07-16 2006-02-23 Cassatt Corporation Distributed parallel file system for a distributed processing system
US20070088703A1 (en) * 2005-10-17 2007-04-19 Microsoft Corporation Peer-to-peer auction based data distribution
CN101004743A (en) * 2006-01-21 2007-07-25 鸿富锦精密工业(深圳)有限公司 Distribution type file conversion system and method
CN101226557A (en) * 2008-02-22 2008-07-23 中国科学院软件研究所 Method and system for processing efficient relating subject model data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006020039A1 (en) * 2004-07-16 2006-02-23 Cassatt Corporation Distributed parallel file system for a distributed processing system
US20070088703A1 (en) * 2005-10-17 2007-04-19 Microsoft Corporation Peer-to-peer auction based data distribution
CN101004743A (en) * 2006-01-21 2007-07-25 鸿富锦精密工业(深圳)有限公司 Distribution type file conversion system and method
CN101226557A (en) * 2008-02-22 2008-07-23 中国科学院软件研究所 Method and system for processing efficient relating subject model data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339283A (en) * 2010-07-20 2012-02-01 中兴通讯股份有限公司 Access control method for cluster file system and cluster node
CN105260477A (en) * 2015-11-06 2016-01-20 北京金山安全软件有限公司 Information pushing method and device

Also Published As

Publication number Publication date
CN101226557B (en) 2010-07-14
CN101226557A (en) 2008-07-23

Similar Documents

Publication Publication Date Title
WO2009103221A1 (en) Effective relating theme model data processing method and system thereof
Wang et al. Performance prediction for apache spark platform
CN105593818B (en) Apparatus and method for scheduling distributed workflow tasks
Bautista Villalpando et al. Performance analysis model for big data applications in cloud computing
Zhang et al. Automated profiling and resource management of pig programs for meeting service level objectives
CN104050042B (en) The resource allocation methods and device of ETL operations
Emara et al. Distributed data strategies to support large-scale data analysis across geo-distributed data centers
Chao et al. A gray-box performance model for apache spark
Clemente-Castelló et al. Performance model of mapreduce iterative applications for hybrid cloud bursting
Salloum et al. An asymptotic ensemble learning framework for big data analysis
Tao et al. Collaborative filtering recommendation algorithm based on spark
Kumar et al. Replication-Based Query Management for Resource Allocation Using Hadoop and MapReduce over Big Data
Lee et al. Design and implementation of a data-driven simulation service system
Cafaro et al. Frequent itemset mining
Bonifacio et al. Hadoop MapReduce configuration parameters and system performance: A systematic review
Zu Hadoop-based painting resource storage and retrieval platform construction and testing
Khan et al. Computational performance analysis of cluster-based technologies for big data analytics
Orhean et al. Evaluation of a scientific data search infrastructure
Hirchoua et al. A new knowledge capitalization framework in big data context
Liang et al. Accelerating parallel ALS for collaborative filtering on hadoop
Vamosi et al. Data allocation based on evolutionary data popularity clustering
Watanabe et al. Improving Parallelism in Data-Intensive Workflows with Distributed Databases
Khader et al. Big Data Clustering Using MapReduce Framework: A Review
Fu An improved parallel collaborative filtering algorithm based on Hadoop
Zhang et al. A distributed PCM clustering algorithm based on spark

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09712447

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09712447

Country of ref document: EP

Kind code of ref document: A1