XQuery/潜在语义索引

动机

您有一组文档，对于任何文档，您都希望找出哪些文档与任何给定文档最相似。

方法

我们将使用一种名为“潜在语义索引”的文本挖掘技术。我们首先将创建所有概念词（术语）与所有文档的矩阵。每个单元格将包含每个文档中术语的出现频率。然后，我们将这个词-文档矩阵发送到执行标准奇异值分解或 SVD 的服务。 SVD 是一种非常计算密集型的算法，如果您的词语和文档数量很多，则可能需要数小时或数天的时间来计算。 SVD 服务然后返回一组“概念向量”，这些向量可用于对相关文档进行分组。

示例数据

为了保持示例简单，我们将只使用文档标题，而不是完整的文档。

以下是一些文档标题

XQuery Tutorial and Cookbook 
XForms Tutorial and Cookbook 
Auto-generation of XForms with XQuery 
Building RESTful Web Applications with XRX 
XRX Tutorial and Cookbook 
XRX Architectural Overview 
The Return on Investment of XRX

我们的第一步将是构建一个词-文档矩阵。此矩阵包含文档中的所有词语（在列中），以及每个文档的一列。

我们将分几个步骤进行操作。

获取所有文档中的所有词语，并将其放入一个序列中
创建一个包含所有非“停用词”的唯一词语列表
对于每个词语
1. 对于每个文档，统计该词语在文档中出现的频率

示例词-文档矩阵

词语	1	2	3	4	5	6	7
应用程序				0.03125
建筑						0.03125
自动生成	0.03125
构建				0.03125
食谱		0.03125	0.03125		0.03125
投资							0.03125
概述						0.03125
RESTful				0.03125
返回							0.03125
教程		0.03125	0.03125		0.03125
Web				0.03125
XForms	0.03125		0.03125
XQuery	0.03125	0.03125
XRX				0.03125	0.03125	0.03125	0.03125

示例程序源代码

xquery version "1.0";

declare option exist:serialize "method=xhtml media-type=text/html indent=yes";

(: this is where we get our data :)
let $app-collection := '/db/apps/latent-semantic-analysis'
let $data-collection := concat($app-collection , '/data')

(: get all the titles where $titles is a sequence of titles :)
let  $titles := collection($data-collection)/html/head/title/text()
let $doc-count := count($titles)

(: A list of words :)
let $stopwords :=
<words>
   <word>a</word>
   <word>and</word>
   <word>in</word>
   <word>the</word>
   <word>of</word>
   <word>or</word>
   <word>on</word>
   <word>over</word>
   <word>with</word>
</words>

(: a sequence of words in all the document titles :)
(: the \s is the generic whitespace regular expression :)
let $all-words :=
   for $title in $titles
      return
         tokenize($title, '\s')

(: just get a distinct list of the sorted words that are not stop words :)
let $concept-words :=
   for $word in distinct-values($all-words)
   order by $word
      return
         if ($stopwords/word = lower-case($word))
            then ()
            else $word

let $total-word-count := count($all-words)
return
<html>
    <head>
        <title>All Document Words</title>
    </head>
    <body>
        <p>Doc count =<b>{$doc-count}</b> Word count = <b>{$total-word-count}</b></p>
        
        <h2>Documents</h2>
        <ol>
        {for $title in $titles
           return
               <li>{$title}</li>
         }
         </ol>
         
         <h2>Word-Document Matrix</h2>
         <table border="1">
            <thead>
               <tr>
               <th>Word</th>
               {for $doc at $count in $titles
                       return
                          <th>{$count}</th>
                    }
               </tr>
            </thead>
             {for $word in $concept-words
             return
                 <tr>
                    <td>{$word}</td>
                    {for $title in $titles
                       return
                          <td>{if (contains($title, $word)) 
                                 then (1 div $total-word-count)
                                 else (' ')}</td>
                    }
                 </tr>
             }
          </table>
    </body>
</html>

创建 Sigma 值

Sigma 矩阵是一个矩阵，它乘以词语向量和文档向量

[Word Document Matrix] = [Word Vectors] X [Sigma Values] X [Document Vectors]