XQuery/Tag Cloud

Counting WordsEdit

A tag cloud (or weighted list in visual design) is a visual depiction of user-generated tags, or simply the word content of a site, typically used to describe the content of web sites.

One method of creating a tag cloud is to create a list of the words in a document, count the number of occurrences of each word, and depict the more frequently occurring words with a larger font size than the words that occur less frequently.

Counting the total number of words in a text objectEdit

To get a feeling for one of the basic techniques, let's examine Jon Robie's code, which takes all of the text nodes in a document, strings them together, splits them into a sequence of "words" (tokenizing by whitespace, punctuation, or the 'nbsp' entity), and counts the number of resulting words:

let $txt := string-join( $doc//text() , " ")
return
  count(tokenize($txt,'(\s|[,.!:;]|[n][b][s][p][;])+'))

Note that the string-join() function here takes an input sequence and returns a single string that is separated by single spaces (the second argument of string-join).

If you want to see what this routine treats as a "word" in your document, use the following variation.

let $txt := string-join( $doc//text() , " ")
let $words := tokenize($txt,'(\s|[,.!:;]|[n][b][s][p][;])+')
return   
  <words count="{count($words)}">
     { for $word in $words return <word>{$word}</word> }
  </words>

Another variation is the word-count() function found at xqueryfunctions.com:

declare function local:word-count( $arg as xs:string? )  as xs:integer {       
   count(tokenize($arg, '\W+')[. != ''])
 } ;

This version uses the \W+ regular expression (which matches non-alphabetical characters) to return word tokens.

Counting KeywordsEdit

Kurt Cagle suggested the following XQuery for counting keywords:

declare namespace xqwb="http://xquery.wikibooks.org";

declare function xqwb:word-count($wordlist as element() ) as element() {
<terms>
   {for $term in distinct-values($wordlist/term)
    let $term-count := count($wordlist/term[.  = $term])
    return 
     <term count="{$term-count}">{$term}</term>
   }
</terms>
};

let $keywords := 
<keywords>
   <term>red</term>
   <term>green</term>
   <term>red</term>
   <term>blue</term>
   <term>violet</term>
   <term>red</term>
   <term>blue</term>
   <term>blue</term>
   <term>red</term>
   <term>orange</term>
   <term>green</term>
   <term>yellow</term>
   <term>indigo</term>
   <term>red</term>
</keywords>

let $result := xqwb:word-count($keywords)
return $result

[Execute]

This Returns the FollowingEdit

<terms>
    <term count="5">red</term>
    <term count="2">green</term>
    <term count="3">blue</term>
    <term count="1">violet</term>
    <term count="1">orange</term>
    <term count="1">yellow</term>
    <term count="1">indigo</term>
</terms>

Creating a Tag CloudEdit

From this you can create a Tag Cloud or word density map such as the "Popular Tags" link on the flickr web site Flicker Popular Tags

declare namespace xqwb="http://xquery.wikibooks.org";
declare option exist:serialize "method=xhtml media-type=text/html indent=yes";

declare function xqwb:word-count($wordlist as element() ) as element() {
<terms>
   {for $term in distinct-values($wordlist/term)
    let $term-count := count($wordlist/term[.  = $term])
    return 
       <term count="{$term-count}">{$term}</term>
   }
 </terms>
};


let $keywords := 
<keywords>
   <term>red</term>
   <term>green</term>
   <term>red</term>
   <term>blue</term>
   <term>violet</term>
   <term>red</term>
   <term>blue</term>
   <term>blue</term>
   <term>red</term>
   <term>orange</term>
   <term>green</term>
   <term>yellow</term>
   <term>indigo</term>
   <term>red</term>
</keywords>

let $result := xqwb:word-count($keywords)
let $total := count($keywords/term)
let $scale := 20

return 
 <div>
  {
  for $term in $result/term
  let $fontSize := round( $term/@count div $total * 100 * $scale)
  order by $term
  return <span style="font-size:{$fontSize}%">{string($term)} </span>
  }
 </div>

Execute

Last modified on 21 February 2011, at 15:12