XQuery/Histogram of File Sizes

Motivation edit

You have a large number of files in one or more collections and you want to generate a histogram chart that shows the relative distribution of the number of files in each size range.

Method edit

We will use the xmldb:size() function to generate a list of all the file sizes in a given collection. We can then transform this list into a series of strings that can be passed to the Google Charts line chart function.

The format of the size function is the following:

 xmldb:size($collection, $document-name)

This function returns the number of bytes in a file.


Our first step is to create a sequence of numbers that represents all of the sizes of resources in a collection:

 let $sizes :=
  for $file in xmldb:get-child-resources($collection)
     let $size := xmldb:size($collection, $file)
     return
        $size

This can also be done with a combination of the collection() function and the util:document-name() function:

  let $sizes :=
     for $file in collection($collection)/*
        let $name := util:document-name($file)
        let $size := xmldb:size($collection, $name)
        return
           $size

Sample Program edit

xquery version "1.0";

declare option exist:serialize "method=xhtml media-type=text/html";

(: Put the collection you want to analyze here :)
let $collection := "/db/test"

(: How many bytes in each section :)
let $increment := 10000

(: How many divisions :)
let $divisions := 20

(: Color for the lines or the bars in RRGGBB :)
let $color := '0000FF'

(: For vertical bar chart use 'bvs', for line chart use 'lc', for spark line (no axis) use 'ls' :)
let $chart-type := 'bvs'

(: this is the max size of a google chart - 30K pixels.  The first number is the width, the second is the height.  :)
let $chart-size := '600x500'

let $uriapi := 'http://chart.apis.google.com/chart?'

let $sizes :=
   for $file in xmldb:get-child-resources($collection)
      let $size := xmldb:size($collection, $file)
      return
         $size

(: the raw data counts for each range.  The 't' is just a marker that it is true that we are in this range. :)
let $raw-data :=
    for $range in (0 to $divisions)
       let $min := $range * $increment
       let $max := ($range + 1) * $increment
       return
          count(
               for $number in $sizes
               return
                 if ($number gt $min and $number lt $max)
                 then ('t') else ()
              )

let $max-value := max($raw-data)

(: scale to the max height :)
let $scaled-data := 
   for $num in $raw-data
   return string(floor($num div ($max-value div 500)))

(: join the strings with commas to get a comma separated list :)
let $data-csv := string-join($scaled-data, ',')

(: construct the URL :)
let $chart-uri := concat($uriapi, 'cht=', $chart-type, '&chs=', $chart-size, '&chco=', $color, '&chd=t:', $data-csv)

(: return the results in an HTML page :)
<html>
<head><title>Google Chart Histogram View of {$collection}</title></head>
<body>
<h1>Google Chart Histogram View of {$collection}</h1>
<p><img src="{request:encode-url(xs:anyURI($chart-uri))}"/></p>
</body>
</html>

Sample Result edit

http://chart.apis.google.com/chart?cht=ls&chs=500x500&chco=0000FF&chd=t:83,6,13,37,85,414,500,87,41,31,11,16,9,12,5,7,4,4,3,1,1

Discussion edit

To run the query you will need to customized the name of the collection that you are analyzing. After you run the query you can check to make sure the results are what you expect and then copy the results into a browser URL.

Note that if there are files over the max size indicated in the top range, an additional count of these file sizes should be added.

 let $top-range := $increment * ($divisions + 1)
 let $top-count := count(
           for $num in $sizes
           return
              if ($num > $overflow)
                then ('t')
                else ()
       )

This query could also be parametrized using the get-parameter() function so that many of the parameters that are passed to the Google chart can also be set as a parameter in the XQuery of the URL.