Working with XML

Converting XML into lists

XML files are widely used these days, and you might have noticed that the highly organized tree structure of an XML file is similar to the nested list structures that we've met in newLISP. So wouldn't it be good if you could work with XML files as easily as you can work with lists?

You've met the two main XML-handling functions already. (See ref and ref-all.) The xml-parse and xml-type-tags functions are all you need to convert XML into a newLISP list. (xml-error is used for diagnosing errors.) xml-type-tags determines how the XML tags are processed by xml-parse, which does the actual processing of the XML file, turning it into a list.

To illustrate the use of these functions, we'll use the RSS newsfeed from the newLISP forum:

(set 'xml (get-url "http://newlispfanclub.alh.net/forum/feed.php"))

and store the retrieved XML in a file, to save hitting the server repeatedly:

(save {/Users/me/Desktop/newlisp.xml} 'xml)  ; save symbol in file
(load {/Users/me/Desktop/newlisp.xml})       ; load symbol from file

The XML starts like this:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-gb">
<link rel="self" type="application/atom+xml" href="http://newlispfanclub.alh.net/forum/feed.php" />

<title>newlispfanclub.alh.net</title>
<subtitle>Friends and Fans of NewLISP</subtitle>
<link href="http://newlispfanclub.alh.net/forum/index.php" />
<updated>2010-01-11T09:51:39+00:00</updated>

<author><name><![CDATA[newlispfanclub.alh.net]]></name></author>
<id>http://newlispfanclub.alh.net/forum/feed.php</id>
<entry>
<author><name><![CDATA[kosh]]></name></author>
<updated>2010-01-10T12:17:53+00:00</updated>
...

If you use xml-parse to parse the XML, but don't use xml-type-tags first, the output looks like this:

(xml-parse xml)
(("ELEMENT" "feed" (("xmlns" "http://www.w3.org/2005/Atom") 
  ("xml:lang" "en-gb")) 
  (("TEXT" "\n") ("ELEMENT" "link" (("rel" "self") 
   ("type" "application/atom+xml") 
     ("href" "http://newlispfanclub.alh.net/forum/feed.php")) 
    ()) 
   ("TEXT" "\n\n") 
   ("ELEMENT" "title" () (("TEXT" "newlispfanclub.alh.net"))) 
   ("TEXT" "\n") 
   ("ELEMENT" "subtitle" () (("TEXT" "Friends and Fans of NewLISP"))) 
   ("TEXT" "\n") 
   ("ELEMENT" "link" (("href" "http://newlispfanclub.alh.net/forum/index.php")) ()) 
; ...

Although it looks a bit LISP-y already, you can see that the elements have been labelled as "ELEMENT", "TEXT". For now, we could do without all these labels, and that's basically the job that xml-type-tags does. It lets you determine the labels for four types of XML tag: TEXT, CDATA, COMMENTS, and ELEMENTS. We'll hide them altogether with four nils. We'll also use some options for xml-parse to tidy up the output even more.

(xml-type-tags nil nil nil nil)
(set 'sxml (xml-parse xml 15))        ; options: 15 (see below)
;->
((feed ((xmlns "http://www.w3.org/2005/Atom") (xml:lang "en-gb")) (link ((rel "self") 
    (type "application/atom+xml") 
    (href "http://newlispfanclub.alh.net/forum/feed.php"))) 
  (title "newlispfanclub.alh.net") 
  (subtitle "Friends and Fans of NewLISP") 
  (link ((href "http://newlispfanclub.alh.net/forum/index.php"))) 
  (updated "2010-01-11T09:51:39+00:00") 
  (author (name "newlispfanclub.alh.net")) 
  (id "http://newlispfanclub.alh.net/forum/feed.php") 
  (entry (author (name "kosh")) 
   (updated "2010-01-10T12:17:53+00:00") 
   (id "http://newlispfanclub.alh.net/forum/viewtopic.php?t=3447&amp;p=17471#p17471") 
   (link ((href "http://newlispfanclub.alh.net/forum/viewtopic.php?t=3447&amp;p=17471#p17471"))) 
   (title ((type "html")) "newLISP in the real world \226\128\162 Re: suggesting the newLISP manual as CHM") 
   (category ((term "newLISP in the real world") 
    (scheme "http://newlispfanclub.alh.net/forum/viewforum.php?f=16") 
     (label "newLISP in the real world"))) 
   (content ((type "html") 
   (xml:base "http://newlispfanclub.alh.net/forum/viewtopic.php?t=3447&amp;p=17471#p17471")) 
    "\nhello kukma.<br />\nI tried to make the newLISP.chm, and this is it.<br />\n
  <br />\n<!-- m --><a class=\"postlink\" 
; ...

This is now a useful newLISP list, albeit quite a complicated one, stored in a symbol called sxml. (This way of representing XML is called S-XML.)

If you're wondering what that 15 was doing in the xml-parse expression, it's just a way of controlling how much of the auxiliary XML information is translated: the options are as follows:

1 - suppress whitespace text tags

2 - suppress empty attribute lists

4 - suppress comment tags

8 - translate string tags into symbols

16 - add SXML (S-expression XML) attribute tags

You add them up to get the options code number - so 15 (+ 1 2 4 8) uses the first four of these options: suppress unwanted stuff, and translate strings tags to symbols. As a result of this, new symbols have been added to newLISP's symbol table:

(author category content entry feed href id label link rel scheme subtitle term 
 title type updated xml:base xml:lang xmlns)

These correspond to the string tags in the XML file, and they'll be useful almost immediately.

Now what?

The story so far is basically this:

(set 'xml (get-url "http://newlispfanclub.alh.net/forum/feed.php"))
; we stored this in a temporary file while exploring
(xml-type-tags nil nil nil nil)
(set 'sxml (xml-parse xml 15))

which has given us a list version of the news feed stored in the sxml symbol.

Because this list has a complicated nested structure, it's best to use ref and ref-all rather than find to look for things. ref finds the first occurrence of an expression in a list and returns the address:

(ref 'entry sxml)
;-> (0 9 0)

These numbers are the address in the list of the first occurrence of the symbol item: (0 9 0) means start at item 0 of the whole list, then go to item 9 of that item, then go to item 0 of that one. (0-based indexing, of course!)

To find the higher-level or enclosing item, use chop to remove the last level of the address:

(chop (ref 'entry sxml))
;-> (0 9)

This now points to the level that contains the first item. It's like chopping the house number off an address, leaving the street name.

Now you can use this address with other expressions that accept a list of indices. The most convenient and compact form is probably the implicit address, which is just the name of the source list followed by the set of indices in a list:

(sxml (chop (ref 'entry sxml)))            ; a (0 9) slice of sxml

(entry (author (name "kosh")) (updated "2010-01-10T12:17:53+00:00") (id "http://newlispfanclub.alh.net/forum/viewtopic.php?t=3447&amp;p=17471#p17471") 
 (link ((href "http://newlispfanclub.alh.net/forum/viewtopic.php?t=3447&amp;p=17471#p17471"))) 
 (title ((type "html")) "newLISP in the real world \226\128\162 Re: suggesting the newLISP manual as CHM") 
 (category ((term "newLISP in the real world") (scheme "http://newlispfanclub.alh.net/forum/viewforum.php?f=16") 
   (label "newLISP in the real world"))) 
 (content ((type "html") (xml:base "http://newlispfanclub.alh.net/forum/viewtopic.php?t=3447&amp;p=17471#p17471")) 
...

That found the first occurrence of entry, and returned the enclosing portion of the SXML.

Another technique available to you is to treat sections of the list as association lists:

(lookup 'title (sxml (chop (ref 'entry sxml))))
;-> 
newLISP in the real world • Re: suggesting the newLISP manual as CHM

Here we've found the first item, as before, and then looked up the first occurrence of title using lookup.

Use ref-all to find all occurrences of a symbol in a list. It returns a list of addresses:

(ref-all 'title sxml)
;-> 
((0 3 0) 
(0 9 5 0) 
(0 10 5 0) 
(0 11 5 0) 
(0 12 5 0) 
(0 13 5 0) 
(0 14 5 0) 
(0 15 5  0) 
 (0 16 5 0) 
 (0 17 5 0) 
 (0 18 5 0))

With a simple list traversal, you can quickly show all the titles in the file, at whatever level they may be:

(dolist (el (ref-all 'title sxml)) 
    (println (rest (rest (sxml (chop el))))))

;->
()
("newLISP in the real world \226\128\162 Re: suggesting the newLISP manual as CHM")
("newLISP newS \226\128\162 Re: newLISP Advocacy")
("newLISP in the real world \226\128\162 Re: newLISP-gs opens only splash picture")
("So, what can you actually DO with newLISP? \226\128\162 Re: Conception of Adaptive Programming Languages")
("Dragonfly \226\128\162 Re: Dragonfly 0.60 Released!")
("So, what can you actually DO with newLISP? \226\128\162 Takuya Mannami, Gauche Newlisp Library")
; ...

Without the two rests in there, you would have seen this:

(title "newlispfanclub.alh.net")
(title ((type "html")) "newLISP in the real world \226\128\162 Re: suggesting the newLISP manual as CHM")
(title ((type "html")) "newLISP newS \226\128\162 Re: newLISP Advocacy")
; ...

As you can see, there are many different ways to access the information in the SXML data. To produce a concise summary of the news in the XML file, one approach is to go through all the items, and extract the title and description entries. Because the description elements are a mass of escaped entities, we'll write a quick and dirty tidying routine as well:

(define (cleanup str)
 (let (replacements 
  '(({&amp;amp;}  {&amp;})
    ({&amp;gt;}   {>})
    ({&amp;lt;}   {<})
    ({&amp;nbsp;} { })
    ({&amp;apos;} {'})
    ({&amp;quot;} {"})
    ({&amp;#40;}  {(})
    ({&amp;#41;}  {)})
    ({&amp;#58;}  {:})
    ("\n"      "")))
 (and
  (!= str "")
  (map 
   (fn (f) (replace (first f) str (last f))) 
   replacements)
  (join (parse str {<.*?>} 4) " "))))

(set 'entries (sxml (chop (chop (ref 'title sxml)))))

(dolist (e (ref-all 'entry entries))
   (set 'entry (entries (chop e)))
   (set 'author (lookup 'author entry))
   (println "Author: " (last author))
   (set 'content (lookup 'content entry))
   (println "Post: " (0 60 (cleanup content)) {...}))

Author: kosh
Post: hello kukma. I tried to make the newLISP.chm, and this is it...
Author: Lutz
Post: ... also, there was a sign-extension error in the newLISP co...
Author: kukma
Post: Thank you Lutz and welcome home again, the principle has bec...
Author: Kazimir Majorinc
Post: Apparently,  Aparecido Valdemir de Freitas  completed his Dr...
Author: cormullion
Post: Upgrade seemed to go well - I think I found most of the file...
Author: Kazimir Majorinc
Post:   http://github.com/mtakuya/gauche-nl-lib   Statistics: Post...
Author: itistoday
Post: As part of my work on Dragonfly, I've updated newLISP's SMTP...
Author: Tim Johnson
Post:   itistoday wrote:     Tim Johnson wrote:  Have you done any...
; ...

Changing SXML

You can use similar techniques to modify data in XML format. For example, suppose you keep the periodic table of elements in an XML file, and you wanted to change the data about elements' melting points, currently stored in degrees Kelvin, to degrees Celsius. The XML data looks like this:

<?xml version="1.0"?>
<PERIODIC_TABLE>
  <ATOM>
  ...
  </ATOM>
  <ATOM>
    <NAME>Mercury</NAME>
    <ATOMIC_WEIGHT>200.59</ATOMIC_WEIGHT>
    <ATOMIC_NUMBER>80</ATOMIC_NUMBER>
    <OXIDATION_STATES>2, 1</OXIDATION_STATES>
    <BOILING_POINT UNITS="Kelvin">629.88</BOILING_POINT>
    <MELTING_POINT UNITS="Kelvin">234.31</MELTING_POINT>
    <SYMBOL>Hg</SYMBOL>
...

When the table has been loaded into the symbol sxml, using (set 'sxml (xml-parse xml 15)) (where xml contains the XML source), we want to change each sublist that has the following form:

(MELTING_POINT ((UNITS "Kelvin")) "629.88")

You can use the set-ref-all function to find and replace elements in one expression. First, here's a function to convert a temperature from Kelvin to Celsius:

(define (convert-K-to-C n)
 (sub n 273.15))

Now the set-ref-all function can be called just once to find all references and modify them in place, so that every melting-point is converted to Celsius. The form is:

(set-ref-all key list replacement function)

where the function is the method used to find list elements given the key.

(set-ref-all 
  '(MELTING_POINT ((UNITS "Kelvin")) *) 
  sxml 
  (list 
    (first $0) 
    '((UNITS "Celsius")) 
    (string (convert-K-to-C (float (last $0)))))
  match)

Here the match function searches the SXML list using a wild-card construction (MELTING_POINT ( (UNITS "Kelvin") ) *) to find every occurrence. The replacement expression builds a replacement sublist from the matched expression stored in $0. After this has been evaluated, the SXML changes from this:

; ...
(ATOM
    (NAME "Mercury")
    (ATOMIC_WEIGHT "200.59")
    (ATOMIC_NUMBER "80")
    (OXIDATION_STATES "2, 1")
    (BOILING_POINT
        ((UNITS "Kelvin")) "629.88")
    (MELTING_POINT
        ((UNITS "Kelvin")) "234.31")
;  ...

to this:

; ...
(ATOM
    (NAME "Mercury")
    (ATOMIC_WEIGHT "200.59")
    (ATOMIC_NUMBER "80")
    (OXIDATION_STATES "2, 1")
    (BOILING_POINT
        ((UNITS "Kelvin")) "629.88")
    (MELTING_POINT
        ((UNITS "Celsius")) "-38.84")
; ...

XML isn't always as easy to manipulate as this - there are attributes, CDATA sections, and so on.

Outputting SXML to XML

If you want to go the other way and convert a newLISP list to XML, the following function suggests one possible approach. It recursively works through the list:

(define (expr2xml expr (level 0))
 (cond 
   ((or (atom? expr) (quote? expr))
      (print (dup " " level))
      (println expr))
   ((list? (first expr))
      (expr2xml (first expr) (+ level 1))
      (dolist (s (rest expr)) (expr2xml s (+ level 1))))
   ((symbol? (first expr))
      (print (dup " " level))
      (println "<" (first expr) ">")
      (dolist (s (rest expr)) (expr2xml s (+ level 1)))
      (print (dup " " level))
      (println "</" (first expr) ">"))
   (true
    (print (dup " " level)) 
    (println "<error>" (string expr) "<error>"))))

(expr2xml sxml)

 <rss>
   <version>
   0.92
   </version>
  <channel>
   <docs>
   http://backend.userland.com/rss092
   </docs>
   <title>
   newLISP Fan Club
   </title>
   <link>
   http://www.alh.net/newlisp/phpbb/
   </link>
   <description>
     Friends and Fans of newLISP                                                         
   </description>
   <managingEditor>
   newlispfanclub-at-excite.com
   </managingEditor>
...

Which is - almost - where we started from!

A simple practical example

The following example was originally set in the shipping department of a small business. I've changed the items to be pieces of fruit. The XML data file contains entries for all items sold, and the charge for each. We want to produce a report that lists how many at each price were sold, and the total value.

Here's an extract from the XML data:

<FRUIT>
  <NAME>orange</NAME>
  <charge>0</charge>
  <COLOR>orange</COLOR>
</FRUIT>
<FRUIT>
  <NAME>banana</NAME>
  <COLOR>yellow</COLOR>
  <charge>12.99</charge>
</FRUIT>
<FRUIT>
  <NAME>banana</NAME>
  <COLOR>yellow</COLOR>
  <charge>0</charge>
</FRUIT>
<FRUIT>
  <NAME>banana</NAME>
  <COLOR>yellow</COLOR>
  <charge>No Charge</charge>
</FRUIT>

This is the main function that defines and organizes the tasks:

(define (work-through-files file-list)
 (dolist (fl file-list)
   (set 'table '())
   (scan-file fl)
   (write-report fl)))

Two functions are called: scan-file, which scans an XML file and stores the required information in a table, which is going to be some sort of newLISP list, and write-report, which scans this table and outputs a report.

The scan-file function receives a pathname, converts the file to SXML, finds all the charge items (using ref-all), and keeps a count of each value. We allow for the fact that some of the free items are marked variously as No Charge or no charge or nocharge:

(define (scan-file f)
  (xml-type-tags nil nil nil nil)
  (set 'sxml (xml-parse (read-file f) 15))
  (set 'r-list (ref-all 'charge sxml)) 
  (dolist (r r-list)
    (set 'charge-text (last (sxml (chop r))))
    (if (= (lower-case (replace " " charge-text "")) "nocharge")
        (set 'charge (lower-case charge-text))
        (set 'charge (float charge-text 0 10)))
    (if (set 'result (lookup charge table 1))
        ; if this price already exists in table, increment it
        (setf (assoc charge table) (list charge (inc result)))
        ; or create an entry for it
        (push (list charge 1) table -1))))

The write-report function sorts and analyses the table, keeping running totals as it goes:

(define (write-report fl)
 (set 'total-items 0 'running-total 0 'total-priced-items 0)
 (println "sorting")
 (sort table (fn (x y) (< (float (x 0)) (float (y 0)))))
 (println "sorted ")
 (println "File: " fl)
 (println " Charge           Quantity     Subtotal")
 (dolist (c table)
  (set 'price (float (first c)))
  (set 'quantity (int (last c)))
  (inc total-items quantity)
  (cond 
   ; do the No Charge items:
   ((= price nil)    (println (format " No charge  %12d" quantity)))
   ; do 0.00 items
   ((= price 0)      (println (format "    0.00    %12d" quantity)))
   ; do priced items:
   (true 
    (begin 
     (set 'subtotal (mul price quantity))
     (inc running-total subtotal)
     (if (> price 0) (inc total-priced-items quantity))
     (println (format "%8.2f        %8d  %12.2f" price quantity subtotal))))))
 ; totals
 (println (format "Total charged   %8d  %12.2f" total-priced-items  running-total))            
 (println (format "Grand Total     %8d  %12.2f"  total-items  running-total)))

The report requires a bit more fiddling about than the scan-file function, particularly as the user wanted - for some reason - the 0 and no charge items to be kept separate.

 Charge           Quantity     Subtotal
 No charge           138
    0.00             145
    0.11               1          0.11
    0.29               1          0.29
    1.89              72        136.08
    1.99              17         33.83
    2.99              18         53.82
   12.99              55        714.45
   17.99               1         17.99
Total charged        165        956.57
Grand Total          448        956.57

Introduction to newLISP/Working with XML

Contents