Performing a conversion from MS Word documents into instances of a specified SGML or XML DTD is a very complex task. What you will need for that is:
- A SGML or XML document type definition (DTD) that serves as structure model for the output. One says that the output SGML document is valid to the specifies DTD, or it is an instance of this DTD:
- A Word style sheet that holds paragraph and character styles according to the structures in the DTD. So if in a DTD you have defined a structure for Author (e.g. expressed in the output file as):
- You have to find expression in Word:
- paragraph styles: author
- character styles (just to be used within an author-paragraph): firstname, surname, title
- You will need some kind of a configuration file that allows the mapping of the DTD elements into Word elements and vice versa.
- You will need an SGML or XML parser to check the output SGML/ XML document against the DTD.
Often a conversion is done by using a plug in to MS Word directly. But other options use the Microsoft internal exchange format RTF (Rich Text Format) for conversion. Those tools can interpreted the RTF file with the MS Word style that are still coded in this RTF file and export it into an SGML document. This process mostly happens within batch mode without using much graphical user interfaces.
Within the following paragraphs we describe several approaches:
1. Approche of the Université de Montréal, Université de Lyon 2, Universidad de Chile
2. Humboldt-University Berlin and Germanwide Dissertation Online project
There are other approaches in development as well, especially within Scandinavia and the University of Oslo/ Norway. We don't refer to their solution yet.
Conversion method of the Cyber theses projectEdit
The process line for converting Word files into SGML documents developed within the CyberThèses project uses scripts written with the Omnimark language.
The input of the process line is an RTF file with a "structuring style sheet" and the output is an SGML document encoded according to the TEI Lite DTD (see the TEI web site at http://etext.virginia.edu/TEI.html).
The conversion process is constituted of three main steps :
- a first one converts the RTF file into a flat XML file encoded according to DTD of RTF. The produced file is a linear sequence of paragraph elements having each one an explicit "style name" attribute corresponding to the RTF style names.
- the second step consists in the re-generation of the hierarchical and logical structure of the document based on the analysis of style name attribute.
- last, a SGML parser allows validating the conformity of the produced SGML document with the TEI Lite DTD.
Some supplementary scripts then allow the export of the SGML document towards other formats (HTML, XML).
Most of the scripts are available from the CyberTheses web site : http://www.cybertheses.org
This system is devoted to a particular DTD, but its generalization to other document models shall not raise any difficulty.
Using SGML Author for Word (Humboldt-University Berlin)Edit
Why did we use the SGML Author for Word?
The "Dissertation Online" project implemented and refined a conversion strategy that allows to convert documents written in MS word with a special style sheet (dissertation.dot) into an SGML instance of the DiM.dtd.
We used this product from Microsoft, the SGML Author for Word, due to several reasons:
- SGML Author is quite easy to configure
- It is easy to use.
- It is less expensive than other software producing SGML files with the same quality.
- It supports an international standard for tables: CALS.
- As it is a Word-Add-On it handles documents in MS-Word doc- format better than other tools.
- As we started using this technology in 1997, it supported from the very beginning Word97, the version of word, which was the actual one that time.
Unfortunately, Microsoft didn't continue the development of this tool. So there are new versions available for Office 2000, Office XP, or Office 2007. But the internal document format from MS Word 97, MS Word 2000, Office XP, and Office 2007 are the same in the sense of the conversion into SGML. This means documents written in Word 2000, Office XP, or Office 2007 can be imported into Word97 and therefore a conversion can be done.
For a successful conversion from a word document into a DiML document you will need:
- The DiML-document type definition (diml20.dtd, calstb.dtd)
- the SGML-Author for Word97 (may not available at Microsoft Shops any more, but NDLTD esp. Prof. Dr. Edward Fox may provide English versions of it that work with English Word)
- The Association file for the Microsoft SGML-Author for Word (diml20.dta)
- The converter style sheet, which consists of several macros programmed to make the preconversion process easier.
- The perl programming language (free Software)
- The nsgmls-Parser (free Software)
- Several perl scripts to correct the transformation of tables.
You must have the following software installed at you computer:
- SP (NSGMLS) (Parser for SGML-Files by James Clark). (new version are available at http://openjade.sourceforge.net/doc-1.4/index.htm, but we haven’t tested that)
- Run SP (A WYSIWYG tool for SP by Richard Light). http://www.light.demon.co.uk/runsp/index.htm
- Perl (a scripting language for using the perl scripts).
The converter style sheet and the author’s style sheet can be obtained from the following website: http://dochost.rz.hu-berlin.de/epdiss/vorlage.html
Converter scripts and perlscripts can be obtained from http://www.educat.hu- berlin.de/diss_online/software/tools.exe (Perl scriptc, DTD and converter file for MS SGML-Author for Word - KonverterDiML2_0.dta)
The conversion from a Microsoft Word document into a SGML document, which is an instance of the DiML.dtd that is used at Humboldt-University, takes several steps:
1. Step: Preparing the conversion without using the converter Microsoft SGML Author for Word directly
Check the correct usage
Load the style sheet for conversion (NOT the one for the authors) see, see figure below.
There is a special feature to get the page numbers out of the Word document by using certain word specific text anchors. Those have to be converted into hard coded information using a page number style sheet.
Formatting that has been applied by the author without using style sheets have to be replaced by the correct style sheets.
In order to get a correct display of tables later on by using CSS style sheets within common browsers, empty table cell have to be filled up with a single space (letter).
Soft coded line breaks have to be preserved for the conversion. This is done by inserting special characters #BR# to that. This will be used to insert later a special SGML tag for soft line breaks
2. Step: Converting with Microsoft SGML Author for Word
Press the button "Save as SGML" within the FILE menu.
Load the converter file KonverterDiML2_0.DTA
Check the XML/SGML output using the feedback file (fbk) see figure below.
3. Step: Work through the output file (output according to the DiML.dtd)automatically.
Load the perlskripts using the batch file preprocessor.bat
Parse the DiML file
Errors have to be wiped out manually
4. Step: Transforming the DiML file into a HTML file
Load the perl scripts by using the batch file did2html.bat
Check the HTML Output.
Correct possible errors manually within the SGML file and repeat the transformation.
A demonstration QuickTime video may be found at the ETD-Guide server as well. (see http://www.educat.hu- berlin.de/diss_online/software/didi.mov)
Text editors, Desktop Publishing Systems that can export SGML/XML documents
Tools that export using a user specified  DTD:
- WordPerfect since Version 7.0 (Corel http://www.corel.com )
- FrameMaker+SGML6.0 (Adobe) (http://www.adobe.com )
Tools that exports using their own native DTD:
- Openoffice (SUN/open source ) (http://www.openoffice.org )
- AbiWord (AbiWord/ open source) (http://www.abisource.com )
- Kword (KOffice, KDE Project/ open source) (http://www.kde.org )
Omnimark (Omnimark) (http://www.omnimark.com )
MarkupKit (Schema) (http://www.schema.de )
Majix (Tetrasix) (http://www.tetrasix.com )
TuSTEP (RZ Uni Tübingen) (http://www.uni-tuebingen.de/zdv/tustep/index.html)
 Bollenbach, Markus; Rüppel, Thomas, Rocker, Andreas: FrameMaker+SGML5.5. Bonn; Reading, Mass., Addison-Wesley Longman, 1999, ISBN 3 8273 1508 5
 St. Laurent; Biggar, Robert: Inside XML DTDs. New York, McGraw Hill, 1999, ISBN 0 07 134621 X
 Ducharme, Bob: SGML CD. New Jersey, Prentice Hall, 1997, ISBN 0 13 475740 8
 Smith, Norman E.: Practical Guide to SGML/XML Filters. Plano. Texas, Wordware Publishing Inc., 1998, ISBN 1 55622 587 3
 Goldfarb, Charles; Prescod, Paul: XML Handbuch. München, Prentice Hall, 1999, ISBN 3 8277 9575 0
Next Section: In WordPerfect