CAT-Tools/DéjàVu X/Import from Microsoft Word

4. Import from Microsoft Word

Added: --Loek van Kooten 15:54, 6 February 2007 (UTC)

Version: 7.0.284

System: Windows XP SP2

Description: After importing Japanese Word documents, the text in DVX is infested with rogue codes. Japanese often features both Chinese ideograms (kanji) and the Roman alphabet/Arabic numerals. Since standard Roman alphabet/Arabic numerals in Japanese fonts are quite ugly, most Japanese texts in Word use two different fonts (one for kanji and one for the Roman alphabet/Arabic numerals), which are constantly alternated. Each alternation results in a rogue code.

The problem is that if the target text is a western language, these rogue codes make no sense at all, as the target text will not feature kanji and is therefore written in one font (the font used for the Roman alphabet/Arabic numerals). I.e. these rogue codes are completely superfluous in the target text, slow down the translation process and pollute your databases.

When translating from Japanese to a western language, DVX should therefore ignore all font changes and opt for a standard font instead (like Arial, or maybe even user-defined). Automatic recognition of the fonts used in the Japanese source text seems very complicated, as "font sets" (consisting of a Japanese and a Roman font) might be used and "paired" inconsistently.

It seems many rogue codes are actually not caused by alternating fonts, but by alternating code sets, as even Word documents that have been forced to use one font give inconsistent rogue codes in DVX:

{\loch\af25\hich\af25\dbch\af25 \loch\af25\hich\af25\dbch\f25 平成}{\loch\af25\hich\af25\dbch\af25 \hich\af25\dbch\af25\loch\f25 18年12月19日}

instead of merely 平成18年12月19日

The rogue codes are inconsistent: if a code appears behind 平成 (Japanese changes from kanji to Arabic numerals), you'd expect a similar code behind 年, 月 or 日, but this is not the case. Reproduction:

Workaround:


Status:

Not read by Atril yet.


Comments from other users: