Web Application Security Guide/XML and internal data escaping

XML and internal data escaping

Escaping is required in internal data representations, too. For example, incorrectly escaped strings in XML could allow the attackers to close their including tag and inject arbitrary XML.

XML is a very complex format which can bear many unpleasant surprises.

To prevent this type of attack

  • Avoid XML if possible.
  • For XML, use well-tested, high-quality libraries, and pay close attention to the documentation. Know your library – some libraries have functions that allow you to bypass escaping without knowing it.
  • If you parse (read) XML, ensure your parser does not attempt to load external references (e.g. entities and DTDs).
  • For other internal representations of data, make sure correct escaping or filtering is applied. Try to use well-tested, high-quality libraries if available, even if it seems to be more difficult.
  • If escaping is done manually, ensure that it handles null bytes, unexpected charsets, invalid UTF-8 characters etc. in a secure manner.

Rationale

XML is a highly complex format with many surprising features - did you know that XML can load other content via HTTP? If you just want to store/pass a few structured values, the powerful features of XML are often unnecessary. JSON is a less complex alternative, but requires its own safety measures (like avoiding arrays at top level and hex-encoding special characters that may be interpreted by broken browsers).

XML is too complex to “just quickly” write code that handles all possibilities correctly and safely. Do not rely on the security of “home-made” minimal libraries. Even some “official” XML libraries are known to have escaping issues in some functions or to explicitly allow content to be passed into the XML without escaping. (Notably the addChild method in PHP’s SimpleXML does partial escaping, see comments for PHP bug 36795) Libraries can contain critical issues, too. Read the documentation of your library carefully and consider searching the internet for known issues. If you are not sure, quickly test at least some basic cases.

XML has features that allow loading of external data like entities and DTDs. Some parsers enable this by default. If you parse untrusted XML files (remember, everything that comes from a user is untrusted), this may be used to read local files, make requests to internal systems not accessible from outside the firewall, and in some cases, even execute code. See OWASP article for details.

Doing escaping manually is very difficult to do correctly, as all problematic cases (e.g. partial UTF8 characters or different charsets) need to be considered. Writing a solution that works correctly with regular input may be fast and easy, but writing a solution that works correctly with any intentionally malformed input is difficult.