Web Application Security Guide/Cross-site scripting (XSS)

XSS vulnerabilities occur if user input included in the output of a web application is not escaped correctly. This type of vulnerability allows attackers to inject content into the web application output. This can be used to inject a false login form (reporting the input to an attacker) or malicious JavaScript code which can steal cookies and information or execute actions using the user’s permissions. XSS vulnerabilities are separated into two main categories, reflected (non-persistent) and persistent vulnerabilities.

Reflected XSS vulnerabilities include the user input only in the output directly following the request. Thus, the attacker needs the user to follow a malicious link or make a malicious POST request. The former can be done by including the link as an IFRAME; the latter can be done using JavaScript. Both vulnerabilities do require that the user visits a malicious/compromised site, but they do not necessarily require user interaction.

Persistent XSS vulnerabilities store the user input and include it later outputs (e.g. a posting in a forum). This means that the users do not need to visit a malicious/compromised site.

To prevent this type of attack

  • Escape anything that is not a constant before including it in a response as close to the output as possible (i.e. right in the line containing the “echo” or “print” call)
  • If not possible (e.g. when building a larger HTML block), escape when building and indicate the fact that the variable content is pre-escaped and the expected context in the name
  • Consider the context when escaping: Escaping text inside HTML is different from escaping HTML attribute values, and very different from escaping values inside CSS or JavaScript, or inside HTTP headers.
    • This may mean that you need to escape for multiple contexts and/or multiple times. For example, when passing a HTML fragment as a JS constant for later includsion in the document, you need to escape for JS string inside HTML when writing the constant to the JavaScript source, then escape again for HTML when your script writes the fragment to the document. (See rationale for examples)
    • The attacker must not be able to put anything where it is not supposed to be, even if you think it is not exploitable (e.g. because attempts to exploit it result in broken JavaScript).
  • Explicitly set the correct character set at the beginning of the document (i.e. as early as possible) and/or in the header.
  • Ensure that URLs provided by the user start with an allowed scheme (whitelisting) to avoid dangerous schemes (e.g. javascript:-URLs )
  • don’t forget URLs in redirector scripts
  • A Content Security Policy may be used as an additional security measure, but is not sufficient by itself to prevent attacks.

Rationale

Escaping data directly at the output location makes it easier to check that all outputs are escaped – each and every variable used as a parameter for an output method must either be marked as pre-escaped or be wrapped in a corresponding escape command.

Different contexts require completely different escaping rules. A “)” character with no dangerous meaning in HTML and HTML attributes can signify the end of an URL path in CSS. See the example at the bottom for a complex but common case where HTML and JavaScript are used together and create countless opportunities for XSS. Note that many simple XSS attempts are "accidentally" blocked even by the wrong escaping (e.g. HTML escaping mangles quotes required for a JavaScript string injection, or newlines creating invalid JavaScript in case of injection attempts). Do NOT rely on this. The attacker may know a trick you are not thinking about. If it is possible to place anything in a place of the document structure where it is not supposed to go (e.g. outside a JavaScript string literal), it is a security issue that must be fixed. It might not be exploitable - or you may simply not be seeing the way to exploit it. Don't take that risk!

Not setting the character set may lead to guessing by the browser. Such guessing can be exploited to pass a string that seems harmless in your intended encoding, but is interpreted as a script tag in the encoding assumed by the browser. For HTML5, use <meta charset="utf-8" /> as the first element in the head section.

URLs can be dangerous, too. User-provided links should be checked against a scheme whitelist, as the javascript scheme is not the only dangerous one. Other schemes can trigger possibly unwanted action. If only web links are to be allowed, require the URLs to start with “http://” or “https://”.

A Content Security Policy can prevent certain kinds of injection. Only some browsers support it; others simply ignore it. It is a powerful secondary defense to limit the impact of security issues, but cannot be used as the primary way to prevent XSS - the primary way to prevent XSS is correct escaping, which will not only prevent XSS, but also ensure that your page displays correctly even in the presence of uncommon input. Implementing a CSP may require significant changes to your code. Notably, you cannot include any inline JavaScript (unless you explicitly allow inline JS in your CSP - which removes most of the protection CSPs provide).

Complex XSS example with JS inside HTML

Often overlooked issues include the complex interaction between HTML and JavaScript. A often-used construct is something like this:

<script>
  var CURRENT_VALUE = 'test';
  document.getElementById("valueBox").innerHTML = CURRENT_VALUE; // INSECURE CODE - DO NOT USE.
</script>

The content of CURRENT_VALUE (in this example, the word test) is inserted into the page source dynamically by the server according to e.g. user input or a value from a database. The second line, which actually writes the data to the document, is often part of a script included from a file. There are many different ways to perform XSS attacks against such a construct, unless proper escaping is used in every step. In our examples, the attacker wants to execute the code alert(1);.

First, if proper escaping for JavaScript is missing, the attacker can simply provide the appropriate quote symbol to terminate the string, a semicolon, his code, and then comment out the rest of the line. For example, the attacker could provide the value ';alert(1);//, resulting in the following HTML code, executing his code:

<script>
  var CURRENT_VALUE = '';alert(1);//';
  document.getElementById("valueBox").innerHTML = CURRENT_VALUE;
</script>

Note that this will work even if the value is escaped using a HTML-escaping function like htmlspecialchars() if that function doesn't touch the single-quote used in this example.

Assuming the attacker cannot use the appropriate quote, because it is filtered, he can use the value </script><script>alert(1);</script>. Inside a regular JavaScript file, the resulting line would not immediately cause a problem (though assigning it to innerHTML would), since the following is a perfectly safe variable assignment:

var CURRENT_VALUE = '</script><script>alert(1);</script>';

Since, however, this appears in an inline script block, the HTML parser will interpret the "script-end" tag, resulting in a broken piece of JavaScript, followed by a second script block containing the attacker's code, some text, and a spurious script-end tag:

<script>
  var CURRENT_VALUE = '</script><script>alert(1);</script>';
  document.getElementById("valueBox").innerHTML = CURRENT_VALUE;
</script>

Or, reindented for clarity:

<script>
  var CURRENT_VALUE = '
</script>
<script>alert(1);</script>
'; document.getElementById("valueBox").innerHTML = CURRENT_VALUE;
</script>

The attacker can also simply break the JavaScript by inserting a backslash at the end of the string, thus escaping the quote at the end:

var CURRENT_VALUE = 'text\';

A simple newline anywhere in the string will also cause a syntax error (unterminated string literal). While these attacks do not allow direct XSS in this example, they may break critical security features, render the site unusable (Denial of Service), or allow XSS if another value can be manipulated - here the attacker supplies text\ and ;alert(1);' to a variant of this construct that passes two values:

var CURRENT_VALUE1 = 'text\'; var CURRENT_VALUE2 = ';alert(1);'';

Since the string-ending quote was escaped, the quote that is supposed to start the second string instead closes the first, turning the remaining content into JavaScript. This brings us to the statement above: If it is possible to place anything in a place of the document structure where it is not supposed to go (e.g. outside a JavaScript string literal), it is a security issue that must be fixed. It might not be exploitable - or you may simply not be seeing the way to exploit it. Don't take that risk!

These are only issues with the first line in our example. The second line directly inserts the value into the document as HTML, thus allowing XSS. To exploit this, the attacker must avoid the script end tag due to the issue mentioned above, so he uses a non-existing image with an error handler. His input <img src=1 onerror=alert(1)> results in:

<script>
  var CURRENT_VALUE = '<img src=1 onerror=alert(1)>';
  document.getElementById("valueBox").innerHTML = CURRENT_VALUE;
</script>

The innerHTML assignment puts the image tag into the document, and since "1" is not a valid URL, the error handler is executed. Note that this is not perfectly valid HTML, since the quotes around the attributes are missing. It is still valid enough to work, and avoids the quotes being mangled due to escaping.

Simply HTML escaping output value using functions like htmlspecialchars() on the server side (when writing it to the variable assignment line) will prevent some of these attacks and might make others unexploitable or harder to exploit. However, it is incorrect and dangerous and will leave other means of attack!

Most notably, the attacker might decide to do what you should have done, and properly escape his attack sequence for you. This will leave the backslash \ as the only special character, giving an input like \u003Cimg src=1 onerror=alert(1)\u003E (note that any remaining character, i.e. the spaces, braces, equals signs and letters could also be escaped). This will be unharmed by your escape function, resulting in the following code:

<script>
  var CURRENT_VALUE = '\u003Cimg src=1 onerror=alert(1)\u003E';
  document.getElementById("valueBox").innerHTML = CURRENT_VALUE;
</script>

The JavaScript parser will interpret the escape sequeces and insert the XSS code into your document.


There are two correct ways to escape in this situation:

  • Method 1 - JS escaping server side, HTML escaping client side (recommended)
    • On the server, properly (see below) escape the value using JavaScript escape values.
    • In the client-side JavaScript, ensure your code escapes the text before inserting it into the document, using e.g. the .text() setter of jQuery.
  • Method 2 - HTML escaping server side, JS escaping client side (not recommended)
    • On the server, first escape the value for HTML
    • On the server, then properly (see below) escape the value using JavaScript escape values before inserting it into the document.

Method 2 allows you to deliver server-generated custom HTML to the client. You need to escape the HTML like any other HTML output (e.g. using htmlspecialchars in PHP). The escaped content then gets passed to the client side, which directly dumps it into the document. This means the client side cannot use the text for any non-HTML context, and attempting to do so may lead to a security issue. As you can see, the escaping is done in reverse order: The format that gets interpreted last (HTML, in this case) gets escaped first, then the entire string is "wrapped" by escaping in the outer format.

The recommended approach is to keep text unescaped until it is ready for output, then escape right before it is output (i.e. when the context is known). Consistently following this approach will also avoid double-encoding (i.e. showing your users HTML entities like & in the text).

How to properly escape for JavaScript inside HTML: Ensure that characters like < which have no special meaning in JavaScript but do have a special meaning in HTML also get escaped. Do not write your own escaping routines, you will most likely miss something. Use existing libraries. For current versions of PHP, you may want to consider using json_encode() with the additional flags set:

...
<script>
  var CURRENT_VALUE = <?php echo json_encode($text,
        JSON_HEX_QUOT | JSON_HEX_TAG | JSON_HEX_AMP | JSON_HEX_APOS); ?>;
    $("#valueBox").text(CURRENT_VALUE);
</script>
...

The text will now be correctly rendered, even if it includes weird special characters.