Cross-site scripting (XSS)
Persistent XSS vulnerabilities store the user input and include it later outputs (e.g. a posting in a forum). This means that the users do not need to visit a malicious/compromised site.
To prevent this type of attack
- Escape anything that is not a constant before including it in a response as close to the output as possible (i.e. right in the line containing the “echo” or “print” call)
- If not possible (e.g. when building a larger HTML block), escape when building and indicate the fact that the variable content is pre-escaped and the expected context in the name
- Explicitly set the correct character set at the beginning of the document (i.e. as early as possible) and/or in the header.
- don’t forget URLs in redirector scripts
- A Content Security Policy may be used as an additional security measure, but is not sufficient by itself to prevent attacks.
Escaping data directly at the output location makes it easier to check that all outputs are escaped – each and every variable used as a parameter for an output method must either be marked as pre-escaped or be wrapped in a corresponding escape command.
Not setting the character set may lead to guessing by the browser. Such guessing can be exploited to pass a string that seems harmless in your intended encoding, but is interpreted as a script tag in the encoding assumed by the browser. For HTML5, use
<meta charset="utf-8" /> as the first element in the head section.
Complex XSS example with JS inside HTML
<script> var CURRENT_VALUE = 'test'; document.getElementById("valueBox").innerHTML = CURRENT_VALUE; // INSECURE CODE - DO NOT USE. </script>
The content of
CURRENT_VALUE (in this example, the word test) is inserted into the page source dynamically by the server according to e.g. user input or a value from a database. The second line, which actually writes the data to the document, is often part of a script included from a file. There are many different ways to perform XSS attacks against such a construct, unless proper escaping is used in every step. In our examples, the attacker wants to execute the code
';alert(1);//, resulting in the following HTML code, executing his code:
<script> var CURRENT_VALUE = '';alert(1);//'; document.getElementById("valueBox").innerHTML = CURRENT_VALUE; </script>
Note that this will work even if the value is escaped using a HTML-escaping function like
htmlspecialchars() if that function doesn't touch the single-quote used in this example.
Assuming the attacker cannot use the appropriate quote, because it is filtered, he can use the value
var CURRENT_VALUE = '</script><script>alert(1);</script>';
<script> var CURRENT_VALUE = '</script><script>alert(1);</script>'; document.getElementById("valueBox").innerHTML = CURRENT_VALUE; </script>
Or, reindented for clarity:
<script> var CURRENT_VALUE = ' </script> <script>alert(1);</script> '; document.getElementById("valueBox").innerHTML = CURRENT_VALUE; </script>
var CURRENT_VALUE = 'text\';
A simple newline anywhere in the string will also cause a syntax error (unterminated string literal). While these attacks do not allow direct XSS in this example, they may break critical security features, render the site unusable (Denial of Service), or allow XSS if another value can be manipulated - here the attacker supplies
;alert(1);' to a variant of this construct that passes two values:
var CURRENT_VALUE1 = 'text\'; var CURRENT_VALUE2 = ';alert(1);'';
These are only issues with the first line in our example. The second line directly inserts the value into the document as HTML, thus allowing XSS. To exploit this, the attacker must avoid the script end tag due to the issue mentioned above, so he uses a non-existing image with an error handler. His input
<img src=1 onerror=alert(1)> results in:
<script> var CURRENT_VALUE = '<img src=1 onerror=alert(1)>'; document.getElementById("valueBox").innerHTML = CURRENT_VALUE; </script>
The innerHTML assignment puts the image tag into the document, and since "1" is not a valid URL, the error handler is executed. Note that this is not perfectly valid HTML, since the quotes around the attributes are missing. It is still valid enough to work, and avoids the quotes being mangled due to escaping.
Simply HTML escaping output value using functions like
htmlspecialchars() on the server side (when writing it to the variable assignment line) will prevent some of these attacks and might make others unexploitable or harder to exploit. However, it is incorrect and dangerous and will leave other means of attack!
Most notably, the attacker might decide to do what you should have done, and properly escape his attack sequence for you. This will leave the backslash
\ as the only special character, giving an input like
\u003Cimg src=1 onerror=alert(1)\u003E (note that any remaining character, i.e. the spaces, braces, equals signs and letters could also be escaped). This will be unharmed by your escape function, resulting in the following code:
<script> var CURRENT_VALUE = '\u003Cimg src=1 onerror=alert(1)\u003E'; document.getElementById("valueBox").innerHTML = CURRENT_VALUE; </script>
There are two correct ways to escape in this situation:
- Method 1 - JS escaping server side, HTML escaping client side (recommended)
- Method 2 - HTML escaping server side, JS escaping client side (not recommended)
- On the server, first escape the value for HTML
Method 2 allows you to deliver server-generated custom HTML to the client. You need to escape the HTML like any other HTML output (e.g. using
htmlspecialchars in PHP). The escaped content then gets passed to the client side, which directly dumps it into the document. This means the client side cannot use the text for any non-HTML context, and attempting to do so may lead to a security issue. As you can see, the escaping is done in reverse order: The format that gets interpreted last (HTML, in this case) gets escaped first, then the entire string is "wrapped" by escaping in the outer format.
The recommended approach is to keep text unescaped until it is ready for output, then escape right before it is output (i.e. when the context is known). Consistently following this approach will also avoid double-encoding (i.e. showing your users HTML entities like
& in the text).
json_encode() with the additional flags set:
... <script> var CURRENT_VALUE = <?php echo json_encode($text, JSON_HEX_QUOT | JSON_HEX_TAG | JSON_HEX_AMP | JSON_HEX_APOS); ?>; $("#valueBox").text(CURRENT_VALUE); </script> ...
The text will now be correctly rendered, even if it includes weird special characters.