A Link Rot Bestiary/Chapter 3 : Soft 404
Soft 404 is a URL that serves content different from the original. For example, http://www.foxnews.com/us/2009/09/13/tennis-great-jack-kramer-dead/ redirects to https://www.foxnews.com/us
Soft 404s are most commonly redirects, as in the foxnews example. However they can also be static pages where the content has changed, this is called content drift. The classic example is a weather reports. Other examples are sports scores, and financial prices.
Soft 404s can be domain name squatters, blank pages, content management changes, spam sites, bot blockers, rate limiters; the possibilities are endless. Conceptually, the page is returning a status of 200, but is also not returning the intended content, in-effect a 404 and thus "soft".
Detection methods
editSoft 404s are notoriously difficult. This section describes some detection methods.
Key phrases
editDownloading the HTML content of a page and searching for known key phrases like "No page found". This method has limitations because of the wide variety of phrases in English, much less the 1000s of other languages in the world.
URL analysis
editURLs can be analyzed for soft 404s. In the above foxnews.com example, it is apparent the redirect URL has a much shorter path than the original URL. Likewise if a URL itself contains "404" such as .com/404.htm
Logging rules
editSoft 404s are most often the result of redirects. This is because websites make changes but fail to leave a redirect, rather defaulting everything to a home page (foxnews.com example above). Knowing this, it is possible to query a large number of URLs within a single domain, and record the source URL and redirect URL in a 2-column table. It might look like:
- <source URL 1> <redirect URL 1>
- <source URL 2> <redirect URL 2>
- <source URL 3> <redirect URL 1>
- <source URL 4> <redirect URL 4>
- <source URL 5> <redirect URL 1>
Here we see the "redirect URL 1" repeats 3 times. This is a flag of a possible soft 404. Once the soft 404 table is generated, rules can be added so the next time it runs it knows to treat as a dead link.
Third party packages
editThird party packages for soft 404 detection:
- "soft404: a classifier for detecting soft 404 pages", uses machine learning