We just updated our algorithm to recover more pages. Our Wayback Downloader now also recovers URLs without an internal link path to it.
To illustrate what we fixed, let’s first explain how our scraper used to work and how that caused a problem. Below is the text from a now-defunct FAQ item:
If you can browse to a page, by starting from the front page from a certain date, then we download that page. Our software works like a human user who clicks on all links on the front page. Then it will visit those links and again click on all the links it can find on those pages. It will continue like this until it found all pages.
It’s therefore important, to pick a good date in archive.org. For example, if you pick the front page from the year 2010, then it’s unlikely the software will find a link to a page that was created in 2016. Usually it’s best to pick a recent date.
This describes how our scraper used to work. It still works like this, but we added a way to download more pages that can’t be reached by a “human user who clicks on all links”.
The process above could cause problems with recovering all pages from a domain. For example, let’s say you wanted to recover a website and gave our scraper the front page from 2016 as a starting point. An old blog post from 2010 on the same domain might not have an internal link path going to it any longer. Maybe the blog post was a news item that was pushed away from the front page, by other news items. Before, we were not able to recover this blog page, even when Archive.org still had the content for that page.
The backlink problem
This especially caused problems for customers who attach value to SEO. Some of those old blog posts had lots of backlinks pointing to it – according to tools like Majestic or Ahref. Those old blog posts would also have internal links to the front page and other internal pages, so there would be no path for the incoming link juice to flow through the entire website.
We did create automatic 301 redirects to the front page for any missing pages, but that is not an optimal solution. It creates a bad user experience and it causes backlinks to get removed over time, because the content is simply not there. This technique is also very common among SEO’ers (see https://wordpress.org/plugins/link-juice-keeper/) so Google probably frowns upon it too. It’s better to avoid this as much as possible, especially if there is the opportunity to recover the original content.
Some customers managed to “merge” multiple dates into one website, but this was a cumbersome process, so we needed a better solution.
How our wayback scraper works in 2018
The same as before – except we now also do post-processing to find all the pages that didn’t have an internal link path. This process is done completely automatic.
The scraper still functions “like a human clicking on all internal links”. After that process has finished, the algorithm checks archive.org if it has missed any URLs from dates in the past or future. It then simply adds those pages.
If the domain has a history of multiple owners with different websites on it, then we now download all pages from all those different websites on the same domain. That might not always be a desirable outcome for everyone, but it’s a good way to illustrate how our new scraper works (see example below).
If pages had the same URL, for example domain.com/index.html, we only recover the first page that our scraper comes across. Here is an example of how big the difference can be for https://web.archive.org/web/20130707012635/http://www.bowdbeal.com:80/:
Old method, resulting in 10 pages
New method, resulting in 39 pages
Why is the difference so big?
For this domain, the difference is so large (10 vs 39 pages), because this domain was used for multiple purposes. First it was used by a musician (see https://web.archive.org/web/20080827204449/http://www.bowdbeal.com/bio.html) and later on it was used as a PBN domain for the construction niche. (see https://web.archive.org/web/20130707012635/http://www.bowdbeal.com:80/).