The internet is a vast treasure trove of knowledge. But it is fleeting and there are no guarantees that the content you like will be there in the future. If you can’t afford to lose that content, you can use a web archiving tool to store a copy of the web page.
Many people use read-later services for saving web articles. These apps work best with text-based content and do not handle complicated webpage designs or media properly. Want some more control?
Let’s see how you can create a clone of Instapaper or Pocket in your computer without losing any web page asset.
ArchiveBox is an Open Source solution that can help you host your own alternative to an archiving service like the Wayback Machine. You don’t give up your privacy or stay locked in a service you cannot control.
It takes the list of URLs you want to archive and creates a local, browsable HTML clone of the content in multiple formats. It includes local copies in HTML, a screenshot of the page, a PDF file, and WARC (Web ARChive).
These copies stay with you even if the original webpage disappears in the future.
ArchiveBox is written in Python 3. It also uses dependencies like Wget, Headless Chrome, Youtube-dl, and other Unix tools to save the webpage. You don’t need a constantly running backend server. Just run it each time you want to import new links and update the static output.
Once the archiving completes, you can open the generated output/index.html in your browser to view the archive.
Advantages of ArchiveBox
- It archives the links in several file formats that work as backups.
- It tries to retain the original webpage using sophisticated capturing methods.
- Has the ability to automatically extract the content and save them to a single folder.
- It also provides a simple, command-line interface to deal with multiple links, feeds, and bookmarks. You have to set it once and run it on a schedule to archive newer links.
Disadvantages of ArchiveBox
- ArchiveBox extracts all the assets from the webpage. It consumes significant disk space and is CPU intensive.
- The app requires three or more dependencies beyond Python 3.5. It takes trial-and-error to make these components work together.
- The app does not completely support Windows OS. You have to install Docker or enable Windows Subsystem for Linux (WSL) . Even then some features may or may not work.
Supported Operating Systems
ArchiveBox officially supports the following operating systems:
- macOS: 10.12 Sierra with Homebrew.
- Linux: Ubuntu, Debian (with APT). The app may (or may not) work in distros like Fedora, CentOS, SUSE, Arch, and more.
- BSD: FreeBSD, OpenBSD, NetBSD (with pkg).
ArchiveBox is a flexible web archiving tool. You must install the following dependencies and meet the minimum requirements.
- Python 3. Don’t use the default Python 2.0 that comes with macOS.
- Wget 1.16
- Chromium 59. If you already use Google Chrome, don’t install Chromium.
- Youtube-dl (Optional): Media resources need a lot of storage space. Give it a detailed thought before archiving your bookmarks.
Set Up ArchiveBox
There are two ways of setting up ArchiveBox—Automatic and Manual.
In the automatic method, a helper script will install the app and their dependencies. But you won’t be able to troubleshoot the problem if any error arises. It’s better to install the app manually.
For the purpose of demonstration, we’ll use macOS 10.14.6.
Installing the Dependencies
The best way to install dependencies is through a package manager called Homebrew. To understand its basics, check out this article on how to install Mac apps with Homebrew.
Open Terminal and type in
brew install python3 git wget curl youtube-dl
brew cask install chromium
(Skip this if you already have Google Chrome/Chromium installed in Applications)
Check the Version Number of All Dependencies
To check the version number of all dependencies, type in
dependency app --version
(Replace the dependency app with python3, wget, youtube-dl, and more)
Download Your Bookmarks Export File
All the read-later services and browsers can export your bookmarks as an HTML file. Follow the instructions in this article on how to export bookmarks from your browser . You can also save a single link or the list of URLs in a text file.
Clone the repo from the GitHub. Open Terminal, and type in
git clone https://github.com/pirate/ArchiveBox
When you clone this repo, the installer will create an ArchiveBox folder in your Home directory. This folder contains all the main application and configuration files.
Add Your URL to the Archive
If you want to archive a single link, then type in
echo 'https://example.com'| ./archive
Navigate to your ArchiveBox folder to see the newly created output folder. In here, you’ll see an index.html file.
Adding Multiple Links to the Archive
When you want to save multiple links (dozens or more), it’s better to add your links to a text file. The app will parse the URLs inside the file and archive them. Open Terminal and type in
./archive [Path to Your File.txt]
If your file is located in the Downloads folder, your path will look like
./archive /Users/(Home directory name)/Downloads/links.txt
Wait for a few minutes/hours to complete the process. To access your archive, open the output/index.html in your browser. You can sort by column, search title using the box in upper-right section, and see the total number of links at the bottom.
Click the favicon under the Files column to visit the details page. You’ll find links to individual file format as seen in the screenshot. The same link also gets uploaded to archive.org.
In the same way, export your Instapaper or Pocket links as an HTML file. Then, type in
You can also import a list of links from the feed URL. But remember you might encounter too many failures or session timeouts. If there are thousands of URLs, it’s better to break them into smaller files to increase the success rate.
The default settings work in most cases, but there are certain important parameters you can tweak to get more features. The configuration file lives in
Note: Do not modify this file, because they’ll get erased whenever you update the app. To create a persistent config file, type in
cp ~/ArchiveBox/etc/ArchiveBox.conf.default ~/.ArchiveBox.conf
The cp command will create a duplicate copy of the configuration file in your home directory. By default, the file is not visible in your directory. To unhide press Cmd + Shift + Period. Open the config file in TextEdit.
ArchiveBox offers you many options. Here are some important ones;
- ONLY_NEW: Set this to True to download the archive for newly added links. Comes useful if you regularly bookmark links.
- TIMEOUT: Possible values are 60 or 120 seconds. If you see frequent timeout errors increase it to 120 seconds.
- URL_BLACKLIST: You can use regex expression to exclude certain domains, extensions, or URL patterns from the archive.
- FETCH_MEDIA: Fetch all audio and video files using youtube-dl. Set this to True only when you have enough storage.
- WGET_USER_AGENT: Use it to change the user agent during archiving. If you’re getting blocked by certain servers, this option comes useful.
To know more about the configuration details, visit the ArchiveBox Configuration for more information.
Publishing Your Archive
The archive produced by ArchiveBox is compatible with any provider that can host static HTML. For example, GitHub pages.
You can also serve it from a home server or VPS by directly uploading the output folder to your web directory.
Make sure you’re not running any content as CGI or PHP, you want to host only static HTML files.
Hosting your archive has both pros and cons. When you download links from random sites, you must understand the dangers of hosting malicious CSS and JS files in your shared domain. You may also want to blacklist your archives in robots.txt file to remain private.
Download Entire Websites Offline
If you’re frustrated with Instapaper or Pocket, then ArchiveBox is an excellent alternative. Apart from web articles, you might want to archive entire websites to access them offline or to preserve their knowledge. If this interests you, read this piece on how to download any website for offline reading .