wbm-archiver
Python scripts to archive URLs to the Internet Archive’s Wayback Machine, or retrieve existing archived versions.
1. General-purpose scripts
These scripts work with any website. They read a list of URLs from a text file and submit them to the Wayback Machine using the waybackpy library.
| Script | Description |
|---|---|
| SaveToWaybackMachine_v2_30112021.py | Main script (v2, Nov 2021). Interactive: prompts for input file, operation mode, and output file. |
| SaveToWaybackMachine_v2_30112021_improvedVeraDeKok.py | Improved version by Vera de Kok, with better error handling. |
Features
Three modes of operation:
- Save pages — submit URLs to the Wayback Machine for archiving
- Retrieve latest — get the most recent archived version of a page
- Retrieve oldest — get the earliest archived version of a page
Requirements
pip install waybackpy
Usage
python SaveToWaybackMachine_v2_30112021.py
The script will prompt you for:
- Input file with URLs (one per line)
- Operation mode (save / retrieve latest / retrieve oldest)
- Output file for results
2. mmdc.nl-specific scripts
These scripts were developed for the mmdc.nl archiving project and are located in the _archiving-artifacts/scripts/ folder. They use the Internet Archive’s Save Page Now 2 (SPN2) API with authenticated access.
| Script | Description |
|---|---|
| SaveToWBM_mmdc_non-catalog-pages.py | Submits non-catalog URLs (317 static HTML pages, 112 PDFs, 38 images) to the WBM. Used in Dec 2025: 466/466 (100%) archived. |
| SaveToWBM_mmdc_catalog-pages.py | Submits pre-rendered catalog pages to the WBM. Used in Apr 2026: 11.738/11.738 (100%) archived. |
Features
Both scripts share the same robustness features, designed for long-running submissions (the catalog script ran for ~55 hours):
- Resume capability — progress saved after every URL; picks up exactly where it left off after a crash or interruption
- Graceful shutdown — Ctrl+C saves Excel + progress state before exit
- Automatic retry — exponential backoff on transient failures (timeouts, connection errors)
- Rate-limit handling — detects HTTP 429, pauses for 5 minutes + jitter
- Offline detection — after 3 consecutive failures, enters wait-for-connectivity mode with increasing delays (30s → 1m → 2m → 5m → 10m)
- Excel locking detection — retries if the Excel file is open in another application
- Pause file — create a
PAUSE.flagfile next to the script to pause after the current URL
Requirements
pip install requests openpyxl python-dotenv
Configuration
Both scripts require Internet Archive API credentials in a .env file:
IA_ACCESS_KEY=your_access_key
IA_SECRET_KEY=your_secret_key
Get your credentials at https://archive.org/account/s3.php.
Usage
# Navigate to the scripts folder
cd archived-sites/mmdc.nl/_archiving-artifacts/scripts
# Submit non-catalog pages (static HTML, PDFs, images)
python SaveToWBM_mmdc_non-catalog-pages.py
# Submit catalog pages (pre-rendered HTML)
python SaveToWBM_mmdc_catalog-pages.py
Both scripts read URLs from mmdc-urls-unified_15042026.xlsx and write results back to the same Excel (WBM URL, timestamp, HTTP status). To reset and start fresh, delete the corresponding progress JSON file in _archiving-artifacts/data/.
Adapting for other projects
These scripts can be adapted for other websites by:
- Changing the Excel file path and sheet name
- Adjusting the column mapping (which columns to read URLs from, which to write results to)
- Updating the
.envcredentials
The core submission logic (SPN2 API calls, retry handling, progress tracking) is reusable as-is.
For more context, see the mmdc.nl archiving documentation and the mmdc.nl lessons learned.
3. manuscripts.kb.nl-specific scripts
These scripts for WBM submission were developed for the manuscripts.kb.nl archiving project and are located in _archiving-artifacts/scripts/ for WBM submission.
| Script | Description |
|---|---|
| SaveToWBM_manuscripts_wiki_priority.py | Phase 1: fetches manuscripts.kb.nl URLs linked from Dutch Wikipedia and Wikimedia Commons via the MediaWiki API, then submits them to WBM. Used 10-11 Dec 2025: 61/61 (100%) archived in ~23 minutes. |
| SaveToWBM_manuscripts_bulk.py | Phase 2: bulk submission of all 7,433 spidered URLs, processed sheet by sheet (smallest first). Used 11-14 Dec 2025: 7,433/7,433 (100%) archived with <0.1% transient error rate. |
Features
The archiving scripts share the same robustness features as the mmdc.nl scripts:
- Resume capability — progress saved after every URL
- Automatic retry — exponential backoff on transient failures
- Rate-limit handling — 17s base delay, 5-minute pause on HTTP 429
- Streaming Excel updates — results written back to the spreadsheet every 5-10 URLs
Usage
# Navigate to the scripts folder
cd archived-sites/manuscripts.kb.nl/_archiving-artifacts/scripts
# Phase 1: Archive wiki-priority URLs
python SaveToWBM_manuscripts_wiki_priority.py
# Phase 2: Bulk archive all URLs
python SaveToWBM_manuscripts_bulk.py
For more context, see the manuscripts.kb.nl archiving documentation.