Lessons learned: archiving mmdc.nl to the Wayback Machine
Project: Medieval Manuscripts in Dutch Collections (mmdc.nl) preservation Team: Olaf Janssen (KB, human curator) & Claude/Jantien (AI assistant) Timeline: December 2025 - April 2026
1. The challenge
mmdc.nl - a medieval manuscripts database and cultural heritage resource hosted by the KB, national library of the Netherlands - was scheduled for shutdown on 15 December 2025. The site contained 12.204 URLs: 466 static pages/PDFs/assets and 11.738 manuscript catalog records. All needed to be preserved before the site went offline.
2. The JavaScript problem
The 11.738 catalog pages were the core challenge. Each page was a JavaScript Single Page Application (SPA):
- Browser loads an HTML shell with an empty
<div id="recordDetail"></div> - JavaScript (
ptp_v3.js) reads therecordIdfrom the URL - AJAX call fetches XML data from KB’s SRU API (
jsru.kb.nl) - XSLT transformation renders the content client-side
The Wayback Machine does NOT execute JavaScript. When WBM archives such pages, it captures only the empty HTML shell - the actual manuscript data is lost.
3. What we tried (and what failed)
Attempt 1: Standard WBM Save Page Now
Submit URLs directly to web.archive.org/save. Result: empty pages. WBM fetches raw HTML without executing JavaScript.
Attempt 2: WBM SPN2 API with authentication
Use Internet Archive credentials with capture flags. Result: still empty (9.311 bytes captured vs 33.545 when rendered). Even the authenticated API does not render JavaScript.
Attempt 3: Headless browser + WBM submit
Render the page locally with Playwright, then submit the URL to WBM. Result: still empty. WBM fetches fresh HTML from the server, ignoring any local rendering.
Method comparison experiment
We ran a controlled experiment comparing Method 2 (pre-render + submit) vs Method 3 (direct submit) on 5 test records:
| Method | Avg time/URL | Success rate | Projected time for 11.738 pages |
|---|---|---|---|
| Pre-render + submit | 41,4 s | 80% | 135 hours (5,6 days) |
| Direct submit | 7,6 s | 100% | 25 hours (1 day) |
Direct submission was 5x faster - but both methods captured empty shells. Neither could solve the JavaScript rendering problem.
4. The solution: local rendering pipeline
Attempt 4: Local rendering with Playwright (success)
The breakthrough was to stop trying to make WBM render the JavaScript, and instead:
- Render locally with a headless browser (Playwright)
- Save as standalone HTML with all CSS inlined and images embedded as base64
- Upload the pre-rendered files to a temporary webserver under a clean URL path
- Submit those static URLs to WBM
Unexpected problems along the way
The shutdown banner: After rendering, the page still showed a red shutdown notice. Investigation revealed it wasn’t in the rendered HTML - it was dynamically injected by an external script (default.js) that the saved HTML still referenced. Solution: strip all external <script> tags.
The broken logo: The logo used a path with ../.. segments (/static/site/../../assets/images/logo_mmdc.gif) that didn’t resolve in the standalone file. Solution: embed all images as base64 data URIs, making the HTML completely self-contained.
The final rendering pipeline
Implemented in render_catalog_full.py:
- Load page in Playwright, wait for
#recordDetailto become non-empty - Extract rendered DOM with inlined CSS
- Remove shutdown banner and external scripts
- Convert relative URLs to absolute
- Embed images as base64
- Save as standalone HTML (~150-160 KB per page)
Result: 11.738 self-contained HTML files, each working completely offline.
5. The rate limiting disaster
Static pages (Phase 1)
The first archiving run for the 429 static pages used 12 concurrent connections. Result: 83% failure rate due to WBM rate limiting.
“Very many pages failed!! … this is disappointing!” - Olaf
Solution: A sequential archiver with 5-second delays between requests and exponential backoff. The slower approach achieved 429/429 (100%) success. PDFs needed an extra-slow retry (10-second delays).
Key insight: Patience over speed. Going slower (5-10s between requests) achieved 100% success vs 83% failure with concurrent requests.
Catalog pages (Phase 2)
The 11.738 catalog pages were submitted sequentially over 6 calendar days (Apr 2-7, 2026) at ~2.000-3.000 submissions/day. After an initial pass (11.658 successful) and a 34-minute retry of the remaining 80 pages: 11.738/11.738 (100%) indexed in the Wayback Machine.
6. Key lessons for web archivists
- JavaScript is the enemy of web archiving. Modern SPAs require special handling. Standard crawlers capture empty shells.
- Always verify WBM captures. Don’t assume successful submission = successful capture. Open the archived page and check the content is actually there.
- Headless browsers are essential. Playwright, Puppeteer, or Selenium are necessary for JavaScript-rendered content.
- Watch for dynamic injection. External scripts can re-inject content (banners, tracking) even after removal. Strip all external script references.
- Base64 embedding = self-contained archives. Embedding images as data URIs creates truly standalone HTML files with no external dependencies.
- Go slow with WBM submissions. Sequential requests with delays beat concurrent requests. The WBM rate limiter is aggressive.
- Institutional relationships matter. Contact the Internet Archive early. Libraries and cultural heritage institutions often get special consideration.
7. Human-AI collaboration
This project was a collaboration between Olaf Janssen (human curator at the KB) and Claude/Jantien (AI assistant). The AI was renamed “Jantien” early in the project.
“no, you are called Jantien from now, Claude is your maiden name!” - Olaf
The collaboration pattern: human makes decisions and sets direction, AI executes technical tasks (writing scripts, debugging, documentation). Session logs served as shared memory between sessions. The iterative problem-solving - trying approaches, analyzing failures, pivoting - was genuinely collaborative.
8. Tools developed
| Script | Purpose |
|---|---|
| render_catalog_full.py | Render all 11.738 catalog pages to standalone HTML |
| render_static_pages.py | Render static pages to standalone HTML |
| SaveToWBM_mmdc_non-catalog-pages.py | Submit static pages to WBM with rate limiting |
| SaveToWBM_mmdc_catalog-pages.py | Submit catalog pages to WBM with rate limiting |
Combined from experiment reports and lessons-learned documents, December 2025 - April 2026