Offline Wikipedia Sync

Did you know all of wikipedia compressed is under 100Gb? I'll admit this is pretty tin-foil hat of me, but I find it fun to treat some aspects of life as a security exercise. Identify the things I want to protect, identity threats and take steps to mitigate them. This generally manifests itself mostly in preparing for natural disasters and those sorts of things. One of the threats I've identified is, total loss of internet for an extended period of time. I'm sure most of you are like me, I heavily rely on the internet for information on how to preform tasks. Especially if the task isn't something I often do, if ever. Let's consider we've lost internet entirely, and a critical piece of gear fails that we need to repair. Wouldn't it be nice to have a giant repository of information to prune through to figure it out? To mitigate this, I'm going to setup a local mirror of Wikipedia! My goals are:

  • Weekly sync of all of Wikipedia
  • Hosted on docker for local access

The stack

I'll be using Kiwix, which will host the ZIM backup of wikipedia. I'll create a VM on my Proxmox server dedicated to this. I'll setup a cronjob which downloads updates to the ZIM every Wednesday night or something like that.

Setup

I'm just going to assume you already have a home lab if you're interested in this, or perhaps you have an RPi lying around that you want to do this on. Either way, I'm going to skip the part where we get a server running somewhere. I've deployed headless Ubuntu to Proxmox.

Install Docker

apt update && \
apt-get install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release && \
mkdir -p /etc/apt/keyrings && \
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg && \
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null && \
apt update && \
apt install docker-ce docker-ce-cli containerd.io docker-compose-plugin -y

Download a .Zim to start with

Check available dumps here. I'm going to download all of Wikipedia with pictures!

mkdir -p /var/kiwix/zims && cd /var/kiwix/zims && nohup wget https://download.kiwix.org/zim/wikipedia_en_all_maxi.zim &

Check that it's started by printing the nohup.out file:

tail nohup.out
135550K .......... .......... .......... .......... ..........  0% 2.03M 6h41m
135600K .......... .......... .......... .......... ..........  0% 24.0M 6h40m
135650K .......... .......... .......... .......... ..........  0% 2.08M 6h41m
135700K .......... .......... .......... .......... ..........  0% 26.2M 6h40m
135750K .......... .......... .......... .......... ..........  0% 2.07M 6h41m
135800K .......... .......... .......... .......... ..........  0% 26.9M 6h40m
135850K .......... .......... .......... .......... ..........  0% 2.03M 6h41m
135900K .......... .......... .......... .......... ..........  0% 25.4M 6h40m
135950K .......... .......... .......... .......... ..........  0% 1.96M 6h41m

Sweet, it'll get there eventually!

Configure the Docker Container

Because I want TLS inside the network I use nginx everywhere to preform TLS termination and handle domain routing. Therefore I'll have a few containers running, so I'll use compose to slap em together. This is the compose file I ended up with:

version: "3"
networks:
  kiwix:
    external: false
services:
  kiwix:
    image: kiwix/kiwix-serve
    container_name: kiwix
    command: wikipedia_en_all_maxi.zim
    restart: always
    networks:
      - kiwix
    volumes:
      - /var/kiwix/zims:/data
  nginx:
    image: nginx
    networks:
        - kiwix
    volumes:
        - ./nginx.conf:/etc/nginx/nginx.conf:ro
        - ./tls:/etc/tls:ro
    ports:
        - 0.0.0.0:80:80
        - 0.0.0.0:443:443
    restart: unless-stopped

Note: Be sure to change the command value to whatever zim filenames you want loaded into kiwix.

Generate TLS certificate

cd /var/kiwix && mkdir tls && cd tls
openssl req -newkey rsa:2048 -nodes -keyout key.pem -x509 -days 365 -out certificate.pem
openssl pkcs12 -inkey key.pem -in certificate.pem -export -out certificate.p12

Configure NGINX

touch /var/kiwix/nginx.conf

Content:

events {}
http {
    server {
        listen 80;
        server_name kiwix.salmon.sec;
        listen [::]:80;
        location / {
            return 301 https://$host$request_uri;
        }
    }
    server {
        listen 443 ssl;
        ssl_certificate /etc/tls/certificate.pem;
        ssl_certificate_key /etc/tls/key.pem;
        server_name kiwix.salmon.sec;
        listen [::]:443;
        location / {
            proxy_set_header Host            kiwix.salmon.sec;
            proxy_pass http://kiwix;
        }
    }
}

Bring up services

apt install docker-compose && docker-compose up -d

Test Web Interface

curl localhost:80
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.23.1</center>
</body>
</html>
curl --insecure https://localhost
... web response

1 1

Cronjob

We'll need a cronjob to keep the file up to date. To find a way to verify there's an updated version, I'll have to poke around on the download site. First, let's just read the index:

curl https://download.kiwix.org/zim/
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /zim</title>
 </head>
 <body>
<h1>Index of /zim</h1>
<pre>
<img src="/icons/blank.gif" alt="Icon "> 
    <a href="?C=N;O=D">Name</a>
    <a href="?C=M;O=A">Last modified</a>      
    <a href="?C=S;O=A">Size</a>  
    <a href="?C=D;O=A">Description</a><hr><img src="/icons/back.gif" 
    alt="[PARENTDIR]"> 
    <a href="/">Parent Directory</a>                             -
<img src="/icons/hand.right.gif" alt="[   ]"> <a href="README">README</a>                  2019-11-23 15:58  1.2K
<img src="/icons/folder.gif" alt="[DIR]"> <a href="gutenberg/">gutenberg/</a>              2022-08-29 02:55    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="ifixit/">ifixit/</a>                 2022-08-23 17:17    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="mooc/">mooc/</a>                   2022-04-24 06:25    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="other/">other/</a>                  2022-09-01 22:57    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="phet/">phet/</a>                   2021-08-04 06:25    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="psiram/">psiram/</a>                 2021-04-11 13:43    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="stack_exchange/">stack_exchange/</a>         2022-08-16 16:06    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="ted/">ted/</a>                    2022-08-10 19:16    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="videos/">videos/</a>                 2022-09-04 10:24    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="vikidia/">vikidia/</a>                2022-06-03 06:34    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="wikibooks/">wikibooks/</a>              2022-08-22 04:34    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="wikihow/">wikihow/</a>                2022-09-02 13:33    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="wikinews/">wikinews/</a>               2022-08-22 04:34    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="wikipedia/">wikipedia/</a>              2022-09-04 15:11    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="wikiquote/">wikiquote/</a>              2022-08-23 18:32    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="wikisource/">wikisource/</a>             2022-08-22 04:34    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="wikiversity/">wikiversity/</a>            2022-08-22 04:34    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="wikivoyage/">wikivoyage/</a>             2022-08-22 04:34    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="wiktionary/">wiktionary/</a>             2022-09-02 13:33    -
<img src="/icons/folder.gif" alt="[DIR]"> <a href="zimit/">zimit/</a>                  2022-08-30 08:36    -
<hr></pre>
</body></html>

This seems nice, we should be able to easily use sed to parse this against our last known file download date. Let's deal with the main Wikipedia dump first, this URL contains the modification date and URL for it. For now, just testing in a shell script, let's extract the modification date for wikipedia_en_all_maxi.zim:

#!/bin/bash
ROOT_URL=https://download.kiwix.org/zim/wikipedia/
RESP=$(curl -v --silent "$ROOT_URL" 2>&1 | grep wikipedia_en_all_maxi)
if [[ "$RESP" == "" ]]; then
    echo "Invalid response from download.kiwix.org, no match for `wikipedia_en_all_maxi.zim`."
    echo
    curl "$ROOT_URL"
    exit 404
fi
NEW_DATE=""
NEW_URL=""
while read line ; do
    DATE=$(echo $line | grep -o -E '[0-9]{4}-[0-9]{2}-[0-9]{2}')
    if [[ "$NEW_DATE" == "" ]]; then
        NEW_DATE="$DATE"
        FILENAME=$(echo "$line" | grep -o -P '(?<=href\=").*(?=")')
        NEW_URL="$ROOT_URL$FILENAME"
    else
        if [[ $(date --date="$DATE" +%s) > $(date --date="$NEW_DATE" +%s) ]]; then
            NEW_DATE="$DATE"
            FILENAME=$(echo "$line" | grep -o -P '(?<=href\=").*(?=")')
            NEW_URL="$ROOT_URL$FILENAME"
        fi
    fi
done <<< "$RESP"
echo "Latest available download is $NEW_DATE at $NEW_URL"

This produces:

Latest available download is 2022-05-11 at https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2022-05.zim

Now we just need to decide where to place this on our system and make some changes to how we're storing the files and mounting them into Docker. Currently, the zims folder looks like this:

root@ubuntu2004:/var/kiwix/zims# tree
.
├── openstreetmap-wiki_en_all_maxi.zim
└── wikipedia_en_all_maxi.zim

We're going to need a nested structure, with the root symlinked. I'll preform this manually the first time so we can finish writing our cronjob. The end result was this:

root@ubuntu2004:/var/kiwix/zims# tree
.
├── openstreet
│   └── 2021-03-09
│       └── openstreetmap-wiki_en_all_maxi.zim
├── openstreetmap-wiki_en_all_maxi.zim -> ./openstreet/2021-03-09/openstreetmap-wiki_en_all_maxi.zim
├── wikipedia
│   └── 2022-05-11
│       └── wikipedia_en_all_maxi.zim
└── wikipedia_en_all_maxi.zim -> ./wikipedia/2022-05-11/wikipedia_en_all_maxi.zim

There's a parent folder for each backup, under that there'll be a folder for each new download and finally the zim inside the timestamp folder is symlinked back to zims root. This will allow our script to simply update the symlink and not have to make any changes to docker. Now, let's parse this folder structure in our cronjob to compare to the 'new' download. I also replace the static variables with simple CLI positional args.

#!/bin/bash
ROOT_URL="$1"
ROOT_DIR="$2"
FILENAME="$3"
echo $FILENAME
CURLGREP=${FILENAME/".zim"/""}
RESP=$(curl -v --silent "$ROOT_URL" 2>&1 | grep $CURLGREP)
if [[ "$RESP" == "" ]]; then
    echo "Invalid response from download.kiwix.org, no match for `${FILENAME/.zim/""}`."
    echo
    curl "$ROOT_URL"
    exit 404
fi
NEW_DATE=""
NEW_URL=""
while read line ; do
    DATE=$(echo $line | grep -o -E '[0-9]{4}-[0-9]{2}-[0-9]{2}')
    if [[ "$NEW_DATE" == "" ]]; then
        NEW_DATE="$DATE"
        FILENAME=$(echo "$line" | grep -o -P '(?<=href\=").*(?=")')
        NEW_URL="$ROOT_URL$FILENAME"
    else
        if [[ $(date --date="$DATE" +%s) > $(date --date="$NEW_DATE" +%s) ]]; then
            NEW_DATE="$DATE"
            FILENAME=$(echo "$line" | grep -o -P '(?<=href\=").*(?=")')
            NEW_URL="$ROOT_URL$FILENAME"
        fi
    fi
done <<< "$RESP"
echo "Latest available download is $NEW_DATE at $NEW_URL"
echo
for i in $(ls -d $ROOT_DIR/*); do 
    DATE=${i/"$ROOT_DIR"\//""}
    if [[ "$DATE" == "$NEW_DATE" ]]; then
        echo "Version already present on system, nothing to do. Exiting."
        exit 0
    fi
done
echo "Version not foudn on system, downloading..."
cd "$ROOT_DIR"
mkdir "$NEW_DATE" && cd "$NEW_DATE"
wget -O $FILENAME "$NEW_URL"
cd $ROOT_DIR/..
ln -s -f $ROOT_DIR/$NEW_DATE/$FILENAME ./$FILENAME
echo
tree
echo "Backup complete!"
exit 0

Now it's ready to rock! Let's just copy it to our system and make a cronjob for it.

$ cd /var/kiwix && mkdir cron && touch download_update.sh
$ vim download_update.sh # paste script 
$ crontab -e
...
0 1 * * 7 /var/kiwix/cron/download_update.sh "https://download.kiwix.org/zim/wikipedia/" "/var/kiwix/zims/wikipedia" "wikipedia_en_all_maxi.zim" | tee -a /var/log/kiwix_backup.log
0 1 * * 6 /var/kiwix/cron/download_update.sh "https://download.kiwix.org/zim/other" "/var/kiwix/zims/openstreet" "openstreetmap-wiki_en_all_maxi.zim" | tee -a /var/log/kiwix_backup.log

Annnnnnnnnnnnndddd TIME!