Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: wikipedia_uk_all_maxi_2022-03 and wikipedia_ru_all_maxi_2022-03 #120

Merged
merged 3 commits into from
Mar 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 17 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Putting Wikipedia Snapshots on IPFS and working towards making it fully read-wri
- https://my.wikipedia-on-ipfs.org
- https://ar.wikipedia-on-ipfs.org
- https://zh.wikipedia-on-ipfs.org
- https://uk.wikipedia-on-ipfs.org
- https://ru.wikipedia-on-ipfs.org
- https://fa.wikipedia-on-ipfs.org

Expand Down Expand Up @@ -115,24 +116,31 @@ It is advised to use separate IPFS node for this:

```console
$ export IPFS_PATH=/path/to/IPFS_PATH_WIKIPEDIA_MIRROR
$ ipfs init -p server,local-discovery,badgerds,randomports --empty-repo
$ ipfs init -p server,local-discovery,flatfs,randomports --empty-repo
```

#### Tune datastore for speed
#### Tune DHT for speed

Make sure repo is initialized with datastore backed by `badgerds` for improved performance, or if you choose to use slower `flatfs` at least use it with `sync` set to `false`.
Wikipedia has a lot of blocks, to publish them as fast as possible,
enable [Accelerated DHT Client](https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#accelerated-dht-client):

**NOTE:** While badgerv1 datastore _is_ faster, one may choose to avoid using it with bigger builds like English because of [memory issues due to the number of files](https://github.com/ipfs/distributed-wikipedia-mirror/issues/85). Potential workaround is to use [`filestore`](https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#ipfs-filestore) that avoids duplicating data and reuses unpacked files as-is.
```console
$ ipfs config --json Experimental.AcceleratedDHTClient true
```

#### Enable HAMT sharding
#### Tune datastore for speed

Configure your IPFS node to enable directory sharding
Make sure repo uses `flatfs` with `sync` set to `false`:

```sh
$ ipfs config --json 'Experimental.ShardingEnabled' true
```console
$ ipfs config --json 'Datastore.Spec.mounts' "$(ipfs config 'Datastore.Spec.mounts' | jq -c '.[0].child.sync=false')"
```

This step won't be necessary when automatic sharding lands in go-ipfs (wip).
**NOTE:** While badgerv1 datastore is faster is nome configurations, we choose to avoid using it with bigger builds like English because of [memory issues due to the number of files](https://github.com/ipfs/distributed-wikipedia-mirror/issues/85). Potential workaround is to use [`filestore`](https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#ipfs-filestore) that avoids duplicating data and reuses unpacked files as-is.

#### HAMT sharding

Make sure you use go-ipfs 0.12 or later, it has automatic sharding of big directories.

### Step 3: Download the latest snapshot from kiwix.org

Expand Down
2 changes: 1 addition & 1 deletion mirrorzim.sh
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ fi

printf "\nEnsure zimdump is present...\n"
PATH=$PATH:$(realpath ./bin)
which zimdump &> /dev/null || (curl --progress-bar -L https://download.openzim.org/release/zim-tools/zim-tools_linux-x86_64-3.0.0.tar.gz | tar -xvz --strip-components=1 -C ./bin zim-tools_linux-x86_64-3.0.0/zimdump && chmod +x ./bin/zimdump)
which zimdump &> /dev/null || (curl --progress-bar -L https://download.openzim.org/release/zim-tools/zim-tools_linux-x86_64-3.1.0.tar.gz | tar -xvz --strip-components=1 -C ./bin zim-tools_linux-x86_64-3.1.0/zimdump && chmod +x ./bin/zimdump)

printf "\nDownload and verify the zim file...\n"
ZIM_FILE_SOURCE_URL="$(./tools/getzim.sh download $WIKI_TYPE $WIKI_TYPE $LANGUAGE_CODE all maxi latest | grep 'URL:' | cut -d' ' -f3)"
Expand Down
13 changes: 10 additions & 3 deletions snapshot-hashes.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,20 @@ zh:
date: 2021-03-16
ipns:
ipfs: https://dweb.link/ipfs/bafybeiazgazbrj6qprr4y5hx277u4g2r5nzgo3jnxkhqx56doxdqrzms6y
uk:
name: Ukrainian
original: uk.wikipedia.org
source: wikipedia_uk_all_maxi_2022-03.zim
date: 2022-03-09
ipns:
ipfs: https://dweb.link/ipfs/bafybeibiqlrnmws6psog7rl5ofeci3ontraitllw6wyyswnhxbwdkmw4ka
ru:
name: Russian
original: ru.wikipedia.org
source: wikipedia_ru_all_maxi_2021-03.zim
date: 2021-03-25
source: wikipedia_ru_all_maxi_2022-03.zim
date: 2022-03-12
ipns:
ipfs: https://dweb.link/ipfs/bafybeieto6mcuvqlechv4iadoqvnffondeiwxc2bcfcewhvpsd2odvbmvm
ipfs: https://dweb.link/ipfs/bafybeiezqkklnjkqywshh4lg65xblaz2scbbdgzip4vkbrc4gn37horokq
fa:
name: Persian
original: fa.wikipedia.org
Expand Down