Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Move mirror creation to the cloud #123

Closed
wants to merge 9 commits into from
Closed

Conversation

galargh
Copy link
Contributor

@galargh galargh commented Mar 14, 2022

I've seen #120 and got inspired to try building a wikipedia mirror myself to see how it gets done. To make it more useful of an exercise I decided to try to hunt down the issue that affected Belarusian wikipedia mentioned in the original issue. Since the steps were taking quite a long time on my machine I decided to go beyond the original scope and take as many parts of the process off of my machine as possible.

In this PR:

  1. I update the Dockerfile so that it is on par with README.md and I make it fully prepared to execute mirrorzim.sh.
  2. I add a terraform configuration which creates a public ECR repository where docker images can be stored. I also create such a repository in my PL AWS account and store a docker image there.
  3. I add a terraform configuration which can be used to create a EC2 machine that exposes IPFS ports to the world and has docker installed. I used it myself to run the published docker image there and create Belarusian wikipedia mirror.
Original Proposal (outdated)
  • I create a GitHub Action workflow which takes inputs as mirrorzim.sh, runs mirrorzim.sh in container created from newly updated Dockerfile, packages the outputs as tar.gz and finally uploads them to S3 (the S3 creation is automated in terraform directory added in this PR too)
  • I create a bash script which downloads package created by the workflow from S3, unpacks it and adds the contents to ipfs
  • I create packer configuration which creates AWS AMI in which ipfs is started on boot and there is a the aforementioned bash script available on PATH
  • I create terraform configuration which knows how to create the aforementioned S3 bucket AND an EC2 instance which runs the AMI
  • I fix the issue that affected the Belarusian wikipedia - details in fix: main page in case it was created from exception #122
Testing
  1. I ran terraform apply in terraform/ecr to create public ECR repository.
  2. I ran docker build . --platform=linux/amd64 -f Dockerfile -t public.ecr.aws/c4h1q7d1/distributed-wikipedia-mirror -t public.ecr.aws/c4h1q7d1/distributed-wikipedia-mirror:$(date -u +%F) -t public.ecr.aws/c4h1q7d1/distributed-wikipedia-mirror:$(date -u +%s) to create a docker image.
  3. I ran docker push --all-tags public.ecr.aws/c4h1q7d1/distributed-wikipedia-mirror to push the image to public ECR.
  4. I ran terraform apply in terraform/ec2 to create EC2 instance for myself.
  5. I ran ssh -i <private_key> ec2-user@<public_dns> to ssh into the machine.
  6. I ran docker run --name wikipedia-on-ipfs --ulimit nofile=65536:65536 -d -p 4001:4001/tcp -p 4001:4001/udp public.ecr.aws/c4h1q7d1/distributed-wikipedia-mirror:latest --languagecode=be --wikitype=wikipedia to create and publish a Belarusian wikipedia mirror.
  7. I waited for docker logs to show CID - bafybeihs2ql4lnd5v7oscxwbblsqnp6krlvbvak4k3wmqvnb32cg73cpiq.
Original Proposal (outdated)
  1. I ran packer build wikipedia-on-ipfs.pkr.hcl which successfully created ami-02ff7a8cff61c5d41.
  2. I ran terraform apply which successfully created S3 and EC2 (I think we should split the configs for the two).
  3. I ran the workflow to create Belarusian wiki mirror - https://github.com/galargh/distributed-wikipedia-mirror/actions/runs/1980575877
    It spit out the following notice:
    You can now publish wikipedia_be_all_maxi_2022-03
    publish_website_from_s3.sh 'wikipedia_be_all_maxi_2022-03'
    
  4. I ssh'ed to the EC2 and ran > publish.out publish_website_from_s3.sh 'wikipedia_be_all_maxi_2022-03' &
  5. I ssh'ed to the EC2 again, checked that publush_website_from_s3.sh is not running anymore and copied the CID from publish.out
  6. I can access Belarusian wiki on https://bafybeic25i5malcaznpqgh6hcrqbntmfmebkutx44mzb3oyhnceen2s47a.ipfs.dweb.link/wiki/ and the links are working now 🥳

TODO

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! @galargh any idea how this cloud setup would behave/cost when a bigger ZIM is used?

The biggest one is wikipedia_en_all_maxi_2021-12.zim – 87G compressed, ~360GB unpacked on IPFS.
I was able to build it on 1TB SSD, everything else was too slow. Wondering if we could get emergency fast build capability with this setup, at the same time I am not sure what would be the cost (and how it compares to buying physical 1TB SSD).

Any quesstimates / end of napkin calculations?


Sidenote: Be mindful that the more we invest in the unpacking process, the harder it will be to avoid the sunk cost fallacy. On a principle, I am worried about subsidizing AWS with all this unnecessary unpacking and building, would rather donate the money to https://www.kiwix.org/en/support/ or fund a devgrant to remove the need for unpacking ZIMs (https://github.com/ipfs/devgrants/blob/devgrant/kiwix-js/targeted-grants/kiwix-js.md, #42 (comment)).

tools/start_ipfs.sh Outdated Show resolved Hide resolved
@galargh
Copy link
Contributor Author

galargh commented Mar 17, 2022

Interesting! @galargh any idea how this cloud setup would behave/cost when a bigger ZIM is used?

Good point, I didn't think of bigger ZIMs at all. Bumping the SSD to 1TB on EC2 in the current setup would put the total yearly cost at around $1500 (see https://calculator.aws/#/estimate?id=bd698299f8d4943c9d41130fbaac0fa4d28bf24a). That is if the machine is to run all year round and we stick with the default volume properties.

That also poses a problem with GitHub Actions as by default it only gets ~30GB of disk space.


My main goal with this setup was to try to support contributors (such as myself) that cannot complete the mirror creation on their own machines. I think in the meantime I might have gone a bit overboard. So let me take a step back and propose some simplifications.

Let's:

  • get rid of GitHub Actions
  • get rid of AMI creation
  • update the Dockerfile so that it's on par with the README
  • create a place where the docker image created from that Dockerfile could be stored
  • describe how to create a machine in AWS which can run that docker image

I'll share the updated code shortly. Thank you for your comments, I found them really helpful :)

@kelson42
Copy link

On zimdump side, we still have potential to improve speed with openzim/zim-tools#69. We made this multithreading implementation for zimcheck with success (but don't remember the benchmarking results).

@galargh
Copy link
Contributor Author

galargh commented May 13, 2022

I'm closing this PR in favor of one that only updates the Dockerfile - #126

@galargh galargh closed this May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants