Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate large files from git history #5

Open
r-barnes opened this issue Jul 2, 2019 · 16 comments
Open

Eliminate large files from git history #5

r-barnes opened this issue Jul 2, 2019 · 16 comments

Comments

@r-barnes
Copy link
Collaborator

r-barnes commented Jul 2, 2019

The repo contains a number of large files that you likely wanted to ignore - the largest are listed below. This collectively means that the repo is a 100MB download.

41e6f427c11b  7.7MiB analysis/output_files/ALT_DATA2_OUT/fft/fft_results.gif
a8267f9be190  7.9MiB analysis/output_files/results_1/xcor/cross-correlations.txt
e271bcab6381   11MiB analysis/output_files/results_1/fft/fft_results.gif
669261e09a05   21MiB analysis/output_data/ALT_DATA1_OUT/xcor/cross-correlations.txt
36cbe3d82cf2   36MiB scripts/core.45511
4ac01836f00a   36MiB scripts/core.53132
9c2bb6f1759f   36MiB scripts/core.171982
a6cecc16b57b   57MiB analysis/output_data/ALT_DATA1_OUT/fft/fft_analysis_animation.gif
6def6506d3f7   66MiB scripts/GENESIS.log

these can be removed using the BFG repo cleaner using the following commands:

git clone --mirror https://github.com/kellykochanski/rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-folders 'output_files'  rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-folders 'output_data'  rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-files 'core.*'  rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-files 'GENESIS.log'  rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-files '*.o'  rescal-snow.git
java -jar ~/Downloads/bfg-1.12.13.jar --delete-files '*.py~'  rescal-snow.git
#Perhaps the `scripts/DUN.csp` file is also a temporary? It takes up 10MB.

after which you should check to make sure things look alright and then

cd rescal-snow.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive

The upside is that this reduces the repo size to either 11MB (with DUN.csp) or (1MB without DUN.csp), which saves bandwidth and space for users.

@kellykochanski
Copy link
Owner

kellykochanski commented Jul 2, 2019 via email

@r-barnes
Copy link
Collaborator Author

r-barnes commented Jul 3, 2019

They must be in the repo to appear in the readme, unless you host them elsewhere.

However, none of the files I've suggested purging (I don't think) are currently used by the repo. These are (I think) all large files that were mistakenly committed in the past. Removing from the repo using git rm doesn't remove them from the history, so the repo only ever grows in size unless you rewrite history.

The files you show on the readme are stored in example_images and take only 3.2MB. They should be unaffected by the commands I suggest above.

@r-barnes
Copy link
Collaborator Author

@kellykochanski: I thought we were fixing this prior to JOSS?

@kellykochanski
Copy link
Owner

kellykochanski commented Sep 21, 2019

I haven't had time to get to it, and don't want to rush into messing with the git history.

@r-barnes
Copy link
Collaborator Author

r-barnes commented Sep 25, 2019 via email

@kellykochanski
Copy link
Owner

@r-barnes I used bfg as you suggested, and the repo is now 14MB (including the removal of DUN.csp - I think some additional docs with figures have been added since you opened this).

@r-barnes
Copy link
Collaborator Author

r-barnes commented Sep 26, 2019 via email

@kellykochanski
Copy link
Owner

kellykochanski commented Sep 26, 2019

bfg warned me... Any issue with just repeating the bfg calls after accepting the PRs?

@zbeekman
Copy link
Contributor

I just went through a similar process with another repository, although the issue was more related to pruning & relocating sensitive information prior to open-sourcing a software package. I discovered that GitHub has write protected refs for PRs. This means that you cannot prune data from these by default.

However, I think I have special settings in my git config to fetch these PR refs that most users do not have, so this may not be a real issue (at least not if you're only concerned about repo file size; it certainly is when you're removing sensitive info).

If it turns out that the PR refs keep the repository size bloated, then, the only solutions are either:

  1. Contacting GitHub support and asking them to delete the old PR refs (I'm not sure if they can/will do this for you)
  2. Deleting and recreating the repository.

Hopefully you won't need to do either and the PR refs won't much this up for you.

@r-barnes
Copy link
Collaborator Author

r-barnes commented Sep 26, 2019 via email

@zbeekman
Copy link
Contributor

@r-barnes

Cool idea! So that cleans the while repo and associated PRs all at once?

Not 100% sure what you're talking about here. If it's my point 2. "Deleteing and recreating the repository" then I need to explain a little bit further:

What I really mean, is:

  1. Move/rename the original repository (or at least keep a local backup clone as a copy, in addition to the one you plan to run BFG on, and then delete the original)
  2. See if you have any PR refs in your local bare/mirrored repository with git show-ref
  3. Use `git update-ref -d refs/.../... # protected github PR refs
  4. Run BFG to eliminate bloat
  5. Run the git reflog and git gc commands recommended by BFG
  6. Create a new empty repo
  7. git push --mirror or whatever BFG recommends to the new repo

I would not recommend this, unless the repo size stays large after a normal pass with BFG. Even then, it's much easier to contact GitHub support and ask if they can delete the old protected PR refs.

I had to go through this procedure because I realized that upon open sourcing a repository, you could still access old PR refs which included the sensitive information that cannot be made public. If you do not need to do it, then please don't.

Also, if you haven't run BFG yet to prune history, you may want to do it either before the final submission or not at all; I'm not sure if it will mess with JOSS' machinery, DOI process, etc. and it will certainly affect tagging.

@kellykochanski
Copy link
Owner

@zbeekman I ran bfg on the repository, though the changes were rejected from the then-open PR on kk/JOSS-fixes. Downloading rescal-snow is now down to 14MB from ~100MB.

I expect to have all open PRs closed at the time of JOSS acceptance, and will re-run bfg then - I can do this after finishing the corrections in your review, and merging the kk/JOSS-fixes branch, but before formal JOSS acceptance.

I hope bfg will work smoothly if all PRs are closed... Let me know if you think that it won't.

@zbeekman
Copy link
Contributor

@kellykochanski: Yes it should work fine. IMO, you have images and stuff for the tutorials, and 14MB is probably how much space everything you want to keep takes up. But at the end of the day, I wouldn't bother with any steps that are more complicated than what you are doing. If you get complaints about rejected refs when you try to push due to PR refs, you can just delete them locally then try pushing again. (They will persist on the GitHub side, but I suspect this is fine and most people don't fetch them.)

@r-barnes
Copy link
Collaborator Author

r-barnes commented Sep 26, 2019 via email

@zbeekman
Copy link
Contributor

zbeekman commented Sep 26, 2019

[Edited for improved clarity 🤞]

@r-barnes: I'll pipe down and let you guys figure out what you want to do. My point was that it sounds like Kelly had success with BFG and got things down to 14MB. Deleting the entire github repository and re-creating it is (hopefully) beyond the scope of what you want/need to accomplish. At any rate, sorry for the confusion and feel free to ignore my previous comments.

If you run into troubles pushing back up to github after running BFG, let me know, it might be the PR refs issue, and I may know the solution. Either way I'd happily take a look.

@r-barnes
Copy link
Collaborator Author

r-barnes commented Sep 26, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants