-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repo size > 250MB #713
Comments
One thing we can do is move the notebooks directory to a standalone repo and add it back as a submodule. The idea is for air-water-vv to continue being a repository of test problems that travis will run, so in the future we can migrate more of the tests that require data files to air-water-vv as well. I think your main point is that running bfg is a matter of timing, so we may want to discuss on Wednesday a timeline. |
Perhaps use erdc/proteus_old; through repository copy |
The size of the current repo including git-lfs files from a raw clone:
with
Current size based on website:
Output from
Sizes of largest 20 files:
|
I've created a cleaned copy of the repo with only the master branch existing at
For reference, the cleaned repo was made through the following steps:
There is an assumption that large files (>1MB) that are not referenced by the master branch are to be removed, and those that are need to be tracked by lfs. |
Looks like over a factor of 4 reduction, right? Does it seem like a reasonable plan to do one more release on the 1.7 branch with the history as is, then copy that repository to proteus-old, then run your cleaning commands? I suppose we should try to close out as many branches as possible before doing the history rewrite. Since those old branches would be intact on the backed up (old) repository, it wouldn't necessarily need a lot of coordination. |
Yes, it's quite a drastic reduction in size since I remove any file (object) that is not referenced by the master branch. Assuming none of the previous releases had any true dependencies on such large files, then it might be possible to preserve the various releases as branches on the repo. Regarding the number of branches, I think it would be best to do the following:
This doesn't prevent us from pruning/closing some of the older branches before step 1, but it becomes a matter of knowing which branches to close. It is important that everyone makes a fork of proteus afterward so that the actively developed branches have the same history as |
That sounds good. How about this for some of the details on how to get this finished:
|
Below is a list of the files + file sizes in the repo in descending order. The first column denotes the size in kB, the second the packed size in kB, the third the SHA of the file: As one might expect, there are a number of files from the tests that need to be removed. There are also other types of files: 12979 1606 fa5edb1220c22fd7fe89e987dd0a264c72b7f6a3 RANS2P2D.h.gch etc. Removing the first 150 or so files would shrink the repo 70-80%. |
Separate notebooks repo through the following instructions: https://medium.com/@ayushya/move-directory-from-one-repository-to-another-preserving-git-history-d210fa049d4b |
@zhanga is that for files that do not exist anymore on the latest master? Would be nice to get a 70-80% shrinkage! There are a few files .cpp and .c that are autogenerated code in the list, like ChRigidBody.cpp or WaveTools.cpp, so these can go |
@tridelat These are just the largest files in the repository, not just on the latest master. Part of the difficulty is that it's not clear sometimes which .cpp files are source code or cythonized/autogenerated code, which is why I'd go for a "remove top 150 + .h5 files" approach. It doesn't look like there's much stuff in the current |
I'm OK with just checking if the large cpp files are actually in the current master. |
@zhang-alvin yes that's what I thought, I meant to ask if it was the top 150 files that are anywhere in the repo but not in the latest master. The |
List of files to be deleted. The cleaning process doesn't look for unique paths and instead simply matches filenames for deletion. The result is that some tests will fail because some files share names with those being removed from history. Such files will be added back after the fact. The first column is the size in bytes (i.e. first item is about 98MB) The repo was cloned with The criteria for choosing the remaining files were:
Some .cpp files are actually cython-outputted files. The resulting files are largely .h5, .ipynb, .bin, .dat, .txt, mesh, and html files. I tried looking into removing individual git blobs, but that didn't play well with git lfs for an unknown reason. |
Cleaned mirror clone pushed to this repo. The original mirror clone + git lfs fetch --all was 1.3 GB. The cleaned repo mirror clone with git lfs fetch --all is 444 MB. There's a possibility of additional size reduction when more branches are deleted. Users will see an even smaller footprint as a regular git clone + git lfs fetch --all yields only a 93 MB directory. |
List of files deleted from second pass of cleanup via bfg. The repo size seems to have increased since the initial cleanup - possibly additional branches. It is currently at 96MB with git lfs fetch --all. Removing the listed files also didn't seem to yield the size reduction according to the size of the files. That is, the removal of the files totaling to 20MB seemed to ultimately affect the repo size by 4MB in before and after tests. |
Closing the issue now as most goals were accomplished. |
The repo size is still unnecessarily large. I had previously attempted to migrate image and result files (e.g. h5, png, sms) to
git-lfs
withgit filter-branch
, but it doesn't look like it that was thorough enough.The size of the repo can be seen in kB under the size tag :
https://api.github.com/repos/erdc/proteus
Alternatively, one can check with
git count-objects -vH
which results in:Looking at the 10 largest files in the repo's history, we see several image files, result (h5) files, and mesh (sms) files: (obtained from this script)
There's a tool that can migrate any file of various types in the entire history of the repo called bfg. Running the following yielded a significantly smaller size for the repo:
The downside of this massive lfs migration is that the history will be modified. To avoid a mess of conflicting histories I think the following would need to be done:
git push --force
The text was updated successfully, but these errors were encountered: