Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad object refs on environment git repositories #1399

Open
ymartin-ovh opened this issue Aug 23, 2024 · 12 comments · Fixed by #1410
Open

Bad object refs on environment git repositories #1399

ymartin-ovh opened this issue Aug 23, 2024 · 12 comments · Fixed by #1410
Labels

Comments

@ymartin-ovh
Copy link

Hello

We use r10k to create puppet environment based on an active git repository.
Sysadmins tend to create feature-branch (and do push-force in their dev environment).

Describe the Bug

Some environment (/etc/puppet/code/environments/dev_XXX) may be "stuck", git operations fail with something like:

   fatal: bad object refs/remotes/cache/dev/YYY
   error: ssh://<upstream repo>.git did not send all necessary objects

I encountered two types of issues:

  • refs/remotes/cache/dev/YYY is gone (merged or deleted) => maybe --prune should be added (
    def fetch(remote = 'cache')
    )
  • refs/remotes/cache/dev/YYY: local hash does not exist anymore because DEV2 issued git push --force on his branch

Expected Behavior

On environment repositories (/etc/puppet/code/environments), maybe r10k should not do a "git fetch cache" as we just need code for a specific branch.

Steps to Reproduce

I think my issue is a race condition on active repository (aka concurrent r10k environment deploy) with people issuing "git push --force" on branches

Environment

  • Version 3.15.4
  • Platform Debian bookworm
@tmu-sprd
Copy link

We experienced this too with version 4.0.2 and 4.1.0. In PR #1371 prune is added. We are using this in production for quite a while. In the beginning, the bad object errors happened quite frequently, now it happens way less.

@alfsch
Copy link

alfsch commented Oct 2, 2024

I'm on r10k 4.1.0 and get these error, too.

ERROR	 -> Command exited with non-zero exit code:
Command: git --git-dir /etc/puppetlabs/code/environments/development/.git --work-tree /etc/puppetlabs/code/environments/development checkout 2104bf32684f554882197022ca0f29099700f257 --force
Stderr:
fatal: bad object refs/remotes/cache/local_dev
Exit code: 128

It's happening always, when developers do some cleanup in git history of their private branches.
With command

git --git-dir /etc/puppetlabs/code/environments/development/.git --work-tree /etc/puppetlabs/code/environments/development reset --hard 2104bf32684f554882197022ca0f29099700f257

I'm able to mitigate this manually. But this is somehow clumsy when r10k is call via webhook.

@justinstoller
Copy link
Member

I looked into this and I believe the root cause is that our thin environment repos use the the cache repo as a "reference" or shared repo. This means the thin repo is essentially a working tree pointing into specific objects held in the cache repo. Sometimes when the upstream is rebased and pulled down to the cache repo (which is a mirror of the upstream) and garbage collection happens the objects that the thin repository point to will no longer exist. See https://git-scm.com/docs/git-clone#:~:text=NOTE%3A%20this%20is,will%20become%20corrupt I'm not sure how to best fix that issue, as doing anything besides sharing with the cache repo will be slower and use more space on disk.

@justinstoller
Copy link
Member

I believe @bastelfreak mentioned this elsewhere. But changing the refs that tags point to isn't a git best practice and not supported by r10k. Let us know the use case where branches don't work for you. I have a feeling that a lot of confusion comes from the fact that docker doesn't have "branches" like git does, instead it implements a branch-like feature it regrettably calls "tags".

@bastelfreak
Copy link
Contributor

I agree with @justinstoller here. And my comment was at #1371 (comment)

@nabertrand
Copy link
Contributor

@justinstoller just to confirm, the fatal: bad object refs/remotes/cache/XXX issue occurs outside of changing which ref a tag points to. Our site does not update tag refs, but does frequently delete branches and occasionally utilizes force pushes. We have not been able to determine which action is causing the issue, but the issue occurs frequently using only what I believe is the suggested 'branches as environments' method.

@justinstoller
Copy link
Member

@nabertrand I believe the issue is with the rebasing. Let me give a more detailed rundown of how I think r10k and git are interacting:

I have a git repo hosted on github, lets call it "upstream1" with a production branch and in git's object database it has this commit information for the production branch:

commit a -> commit b -> commit c -> commit d (HEAD)

R10k pulls down a full mirror of "upstream1" in its cache dir. Lets call that repo "cache1" and it contains an exact match of all the git data in "upstream1".

R10k then creates a repo "production" and tells it to reference the git data in "cache1". This repo "production" is treated essentially as a copy-on-write clone of "cache1". It contains a worktree checked out at commit d but its git dir simply says "I'm a repo pointing to commit d in cache1".

On my dev machine I rebase my local history, collapsing commit c and commit d into one new commit with a better commit message. Git will now treat that as a new commit, commit e. I force push that to "upstream1" and now upstream1's git object database looks like this:

commit a -> commit b -> commit e (HEAD)
               `- commit c -> commit d (ORPHANED)

Commit c and commit d still live in the git object database but they are unreachable by users. At some point before deploying with r10k a git gc is ran. It can occur after a commit so lets say I add commit f on top of my rebased branch and trigger a gc which cleans up the orphaned git objects. Now my git objects look more like this:

commit a -> commit b -> commit e -> commit f (HEAD)

And commit c and commit d have been garbage collected.

Now I do a deploy and "cache1" is updated to look exactly like "upstream1".

Then r10k sees that the deployed commit of production is commit d and the head of cache1's production branch is commit f. R10k asks the production repo to update to commit f. It does so saying to cache1, "I'm on commit d, send me the information on how to go from commit d to commit f." To which cache1 says, "I don't know anything about commit d, I can't tell you how to get to commit f" and the process fails with something about "did not send all necessary objects".

In a full clone the production repo would return and say, "well I believe the parent of commit d is commit c, do you know about that?" and then "the parent of commit c is commit b, do you know about that?". THEN, the repo would be able to construct a path from commit d to commit f: drop d, drop c, apply e, apply f.

I think if we did a shallow clone of depth 1 we'd have the same issue in reconciling git histories, however we'd also have copied all the git data from cache1 to production. So same issue but slower and more disk space. However, if we did a shallow clone of depth 3 in the above case, we'd have cloned the last three commits (commit d, c, and b) to the production repo and been able to reconcile our commit histories.

So, I do think there's a way to solve the issue. But we'd need to shallow clone of depth X where X is more commits than most folks' would rebase away. And it would make r10k take up more space and run slower for everyone, which I'm not sure is a good tradeoff.

@nabertrand
Copy link
Contributor

Thanks for looking into this @justinstoller. What about allowing the user to customize how often loose objects and reflog entries are garbage collected? The performance and space hit of bumping the thresholds might be unacceptable/unnecessary for some sites, so making it customizable could allow users to tune the values as needed. Specifically, we might want to customize:

When git gc is run, it will call prune --expire 2.weeks.ago (and repack --cruft --cruft-expiration 2.weeks.ago if using cruft packs via gc.cruftPacks or --cruft). Override the grace period with this config variable. The value "now" may be used to disable this grace period and always prune unreachable objects immediately, or "never" may be used to suppress pruning. This feature helps prevent corruption when git gc runs concurrently with another process writing to the repository; see the "NOTES" section of git-gc[1].

git reflog expire removes reflog entries older than this time; defaults to 90 days. The value "now" expires all entries immediately, and "never" suppresses expiration altogether. With "<pattern>" (e.g. "refs/stash") in the middle the setting applies only to the refs that match the .

git reflog expire removes reflog entries older than this time and are not reachable from the current tip; defaults to 30 days. The value "now" expires all entries immediately, and "never" suppresses expiration altogether. With "<pattern>" (e.g. "refs/stash") in the middle, the setting applies only to the refs that match the <pattern>.

These types of entries are generally created as a result of using git commit --amend or git rebase and are the commits prior to the amend or rebase occurring. Since these changes are not part of the current project most users will want to expire them sooner, which is why the default is more aggressive than gc.reflogExpire.

@justinstoller
Copy link
Member

I think that'd be a good idea.

Is your environment a FOSS or PE install? I don't see us calling git-gc within r10k, but I do know we do some gc in PE via other tools.

Before we add options to r10k it would be nice to validate them. If your environment is a FOSS install and reproduces this fairly regularly could you try setting some of these values in the gitconfig for the user r10k runs as? The defaults they list seem fairly benign (eg cleaning up unreachable commits more than 30 days old) but there may be some interaction there causing issues.

@nabertrand
Copy link
Contributor

@justinstoller our environment is a FOSS install, but I thought perhaps the garbage collection was happening automatically when other non-gc git commands were run. I'd be glad to test this out, but won't have time until after the US holiday break. Could you re-open this issue? I think it was closed automatically when #1410 was merged.

@justinstoller justinstoller reopened this Dec 3, 2024
@nabertrand
Copy link
Contributor

@justinstoller I'm currently testing setting gc.pruneExpire to never and so far have not run into the bad object error, but it's difficult to know for sure if the requisite events that normally trigger it have occurred. I didn't want to test all three settings at once to ensure we know which setting fixes the issue. If the errors return, I'd like to test gc.reflogExpireUnreachable next, but I'm not sure how long you'd like to hold off on additional testing before moving forward with the new release.

@justinstoller
Copy link
Member

That testing strategy sounds great. I think we should go ahead and do a release very soon while you continue with the testing. I've actually been validating another PR that came in this week #1412 and cleaning up our internal CI.

We don't do releases on Fridays and I've been hesitant to release on a Thursday. But I've gotten everything through CI and I'm inclined to do a release later today since we've been putting it off for so long unless folks have concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants