Bad object refs on environment git repositories #1399

ymartin-ovh · 2024-08-23T09:00:21Z

Hello

We use r10k to create puppet environment based on an active git repository.
Sysadmins tend to create feature-branch (and do push-force in their dev environment).

Describe the Bug

Some environment (/etc/puppet/code/environments/dev_XXX) may be "stuck", git operations fail with something like:

   fatal: bad object refs/remotes/cache/dev/YYY
   error: ssh://<upstream repo>.git did not send all necessary objects

I encountered two types of issues:

refs/remotes/cache/dev/YYY is gone (merged or deleted) => maybe --prune should be added (

r10k/lib/r10k/git/shellgit/thin_repository.rb

Line 33 in 99505c6

def fetch(remote = 'cache')

)
refs/remotes/cache/dev/YYY: local hash does not exist anymore because DEV2 issued git push --force on his branch

Expected Behavior

On environment repositories (/etc/puppet/code/environments), maybe r10k should not do a "git fetch cache" as we just need code for a specific branch.

Steps to Reproduce

I think my issue is a race condition on active repository (aka concurrent r10k environment deploy) with people issuing "git push --force" on branches

Environment

Version 3.15.4
Platform Debian bookworm

The text was updated successfully, but these errors were encountered:

tmu-sprd · 2024-09-10T07:34:43Z

We experienced this too with version 4.0.2 and 4.1.0. In PR #1371 prune is added. We are using this in production for quite a while. In the beginning, the bad object errors happened quite frequently, now it happens way less.

alfsch · 2024-10-02T13:53:00Z

I'm on r10k 4.1.0 and get these error, too.

ERROR	 -> Command exited with non-zero exit code:
Command: git --git-dir /etc/puppetlabs/code/environments/development/.git --work-tree /etc/puppetlabs/code/environments/development checkout 2104bf32684f554882197022ca0f29099700f257 --force
Stderr:
fatal: bad object refs/remotes/cache/local_dev
Exit code: 128

It's happening always, when developers do some cleanup in git history of their private branches.
With command

git --git-dir /etc/puppetlabs/code/environments/development/.git --work-tree /etc/puppetlabs/code/environments/development reset --hard 2104bf32684f554882197022ca0f29099700f257

I'm able to mitigate this manually. But this is somehow clumsy when r10k is call via webhook.

justinstoller · 2024-11-25T23:25:19Z

I looked into this and I believe the root cause is that our thin environment repos use the the cache repo as a "reference" or shared repo. This means the thin repo is essentially a working tree pointing into specific objects held in the cache repo. Sometimes when the upstream is rebased and pulled down to the cache repo (which is a mirror of the upstream) and garbage collection happens the objects that the thin repository point to will no longer exist. See https://git-scm.com/docs/git-clone#:~:text=NOTE%3A%20this%20is,will%20become%20corrupt I'm not sure how to best fix that issue, as doing anything besides sharing with the cache repo will be slower and use more space on disk.

justinstoller · 2024-11-25T23:28:01Z

I believe @bastelfreak mentioned this elsewhere. But changing the refs that tags point to isn't a git best practice and not supported by r10k. Let us know the use case where branches don't work for you. I have a feeling that a lot of confusion comes from the fact that docker doesn't have "branches" like git does, instead it implements a branch-like feature it regrettably calls "tags".

bastelfreak · 2024-11-25T23:35:32Z

I agree with @justinstoller here. And my comment was at #1371 (comment)

nabertrand · 2024-11-26T06:20:27Z

@justinstoller just to confirm, the fatal: bad object refs/remotes/cache/XXX issue occurs outside of changing which ref a tag points to. Our site does not update tag refs, but does frequently delete branches and occasionally utilizes force pushes. We have not been able to determine which action is causing the issue, but the issue occurs frequently using only what I believe is the suggested 'branches as environments' method.

justinstoller · 2024-11-26T17:35:17Z

@nabertrand I believe the issue is with the rebasing. Let me give a more detailed rundown of how I think r10k and git are interacting:

I have a git repo hosted on github, lets call it "upstream1" with a production branch and in git's object database it has this commit information for the production branch:

commit a -> commit b -> commit c -> commit d (HEAD)

R10k pulls down a full mirror of "upstream1" in its cache dir. Lets call that repo "cache1" and it contains an exact match of all the git data in "upstream1".

R10k then creates a repo "production" and tells it to reference the git data in "cache1". This repo "production" is treated essentially as a copy-on-write clone of "cache1". It contains a worktree checked out at commit d but its git dir simply says "I'm a repo pointing to commit d in cache1".

On my dev machine I rebase my local history, collapsing commit c and commit d into one new commit with a better commit message. Git will now treat that as a new commit, commit e. I force push that to "upstream1" and now upstream1's git object database looks like this:

commit a -> commit b -> commit e (HEAD)
               `- commit c -> commit d (ORPHANED)

Commit c and commit d still live in the git object database but they are unreachable by users. At some point before deploying with r10k a git gc is ran. It can occur after a commit so lets say I add commit f on top of my rebased branch and trigger a gc which cleans up the orphaned git objects. Now my git objects look more like this:

commit a -> commit b -> commit e -> commit f (HEAD)

And commit c and commit d have been garbage collected.

Now I do a deploy and "cache1" is updated to look exactly like "upstream1".

Then r10k sees that the deployed commit of production is commit d and the head of cache1's production branch is commit f. R10k asks the production repo to update to commit f. It does so saying to cache1, "I'm on commit d, send me the information on how to go from commit d to commit f." To which cache1 says, "I don't know anything about commit d, I can't tell you how to get to commit f" and the process fails with something about "did not send all necessary objects".

In a full clone the production repo would return and say, "well I believe the parent of commit d is commit c, do you know about that?" and then "the parent of commit c is commit b, do you know about that?". THEN, the repo would be able to construct a path from commit d to commit f: drop d, drop c, apply e, apply f.

I think if we did a shallow clone of depth 1 we'd have the same issue in reconciling git histories, however we'd also have copied all the git data from cache1 to production. So same issue but slower and more disk space. However, if we did a shallow clone of depth 3 in the above case, we'd have cloned the last three commits (commit d, c, and b) to the production repo and been able to reconcile our commit histories.

So, I do think there's a way to solve the issue. But we'd need to shallow clone of depth X where X is more commits than most folks' would rebase away. And it would make r10k take up more space and run slower for everyone, which I'm not sure is a good tradeoff.

nabertrand · 2024-11-26T18:21:46Z

Thanks for looking into this @justinstoller. What about allowing the user to customize how often loose objects and reflog entries are garbage collected? The performance and space hit of bumping the thresholds might be unacceptable/unnecessary for some sites, so making it customizable could allow users to tune the values as needed. Specifically, we might want to customize:

gc.pruneExpire

When git gc is run, it will call prune --expire 2.weeks.ago (and repack --cruft --cruft-expiration 2.weeks.ago if using cruft packs via gc.cruftPacks or --cruft). Override the grace period with this config variable. The value "now" may be used to disable this grace period and always prune unreachable objects immediately, or "never" may be used to suppress pruning. This feature helps prevent corruption when git gc runs concurrently with another process writing to the repository; see the "NOTES" section of git-gc[1].

gc.reflogExpire

git reflog expire removes reflog entries older than this time; defaults to 90 days. The value "now" expires all entries immediately, and "never" suppresses expiration altogether. With "<pattern>" (e.g. "refs/stash") in the middle the setting applies only to the refs that match the .

gc.reflogExpireUnreachable

git reflog expire removes reflog entries older than this time and are not reachable from the current tip; defaults to 30 days. The value "now" expires all entries immediately, and "never" suppresses expiration altogether. With "<pattern>" (e.g. "refs/stash") in the middle, the setting applies only to the refs that match the <pattern>.

These types of entries are generally created as a result of using git commit --amend or git rebase and are the commits prior to the amend or rebase occurring. Since these changes are not part of the current project most users will want to expire them sooner, which is why the default is more aggressive than gc.reflogExpire.

justinstoller · 2024-11-27T17:07:48Z

I think that'd be a good idea.

Is your environment a FOSS or PE install? I don't see us calling git-gc within r10k, but I do know we do some gc in PE via other tools.

Before we add options to r10k it would be nice to validate them. If your environment is a FOSS install and reproduces this fairly regularly could you try setting some of these values in the gitconfig for the user r10k runs as? The defaults they list seem fairly benign (eg cleaning up unreachable commits more than 30 days old) but there may be some interaction there causing issues.

nabertrand · 2024-11-27T20:46:28Z

@justinstoller our environment is a FOSS install, but I thought perhaps the garbage collection was happening automatically when other non-gc git commands were run. I'd be glad to test this out, but won't have time until after the US holiday break. Could you re-open this issue? I think it was closed automatically when #1410 was merged.

nabertrand · 2024-12-04T18:05:19Z

@justinstoller I'm currently testing setting gc.pruneExpire to never and so far have not run into the bad object error, but it's difficult to know for sure if the requisite events that normally trigger it have occurred. I didn't want to test all three settings at once to ensure we know which setting fixes the issue. If the errors return, I'd like to test gc.reflogExpireUnreachable next, but I'm not sure how long you'd like to hold off on additional testing before moving forward with the new release.

justinstoller · 2024-12-05T18:07:22Z

That testing strategy sounds great. I think we should go ahead and do a release very soon while you continue with the testing. I've actually been validating another PR that came in this week #1412 and cleaning up our internal CI.

We don't do releases on Fridays and I've been hesitant to release on a Thursday. But I've gotten everything through CI and I'm inclined to do a release later today since we've been putting it off for so long unless folks have concerns.

ymartin-ovh added the bug label Aug 23, 2024

This was referenced Nov 22, 2024

fetch tags to local, if different to remote #1371

Open

Ensure old refs are pruned on fetch #1410

Merged

justinstoller closed this as completed in #1410 Nov 27, 2024

justinstoller mentioned this issue Nov 27, 2024

Support more minitar versions & prep for r10k 5.0 #major #1408

Closed

justinstoller reopened this Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad object refs on environment git repositories #1399

Bad object refs on environment git repositories #1399

ymartin-ovh commented Aug 23, 2024

tmu-sprd commented Sep 10, 2024

alfsch commented Oct 2, 2024 •

edited

Loading

justinstoller commented Nov 25, 2024

justinstoller commented Nov 25, 2024

bastelfreak commented Nov 25, 2024

nabertrand commented Nov 26, 2024

justinstoller commented Nov 26, 2024

nabertrand commented Nov 26, 2024

justinstoller commented Nov 27, 2024

nabertrand commented Nov 27, 2024

nabertrand commented Dec 4, 2024

justinstoller commented Dec 5, 2024

Bad object refs on environment git repositories #1399

Bad object refs on environment git repositories #1399

Comments

ymartin-ovh commented Aug 23, 2024

Describe the Bug

Expected Behavior

Steps to Reproduce

Environment

tmu-sprd commented Sep 10, 2024

alfsch commented Oct 2, 2024 • edited Loading

justinstoller commented Nov 25, 2024

justinstoller commented Nov 25, 2024

bastelfreak commented Nov 25, 2024

nabertrand commented Nov 26, 2024

justinstoller commented Nov 26, 2024

nabertrand commented Nov 26, 2024

justinstoller commented Nov 27, 2024

nabertrand commented Nov 27, 2024

nabertrand commented Dec 4, 2024

justinstoller commented Dec 5, 2024

alfsch commented Oct 2, 2024 •

edited

Loading