git - Why does a no-op filter-branch create divergence, and how do I fix that? -


i have situation merged couple-years-worth of commits repository. 1 of commits had comment paste of address sanitizer log related fix.

that doesn't sound bad, except address sanitizer logs this:

==10856==error: addresssanitizer: heap-buffer-overflow on address 0x62a00000b201 @ pc 0x47df61 bp 0x7fffffff2ca0 sp 0x7fffffff2c98 read of size 1 @ 0x62a00000b201 thread t0 #0 0x47df60 in expand_series ../src/core/m-series.c:145 #1 0x47e5a7 in extend_series ../src/core/m-series.c:187 #2 0x466e0c in scan_quote ../src/core/l-scan.c:462 #3 0x46a797 in scan_token ../src/core/l-scan.c:918 #4 0x46e263 in scan_block ../src/core/l-scan.c:1188 ... 

and on goes #250 or in case. github scans #xxx patterns , if match issue number, put note mention on referenced issue. github thinks commit remarking on every issue , pull request, , doing time.

i thought i'd use git filter-branch don't mind breaking history (i had filter-branch rid of stuff didn't want). however, did other filter-branch before did merge , continued work. i've noticed popping in github, i'd go , rewrite , don't mind if every commit on every branch after point gets new hash. that's okay me.

the rewrite got work, can't figure out why there divergence. seems have done rewriting that's affecting things before made changes comment. simple test, tried thought should no-op:

git filter-branch -f --msg-filter 'sed "s/a/a/g"' -- --all 

i'm no sed person, understanding redo commit messages , substitute a a. (ayn rand pleased.)

it doesn't diverge many commits actual replacement... 600 instead of 1000. diverges @ indicates have kind of misunderstanding here. how can rewrite that commit message in history without damaging commits besides ones occur after it...and effect on branches?

if there existing message not end newline, sed add 1 (at least versions of sed, including 1 tested here):

$ printf 'foo\nbar' foo bar$ printf 'foo\nbar' | sed 's/a/a/' foo bar $  

which means test message filter might have altered message. based on results, i'd guess @ least 1 commit, 600 commits branch tip(s), modified way. (i've seen exact problem myself before.)

(another possibility sort of unicode normalization, although haven't seen happen sed.)

assuming case, trick find command not affect other commits. 1 1 use environment variable $git_commit identify commit(s) touch, , make sure that's no-op (a cat msg-filter might work better sed, instance) on other commits:

... --msg-filter 'if [ $git_commit == <the one> ]; fix_msg; else cat; fi' ... 

as getting effect on branches, -- --all should trick already.


it sounds know why remaining commits new sha-1s completeness include well. can skip part, it's here other people reading question.

if commit modified, gets new sha-1 (by definition, since sha-1 checksum of commit's contents). no big deal far, let's there 5 commits (all on master in case, not matters) , modify middle 1 filter-branch filter:

a <- b <- c <- d <- e        [original] 

let's actual sha-1 c starts 30001). let's build partial result, in middle of filter-branch operation:

a <- b <- c' 

let's say, weird coincidence, new sha-1 starts 30002, version 2 of commit 3.

let's take @ (part of) original commit d:

$ git cat-file -p head^ tree 954019cba5244a4a135ff62258660b3d2e3a8087 parent 30001... 

commit d refers, number, commit c. filter-branch, while changes nothing else d, must construct new commit d' says parent 30002...:

a <- b <- c' <- d' 

likewise, filter-branch forced copy old commit e new e':

a <- b <- c' <- d' <- e'     [replacement] 

hence filter-branch changes commit, changes subsequent commits. (this true git rebase well. in fact, git rebase , git filter-branch kind of cousins. both read existing commits, apply change(s), , write results new commits; filter-branch programmatically—i.e., has no --interactive mode—and has wide , complex set of specifications make changes, , can apply multiple branches, instead of 1 single branch.)


Comments