todo.txt


SQL table structure:

blames:
repo, commit, file_ref, commit_author_ref, author_ref, lines (+ or -), time

authors:
author_name, email, author_ref

filenames:
filename, file_ref

file_refs
file_ref, ignored

commits: (maybe don't do this one at all, its' easy to look up with git)
repo, commit, date, timezone, revlist_order


SQL would look something like:

SELECT blames.date, blames.author, SUM(blames.lines) FROM blames
  INNER JOIN file_refs ON blames.file_ref == file_refs.file_ref
  WHERE blames.author="brad" 
    AND NOT file_refs.ignored
  GROUP_BY blames.author
  ORDER_BY blames.date


ideally we'd not run the damn blame on each revision, but the problem is that git doesn't know which blame
lines got removed without looking them up, and there doesn't seem to be a super easy way to do that, without
running blame. I could try to implement it, but I'm sure I'd get caught up in old lines, new lines,
etc. Rather just run 50x slower and let it be correct....


for each commit, starting from the beginning:

add an entry to the commits table. If there was already one there, skip this commit

for each file in ls-tree:

run git blame -w -C -M --porcelain, and parse it like the code in GitPython does. Since we started from the
beginning, we know that only the latest commit is of interest, so just skip anything that's not that (since we
already have entries for those! Maybe write some code that tests this, and then disable it later. Should be
able to see those commits in the commits table). for the ones that are the current commit, create a blames
entry.

actually, while discarding lines, pay attention to filenames. If there are any that don't match the current
filename, keep track of this because this signifies a file rename / move. Find the file_ref from the older
filename, and use that as the current file_ref, but also add an entry to the filenames table with the new
name. If there are no other filenames, then just check the filenames table and if it exists, use that file_ref
in the blames row. Otherwise, insert a new entry into file_refs (file_ref field should self-increment), and
add another new one to filenames with the recently-added file_ref

As an optimiazation, we only need to run this on files touched in the given commit


You can use `git diff rev rev^` to see the diff for the given rev. usefaul args for git diff:
-U0 show 0 extra lines around diff
--name-only onlhy output names of changed files
--no-color
-C -M detect copies and moves

run `git diff -C -M -U0 --no-color  rev rev^`, then parse the output:

lines starting with "---" and "+++" should be next to eachother and are filenames. Figure out how to use this
to detect renames?

Lines starting with "@@" show the line numbers for the preceding file.  e.g.: @@ -252,3 +240,3 @@ means that
in the old file, 3 lines starting at 252 were modified, and in the new one, 3 lines starting at 240 were
modified


--------------------------------------------------------------------------------

ok, whole new plan. Use the diff stuff above. Parse the lines. Run blame with -L on the old rev to see who
lost lines. Then just count the lines and whoever did the commit gained those lines. Make that as two seperate
entries. That way a single user who changed code can get points for being finicky


--------------------------------------------------------------------------------

Unrealted note: look into textconv when I get back. It looks like it allows you to specify a program to make
binary diffs nicer. I'm thinking it could be used with xcdump?

--------------------------------------------------------------------------------

latest issue: merge commits

How I'm handling them now is wrong. I'm ignoring them, but then when I look at the commit after a merge, and
try to figure out how many lines were lost, I'll look at the merge commit, and I might count lines lost from
the merge author, when the merge author never got credit for those lines in the first place

Instead, I need to detect if the parent is a merge. If so, I need to look at the *commit* from blame
line-porcelain. And if it is that merge commit, then I need to check both parents, and take whichever one was
committed last (since that will have usurped the previous one already). Other idea: since I'm doing these in
rev-list order, maybe I can just always grab the last commit I did, instead of using the ^ operator in git


This is still messed up. I need to include merges, and figure out how to parse the output properly


Nope, still not right. It's closer, but not I'm adding things up wrong. I can't just add in topo-order because
things happen on branches. E.g. I have to always add a rev to its parent to get a total blame or cumulative
blame sum, I can't just add it to the last rev in topo-order, because that might be from a different branch


--------------------------------------------------------------------------------
OK, I'm an actual fuckup. Here's the real problem. Say your repo looks like:

a--b----d
 \---c-/

and b and c both delete the same line from a. It will get double counted.

The problem is that when doing a sum of the rows, its valid to add a+b because b^ == a, but you can't add b+c
because c^ != b.

solution:

In any commit where the last commit in our database is not a parent of the current commit, we need to fall
back to doing a full blame analysis. To do this, we first compute a cumualtive blame up until the given
commit. We can store this in a cache table, perhaps. Then we use git blame over the entire file for each
changed file, and we figure out what the diff needs to look like to get us from the last blame to the new
one. As an optimization, we could compute the cumulative sum only for the files that changed as well.


--------------------------------------------------------------------------------
here's another idea to fix the above:

don't use git show anymore. Go back to using git diff, but use it where the previous revision is the last one
in the commit table, in topo order.

This is the same as pretending the history is totally flat, which is what we want for display purposes anyway!
Bring back lastRev!


--------------------------------------------------------------------------------

ok, lets try this again. It's actually a mix of both solutions. The merge needs to work the way it did where I
handled it properly. Using the graph above, b onto a is fine. Then for c, we use the diff, so we pretend c
applied to b. But then when we do d, we use the merge logic to make it apply to both b and c. Does this double
count?? Yes, this doesn't work at all.

This doesn't work. Lets think again about doing this where we virtually squash stuff. Maybe we can use
explicit diffs to skip over regions where things are confusing. Maybe we do a and b, then we ignore c and look
at a diff between b and d, which includes the work in c. That might work. This whole thing works great with
linear history, so maybe just ignore the patches where it isn't linear, maybe wrap the whole damn thing up
into a diff between a and d What if there were another merge after that though? think about this more. Nah,
this is crap too, because any long lived branch will be totally screwed.