Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

history is screwed up, negative balance #1116

Closed
chadwhitacre opened this issue Jul 5, 2013 · 37 comments
Closed

history is screwed up, negative balance #1116

chadwhitacre opened this issue Jul 5, 2013 · 37 comments

Comments

@chadwhitacre
Copy link
Contributor

@pjz noticed, not sure yet who else might be affected.

@chadwhitacre
Copy link
Contributor Author

I checked with the backup I took before payday yesterday (#1107) and it's a problem there too. So this isn't something that snuck in yesterday during payday.

@chadwhitacre
Copy link
Contributor Author

This may yet be a payday bug though. The history page starts with your current balance and adds/subtracts from that for each record in the transfers and exchanges table.

@chadwhitacre
Copy link
Contributor Author

IRC

@chadwhitacre
Copy link
Contributor Author

According to transfers, you've received $273.89 and given $48.75, so I'd expect your balance to be $225.14 instead of $4.75.

@chadwhitacre
Copy link
Contributor Author

I count 58 affected accounts as of Gittip 56.

@chadwhitacre
Copy link
Contributor Author

So we've got a bug.

@chadwhitacre
Copy link
Contributor Author

(Though I guess we knew that.)

@chadwhitacre
Copy link
Contributor Author

One implication we'll have to follow up on is that I've used the balance as reported on the history page to drive manual payouts.

@chadwhitacre
Copy link
Contributor Author

I have a series of weekly backups from the past year. A good next step might be to run my checker script against each week's backup to find out when this behavior started.

@chadwhitacre
Copy link
Contributor Author

Here's the number of affected record for each backup I've got:

       0 2012-07-05.psql.log
       0 2012-07-10.psql.log
       0 2012-07-13.psql.log
       2 2012-07-17.psql.log
       2 2012-07-20.psql.log
      73 2012-07-20b.psql.log
       0 2012-07-27.psql.log
       0 2012-07-27b.psql.log
       0 2012-08-03.psql.log
       0 2012-08-03b.psql.log
       0 2012-08-10.psql.log
       0 2012-08-16.psql.log
       0 2012-08-17.psql.log
       0 2012-08-23.psql.log
       2 2012-08-24.psql.log
       2 2012-08-28.psql.log
       2 2012-08-31.psql.log
       5 2012-09-06.psql.log
       6 2012-09-12.psql.log
       6 2012-09-13.psql.log
       7 2012-09-20.psql.log
       7 2012-09-27.psql.log
       9 2012-10-03.psql.log
       9 2012-10-04.psql.log
      13 2012-10-11.psql.log
      18 2012-10-25.psql.log
      20 2012-11-01.psql.log
      22 2012-11-05.psql.log
      22 2012-11-08.psql.log
      22 2012-11-15.psql.log
      22 2012-11-22.psql.log
      24 2012-11-29.psql.log
      26 2012-12-06.psql.log
      27 2012-12-11.psql.log
      27 2012-12-17.psql.log
      27 2012-12-20.psql.log
      27 2012-12-27.psql.log
      27 2013-01-03.psql.log
      27 2013-01-10.psql.log
      27 2013-01-17.psql.log
      30 2013-01-24.psql.log
      30 2013-01-31.psql.log
      34 2013-02-05.psql.log
      34 2013-02-07.psql.log
      36 2013-02-08.psql.log
      36 2013-02-14.psql.log
      39 2013-02-21.psql.log
      40 2013-02-28.psql.log
      42 2013-03-07.psql.log
      46 2013-03-14.psql.log
      49 2013-03-21.psql.log
      55 2013-03-28.psql.log
      59 2013-04-03.psql.log
      59 2013-04-04.psql.log
      62 2013-04-06.psql.log
      62 2013-04-10.psql.log
      62 2013-04-11.psql.log
      64 2013-04-14.psql.log
      64 2013-04-18.psql.log
      66 2013-04-25.psql.log
      67 2013-05-02.psql.log
      69 2013-05-09.psql.log
      69 2013-05-13.psql.log
      69 2013-05-16.psql.log
      75 2013-05-23.psql.log
      77 2013-05-30.psql.log
      79 2013-06-06.psql.log
      84 2013-06-13.psql.log
      85 2013-06-20.psql.log
      88 2013-06-27.psql.log
      88 2013-06-27b.psql.log
      90 2013-06-29.psql.log
      90 2013-07-01.psql.log
      90 2013-07-04.psql.log

@chadwhitacre
Copy link
Contributor Author

So the problem seems to have been with us the whole time(!). It grew through week 7, which was a bit of a clusterhuck (#169). We apparently hit the reset button in cleaning that up, but it has since grown again. It is currently affecting 90 accounts out of 1559 with a balance (6%).

Presumably once an account is corrupted then the corruption carries through (rather than some accounts being corrupted some weeks, others the next).

The growth in corruption doesn't appear to line up with instances of the payday script crashing. We've had a handful of instances of that, and none in the past few months until this week, which isn't even represented above. The corruption appears to be part of payday in general and not triggered by crash-and-rerun.

@chadwhitacre
Copy link
Contributor Author

We should make sure we're accounting for fees properly in the check script. I don't see that fees could be implicated in the corruption itself, because @pjz has zero exchanges and great corruption.

@chadwhitacre
Copy link
Contributor Author

Phew. I wasn't subtracting payout fees from exchange amounts (the exchanges table is asymmetric: payin fees don't need to be subtracted from exchange amounts to compute balance but payout fees do). Corrected for that, here's what I'm now seeing:

       0 2012-07-05.psql.log
       0 2012-07-10.psql.log
       0 2012-07-13.psql.log
       2 2012-07-17.psql.log
       2 2012-07-20.psql.log
      73 2012-07-20b.psql.log
       0 2012-07-27.psql.log
       0 2012-07-27b.psql.log
       0 2012-08-03.psql.log
       0 2012-08-03b.psql.log
       0 2012-08-10.psql.log
       0 2012-08-16.psql.log
       0 2012-08-17.psql.log
       0 2012-08-23.psql.log
       0 2012-08-24.psql.log
       0 2012-08-28.psql.log
       0 2012-08-31.psql.log
       0 2012-09-06.psql.log
       0 2012-09-12.psql.log
       0 2012-09-13.psql.log
       0 2012-09-20.psql.log
       0 2012-09-27.psql.log
       0 2012-10-03.psql.log
       0 2012-10-04.psql.log
       0 2012-10-11.psql.log
       0 2012-10-25.psql.log
       0 2012-11-01.psql.log
       0 2012-11-05.psql.log
       0 2012-11-08.psql.log
       0 2012-11-15.psql.log
       0 2012-11-22.psql.log
       0 2012-11-29.psql.log
       0 2012-12-06.psql.log
       0 2012-12-11.psql.log
       1 2012-12-17.psql.log
       1 2012-12-20.psql.log
       1 2012-12-27.psql.log
       1 2013-01-03.psql.log
       1 2013-01-10.psql.log
       1 2013-01-17.psql.log
       2 2013-01-24.psql.log
       1 2013-01-31.psql.log
       1 2013-02-05.psql.log
       1 2013-02-07.psql.log
       1 2013-02-08.psql.log
       1 2013-02-14.psql.log
       1 2013-02-21.psql.log
       1 2013-02-28.psql.log
       1 2013-03-07.psql.log
       1 2013-03-14.psql.log
       1 2013-03-21.psql.log
       1 2013-03-28.psql.log
       1 2013-04-03.psql.log
       1 2013-04-04.psql.log
       1 2013-04-06.psql.log
       1 2013-04-10.psql.log
       1 2013-04-11.psql.log
       1 2013-04-14.psql.log
       1 2013-04-18.psql.log
       2 2013-04-25.psql.log
       2 2013-05-02.psql.log
       2 2013-05-09.psql.log
       2 2013-05-13.psql.log
       2 2013-05-16.psql.log
       2 2013-05-23.psql.log
       2 2013-05-30.psql.log
       2 2013-06-06.psql.log
       4 2013-06-13.psql.log
       4 2013-06-20.psql.log
       4 2013-06-27.psql.log
       4 2013-06-27b.psql.log
       4 2013-06-29.psql.log
       4 2013-07-01.psql.log
       4 2013-07-04.psql.log
       4 2013-07-11.psql.log
       5 2013-07-15.psql.log

@chadwhitacre
Copy link
Contributor Author

One interesting datum is that for the four accounts affected, three of them get corrupted only once. @pjz got corrupted twice. But in other words it's only happening to a few users (4 / 1559 = 0.2%) and it's only happened five times total. It's a rare event, thankfully.

What is it? Do these corruptions line up with payday crashes?

@chadwhitacre
Copy link
Contributor Author

The second in 2013-01-24 is explained by a blip in my handling of a manual payout.

@pjz
Copy link

pjz commented Jul 12, 2013

Last payday 'worked'... in that it adjusted from my previous balance correctly. I'm still down the $220ish amount, though.

@chadwhitacre
Copy link
Contributor Author

IRC ref

@chadwhitacre
Copy link
Contributor Author

I modified the 1116.sh script I'm using to process backups (in concert with 1116.py). It now processes backup files named on the command line. If none are named it processes all backups in the current directory. I processed two more backup files and appended the result to #1116.

@chadwhitacre
Copy link
Contributor Author

We picked up another corrupt record last week. We crashed two weeks ago 2013-07-04 (#1107). Each backup above generally is from immediately before payday is run, so the 2013-07-04 row above is the state of the db just before the 2013-07-04 payday, etc. 2013-07-11 went off without a hitch (#1131), except that after it we now have an additional corrupt record.

When we first experienced corruption during payday (#169) it turned out that the corruption entered in one week but didn't manifest until the next week. Is that the pattern here as well? How do the manifestations here line up with payday crashes?

@chadwhitacre
Copy link
Contributor Author

(I manually added in the dates corresponding to the Gittip week number.)

$ ack Traceback
2012-07-20  gittip-7.log:145:pid-34940 thread-140735091989696 (MainThread) Traceback (most recent call last):
2012-08-24  gittip-12.0.log:1612:pid-35707 thread-140735091989696 (MainThread) Traceback (most recent call last):
2012-09-06  gittip-14.2.log:515:pid-30377 thread-140735081639104 (MainThread) Traceback (most recent call last):
2012-09-06  gittip-14.log:1202:pid-29946 thread-140735081639104 (MainThread) Traceback (most recent call last):
2012-09-27  gittip-17.log:1898:pid-2821 thread-140735081639104 (MainThread) Traceback (most recent call last):
2012-12-06  gittip-27.log:765:pid-74030 thread-140735090330816 (MainThread) Traceback (most recent call last):
2012-12-13  gittip-28.log:2088:pid-63480 thread-140735090330816 (MainThread) Traceback (most recent call last):
2013-01-03  gittip-31.log:1944:pid-734 thread-140735090330816 (MainThread) Traceback (most recent call last):
2013-01-10  gittip-32.log:1196:pid-93273 thread-140735090330816 (MainThread) Traceback (most recent call last):
2013-01-24  gittip-34.log:214:pid-51984 thread-140735090330816 (MainThread) Traceback (most recent call last):
2013-07-04  gittip-57.log:870:pid-29045 thread-140735088401600 (MainThread) Traceback (most recent call last):
2013-07-04  gittip-57.log:893:Traceback (most recent call last):

@chadwhitacre
Copy link
Contributor Author

For kicks, here's the weeks we crashed, shown on the users chart:

screen shot 2013-07-15 at 3 21 17 pm

Actually, this is a good way to visualize this. Here's the weeks we saw corruption:

screen shot 2013-07-15 at 3 40 40 pm

@chadwhitacre
Copy link
Contributor Author

Well, I'd say that rules out crashing as the root cause! Back to the drawing board. :-)

@chadwhitacre
Copy link
Contributor Author

I believe pachinko first hit in Gittip 46 (#857). The frequency of corruption has increased since then, but we have that one case way back in week 29. Maybe there are multiple causes?

@chadwhitacre
Copy link
Contributor Author

Here's the data for @pjz:

                  Exchanged       Received           Balance
Week     Date           Gave Away          Check           Difference
 53   2013-06-06     0    37.50   229.38   191.88     3.75   188.13
 54   2013-06-13     0    39.75   239.63   199.88    11.75   188.13
 55   2013-06-20     0    42.75   252.39   209.64     4.75   204.89
 56   2013-06-27     0    45.75   261.64   215.89    11.00   204.89
 57   2013-07-04     0    48.75   273.89   225.14     4.75   220.39
 58   2013-07-11     0    51.75   286.14   234.39    14.00   220.39

P.S. Gave Away and Received are cumulative, not per-week.

@chadwhitacre
Copy link
Contributor Author

What is the pattern here? It's interesting that difference increases in stairstep (first by 16.76 and then by 15.50) while nothing else does. Where does the stairstep come from?

@chadwhitacre
Copy link
Contributor Author

Okay! I believe (in @pjz's case at least) that this is due to a bug where we record an ach credit locally even if the Balanced API call fails. The reason we attempt to credit @pjz in the first place here is because he has a Balanced account connected (even though he has no credit card or bank account on file).

In other words, Gittip believes that @pjz has been withdrawing money all along. The stairstep effect is due to the minimum payout threshold of $10.

@chadwhitacre
Copy link
Contributor Author

I keep logs for each payday. I inferred the above after seeing a Crediting ... failed log message for pjz in the log for week 53. Based on that, I just did an ack for "Crediting.*failed", and it lines up perfectly with the corruption we're seeing.

@chadwhitacre
Copy link
Contributor Author

The fix is somewhat tricky because we do want to call into record_credit, and the UPDATE to participants for balance also updates the last_ach_result as well. Need to write some tests and untangle the code paths here.

@pjz
Copy link

pjz commented Jul 18, 2013

So is it fixed? Is it going to happen again? Can you fix my balance yet?

@chadwhitacre
Copy link
Contributor Author

I have an idea for an easy fix, but I want to write a test first.

I put a pdb.set_trace in record_credit and ran the test suite. The test suite doesn't touch that method, apparently.

@chadwhitacre
Copy link
Contributor Author

We're at 75% coverage on gittip.billing.payday.

@chadwhitacre
Copy link
Contributor Author

Now to repair the damage ...

@chadwhitacre
Copy link
Contributor Author

Okay! Database repaired with the 1116.py script.

Affected users were: @colindean @nexxy @nicksergeant @richleland @pjz.

For those just joining us, we had a bug where we corrupted your balance on Gittip, because we mishandled the case where you attached a Balanced bank account (or credit card?) and then detached the bank account. We were still treating you as having a bank account attached, so Gittip was decrementing your balance each week even though there was no actual deposit happening. Sorry! :-(

@pjz
Copy link

pjz commented Jul 25, 2013

yay, fixed!

@colindean
Copy link
Contributor

!m @whit537

@tshepang
Copy link
Contributor

this must have been a difficult one @whit537 and thanks for handling it so well (and in such an open manner)

@chadwhitacre
Copy link
Contributor Author

@tshepang Having the weekly backups and payday log files for the past year certainly helped. :-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants