-
Notifications
You must be signed in to change notification settings - Fork 308
Conversation
I'm pretty sure the initial load that took two days included README fetching, which we ripped out in #4211. I believe now we can load up from just the metadata, which should go much quicker. |
Curveball! |
|
Pretty sure we want to use https://github.com/djc/couchdb-python. |
Probably this API? https://pythonhosted.org/CouchDB/client.html#couchdb.client.Database.changes |
Underlying API: http://docs.couchdb.org/en/2.0.0/api/database/changes.html. |
Yeah, this is gonna be it. :-)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
from couchdb import Database
npm = Database('https://skimdb.npmjs.com/registry')
changes = npm.changes(limit=10)
import pdb; pdb.set_trace() |
Alright, I think to start with we should go for a naive approach where we have a single process/thread that consumes the registry stream and inserts/updates in our database all at once. Decoupling fetch and update will be more complicated and should be done because we're getting too far behind otherwise. We could even probably build a quick dashboard to show far behind we are—or log over to Librato. |
Shelving: #!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
from couchdb import Database
from gratipay import wireup
def go(db):
npm = Database('https://skimdb.npmjs.com/registry')
changes = npm.changes(feed='continuous', include_docs=True)
for change in changes:
doc = change['doc']
if 'name' not in doc:
continue # not a package, probably a design doc*
name = doc['name']
description = doc.get('description', '')
emails = [e for e in [m.get('email') for m in doc.get('maintainers', [])] if e]
try:
db.run( "update packages set description=%s, emails=%s where package_manager='npm' and name=%s"
, (description, emails, name)
)
except:
db.run('insert into packages () values ()')
# * https://github.com/npm/registry/blob/aef8a275/docs/follower.md#clean-up
if __name__ == '__main__':
env = wireup.env()
db = wireup.db(env)
go(db) |
This is basically a rewrite of this subsystem. |
This will actually be much better though. No more |
Eep! Time to upgrade to Postgres 9.6 locally. 😊
|
Okay! I'm going to run this locally and see how long it takes and how it behaves. |
Started a long run ... |
|
I'm at about 80,000 after about 10(?) minutes. |
So maybe an hour for the whole thing? |
That's way better than two days, anyway. |
http://docs.couchdb.org/en/2.0.0/api/database/changes.html#polling |
Out of time for now.
|
Deletes should remove locally. Need to take care to unlink teams from packages when deleting packages. It's okay for the team to stick around, I think? It'll have a 404 homepage on npmjs. |
gratipay/sync_npm.py
Outdated
connection.commit() | ||
|
||
|
||
def delete(cursor, processed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be clearer if we renamed processed
to processed_doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 631dfc9.
gratipay/sync_npm.py
Outdated
for change in change_stream(last_seq): | ||
if change.get('deleted'): | ||
# Hack to work around conflation of design docs and packages in updates | ||
op, doc = delete, {'name': change['id']} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit confusing. delete
takes a dictionary, although it only needs one string
as the argument. Also, we don't need to pass the fake doc ({'name': change['id']}
) through process_doc
.
At the cost of a line or two more, I think this can be simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something along the lines of:
Raw version:
Before:
with db.get_connection() as connection:
for change in change_stream(last_seq):
if change.get('deleted'):
# Hack to work around conflation of design docs and packages in updates
op, doc = delete, {'name': change['id']}
else:
op, doc = upsert, change['doc']
processed = process_doc(doc)
if not processed:
continue
cursor = connection.cursor()
op(cursor, processed)
cursor.run('UPDATE worker_coordination SET npm_last_seq=%(seq)s', change)
connection.commit()
def delete(cursor, processed):
cursor.run("DELETE FROM packages WHERE package_manager='npm' AND name=%(name)s", processed)
def upsert(cursor, processed):
cursor.run('''
INSERT INTO packages
(package_manager, name, description, emails)
VALUES ('npm', %(name)s, %(description)s, %(emails)s)
ON CONFLICT (package_manager, name) DO UPDATE
SET description=%(description)s, emails=%(emails)s
''', processed)
After:
with db.get_connection() as connection:
for change in change_stream(last_seq):
cursor = connection.cursor()
if change.get('deleted'):
# Hack to work around conflation of design docs and packages in updates
delete(cursor, change['id'])
else:
upsert(cursor, process_doc(doc))
cursor.run('UPDATE worker_coordination SET npm_last_seq=%(seq)s', change)
connection.commit()
def delete(cursor, package_name):
cursor.run("DELETE FROM packages WHERE package_manager='npm' AND name=%s", package_name)
def upsert(cursor, processed_doc):
cursor.run('''
INSERT INTO packages
(package_manager, name, description, emails)
VALUES ('npm', %(name)s, %(description)s, %(emails)s)
ON CONFLICT (package_manager, name) DO UPDATE
SET description=%(description)s, emails=%(emails)s
''', processed_doc)
Only downside I see here is that we're doing a little bit more work (calling process_doc
, checking the deleted
key) inside the transaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't account for skipping docs with no name
key. How about 631dfc9?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes 631dfc9 looks good
gratipay/cli/sync_npm.py
Outdated
with sentry.teller(env): | ||
consume_change_stream(production_change_stream, db) | ||
try: | ||
last_seq = get_last_seq(db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if we're calling get_last_seq
here anyway - might make sense to simplify the function definition of consume_change_stream
to accept the stream directly, and not a function that has to be called with seq
to return the stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in af41409.
return self | ||
|
||
def __exit__(self, exc_type, exc_value, traceback): | ||
self.tell_sentry(exc_type, {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does exc_type
have all the details that we need to send to sentry? Shouldn't we pass traceback
and exc_value
? (I'm not sure what they are, but traceback
sure sounds important)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentry accesses Python's global exception state directly during captureException
(via sys.exc_info
, presumably), so we don't have to pass it through these function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting
import traceback | ||
|
||
from aspen import log | ||
from gratipay import wireup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils
importing wireup
? 😛 That seems hacky. No neater way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This PR is already 1.5x our 400-net-lines rule of thumb.)
/me looking into failures ... |
I'm good once travis is :) |
Travis is good! :-D |
|
Some discussion of error reporting and retry architecture in slack. |
Merging and deploying... |
When you're done you could try adding an instrument to http://inside.gratipay.com/appendices/health for npm sync lag. Librato is in 1Password so it would be a good test for that as well. :) |
Hmm, I had a |
I forgot to add the env var 😞 Gratipay was down for around 3 minutes, back up now |
Now to figure out how to run the syncer. Add it to the heroku procfile? |
I'm going to try to run as a one-off dyno first |
Maybe pyc files kept Git from removing it after the switch from |
Picks up from #4148 and #4427 (comment). Part of #4427.
We managed to load up a snapshot of npm back on #4148, but only barely enough to show stubby pages. Now that #4305 is pretty much ready to go, it's time to get some more robust npm syncing in place. Turns out the old API we were depending on is going away, so we need to rebuild this subsystem around npm's CouchDB change stream.
Specs
Todo
ijson
dep, addcouchdb