-
-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comment stream dropping comments? #1043
Comments
@bicubic unfortunately the /r/all comment stream isn't 100% reliable. If you can find a way to increase reliability we'd definitely love to incorporate those ideas. What part of the documentation would make sense to update to state this observation? |
The stream docs I guess. Do you have any clues as to why it's not reliable? Is it the reddit api or is it praw itself? Are you aware of the volume drop I mentioned around December? |
I'm not aware of any such volume drop. PRAW can grab up to 100 comments in a single request and makes requests roughly one a second (assuming you have only a single service running using your credentials). That means if Reddit ever gets more than 100 comments in a single second (more precisely, since the last request) items will be missed. The part PRAW relies on is how quickly Reddit updates those listings. Observations would show that they're pretty reliable at returning results, however anecdotal evidence suggests that for active streams (all comments) that the listing isn't always perfectly updated since it's constantly being updated. For better results, monitor only communities that you're interested in if you want a real time stream. For non real time results give pushshift a try. |
That is not consistent with the failure mode I am seeing. Below is a count of comments ingested per approx. 1 second. They are floats because the counting mechanism is timed by the stream processor firing and so elapsed delta time may be some fraction higher than 1s. Note that at no point does the throughput approach 100/s and in fact there is a pretty clear pattern of the throughput dropping to almost 0 on some calls.
|
Thanks for the data. If you can narrow it down more, I'd love to be wrong so that things can be fixed. You can try logging the actual requests to see if that helps shed any light on the missed comments: https://praw.readthedocs.io/en/latest/getting_started/logging.html Keep in mind too that comments might be going into a spam filter and thus they won't show up in the listings. However, they should be re-added when approved, but I don't know where in the listings they are added. |
I'm not sure how I can help narrow it down since I have zero understanding of praw, but here is something you can easily test yourself. https://gist.github.com/bicubic/774bf7ae25c29d78acb39d6d2b07849c Couple of observations:
I don't know how praw works under the hood, but if it's calling a similar endpoint and sleeps between calls, then I can see how it would be losing fresh comments due to the 100 return limit. Let me know if I can help further. |
I tried this script
And here is the result log. As we can see, there are many dropped comments as @bicubic pointed out, and some of them can be retrieved later via api/info:
I too am guessing it's a cache problem (rather than private subreddit or unapproved comment). As the second GET /comments request in the log suggests (it returns only 9 comments and too many holes), PRAW is requesting too fast? |
My testing suggests that this might have to do with the suboptimal I’ve modified bicubic’s script to output comment import os
import praw
reddit = praw.Reddit()
subreddit = reddit.subreddit('all')
with open(os.path.splitext(__file__)[0] + '.txt', 'w') as fh:
for comment in subreddit.stream.comments(pause_after=None, skip_existing=True):
print(comment, file=fh) I’ve compared the two text files—
Here are some results, running both scripts (in parallel) for 2 minutes, and also some 5 minute trials. The first column is the number of
The results vary inconsistently between trials, but bicubic’s stream seems to have a higher potential to win by a bigger margin. Now, I’ve repeated the same tests but changed the following line: Line 172 in dba778a
To list(function(limit=limit, params={"before": None})) Here are some results:
PRAW’s stream now consistently beats bicubic’s stream by a very significant margin. I can’t explain why the results differ so much here. I almost feel like there’s some flaw in my testing since I’m getting such positive results, but I’m certain that the before adjusting is a factor to PRAW’s low throughput when streaming comments from r/all. |
I noticed this discussion about |
So where do things stand? What change to PRAW, if any, would produce better results? It'd be great to pull such a change in if one exists. |
Well assuming that it is a matter of before adjusting, one way to solve things would be to just lose the before adjusting, but this would reduce the efficiency of the stream. Doing this would only benefit those who are specifically streaming comments from r/all, so this is not a good solution. The best thing to do now would probably be to just have PRAW detect that r/all is being streamed from, and have it use a This could be implemented by having |
Can you say more about losing the efficiency of the stream? For slower streams, PRAW introduces longer wait limits when nothing has changed, so maybe dropping that param altogether is fine. I'd personally prefer higher accuracy over efficiency, and I suspect many PRAW users would as well. |
@bboe I have no idea because I don't know what the root cause of this problem. |
I think removing the before adjusting would be an acceptable fix for now, although know that everyone who’s not streaming from r/all is going to be slightly worse off, which makes me a little bit uncomfortable. But it’s not like any other reddit api library does any sort of fancy I’ll reintroduce before adjusting with a more optimal algorithm by the time I’m though with #1025. (My new streaming implementation already tries to detect the target listing’s activity and adjusts |
I'm guessing this isn't praw's fault, but reddit not updating the index's powering the before parameter fast enough, causing it to not properly find submissions "before" the id if that id was added to the index only a second before. The average number of comments being submitted to reddit each second has grown something like 30% in the last year, so I wouldn't be surprised if something on their side isn't able to keep up. I think removing the before parameter is the right solution, and will at worst undetectably decrease performance. Worst case it fetches 100 items instead of 1, then iterates over all 100 client side and throws away 99 of them, which is a very fast operation compared to the request time. It might be worth it to special case r/all rather than doing this globally though, since no single subreddit (or even a collection of them) is a majority of the new comments. |
None of the approaches discussed here are actually capable of capturing 100% of r/all. That should be kept in mind for any changes to praw. For some use cases a loss rate of 5% is just as bad as 25%. Perhaps its worth considering a different approach for those who do want the 100% r/all firehose and treating that as a special case. |
Aside from this bug, this approach absolutely will capture 100% of r/all as long as you don't make any other requests. And if you are planning to make other requests, there's no endpoint that will let you catch up since no endpoint returns more than a hundred objects. It could be made more robust by using an incremental id based approach like pushshift does, but that won't solve the underlying problem of reddit getting 60-70 new comments a second and the client only being able to retrieve 100 at a time. Just use pushshift. That's why it was created. |
Having run an incremental ID fetch solution for the last few days, I see multiple time periods where the throughput rate exceeds 100messages/s. Does that not impose a guaranteed loss on any orthodox api based approach? |
It averages out to 60-70 comments a second. As long as you kept track of which ids you had processed and which you hadn't, you can just keep requesting the ones you hadn't and you'll eventually catch up. In times of peak activity you might fall minutes behind, but as long as you're only requesting comments and not doing anything else you'll be fine. How are you running an incremental ID approach? |
Probably in a similar way pushshift does. With a pool of workers across multiple IPs making up to N polls per second for 100 explicit IDs each. I don't think such an approach is viable on a single node without technically violating api rate limits. |
Well I'm reasonably sure that u/Stuck_In_the_Matrix only runs one request a second for pushshift. And he fetches comments and submissions in the same request. He's talked recently about adding a second requester, but he hasn't needed to yet. I'm not really sure where you're getting multiple hundreds of distinct ids a second from, reddit just doesn't get that much content other than a few brief periods during major sporting events. This might be getting a bit off topic though. |
Correct, I have a surplus of polling capacity because I'm re-polling comments to watch scores and thread progression. To consume 100% of the firehose in real time you need to be capable of fetching up to 2x100 item queries per second. Re-polling at a later time is not ideal due to comments getting moderated or deleted seconds after creation. Re-polling also prevents use cases like bots reacting to firehose comments in real time. tl;dr I don't think any orthodox approach can consume the firehose at 100% in real time. This is going to become increasingly an issue as daily comment volumes continue increasing. For that reason it might be worthwhile to treat the r/all stream as a special case in praw. |
Can y'all see if the following PR improves the streaming functionality? https://github.com/praw-dev/praw/pull/1050/files Thanks! |
Yes, #1050 works well for me @bboe. I’ve noticed you’ve decided to do away with limit adjusting as well. You once mentioned to me that part of the reason for the param adjusting is to avoid cached results. I’d like to know about any relevant discussions. There must have been a reason for adding all this adjusting in the first place. |
I just want to note that I'm seeing the same issues mentioned here and I'm eagerly awaiting a new version that includes #1050 in the hope that it causes less comments to be dropped. |
To resolve the issue, maybe it can be noted that the I have confirmed that you can do base 36 conversions through the In[2]: int("fdilbrw", base=36)
Out[2]: 33469023452 |
Noting a known issue is always good, but just noting the issue isn't the same as fixing it. For this reason, it would be better if we could improve the streams to avoid dropping items entirely, but this is difficult to achieve. |
In my case, I'm looping through a list of subreddit titles and passing The following
prints 1 commend ID per second / per subreddit. I thought it might be a rate limiting issue, but every first subreddit that I stream comments from does the same as all the subreddits following. Doesn't rule out rate limiting, but I can't figure out what's wrong. |
Maybe the subreddit does output 1 comment/sec. Try on |
Since this issue hasn't been fully resolved yet, is there at least a method to decrease the proportion of comments that are dropped? I've noticed that the bot I wrote doesn't process most comments. |
This issue is stale because it has been open for 20 days with no activity. Remove the Stale label or comment or this will be closed in 10 days. |
This issue was closed because it has been stale for 10 days with no activity. |
Issue Description
A simple consumer like the below seems to not be processing all comments.
I observed a drop in total comment throughput with the praw stream sometime around december 2018 and it has never really recovered.
I have tested by manually making a number of comments and observing that some of them don't get captured by the praw stream.
IO on the client side is not a limiting factor.
System Information
The text was updated successfully, but these errors were encountered: