Issue 2 CC Notebooks > Examples > Experimental > Basic List Statistics #527

Cattekwaad · 2018-07-31T07:43:54Z

Cattekwaad
Jul 31, 2018

Issue 2: when running #threads with most replies the answers don't always match the count in the IETF archives and sometimes are off by a significant margin from my qualitative hand count. It would be interesting to see where this difference comes from. (lines 144 - 270).

sbenthall · 2018-07-31T14:36:07Z

sbenthall
Jul 31, 2018
Maintainer

Would it be possible for you to describe the process you go through to get your hand count?

That way somebody else can replicate the error.

0 replies

Cattekwaad · 2018-07-31T15:28:07Z

Cattekwaad
Jul 31, 2018
Author

Hi Sebastian, the process for the hand count involved spending 1,5 months reading every single email in the IETF HRPC archive at this URL: https://mailarchive.ietf.org/arch/browse/hrpc/ of the period October 2014 until November 2017.

I coded every single email in a spreadsheet for who sent it, the main topics, how contentious it was (measured by how many emails to a thread) and some mental notes for myself.

Additionally, because hand counts are prone to mistakes and oversights, I would triangulate my hand count by looking up each thread on the URL above and see how many replies there were per thread.

0 replies

sbenthall · 2018-07-31T15:46:04Z

sbenthall
Jul 31, 2018
Maintainer

Ah :-)

The thread counting code is 4 years old and was written by an undergrad RA who has long since moved on to bigger and better things, so it is very good that you're making us revisit it. It's likely there is a bug in the core BigBang code:

https://github.com/datactive/bigbang/blame/master/bigbang/thread.py

Of course we cannot replicate you 1.5 months of qualitative work. But maybe there is a way to break the problem down into a smaller task that exposes the error.

Would you be able to identify one or two (or more) particular threads for which there is a discrepancy between what BigBang reports and your findings? How many messages do you have in your count for these threads?

0 replies

Cattekwaad · 2018-07-31T16:00:27Z

Cattekwaad
Jul 31, 2018
Author

hi,

I certainly wouldn't recommend that, no ;-).

A good example is: the thread [hrpc] Examining existing Venue Selection criteria. Which in bigbang gave a much higher count than in my handcount (I don't have the exact numbers at hand right now).

Another example is the peak number of emails in year three ([hrpc] Human Rights Research Group Call on draft-irtf-hrpc-research-07). My handcount gave 53 replies, the IETF URL gave 56 and bigbang gave 53.

Similarly, for [hrpc] Comments about draft-irtf-hrpc-research-07 handcount is 35, URL is 41, and bigbang gives 32.

0 replies

sbenthall · 2018-07-31T19:45:05Z

sbenthall
Jul 31, 2018
Maintainer

Ok, thank you. This is tricky.

Some background: For message threading, BigBang uses the In-Reply-To header on each email, which has a reference to a unique code which is some other message's Message-ID. That creates a machine-readable network/tree structure relating different messages.

Here's the code the builds each thread:
https://github.com/datactive/bigbang/blob/master/bigbang/archive.py#L203

Properly configured mail clients should respect these headers. Also, it is conventional for email replies to have subjects formed from the subject of the root of the thread, i.e.: "Re: [hrpc] Comments about yadda yadda". But that's just a convention; it's possible to reply to an email and change the subject to something else. It's also possible to have the subject of an email look like a reply, but have it not have that header set correctly.

I would expect that the IETF web archive also uses In-Reply-To for threading, though it's hard to be sure.

May I ask how you determined messages to be part of the same thread in your hand-count?

Discrepancies between your hand count and the URL are not something BigBang can really address. But if you could identify any particular messages that the URL finds that are not in your hand count (for example, in [hrpc] Human Rights Research Group Call on draft-irtf-hrpc-research-07) or vice versa, then we can look them up with BigBang and see how they are being treated.

Discrepancies between the URL and BigBang are a little more concerning. Counting from the URL now for [hrpc] Comments about draft-irtf-hrpc-research-07, I see 36 in the thread (how did you get 41?). BigBang, freshly collecting data from [hrpc] is giving me 26 (??)

0 replies

Cattekwaad · 2018-09-06T15:13:55Z

Cattekwaad
Sep 6, 2018
Author

Hi Sebastian, all Apologies for the delay in answering, some replies in-line:

On Tue, Jul 31, 2018 at 9:45 PM, Sebastian Benthall < ***@***.***> wrote: Ok, thank you. This is tricky. Some background: For message threading, BigBang uses the In-Reply-To header on each email, which has a reference to a unique code which is some other message's Message-ID. That creates a machine-readable network/tree structure relating different messages. Here's the code the builds each thread: https://github.com/datactive/bigbang/blob/master/bigbang/archive.py#L203 Properly configured mail clients should respect these headers. Also, it is conventional for email replies to have subjects formed from the subject of the root of the thread, i.e.: "Re: [hrpc] Comments about yadda yadda". But that's just a convention; it's possible to reply to an email and change the subject to something else. It's also possible to have the subject of an email *look* like a reply, but have it not have that header set correctly. I would expect that the IETF web archive also uses In-Reply-To for threading, though it's hard to be sure.

This is good to know! thx.

May I ask how you determined messages to be part of the same thread in your hand-count?

I literally read each email and tracked who was responding to what in a set of spreadsheets. Not perfect, prone to human error but has the advantage of being able to catch the issues regarding changing subject headers etc when still replying to a particular thread/topic.

Discrepancies between your hand count and the URL are not something BigBang can really address. But if you could identify any particular messages that the URL finds that are not in your hand count (for example, in [hrpc] Human Rights Research Group Call on draft-irtf-hrpc-research-07) or vice versa, then we can look them up with BigBang and see how they are being treated.

Will get back to you on this.

Discrepancies between the URL and BigBang are a little more concerning. Counting from the URL now for [hrpc] Comments about draft-irtf-hrpc-research-07, I see 36 in the thread (how did you get 41?). BigBang, freshly collecting data from [hrpc] is giving me 26 (??)

Yes - I had also seen that. Any idea how it occured? Best,

…

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#349 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/APfW52oIcR2lQYycdW9D2nZxHGjUr3ohks5uMLPCgaJpZM4VnwgA> .

-- Corinne Cath - Speth Ph.D. Candidate, Oxford Internet Institute & Alan Turing Institute Web: www.oii.ox.ac.uk/people/corinne-cath Email: [email protected] & [email protected] Twitter: @C_Cath

0 replies

npdoty · 2018-09-19T20:26:25Z

npdoty
Sep 19, 2018
Collaborator

Discrepancies between the URL and BigBang are a little more concerning.
Counting from the URL now for [hrpc] Comments about
draft-irtf-hrpc-research-07, I see 36 in the thread (how did you get 41?).
BigBang, freshly collecting data from [hrpc] is giving me 26 (??)

Yes - I had also seen that. Any idea how it occured?

It could be that the IETF's web archives, or especially the human counter, might detect "Re: blah" as the same thread as "blah" even if the client didn't include an In-Reply-To header, or there might be variations in use of headers that bigbang doesn't currently catch: for example, is bigbang also using the References header as part of the tree calculation? There can be a long list of message IDs in the References header, but I'm not sure all clients use it. (And the spec says that In-Reply-To can also have multiple values, although I feel like that's less common in practice: https://tools.ietf.org/html/rfc5322#section-3.6)

Specific examples would definitely be useful in understanding and uncovering particular bugs, but I suspect that there may always be some variation in these particular counts because there might not even be an easily agreed upon gold standard.

0 replies

sbenthall · 2021-12-08T17:34:26Z

sbenthall
Dec 8, 2021
Maintainer

Moving this to a discussion as it seems to be a broader discussion about the validation of automated and manual data practices.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 2 CC Notebooks > Examples > Experimental > Basic List Statistics #527

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Issue 2 CC Notebooks > Examples > Experimental > Basic List Statistics #527

Cattekwaad Jul 31, 2018

Replies: 8 comments

sbenthall Jul 31, 2018 Maintainer

Cattekwaad Jul 31, 2018 Author

sbenthall Jul 31, 2018 Maintainer

Cattekwaad Jul 31, 2018 Author

sbenthall Jul 31, 2018 Maintainer

Cattekwaad Sep 6, 2018 Author

npdoty Sep 19, 2018 Collaborator

sbenthall Dec 8, 2021 Maintainer

Cattekwaad
Jul 31, 2018

sbenthall
Jul 31, 2018
Maintainer

Cattekwaad
Jul 31, 2018
Author

sbenthall
Jul 31, 2018
Maintainer

Cattekwaad
Jul 31, 2018
Author

sbenthall
Jul 31, 2018
Maintainer

Cattekwaad
Sep 6, 2018
Author

npdoty
Sep 19, 2018
Collaborator

sbenthall
Dec 8, 2021
Maintainer