Added comment-groups.csv export. #1852

samskivert · 2024-11-28T01:47:30Z

Colin will be using this data (or some filtered version of it) to pass to an LLM when it wants to summarize things.

The code uses the summarized data from the PCA json blob instead of computing things from the raw comments and votes tables. The latter approach results in numbers that don't match up exactly with the data that appears on the HTML version of the report (our numbers are a little higher, so the Clojure backend is filtering out some votes/voters that we are not).

We want the LLM to see the exact same data that's on the HTML page because it might refer to specific numbers and we want those numbers to be exactly the same as the numbers the user sees.

Colin will be using this data (or some filtered version of it) to pass to an LLM when it wants to summarize things. The code uses the summarized data from the PCA json blob instead of computing things from the raw comments and votes tables. The latter approach results in numbers that don't match up exactly with the data that appears on the HTML version of the report (our numbers are a little higher, so the Clojure backend is filtering out some votes/voters that we are not). We want the LLM to see the exact same data that's on the HTML page because it might refer to specific numbers and we want those numbers to be exactly the same as the numbers the user sees.

jucor · 2024-12-02T15:22:16Z

@samskivert Do you have somewhere a code that does "the latter" (i.e. compute the PCA from raw comments and votes tables), and gets these higher numbers and compares them with the PCA Json blob stored in database by the clojure code, please? Maybe we could keep it in a side repo. I'm thinking it can be handy if/when I'll need to dive back down into the Clojure PCA code, as a point of comparison. Thanks!

In particular, if I'm understanding correctly, the PCA Clojure code at https://github.com/compdemocracy/polis/blob/edge/math/src/polismath/math/pca.clj has various scalings and tricks (as every custom implementation has, of course), so a separate implementation, that doesn't match, is super helpful to surface those.

samskivert · 2024-12-02T21:08:49Z

Alas, I don't have anything that fancy. The database contains a JSON blob that includes the PCA results and various things derived from that, including an assignment of each user to a cluster.

In this case, we want the votes on each comment tallied by group/cluster, which I wrote the code to compute based on the group/cluster assignments from the PCA blob. However, the counts I got were slightly different from the counts that show up on the report page for the conversation. I applied an equivalent "ignore this user if they didn't vote on enough issues" rule, which we know is being performed by the Clojure code, but even with that the numbers were still slightly different.

I looked into how the data being shown on the report page was being computed, and it turns out that this data is also part of the JSON blob that comes from Clojure and it's just being reproduced directly on the HTML. So I switched the CSV report to use the same Clojure-originated data that is being used on the report page, because the intention for this report is to give it to an LLM for use in summarizing the contents of the report page, so we wanted the numbers to match up exactly.

So I did write code to count up the number of votes per group based on the PCA-based group assignments that came from Clojure, but I did not reproduce the entire PCA computation process. That would be much harder! :) But still worthwhile, particularly if the hope is to someday convert all the Clojure code to TypeScript to simplify the system.

Maybe Pol.is can do a summer of code internship some time, and have a budding data science-leaning CS student take on the job of recreating that code in TypeScript. It'd be a nice clear task and there is a lot of data they could use to test that their implementation was generating equivalent results.

ballPointPenguin approved these changes Dec 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added comment-groups.csv export. #1852

Added comment-groups.csv export. #1852

samskivert commented Nov 28, 2024

jucor commented Dec 2, 2024 •

edited

Loading

samskivert commented Dec 2, 2024

Added comment-groups.csv export. #1852

Are you sure you want to change the base?

Added comment-groups.csv export. #1852

Conversation

samskivert commented Nov 28, 2024

jucor commented Dec 2, 2024 • edited Loading

samskivert commented Dec 2, 2024

jucor commented Dec 2, 2024 •

edited

Loading