Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added comment-groups.csv export. #1852

Open
wants to merge 1 commit into
base: edge
Choose a base branch
from
Open

Conversation

samskivert
Copy link
Collaborator

Colin will be using this data (or some filtered version of it) to pass to an LLM when it wants to summarize things.

The code uses the summarized data from the PCA json blob instead of computing things from the raw comments and votes tables. The latter approach results in numbers that don't match up exactly with the data that appears on the HTML version of the report (our numbers are a little higher, so the Clojure backend is filtering out some votes/voters that we are not).

We want the LLM to see the exact same data that's on the HTML page because it might refer to specific numbers and we want those numbers to be exactly the same as the numbers the user sees.

Colin will be using this data (or some filtered version of it) to pass to an
LLM when it wants to summarize things.

The code uses the summarized data from the PCA json blob instead of computing
things from the raw comments and votes tables. The latter approach results in
numbers that don't match up exactly with the data that appears on the HTML
version of the report (our numbers are a little higher, so the Clojure backend
is filtering out some votes/voters that we are not).

We want the LLM to see the exact same data that's on the HTML page because it
might refer to specific numbers and we want those numbers to be exactly the
same as the numbers the user sees.
@jucor
Copy link
Contributor

jucor commented Dec 2, 2024

@samskivert Do you have somewhere a code that does "the latter" (i.e. compute the PCA from raw comments and votes tables), and gets these higher numbers and compares them with the PCA Json blob stored in database by the clojure code, please? Maybe we could keep it in a side repo. I'm thinking it can be handy if/when I'll need to dive back down into the Clojure PCA code, as a point of comparison. Thanks!

In particular, if I'm understanding correctly, the PCA Clojure code at https://github.com/compdemocracy/polis/blob/edge/math/src/polismath/math/pca.clj has various scalings and tricks (as every custom implementation has, of course), so a separate implementation, that doesn't match, is super helpful to surface those.

@samskivert
Copy link
Collaborator Author

Alas, I don't have anything that fancy. The database contains a JSON blob that includes the PCA results and various things derived from that, including an assignment of each user to a cluster.

In this case, we want the votes on each comment tallied by group/cluster, which I wrote the code to compute based on the group/cluster assignments from the PCA blob. However, the counts I got were slightly different from the counts that show up on the report page for the conversation. I applied an equivalent "ignore this user if they didn't vote on enough issues" rule, which we know is being performed by the Clojure code, but even with that the numbers were still slightly different.

I looked into how the data being shown on the report page was being computed, and it turns out that this data is also part of the JSON blob that comes from Clojure and it's just being reproduced directly on the HTML. So I switched the CSV report to use the same Clojure-originated data that is being used on the report page, because the intention for this report is to give it to an LLM for use in summarizing the contents of the report page, so we wanted the numbers to match up exactly.

So I did write code to count up the number of votes per group based on the PCA-based group assignments that came from Clojure, but I did not reproduce the entire PCA computation process. That would be much harder! :) But still worthwhile, particularly if the hope is to someday convert all the Clojure code to TypeScript to simplify the system.

Maybe Pol.is can do a summer of code internship some time, and have a budding data science-leaning CS student take on the job of recreating that code in TypeScript. It'd be a nice clear task and there is a lot of data they could use to test that their implementation was generating equivalent results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants