Output matching fragments in machine-readable format #1065

tsieger · 2023-03-01T10:38:23Z

tsieger
Mar 1, 2023

As an alternative to console output, would it be please possible to report matching fragments in some machine-readable format?

My use case is this: I need the matching fragments to be translated to other system. Currently, I store the color console output with color-coded matching fragments to a text file using the script tool, and subsequently I parse this file to learn starts and ends of matching fragments. However, this is quite complex and slow.

Would it be possible to report matching fragments e.g. in json or csv form consisting of matching fragments represented e.g. by <starting line number>:<starting char index at line>, <ending line number>:<ending char index at line>, please?

rien · 2023-03-02T12:14:32Z

rien
Mar 2, 2023
Maintainer

We've had this kind of output in the past, but it was removed because it produced too much output and slowed down the analysis significantly. We now calculate the matching fragments on-the-fly because it should be fast enough for most our use cases.

Since you are doing some advanced things with Dolos, I would suggest using the library (@dodona/dolos-lib). Here is an example how you can use the library to print out the relevant fragments:

const dolos = new Dolos();
const report = await dolos.analyzePaths(files);


for (const pair of report.allPairs()) {
 for (const fragment of pair.buildFragments()) {
    const left = fragment.leftSelection;
    const right = fragment.rightSelection;
    console.log(`${pair.leftFile.path}:{${left.startRow},${left.startCol} -> ${left.endRow},${left.endCol}} matches with ${pair.rightFile.path}:{${right.startRow},${right.startCol} -> ${right.endRow},${right.endCol}}`);
 }
}

For the full example, visit https://github.com/rien/dolos-lib-example

Let us know if you need more information.

0 replies

rien · 2023-03-02T12:15:03Z

rien
Mar 2, 2023
Maintainer

Out of curiosity: which system are you integrating Dolos with? We're currently doing our own integration as well, so we might be able to share some ideas.

0 replies

tsieger · 2023-03-02T14:22:32Z

tsieger
Mar 2, 2023
Author

Thanks a lot. That was easy ;-). Can I find doc to the @dodona/dolos-lib somewhere, please? (The doc link refers to the https://dolos.ugent.be/ website.) Perhaps, the API is intuitive on its own, right?

I'm integrating Dolos into our internal proprietary faculty information system, replacing an older plagiarism detection system that is clearly outperfomed by Dolos. I have written some shell scripts to preprocess source codes, split them by language, run Dolos on them on per-languge basis, make plots using R, pick most similar pairs and run Dolos on such pairs, and finally to capture and parse the color console output, producing a json report. I would be happy to share ideas and code, if you like.

1 reply

rien Mar 2, 2023
Maintainer

There is currently not really a documentation page of the library, but the API should be more or less intuitive indeed. Get in touch if something is unclear :)

We're planning to inegrate Dolos with our own system as well, and automate as much as possible of the plagiarism detection flow to reduce load on teachers. If you have any insights and ideas from your experience that would help us, please share :) Our implementation will be open-source, so you will be able to see our final result.

tsieger · 2023-03-06T12:16:09Z

tsieger
Mar 6, 2023
Author

OK, let me share some ideas / experience:

I plot the similarity score against the longest fragment to produce a 2D visualization of the similarities between pairs (see e.g. here or here). The purpose of these plots is: a) to study the shape of the distribution of the similarities in order to get an intuitive overview of what is ok and which pairs pose potential outliers deserving special attention, b) to estimate some reasonable threshold for similarity score and/or longest fragment (and provide a visual feedback for user regarding their choice of the threshold), and c) instantly identify outlying pairs in plots where IDs of the specific sources appear.
As part of preprocessing, I limit the size (the number of lines) of sources to be analyzed to some reasonable extent in order to keep Dolos running well (i.e. not crashing or taking ages to complete).
Fig.6 of the Dolos paper (2021) demonstrates that the default parameters yield almost maximal F1 scores compared to other parameter values, but it is not clear whether some other choice of parameters would not yield even better outcomes. Considering that the tested window sizes were 17, 20, 25, 30, 35 and 40, and the chosen default was 17 (at the border of the parametric space!), it is not evident wheter e.g. smaller window sizes would not yield higher F1 score? (Why did you omit smaller window sizes from the evaluation?) So, to make my analysis possibly more sensitive (and also to get smoother pictures due to less "granularized" values of the reported similarity measures :-) ), I run Dolos two times: first, I try using much smaller window size (10 or even 5), and, when Dolos crashes (maybe, from version 2.1.0, it will no longer crash?), I fall back and I run Dolos once again with the default window size. BTW, is there any disadvantage (besides longer runtime and higher memory usage) linked to smaller than the default window sizes?

I would be grateful for any comments on my approach.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output matching fragments in machine-readable format #1065

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Output matching fragments in machine-readable format #1065

tsieger Mar 1, 2023

Replies: 4 comments · 1 reply

rien Mar 2, 2023 Maintainer

rien Mar 2, 2023 Maintainer

tsieger Mar 2, 2023 Author

rien Mar 2, 2023 Maintainer

tsieger Mar 6, 2023 Author

tsieger
Mar 1, 2023

Replies: 4 comments 1 reply

rien
Mar 2, 2023
Maintainer

rien
Mar 2, 2023
Maintainer

tsieger
Mar 2, 2023
Author

rien Mar 2, 2023
Maintainer

tsieger
Mar 6, 2023
Author