Shortcomings in the current matching engine and possible improvements #1400

rien · 2024-03-01T15:12:53Z

rien
Mar 1, 2024
Maintainer

This discussion tracks some of the shortcomings of our current engine with the aim of finding improvements for them.

Current shortcomings

Repeating patterns can cause a high amount of independent and overlapping matches Handle repeating sequences in a file #820
Adding comments, no-op statements or other directives that do not influence program flow are included in the AST and might confuse some results
There could be "red flags" indicating plagiarism that Dolos doesn't recognize (a repeated typo in variable names or comments, matching code fragments of very unconventional bad code, ...)
Dolos struggles with large datasets and cannot handle multiple files per submission

rien · 2024-03-01T15:31:09Z

rien
Mar 1, 2024
Maintainer Author

Quick calculation of the most similar submissions (and other metrics)

While calculating similarity scores, Dolos spends a lot of time of calculating metrics for each pair of submissions, which scales quadratic with the number of submissions.

However, often we're just interested in the top N pairs according to our metrics (similarity, longest fragment, total overlap). Finding a way to dynamically find these "most suspicious" pairs could drastically improve how much submissions we could handle.

Unfortunately, this would require switching to a more lazy or request-based algorithm. We currently calculate almost everything during the analysis. On-demand calculation would require having a way to interact with a back-end or being able to calculate them on-the-fly in the front-end.

Using an advanced index structure such as a suffix tree would possibly help in this regard, but brings its own challenges.

Slower, more advanced calculation of pair edit distance

Once the general metrics are calculated, we often want to inspect a pair of submissions more closely. Since we will only be inspecting one pair at a time, we can afford to use slower algorithms that deliver more exact results. Especially if these algorithms could benefit from a pre-calculated index structure like a suffix tree.

For example: we could search for the edit distance between the AST and even include exact changes when desired. This would allow Dolos to exactly list which transformations were needed to convert one submission into the other. We can find inspiration for this in the tool difftastic.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shortcomings in the current matching engine and possible improvements #1400

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Shortcomings in the current matching engine and possible improvements #1400

rien Mar 1, 2024 Maintainer

Current shortcomings

Replies: 1 comment

rien Mar 1, 2024 Maintainer Author

Quick calculation of the most similar submissions (and other metrics)

Slower, more advanced calculation of pair edit distance

rien
Mar 1, 2024
Maintainer

rien
Mar 1, 2024
Maintainer Author