Link checking at course context #173

tuanngocnguyen · 2022-10-26T22:16:16Z

This is a MR for issue #174

Testing Instruction

Set up

Refer to readme file:
8cf69da#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R145

Enable course mode
Set listed permissions for bot accounts

View subjects without participation: moodle/course:view
View hidden subjects: moodle/course:viewhiddencourses
View hidden sections: moodle/course:viewhiddensections
View hidden book chapters: mod/book:viewhiddenchapters
View hidden activities: moodle/course:viewhiddenactivities
See hidden categoriesmoodle/category:viewhiddencategories

Set your email in 'Notify course link checker to this email address '
Make sure "Parallel crawling task" is enabled

Run crawling for a course

Go to a course
Go to course admin settings
Click on "Link crawler robot",
Click on 'Run checker',the job is now added to course crawling queue
After few second, refresh the page, you should be able to see the progress.
Once the crawling finishes, you should receive an notification email

View report

Once the crawling finished on a course
Go back to "Link crawler robot" page of the course > Click on "Crawled Link report"
Or go to course settings > reports > Click on "Crawled Link report"
The 'response' should show "working/not working" depend on http code. Links with "200" http code should be showed as "working", while those with other status should be shown as "Not working"
The response column can be sorted
There is brief description of http code

dmitriim · 2022-10-26T22:19:44Z

@tuanngocnguyen can you please create an issue first and describe the problem we are trying to solve with this patch. Thanks!

tuanngocnguyen · 2022-10-26T22:19:49Z

Pending task: clean up and add info to readme file

brendanheywood · 2022-11-02T00:19:44Z

classes/helper.php

+    // Reference: https://datatracker.ietf.org/doc/html/rfc7231.
+
+    /** @var $httpcodes list of http code */
+    private static $httpcodes = [


I do not think this map should exist, if we show an http code then we should show the http string that the server returns and not make one up based on the code alone.

Client wants a layman explanation of httpcode so I added some brief info

Wow! Ok today I learnt something major. In http1.1 you would get back a head something like:

HTTP/1.1 404 Not Found

The Code is '404' and the Reason-Phrase is 'Not Found' and we could have captured this and used it directly instead of making it up in this mapping. In http2 this is dropped completely and there is no reason phrase at all which is really surprising and will have impacts in lots of places.

So yes I guess we do need this mapping. In which case can you make these language strings

yep sure, I will change them as lang strings

Hi @brendanheywood

I have changed them to lang strings

brendanheywood · 2022-11-02T00:25:41Z

classes/helper.php

+             LEFT JOIN {tool_crawler_edge} l ON l.b = b.id
+             LEFT JOIN {tool_crawler_url}  a ON l.a = a.id
+             LEFT JOIN {course} c ON c.id = a.courseid
+                 WHERE b.httpcode != '200' AND c.id = $courseid";


anything which is 1xx 2xx or 3xx should not be considered as broken

This was actually based on this:
https://github.com/catalyst/moodle-tool_crawler/blob/MOODLE_310_STABLE/classes/robot/crawler.php#L457

Probably this can be fixed in another issue/MR

brendanheywood · 2022-11-02T00:39:26Z

classes/robot/crawler.php

+            $courselockfactory = \core\lock\lock_config::get_lock_factory('tool_crawler_process_course_queue');
+            $courseid = 0;
+            foreach ($allowedcourses as $id) {
+                if ($courselock = $courselockfactory->get_lock($id, 0)) {


why are you locking at the course level? You should only need to lock at the level of a specific url. Locking at the level will break the parallelism and be substantially slower and force the crawling to be in serial.

The idea is to lock the crawling within a course. It has to complete the process in a course before it can go to another one.
The parallelism should still be working if the urls are on the current locked course.

That doesn't sound like a feature that we'd actually want? If I am crawling one course then the system should be able to both crawl other courses at the same time, and crawl multiple urls within a course marked for crawl.

But my reading of this code means that you will only ever crawl a single url within a course at a time, EXCEPT that there is a bug in the code on line 549 where if you didn't get a lock then it ignores it and grabs the first course and then blindly keeps going when in theory it should have stopped. In general if you cannot get a lock for a certain thing then you must not keep going and do the thing. This means that the locking is broken for the first course in $allowedcourses.

I do not think there should be any locking at the course level, the only thing it will do is slow down the system overall.

Regards "both crawl other courses at the same time, and crawl multiple urls within a course marked for crawl."

Currently we have configuration for number of parallel workers.
So what is the reasonable way to allocate the workers
ie, we have 10 workers and 5 queued courses.
Should we have 2 or 10 workers for each course? (Any limit if we make 10 for each course)
Or leave the crawling to run arbitrarily?

I think in a first version it just be first in best dressed until you find problems to deal with and then solve them. If you only allow 1 course to run at a time then a course with 1000 urls to crawl which blocks another course with 10 urls from being crawled seems very unfair to me. A course with 1000 urls may also need multiple crawls to progressively discover all 1000 urls so it may not be as fast as simply 1000 / the number of tasks.

Also you cannot always know what course a given url is in until after you have crawled it the first time. I think there is an existing bug where the parsing for this is done in parse_html but this is never passed back and updated in tool_crawl_url so there will be urls missed here I think

Hi @brendanheywood ,

I have removed the course lock.

Unlike site crawling, on course mode, the crawler will need to choose a course to crawl (and I make it randomly) otherwise it will not know where to start.

Would you please have another review once you have time? Thanks

tuanngocnguyen force-pushed the 393847_course_mode_Master branch 13 times, most recently from 0c01682 to 8cf69da Compare October 27, 2022 02:27

tuanngocnguyen changed the title ~~[DRAFT]: Link checking at course context~~ Link checking at course context Oct 27, 2022

brendanheywood reviewed Nov 2, 2022

View reviewed changes

tuanngocnguyen force-pushed the 393847_course_mode_Master branch from 8cf69da to 2b22498 Compare November 9, 2022 04:45

Add link checker at course level

fe36c4e

tuanngocnguyen force-pushed the 393847_course_mode_Master branch from 2b22498 to fe36c4e Compare November 9, 2022 05:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link checking at course context #173

Link checking at course context #173

tuanngocnguyen commented Oct 26, 2022 •

edited

Loading

dmitriim commented Oct 26, 2022

tuanngocnguyen commented Oct 26, 2022

brendanheywood Nov 2, 2022

tuanngocnguyen Nov 2, 2022

brendanheywood Nov 2, 2022

tuanngocnguyen Nov 2, 2022

tuanngocnguyen Nov 9, 2022

brendanheywood Nov 2, 2022

tuanngocnguyen Nov 2, 2022 •

edited

Loading

brendanheywood Nov 2, 2022

brendanheywood Nov 2, 2022

tuanngocnguyen Nov 2, 2022 •

edited

Loading

brendanheywood Nov 2, 2022

tuanngocnguyen Nov 2, 2022

brendanheywood Nov 2, 2022

tuanngocnguyen Nov 9, 2022

Link checking at course context #173

Are you sure you want to change the base?

Link checking at course context #173

Conversation

tuanngocnguyen commented Oct 26, 2022 • edited Loading

Testing Instruction

Set up

Run crawling for a course

View report

dmitriim commented Oct 26, 2022

tuanngocnguyen commented Oct 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuanngocnguyen Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuanngocnguyen Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuanngocnguyen commented Oct 26, 2022 •

edited

Loading

tuanngocnguyen Nov 2, 2022 •

edited

Loading

tuanngocnguyen Nov 2, 2022 •

edited

Loading