Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link checking at course context #173

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tuanngocnguyen
Copy link

@tuanngocnguyen tuanngocnguyen commented Oct 26, 2022

This is a MR for issue #174

Testing Instruction

Set up

Refer to readme file:
8cf69da#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R145

  1. Enable course mode
  2. Set listed permissions for bot accounts
  • View subjects without participation: moodle/course:view
  • View hidden subjects: moodle/course:viewhiddencourses
  • View hidden sections: moodle/course:viewhiddensections
  • View hidden book chapters: mod/book:viewhiddenchapters
  • View hidden activities: moodle/course:viewhiddenactivities
  • See hidden categoriesmoodle/category:viewhiddencategories
  1. Set your email in 'Notify course link checker to this email address '
  2. Make sure "Parallel crawling task" is enabled

Run crawling for a course

  1. Go to a course
  2. Go to course admin settings
  3. Click on "Link crawler robot",
  4. Click on 'Run checker',the job is now added to course crawling queue
  5. After few second, refresh the page, you should be able to see the progress.
  6. Once the crawling finishes, you should receive an notification email

View report

  1. Once the crawling finished on a course
  2. Go back to "Link crawler robot" page of the course > Click on "Crawled Link report"
  3. Or go to course settings > reports > Click on "Crawled Link report"
  4. The 'response' should show "working/not working" depend on http code. Links with "200" http code should be showed as "working", while those with other status should be shown as "Not working"
  5. The response column can be sorted
  6. There is brief description of http code

@dmitriim
Copy link
Member

@tuanngocnguyen can you please create an issue first and describe the problem we are trying to solve with this patch. Thanks!

@tuanngocnguyen
Copy link
Author

Pending task: clean up and add info to readme file

@tuanngocnguyen tuanngocnguyen force-pushed the 393847_course_mode_Master branch 13 times, most recently from 0c01682 to 8cf69da Compare October 27, 2022 02:27
@tuanngocnguyen tuanngocnguyen changed the title [DRAFT]: Link checking at course context Link checking at course context Oct 27, 2022
// Reference: https://datatracker.ietf.org/doc/html/rfc7231.

/** @var $httpcodes list of http code */
private static $httpcodes = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this map should exist, if we show an http code then we should show the http string that the server returns and not make one up based on the code alone.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client wants a layman explanation of httpcode so I added some brief info

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Ok today I learnt something major. In http1.1 you would get back a head something like:

HTTP/1.1 404 Not Found

The Code is '404' and the Reason-Phrase is 'Not Found' and we could have captured this and used it directly instead of making it up in this mapping. In http2 this is dropped completely and there is no reason phrase at all which is really surprising and will have impacts in lots of places.

So yes I guess we do need this mapping. In which case can you make these language strings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep sure, I will change them as lang strings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @brendanheywood

I have changed them to lang strings

LEFT JOIN {tool_crawler_edge} l ON l.b = b.id
LEFT JOIN {tool_crawler_url} a ON l.a = a.id
LEFT JOIN {course} c ON c.id = a.courseid
WHERE b.httpcode != '200' AND c.id = $courseid";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything which is 1xx 2xx or 3xx should not be considered as broken

Copy link
Author

@tuanngocnguyen tuanngocnguyen Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually based on this:
https://github.com/catalyst/moodle-tool_crawler/blob/MOODLE_310_STABLE/classes/robot/crawler.php#L457

Probably this can be fixed in another issue/MR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

$courselockfactory = \core\lock\lock_config::get_lock_factory('tool_crawler_process_course_queue');
$courseid = 0;
foreach ($allowedcourses as $id) {
if ($courselock = $courselockfactory->get_lock($id, 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you locking at the course level? You should only need to lock at the level of a specific url. Locking at the level will break the parallelism and be substantially slower and force the crawling to be in serial.

Copy link
Author

@tuanngocnguyen tuanngocnguyen Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to lock the crawling within a course. It has to complete the process in a course before it can go to another one.
The parallelism should still be working if the urls are on the current locked course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound like a feature that we'd actually want? If I am crawling one course then the system should be able to both crawl other courses at the same time, and crawl multiple urls within a course marked for crawl.

But my reading of this code means that you will only ever crawl a single url within a course at a time, EXCEPT that there is a bug in the code on line 549 where if you didn't get a lock then it ignores it and grabs the first course and then blindly keeps going when in theory it should have stopped. In general if you cannot get a lock for a certain thing then you must not keep going and do the thing. This means that the locking is broken for the first course in $allowedcourses.

I do not think there should be any locking at the course level, the only thing it will do is slow down the system overall.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regards "both crawl other courses at the same time, and crawl multiple urls within a course marked for crawl."

Currently we have configuration for number of parallel workers.
So what is the reasonable way to allocate the workers
ie, we have 10 workers and 5 queued courses.
Should we have 2 or 10 workers for each course? (Any limit if we make 10 for each course)
Or leave the crawling to run arbitrarily?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in a first version it just be first in best dressed until you find problems to deal with and then solve them. If you only allow 1 course to run at a time then a course with 1000 urls to crawl which blocks another course with 10 urls from being crawled seems very unfair to me. A course with 1000 urls may also need multiple crawls to progressively discover all 1000 urls so it may not be as fast as simply 1000 / the number of tasks.

Also you cannot always know what course a given url is in until after you have crawled it the first time. I think there is an existing bug where the parsing for this is done in parse_html but this is never passed back and updated in tool_crawl_url so there will be urls missed here I think

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @brendanheywood ,

I have removed the course lock.

Unlike site crawling, on course mode, the crawler will need to choose a course to crawl (and I make it randomly) otherwise it will not know where to start.

Would you please have another review once you have time? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants