-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Link checking at course context #173
base: master
Are you sure you want to change the base?
Conversation
@tuanngocnguyen can you please create an issue first and describe the problem we are trying to solve with this patch. Thanks! |
Pending task: clean up and add info to readme file |
0c01682
to
8cf69da
Compare
classes/helper.php
Outdated
// Reference: https://datatracker.ietf.org/doc/html/rfc7231. | ||
|
||
/** @var $httpcodes list of http code */ | ||
private static $httpcodes = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this map should exist, if we show an http code then we should show the http string that the server returns and not make one up based on the code alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client wants a layman explanation of httpcode so I added some brief info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! Ok today I learnt something major. In http1.1 you would get back a head something like:
HTTP/1.1 404 Not Found
The Code is '404' and the Reason-Phrase is 'Not Found' and we could have captured this and used it directly instead of making it up in this mapping. In http2 this is dropped completely and there is no reason phrase at all which is really surprising and will have impacts in lots of places.
So yes I guess we do need this mapping. In which case can you make these language strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep sure, I will change them as lang strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed them to lang strings
LEFT JOIN {tool_crawler_edge} l ON l.b = b.id | ||
LEFT JOIN {tool_crawler_url} a ON l.a = a.id | ||
LEFT JOIN {course} c ON c.id = a.courseid | ||
WHERE b.httpcode != '200' AND c.id = $courseid"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything which is 1xx 2xx or 3xx should not be considered as broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually based on this:
https://github.com/catalyst/moodle-tool_crawler/blob/MOODLE_310_STABLE/classes/robot/crawler.php#L457
Probably this can be fixed in another issue/MR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
classes/robot/crawler.php
Outdated
$courselockfactory = \core\lock\lock_config::get_lock_factory('tool_crawler_process_course_queue'); | ||
$courseid = 0; | ||
foreach ($allowedcourses as $id) { | ||
if ($courselock = $courselockfactory->get_lock($id, 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you locking at the course level? You should only need to lock at the level of a specific url. Locking at the level will break the parallelism and be substantially slower and force the crawling to be in serial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to lock the crawling within a course. It has to complete the process in a course before it can go to another one.
The parallelism should still be working if the urls are on the current locked course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't sound like a feature that we'd actually want? If I am crawling one course then the system should be able to both crawl other courses at the same time, and crawl multiple urls within a course marked for crawl.
But my reading of this code means that you will only ever crawl a single url within a course at a time, EXCEPT that there is a bug in the code on line 549 where if you didn't get a lock then it ignores it and grabs the first course and then blindly keeps going when in theory it should have stopped. In general if you cannot get a lock for a certain thing then you must not keep going and do the thing. This means that the locking is broken for the first course in $allowedcourses.
I do not think there should be any locking at the course level, the only thing it will do is slow down the system overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regards "both crawl other courses at the same time, and crawl multiple urls within a course marked for crawl."
Currently we have configuration for number of parallel workers.
So what is the reasonable way to allocate the workers
ie, we have 10 workers and 5 queued courses.
Should we have 2 or 10 workers for each course? (Any limit if we make 10 for each course)
Or leave the crawling to run arbitrarily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in a first version it just be first in best dressed until you find problems to deal with and then solve them. If you only allow 1 course to run at a time then a course with 1000 urls to crawl which blocks another course with 10 urls from being crawled seems very unfair to me. A course with 1000 urls may also need multiple crawls to progressively discover all 1000 urls so it may not be as fast as simply 1000 / the number of tasks.
Also you cannot always know what course a given url is in until after you have crawled it the first time. I think there is an existing bug where the parsing for this is done in parse_html but this is never passed back and updated in tool_crawl_url so there will be urls missed here I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @brendanheywood ,
I have removed the course lock.
Unlike site crawling, on course mode, the crawler will need to choose a course to crawl (and I make it randomly) otherwise it will not know where to start.
Would you please have another review once you have time? Thanks
8cf69da
to
2b22498
Compare
2b22498
to
fe36c4e
Compare
This is a MR for issue #174
Testing Instruction
Set up
Refer to readme file:
8cf69da#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R145
Run crawling for a course
View report