You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been doing some observations in the code of Heritrix related to the URI precedence, what led me to the URI cost, and I think that is something I'm missing or there should be some kind of warning for the situation I observed. I've observed that with the default URI precedence policy, CostUriPrecedencePolicy, the precedence of the URI is assigned to the cost of the URI. I guess that the precedence of a URI is the way a URI is prioritized in the assigned queue when a URI is gathered to crawl, and the cost is the necessary "effort" for crawling that URI (used for things like rotate among the different queues). So, in CostUriPrecedencePolicy:
Now, the only way in the code that I've observed that getPrecedence() from class CrawlURI is used is in calculateInsertKey(CrawlURI) from class DbdMultipleWorkQueues. There, the URI precedence is used for building a binary value where some bytes of it is the URI precedence, but only 8 bits, 1 signed byte, from the URI precedence is used (in the code comment it says 1 signed byte, but I'm not sure where the sign bit is being used, and I think it's not being used at all):
// FROM THE COMMENT OF THE FUNCTION: "NOTE: Dangers here are: (1) priorities or precedences over 2^7 (signed byte comparison)"longprecedence = Math.min(curi.getPrecedence(), 127);
On the other hand, when I was observing the code, I was not very sure that class DbdMultipleWorkQueues is the class that is always used for prioritize the URIs in the queues based on they precedence value. Because of this, I looked if the cost was the actual value being used for what I understand that is the precedence value (i.e. the metric used for prioritizing which URIs should be crawled next), and I found that getHolderCost() was only being used in processFinish(CrawlURI) from class WorkQueueFrontier, and with that method name doesn't sound like something that is being used all the time for prioritizing, but in class CrawlURI I fount out:
/** spot for an integer cost to be placed by external facility (frontier). * cost is truncated to 8 bits at times, so should not exceed 255 */protectedintholderCost = UNCALCULATED;
The cost of the URI, according to the comment of the variable, should not exceed 255. Well, the only place where this value is used for something is in processFinish(CrawlURI) from class WorkQueueFrontier:
// ...intholderCost = curi.getHolderCost();
// ...wq.expend(holderCost); // all retries but DEFERRED cost// ...holderCost = 0; // no charge for disregarded URIs// ...wq.expend(holderCost); // successes & failures charge cost to queue
So, in this above code snippet I don't see any reason why the URI cost should not exceed 255, what leads to my guesses (of course, I might be wrong since I haven't dived as deep as might be necessary in the code in order to understand these concepts, that's why I detailed that much all the issue, sorry :/):
The URI cost might exceed 255 without any problem, but that would affect the default, which is not the only one, URI precedence policy (i.e. class CostUriPrecedencePolicy). This makes that the comment of "not exceed 255" is a comment which makes the URI cost being dependent of a specific URI precedence policy. If this is true, is just that the comment, I think, shouldn't be there, but in class CostUriPrecedencePolicy since is the one using the URI cost in order to assign the precedence.
I understand that the value 255 of the comment is making reference to 8 bits, but it seems that it's talking about an unsigned int value, but if this comment is based on the limit of 8 bits from class DbdMultipleWorkQueues, which is what I think, then this is a signed int value. I mean, it seems that the cost might be set with a value higher than 127, but if this is done, then the value 128 would be the value 0 if we use a signed int value, 129 would be -1, and so on... But still I see that Math.min is used in class DbdMultipleWorkQueues with a max. value of 127, so I'm not pretty sure about this point, since it seems that is discarding the sign bit.
I think there should be a warning somewhere if the precedence is higher than the maximum value of the URI precedence value, since if a different cost URI assignment policy is being used (not class UnitCostAssignmentPolicy), if the cost URI policy is the default (i.e. class CostUriPrecedencePolicy), the cost of the cost URI assignment should not exceed 127 if I'm not wrong.
Have I missed something or all the explained is true and some kind of explicit warning is necessary if a cost or precedence exceeds the implicit maximum values? I know, many guesses, it might all be something weird in my head and nothing all of this makes any sense at all. If that's the case, I'm truly sorry :(
If something of this is right, I've implemented a custom cost URI assignment policy, whose maximum cost value is 10000, is this value being clipped to 255 or 127 or -254? :'(
Thank you!
The text was updated successfully, but these errors were encountered:
cgr71ii
added a commit
to cgr71ii/heritrix3
that referenced
this issue
Sep 25, 2022
Originally DbdMultipleWorkQueues used the least significant byte of the holder cost (with positive overflow wrapping) and when precedence was introduced as a separate field it now clips to 127. This probably explains why the comment in CostUriPrecedencePolicy mentions 255 as maximum instead of 127, as that was at one point true.
If something of this is right, I've implemented a custom cost URI assignment policy, whose maximum cost value is 10000, is this value being clipped to 255 or 127 or -254?
My reading of the code is that when the value is used to affect the precedence (i.e. the order of URIs in a single queue) it is clipped to 127. When it's used to charge cost (which affects prioritization between different queues) then the full int value is used without clipping or wrapping.
I think there should be a warning somewhere if the precedence is higher than the maximum value of the URI precedence value
Yes, that does seem like a useful warning to have.
I agree the "should not exceed 255" holderCost comment should probably either be removed or updated to be more accurate (that the value will be clipped to 127 when using CostUriPrecedencePolicy).
Hi!
I've been doing some observations in the code of Heritrix related to the URI precedence, what led me to the URI cost, and I think that is something I'm missing or there should be some kind of warning for the situation I observed. I've observed that with the default URI precedence policy,
CostUriPrecedencePolicy
, the precedence of the URI is assigned to the cost of the URI. I guess that the precedence of a URI is the way a URI is prioritized in the assigned queue when a URI is gathered to crawl, and the cost is the necessary "effort" for crawling that URI (used for things like rotate among the different queues). So, inCostUriPrecedencePolicy
:Now, the only way in the code that I've observed that
getPrecedence()
fromclass CrawlURI
is used is incalculateInsertKey(CrawlURI)
fromclass DbdMultipleWorkQueues
. There, the URI precedence is used for building a binary value where some bytes of it is the URI precedence, but only 8 bits, 1 signed byte, from the URI precedence is used (in the code comment it says 1 signed byte, but I'm not sure where the sign bit is being used, and I think it's not being used at all):On the other hand, when I was observing the code, I was not very sure that
class DbdMultipleWorkQueues
is the class that is always used for prioritize the URIs in the queues based on they precedence value. Because of this, I looked if the cost was the actual value being used for what I understand that is the precedence value (i.e. the metric used for prioritizing which URIs should be crawled next), and I found thatgetHolderCost()
was only being used inprocessFinish(CrawlURI)
fromclass WorkQueueFrontier
, and with that method name doesn't sound like something that is being used all the time for prioritizing, but inclass CrawlURI
I fount out:The cost of the URI, according to the comment of the variable, should not exceed 255. Well, the only place where this value is used for something is in
processFinish(CrawlURI)
fromclass WorkQueueFrontier
:So, in this above code snippet I don't see any reason why the URI cost should not exceed 255, what leads to my guesses (of course, I might be wrong since I haven't dived as deep as might be necessary in the code in order to understand these concepts, that's why I detailed that much all the issue, sorry :/):
class CostUriPrecedencePolicy
). This makes that the comment of "not exceed 255" is a comment which makes the URI cost being dependent of a specific URI precedence policy. If this is true, is just that the comment, I think, shouldn't be there, but inclass CostUriPrecedencePolicy
since is the one using the URI cost in order to assign the precedence.class DbdMultipleWorkQueues
, which is what I think, then this is a signed int value. I mean, it seems that the cost might be set with a value higher than 127, but if this is done, then the value 128 would be the value 0 if we use a signed int value, 129 would be -1, and so on... But still I see thatMath.min
is used inclass DbdMultipleWorkQueues
with a max. value of 127, so I'm not pretty sure about this point, since it seems that is discarding the sign bit.class UnitCostAssignmentPolicy
), if the cost URI policy is the default (i.e.class CostUriPrecedencePolicy
), the cost of the cost URI assignment should not exceed 127 if I'm not wrong.Have I missed something or all the explained is true and some kind of explicit warning is necessary if a cost or precedence exceeds the implicit maximum values? I know, many guesses, it might all be something weird in my head and nothing all of this makes any sense at all. If that's the case, I'm truly sorry :(
If something of this is right, I've implemented a custom cost URI assignment policy, whose maximum cost value is 10000, is this value being clipped to 255 or 127 or -254? :'(
Thank you!
The text was updated successfully, but these errors were encountered: