-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocate does not distinguish why it fails #1136
Comments
allocate has several ways it can end Line 532 in 22d7f46
return allocateOneBoard In each case if an empty list is returned it needs to be determined if the job needs to be destroyed. |
The easiest is when allocate returns an empty list. It should destroy the job "Bad request" There is zero expectation that trying again will work so keeping the jobs make no sense. |
allocateOneBoard should never destroy the board. if this fails either all boards are full or there is a temporary database issue. |
allocate(Connection conn) the caller of allocate() has a try but no catch. What that means is that if a job request is weird and raises an exception it will stay in the queue and if the error repeats block the queue. For example DimensionEstimate(int numBoards, Rectangle max) throws an IllegalArgumentException The suggestion is that allocate() which has the jobId also has a try with catch. the alternative of doing the destroy before throws requires passing jobId down and leave the risk of a throw not doing the destroy. |
new DimensionEstimate(numBoards, max) does check if the requested size is too big for the machine but I disagree with what it does when it finds such a case. Currently it will reduce the size to the max possible. This is however ugly as there is now have a job which will block the queue as it needs the whole machine (in 1 or both dimensions) and when finally returns is likely to be useless to a user who correctly entered the size they need. Suggested raise IllegalArgumentException AND somewhere destroy the job! |
The boards table has a generated column "may_be_allocated" which is used to find a set of boards not used by other jobs.
This is needed when finding boards to allocate. What will be needed is a second generated column "could_be_allocated" which includes the near permanent conditions "board_num IS NOT NULL" AND " (functioning IS NULL OR functioning != 0)" but excludes the temporary condition "allocated_job IS NULL" Why? |
All of the allocate methods eventually call setAllocation Line 1203 in 22d7f46
This method starts by running getConnectedBoards While not expected this could end in an empty list most likely due to change in the state of a board. On an empty list there are three options.
My opinion is that options 1 or 2 are enough. Option 2 without a count of tries (as in master) is giving Murphy the tools for a hang. |
allocateDimensions has another reason it could return an empty list. Its search query find_rectangle.sql could return an empty list This could be caused by We need to distinguish between A and B My suggestion is to create a find_rectangle.sql which uses a new "could_be_allocated" virtual column. Note: allocateDimensions is called by any job with n_boards > 1 so could be user jobs which should be allowed to stay in the queue and eventually block. (as long as they are not impossible as above) |
allocateTriadsAt uses find_rectangle_at.sql both also use may_be_allocated so like find_rectangle.sql should be rerun with "could_be_allocated" to see if t makes sends to keep the job(reguest) and allow it to block. Note: allocateTriadsAt only makes sense for admins and tests so an alternative here is try to allocate once and if unsuccessful destory the job. |
AllocateDimensions uses a combination of getRectangles both also use may_be_allocated so like find_rectangle.sql should be rerun with "could_be_allocated" to see if t makes sends to keep the job(request) and allow it to block. |
allocateBoard uses findSpecificBoard also uses may_be_allocated so like find_rectangle.sql should be rerun with "could_be_allocated" to see if t makes sends to keep the job(request) and allow it to block. Note: allocateBoard only makes sense for admins and tests so an alternative here is try to allocate once and if unsuccessful destory the job. |
While it is true that the method call by the timer allocate() Line 294 in 22d7f46
is synchronized should we worry about what happens if there is a database error during allocate. The most worring is an error during setAllocation Line 1203 in 22d7f46
This could leave some boards allocated yet the never leaving the queued state and so remaining in the queue and remaining possible as soon as the board loses the allocated marker. with the job remaining queued there is no cleanup (that I know of) so the boards remain allocated. my suggestion is that there is a task of checking for database weirdness which is run occasionally. Line 650 in 22d7f46
|
I also think it is required that all jobs eventually are either allocated, destroyed or become queue blockers. So giving a job a priority of zero (so importance does not increase) or continuously (without a count break) of setting their importance back (to zero) must be avoided . |
It is worth noting that all work with the database is transactional i.e. it shouldn't be possible to leave the database in a poor state because of a failed query / update. |
Summary of work to be done:
With those fixes, we can then further investigate if the priority / importance route is working well enough. One suggestion there is to make all priorities 1 regardless of size, so the importance doesn't grow too much too quickly, but still ensures that a job that has been waiting a while still gets allocated first when all resources for it have been freed. The issue there comes as to how much other jobs will jump in front of it between allocations, but that is likely not a big problem at this point. There are likely other possible solutions to this, such as having board priorities for allocation, where a set of boards needed to allocate something on the queue are "ear-marked" and then other jobs can avoid them for a bit unless absolutely required (probably just an order by sum of some field would then do this). This is left for future work though. |
As for the queue the only way to absolutely prevent a single job (checked to be possible on an empty machine) from hanging forever is to eventually allow it to block the queue. All playing with priority is just minor timing. Much longer term the way to treat blocker job may be to preallocate them to boards even though these boards are already in use. |
Pre-allocation could work to some extent; it might then be easy enough to have a count of pre-allocated jobs per board to avoid having to kill jobs when there are too many, and then order the result of allocation for an unallocated job by that value, as well as the last-used-date of the boards to make them move around a bit in general. That way the next boards to be used should be the ones that are less pre-allocated, so the job should then move through the queue more quickly. Nice idea overall! |
Full path to where you find the problem:
JavaSpiNNaker/SpiNNaker-allocserv/src/main/java/uk/ac/manchester/spinnaker/alloc/allocator/AllocatorTask.java
Line 532 in 22d7f46
What is the problem you see?
#1042
What do you think it should do instead?
short term: #1135
longer term.
distinguish between
A. Wrong values passed into allocate
JavaSpiNNaker/SpiNNaker-allocserv/src/main/java/uk/ac/manchester/spinnaker/alloc/allocator/AllocatorTask.java
Line 571 in 22d7f46
B. todo but will include things like too many boards, request a dead board(s), reguest an invalid board(s),
A. The machine is very busy
B. The job is very big
C. The job requests specific board(s)
The text was updated successfully, but these errors were encountered: