Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocate does not distinguish why it fails #1136

Open
Christian-B opened this issue Feb 28, 2024 · 17 comments
Open

allocate does not distinguish why it fails #1136

Christian-B opened this issue Feb 28, 2024 · 17 comments

Comments

@Christian-B
Copy link
Member

Christian-B commented Feb 28, 2024

Full path to where you find the problem:

What is the problem you see?
#1042

What do you think it should do instead?
short term: #1135

longer term.
distinguish between

  1. allocations that are bad IE will never work
    A. Wrong values passed into allocate
    B. todo but will include things like too many boards, request a dead board(s), reguest an invalid board(s),
  2. allocations that started but then failed for a weird reason that may be recoverable
  3. allocations that failed because the boards requested are not available
    A. The machine is very busy
    B. The job is very big
    C. The job requests specific board(s)
@Christian-B
Copy link
Member Author

Christian-B commented Feb 28, 2024

allocate has several ways it can end

return allocateOneBoard
return allocateDimensions
return allocateTriadsAt
return allocateBoard(
return List.of()

In each case if an empty list is returned it needs to be determined if the job needs to be destroyed.

@Christian-B
Copy link
Member Author

Christian-B commented Feb 28, 2024

The easiest is when allocate returns an empty list.

It should destroy the job "Bad request"

There is zero expectation that trying again will work so keeping the jobs make no sense.

@Christian-B
Copy link
Member Author

allocateOneBoard should never destroy the board.

if this fails either all boards are full or there is a temporary database issue.
In both cases keeping it in the requests queue makes sense

@Christian-B
Copy link
Member Author

allocate(Connection conn) the caller of allocate() has a try but no catch.

What that means is that if a job request is weird and raises an exception it will stay in the queue and if the error repeats block the queue.

For example DimensionEstimate(int numBoards, Rectangle max) throws an IllegalArgumentException

The suggestion is that allocate() which has the jobId also has a try with catch.
On exceptions like IllegalArgumentException it should destroy the job.

the alternative of doing the destroy before throws requires passing jobId down and leave the risk of a throw not doing the destroy.

@Christian-B
Copy link
Member Author

new DimensionEstimate(numBoards, max) does check if the requested size is too big for the machine but I disagree with what it does when it finds such a case.

Currently it will reduce the size to the max possible.

This is however ugly as there is now have a job which will block the queue as it needs the whole machine (in 1 or both dimensions) and when finally returns is likely to be useless to a user who correctly entered the size they need.

Suggested raise IllegalArgumentException AND somewhere destroy the job!

@Christian-B
Copy link
Member Author

The boards table has a generated column "may_be_allocated" which is used to find a set of boards not used by other jobs.

may_be_allocated INTEGER GENERATED ALWAYS AS ( -- generated COLUMN

This is needed when finding boards to allocate.

What will be needed is a second generated column "could_be_allocated" which includes the near permanent conditions "board_num IS NOT NULL" AND " (functioning IS NULL OR functioning != 0)" but excludes the temporary condition "allocated_job IS NULL"

Why?
This would allow there to be secondary queries based on "could_be_allocated" which would check if the query could ever return a result once all allocated_jobs have ended.

@Christian-B
Copy link
Member Author

Christian-B commented Feb 28, 2024

All of the allocate methods eventually call setAllocation

private Collection<BMPAndMachine> setAllocation(AllocSQL sql, int jobId,

This method starts by running getConnectedBoards
https://github.com/SpiNNakerManchester/JavaSpiNNaker/blob/22d7f46d259a40fe84f2809275cc3df8463713de/SpiNNaker-allocserv/src/main/resources/queries/connected_boards_at_coords.sql

While not expected this could end in an empty list most likely due to change in the state of a board.

On an empty list there are three options.

  1. Destroy the job as this is rare and safest
  2. Leave the job in state queued to try allocation again next time. Should only be done if there is a count of number of tries. (can be the same count as queued -> power on -> error -> queued
  3. Run an adapted version of the query to check if it could have worked without state changes. Ideally also with the count.

My opinion is that options 1 or 2 are enough.

Option 2 without a count of tries (as in master) is giving Murphy the tools for a hang.

@Christian-B
Copy link
Member Author

Christian-B commented Feb 28, 2024

allocateDimensions has another reason it could return an empty list.

Its search query find_rectangle.sql could return an empty list
see https://github.com/SpiNNakerManchester/JavaSpiNNaker/blob/22d7f46d259a40fe84f2809275cc3df8463713de/SpiNNaker-allocserv/src/main/resources/queries/find_rectangle.sql

This could be caused by
A, There are too many already allocated boards
B. The query is impossible even on an machine with all possible board available

We need to distinguish between A and B
A. Should be allowed to stay in the queue and block other jobs until boards are freed.
B. This should cause the job to be destroyed

My suggestion is to create a find_rectangle.sql which uses a new "could_be_allocated" virtual column.
If that returns data A is likely if it fails consider B.

Note: allocateDimensions is called by any job with n_boards > 1 so could be user jobs which should be allowed to stay in the queue and eventually block. (as long as they are not impossible as above)

@Christian-B
Copy link
Member Author

Christian-B commented Feb 28, 2024

allocateTriadsAt uses find_rectangle_at.sql
https://github.com/SpiNNakerManchester/JavaSpiNNaker/blob/22d7f46d259a40fe84f2809275cc3df8463713de/SpiNNaker-allocserv/src/main/resources/queries/find_rectangle_at.sql
and then connectedSize
https://github.com/SpiNNakerManchester/JavaSpiNNaker/blob/22d7f46d259a40fe84f2809275cc3df8463713de/SpiNNaker-allocserv/src/main/resources/queries/allocation_connected.sql

both also use may_be_allocated so like find_rectangle.sql should be rerun with "could_be_allocated" to see if t makes sends to keep the job(reguest) and allow it to block.

Note: allocateTriadsAt only makes sense for admins and tests so an alternative here is try to allocate once and if unsuccessful destory the job.

@Christian-B
Copy link
Member Author

AllocateDimensions uses a combination of getRectangles
https://github.com/SpiNNakerManchester/JavaSpiNNaker/blob/22d7f46d259a40fe84f2809275cc3df8463713de/SpiNNaker-allocserv/src/main/resources/queries/find_rectangle.sql
and then aloop with connectedSize
https://github.com/SpiNNakerManchester/JavaSpiNNaker/blob/22d7f46d259a40fe84f2809275cc3df8463713de/SpiNNaker-allocserv/src/main/resources/queries/allocation_connected.sql

both also use may_be_allocated so like find_rectangle.sql should be rerun with "could_be_allocated" to see if t makes sends to keep the job(request) and allow it to block.

@Christian-B
Copy link
Member Author

Christian-B commented Feb 28, 2024

allocateBoard uses findSpecificBoard
https://github.com/SpiNNakerManchester/JavaSpiNNaker/blob/22d7f46d259a40fe84f2809275cc3df8463713de/SpiNNaker-allocserv/src/main/resources/queries/find_location.sql

also uses may_be_allocated so like find_rectangle.sql should be rerun with "could_be_allocated" to see if t makes sends to keep the job(request) and allow it to block.

Note: allocateBoard only makes sense for admins and tests so an alternative here is try to allocate once and if unsuccessful destory the job.

@Christian-B
Copy link
Member Author

While it is true that the method call by the timer allocate()


is synchronized should we worry about what happens if there is a database error during allocate.

The most worring is an error during setAllocation

private Collection<BMPAndMachine> setAllocation(AllocSQL sql, int jobId,

This could leave some boards allocated yet the never leaving the queued state and so remaining in the queue and remaining possible as soon as the board loses the allocated marker.

with the job remaining queued there is no cleanup (that I know of) so the boards remain allocated.
Enough boards left allocated and a big or specific enough job request and the queue hangs.

my suggestion is that there is a task of checking for database weirdness which is run occasionally.
One possible moment to run such a task if when a job reaches a level of importance that it blocks other jobs.

@Christian-B
Copy link
Member Author

I also think it is required that all jobs eventually are either allocated, destroyed or become queue blockers.
Otherwise they could hang forever.

So giving a job a priority of zero (so importance does not increase) or continuously (without a count break) of setting their importance back (to zero) must be avoided .

@rowleya
Copy link
Member

rowleya commented Feb 29, 2024

It is worth noting that all work with the database is transactional i.e. it shouldn't be possible to leave the database in a poor state because of a failed query / update.

@rowleya
Copy link
Member

rowleya commented Feb 29, 2024

Summary of work to be done:

  1. Remove code that deletes the job_queue entry that happens when the job is allocated. This is an error that shouldn't happen.
  2. Add a fail count to the job / job_queue as appropriate. When a job returns to queue state, this should be incremented. When the value reaches a specified maximum (3 seems reasonable), set the job to destroyed with reason something like "Repeated errors allocating job. Please try again or report to ...".
  3. Find a way to query if a job is possible at all i.e. if it is too big or is forced to use boards that are permanently down. If that is possible, when a job tries this, destroy it immediately with "Job request impossible with current machines".

With those fixes, we can then further investigate if the priority / importance route is working well enough. One suggestion there is to make all priorities 1 regardless of size, so the importance doesn't grow too much too quickly, but still ensures that a job that has been waiting a while still gets allocated first when all resources for it have been freed. The issue there comes as to how much other jobs will jump in front of it between allocations, but that is likely not a big problem at this point. There are likely other possible solutions to this, such as having board priorities for allocation, where a set of boards needed to allocate something on the queue are "ear-marked" and then other jobs can avoid them for a bit unless absolutely required (probably just an order by sum of some field would then do this). This is left for future work though.

@Christian-B
Copy link
Member Author

  1. Is this not part of Job fail on alloc failed #1135
  2. The count is also needed for the queued -> pending -> error -> queued jobs. . It is more urgent so should not await resolving this issue
  3. The plan was only to check a job is impossible after the first fail to allocate() Where allocate returns an empty list of bmp's and the job is never set pending.

As for the queue the only way to absolutely prevent a single job (checked to be possible on an empty machine) from hanging forever is to eventually allow it to block the queue. All playing with priority is just minor timing.

Much longer term the way to treat blocker job may be to preallocate them to boards even though these boards are already in use.
Then mark these boards as allocated to job X reserved by job Y
Allow that job to have a high importance score but keep it just below blocking.
Other jobs may not use the reserved board (even if currently not allocated)
If a third job arrives wanting the same board(s) or very large number of boards destroy the third job ("currently too busy")
The preallocated job should be allowed to try any combination of free and its reserved boards.

@rowleya
Copy link
Member

rowleya commented Feb 29, 2024

  1. Yes that could be merged I guess; it will fix the issue and allow for no-queuing if desired
  2. Yes, that is what I meant; only when a job goes back to queue should this be incremented. That can be a new PR on top of 1.
  3. OK, that sounds good. A thought was if a job fails to allocate (i.e. allocate returns empty list) to try to allocate ignoring existing jobs. If that fails, it must be impossible.

Pre-allocation could work to some extent; it might then be easy enough to have a count of pre-allocated jobs per board to avoid having to kill jobs when there are too many, and then order the result of allocation for an unallocated job by that value, as well as the last-used-date of the boards to make them move around a bit in general. That way the next boards to be used should be the ones that are less pre-allocated, so the job should then move through the queue more quickly. Nice idea overall!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants