Skip to content
This repository has been archived by the owner on Aug 3, 2023. It is now read-only.

Keep track of failing boards and don't use them to allocate #26

Open
rowleya opened this issue Nov 16, 2017 · 5 comments
Open

Keep track of failing boards and don't use them to allocate #26

rowleya opened this issue Nov 16, 2017 · 5 comments
Labels
Milestone

Comments

@rowleya
Copy link
Member

rowleya commented Nov 16, 2017

Currently, if a board is failing, it will keep being assigned to other users, where it may continue to be a source of failure. The server should keep a "blacklist" of boards which should be avoided. How the blacklist is updated is open to debate e.g. it could count failures and then blacklist after a threshold, or it could periodically probe the boards (or possibly a combination of these).

@alan-stokes
Copy link
Contributor

how would you detect a failure in this case?

@rowleya
Copy link
Member Author

rowleya commented Nov 16, 2017

OK, so when you power on a board, a number of things can happen:

  • The board doesn't respond - this indicates a BMP failure.
  • The board doesn't get the correct FPGA ids - this could mean that the flash of the board is broken.

We can detect these things quite easily during a power on command. A periodic test would have to turn on boards periodically therefore.

@alan-stokes
Copy link
Contributor

ok, so we're not thinking sdrams, new dead cores etc?

@rowleya
Copy link
Member Author

rowleya commented Nov 16, 2017

No - to be clear, this is just to stop an attempt to allocate the same board resulting in the same server error repeatedly (as would currently happen if there is a board error). As spalloc-server only talks to the BMP that is enough. This can detect transient errors i.e. things that can be fixed by manual intervention (either someone pressing the reset button or re-flashing a board).

An extension of this is to run Luis's tests periodically as well to ensure the boards are tested, but I don't believe this is necessary.

@Christian-B
Copy link
Member

It should cover anything where the spalloc server should not allocate the board again until at least a manual check.

@lplana lplana added the bug label Jul 17, 2018
@dkfellows dkfellows added this to the 5.1.0 milestone Aug 12, 2019
@dkfellows dkfellows modified the milestones: 5.1.0, 6.0.0 Nov 25, 2019
@dkfellows dkfellows modified the milestones: 6.0.0, 7.0.0 Apr 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants