-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update doc for server arguments #2742
base: main
Are you sure you want to change the base?
Conversation
fbc1a63
to
abb44cf
Compare
abb44cf
to
0a288d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the detailed and educational docs of parameters. Having two suggestions:
- We are documenting the official usage, so we can move the educational part to other unoffical repos, like my ML sys tutorial. 😂
- Keep the things concise. If we want to explain the concept, I think just one sentence of educational explanation and give a link to details, which could be bettter.
``` | ||
|
||
</details> | ||
## Model and tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. But for the docs, always keep one first-order title #
and several second-order title ##
, do not use forth-order title ####
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted to include Server Arguments title.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect
docs/backend/server_arguments.md
Outdated
|
||
* `tp_size`: This parameter is important if we have multiple GPUs and our model doesn't fit on a single GPU. *Tensor parallelism* means we distribute our model weights over multiple GPUs. Note that his technique is mainly aimed at *memory efficency* and not at a *higher throughput* as there is inter GPU communication needed to obtain the final output of each layer. For better understanding of the concept you may look for example [here](https://pytorch.org/tutorials/intermediate/TP_tutorial.html#how-tensor-parallel-works). | ||
|
||
* `stream_interval`: If we stream the output to the user this parameter determines at which interval we perform streaming. The interval length is measured in tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not so sure. Could you double check this and make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look this more carefully up. For now I left it as to do and come back to it at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
docs/backend/server_arguments.md
Outdated
|
||
* `random_seed`: Can be used to enforce deterministic behavior. | ||
|
||
* `constrained_json_whitespace_pattern`: When using `Outlines` grammar backend we can use this to allow JSON with syntatic newlines, tabs or multiple spaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can create a ##
for constraint decoding parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think that we could restructure the whole sections. I suggest to do that after I included all parameters.
58efd67
to
b939c56
Compare
…nto feature/server-arguments-docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! Thanks so much for help!
docs/backend/server_arguments.md
Outdated
* `dist_init_addr`: The TCP address used for initializing PyTorch’s distributed backend (e.g. `192.168.0.2:25000`). | ||
* `nnodes`: Total number of nodes in the cluster. | ||
* `node_rank`: Rank (ID) of this node among the `nnodes` in the distributed setup. | ||
|
||
|
||
## Model override args in JSON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better call this ## Constraint Decoding
…lang into feature/server-arguments-docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! We are close to the end!
``` | ||
|
||
</details> | ||
## Model and tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, we are close to the end. Are there any parameters left? If not, after fixing these parameters, we can let yineng to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! We made it!
@@ -66,7 +66,7 @@ In this document we aim to give an overview of the possible arguments when deplo | |||
* `watchdog_timeout`: Adjusts the watchdog thread’s timeout before killing the server if batch generation takes too long. | |||
* `download_dir`: Use to override the default Hugging Face cache directory for model weights. | |||
* `base_gpu_id`: Use to adjust first GPU used to distribute the model across available GPUs. | |||
|
|||
* `allow_auto_truncate`: Automatically truncate requests that exceed the maximum input length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
@zhyncs Wait for a final go over. |
Motivation
As explained here the current documentation of the backend needs update which we intend to implement here.
Checklist