Skip to content

Commit

Permalink
YARN-11444. Improve YARN md documentation format. (#6711) Contributed…
Browse files Browse the repository at this point in the history
… by Shilun Fan.

Reviewed-by: Ayush Saxena <[email protected]>
Signed-off-by: Shilun Fan <[email protected]>
  • Loading branch information
slfan1989 authored Apr 7, 2024
1 parent 73e6931 commit 8c378d1
Show file tree
Hide file tree
Showing 14 changed files with 22 additions and 22 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -633,7 +633,7 @@ The following configuration parameters can be configured in yarn-site.xml to con
| `yarn.resourcemanager.reservation-system.planfollower.time-step` | *Optional* parameter: the frequency in milliseconds of the `PlanFollower` timer. Long value expected. The default value is *1000*. |


The `ReservationSystem` is integrated with the `CapacityScheduler` queue hierachy and can be configured for any **LeafQueue** currently. The `CapacityScheduler` supports the following parameters to tune the `ReservationSystem`:
The `ReservationSystem` is integrated with the `CapacityScheduler` queue hierarchy and can be configured for any **LeafQueue** currently. The `CapacityScheduler` supports the following parameters to tune the `ReservationSystem`:

| Property | Description |
|:---- |:---- |
Expand Down Expand Up @@ -879,7 +879,7 @@ Changing queue/scheduler properties and adding/removing queues can be done in tw
Remove the queue configurations from the file and run refresh as described above

### Enabling periodic configuration refresh
Enabling queue configuration periodic refresh allows reloading and applying the configuration by editing the *conf/capacity-scheduler.xml* without the necessicity of calling yarn rmadmin -refreshQueues.
Enabling queue configuration periodic refresh allows reloading and applying the configuration by editing the *conf/capacity-scheduler.xml* without the necessity of calling yarn rmadmin -refreshQueues.

| Property | Description |
|:---- |:---- |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -173,5 +173,5 @@ class and want to give it a try in your Hadoop cluster.


Firstly, put the jar file under a directory in Hadooop classpath.
(recommend $HADOOP_COMMOND_HOME/share/hadoop/yarn). Secondly,
(recommend $HADOOP_COMMAND_HOME/share/hadoop/yarn). Secondly,
follow the configurations described in [Pluggable Device Framework](./PluggableDeviceFramework.html) and restart YARN.
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,7 @@ The following properties should be set in yarn-site.xml:
Optional. This configuration setting determines the capabilities
assigned to docker containers when they are launched. While these may not
be case-sensitive from a docker perspective, it is best to keep these
uppercase. To run without any capabilites, set this value to
uppercase. To run without any capabilities, set this value to
"none" or "NONE"
</description>
</property>
Expand Down Expand Up @@ -568,7 +568,7 @@ There are several challenges with this bind mount approach that need to be
considered.

1. Any users and groups defined in the image will be overwritten by the host's users and groups
2. No users and groups can be added once the container is started, as /etc/passwd and /etc/group are immutible in the container. Do not mount these read-write as it can render the host inoperable.
2. No users and groups can be added once the container is started, as /etc/passwd and /etc/group are immutable in the container. Do not mount these read-write as it can render the host inoperable.

This approach is not recommended beyond testing given the inflexibility to
modify running containers.
Expand Down Expand Up @@ -715,7 +715,7 @@ Fine grained access control can also be defined using `docker.privileged-contain
docker.trusted.registries=library
```
In development environment, local images can be tagged with a repository name prefix to enable trust. The recommendation of choosing a repository name is using a local hostname and port number to prevent accidentially pulling docker images from Docker Hub or use reserved Docker Hub keyword: "local". Docker run will look for docker images on Docker Hub, if the image does not exist locally. Using a local hostname and port in image name can prevent accidental pulling of canonical images from docker hub. Example of tagging image with localhost:5000 as trusted registry:
In development environment, local images can be tagged with a repository name prefix to enable trust. The recommendation of choosing a repository name is using a local hostname and port number to prevent accidentally pulling docker images from Docker Hub or use reserved Docker Hub keyword: "local". Docker run will look for docker images on Docker Hub, if the image does not exist locally. Using a local hostname and port in image name can prevent accidental pulling of canonical images from docker hub. Example of tagging image with localhost:5000 as trusted registry:
```
docker tag centos:latest localhost:5000/centos:latest
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Graceful Decommission of YARN Nodes is the mechanism to decommission NMs while m

To do a normal decommissioning:

1. Start a YARN cluster (with NodeManageres and ResourceManager)
1. Start a YARN cluster (with NodeManagers and ResourceManager)
2. Start a yarn job (for example with `yarn jar...` )
3. Add `yarn.resourcemanager.nodes.exclude-path` property to your `yarn-site.xml` (Note: you don't need to restart the ResourceManager)
4. Create a text file (the location is defined in the previous step) with one line which contains the name of a selected NodeManager
Expand Down Expand Up @@ -112,7 +112,7 @@ host3

Note: In the future more file formats are planned with timeout support. Follow the [YARN-5536](https://issues.apache.org/jira/browse/YARN-5536) if you are interested.

Important to mention, that the timeout is not persited. In case of a RM restart/failover the node will be immediatelly decommission. (Follow the [YARN-5464](https://issues.apache.org/jira/browse/YARN-5464) for changes in this behavior).
Important to mention, that the timeout is not persisted. In case of a RM restart/failover the node will be immediately decommission. (Follow the [YARN-5464](https://issues.apache.org/jira/browse/YARN-5464) for changes in this behavior).

### Client or server side timeout

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Step 4. Configure a valid RPC address for the NodeManager.

Step 5. Auxiliary services.

* NodeManagers in a YARN cluster can be configured to run auxiliary services. For a completely functional NM restart, YARN relies on any auxiliary service configured to also support recovery. This usually includes (1) avoiding usage of ephemeral ports so that previously running clients (in this case, usually containers) are not disrupted after restart and (2) having the auxiliary service itself support recoverability by reloading any previous state when NodeManager restarts and reinitializes the auxiliary service.
* NodeManagers in a YARN cluster can be configured to run auxiliary services. For a completely functional NM restart, YARN relies on any auxiliary service configured to also support recovery. This usually includes (1) avoiding usage of ephemeral ports so that previously running clients (in this case, usually containers) are not disrupted after restart and (2) having the auxiliary service itself support recoverability by reloading any previous state when NodeManager restarts and reinitialized the auxiliary service.

* A simple example for the above is the auxiliary service 'ShuffleHandler' for MapReduce (MR). ShuffleHandler respects the above two requirements already, so users/admins don't have to do anything for it to support NM restart: (1) The configuration property **mapreduce.shuffle.port** controls which port the ShuffleHandler on a NodeManager host binds to, and it defaults to a non-ephemeral port. (2) The ShuffleHandler service also already supports recovery of previous state after NM restarts.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ There is no reason to set them both. If the system runs with swap disabled, both
Virtual memory measurement and swapping
--------------------------------------------

There is a difference between the virtual memory reported by the container monitor and the virtual memory limit specified in the elastic memory control feature. The container monitor uses `ProcfsBasedProcessTree` by default for measurements that returns values from the `proc` file system. The virtual memory returned is the size of the address space of all the processes in each container. This includes anonymous pages, pages swapped out to disk, mapped files and reserved pages among others. Reserved pages are not backed by either physical or swapped memory. They can be a large part of the virtual memory usage. The reservabe address space was limited on 32 bit processors but it is very large on 64-bit ones making this metric less useful. Some Java Virtual Machines reserve large amounts of pages but they do not actually use it. This will result in gigabytes of virtual memory usage shown. However, this does not mean that anything is wrong with the container.
There is a difference between the virtual memory reported by the container monitor and the virtual memory limit specified in the elastic memory control feature. The container monitor uses `ProcfsBasedProcessTree` by default for measurements that returns values from the `proc` file system. The virtual memory returned is the size of the address space of all the processes in each container. This includes anonymous pages, pages swapped out to disk, mapped files and reserved pages among others. Reserved pages are not backed by either physical or swapped memory. They can be a large part of the virtual memory usage. The reservable address space was limited on 32 bit processors but it is very large on 64-bit ones making this metric less useful. Some Java Virtual Machines reserve large amounts of pages but they do not actually use it. This will result in gigabytes of virtual memory usage shown. However, this does not mean that anything is wrong with the container.

Because of this you can now use `CGroupsResourceCalculator`. This shows only the sum of the physical memory usage and swapped pages as virtual memory usage excluding the reserved address space. This reflects much better what the application and the container allocated.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Some of the pain points for current device plugin development and integration
are listed below:


* At least 6 classes to be implemented (If you wanna support
* At least 6 classes to be implemented (If you want to support
Docker, you’ll implement one more “DockerCommandPlugin”).
* When implementing the “ResourceHandler” interface,
the developer must understand the YARN NM internal concepts like container
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ With reference to the figure above, a typical reservation proceeds as follows:

* **Step 2** The ReservationSystem leverages a ReservationAgent (GREE in the figure) to find a plausible allocation for the reservation in the Plan, a data structure tracking all reservation currently accepted and the available resources in the system.

* **Step 3** The SharingPolicy provides a way to enforce invariants on the reservation being accepted, potentially rejecting reservations. For example, the CapacityOvertimePolicy allows enforcement of both instantaneous max-capacity a user can request across all of his/her reservations and a limit on the integral of resources over a period of time, e.g., the user can reserve up to 50% of the cluster capacity instantanesouly, but in any 24h period of time he/she cannot exceed 10% average.
* **Step 3** The SharingPolicy provides a way to enforce invariants on the reservation being accepted, potentially rejecting reservations. For example, the CapacityOvertimePolicy allows enforcement of both instantaneous max-capacity a user can request across all of his/her reservations and a limit on the integral of resources over a period of time, e.g., the user can reserve up to 50% of the cluster capacity instantaneously, but in any 24h period of time he/she cannot exceed 10% average.

* **Step 4** Upon a successful validation the ReservationSystem returns to the user a ReservationId (think of it as an airline ticket).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ resource may also have optional minimum and maximum properties. The properties
must be named `yarn.resource-types.<resource>.minimum-allocation` and
`yarn.resource-types.<resource>.maximum-allocation`.

The `yarn.resource-types` property and any unit, mimimum, or maximum properties
The `yarn.resource-types` property and any unit, minimum, or maximum properties
may be defined in either the usual `yarn-site.xml` file or in a file named
`resource-types.xml`. For example, the following could appear in either file:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -651,7 +651,7 @@ There are several challenges with this bind mount approach that need to be
considered.

1. Any users and groups defined in the image will be overwritten by the host's users and groups
2. No users and groups can be added once the container is started, as /etc/passwd and /etc/group are immutible in the container. Do not mount these read-write as it can render the host inoperable.
2. No users and groups can be added once the container is started, as /etc/passwd and /etc/group are immutable in the container. Do not mount these read-write as it can render the host inoperable.

This approach is not recommended beyond testing given the inflexibility to
modify running containers.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -859,7 +859,7 @@ Below is the elements of a single event object. Note that `value` of
| Item | Data Type | Description|
|:---- |:---- |:---- |
| `eventtype` | string | The event type |
| `eventinfo` | map | The information of the event, which is orgainzied in a map of `key` : `value` |
| `eventinfo` | map | The information of the event, which is organized in a map of `key` : `value` |
| `timestamp` | long | The timestamp of the event |

### Response Examples:
Expand Down Expand Up @@ -1317,7 +1317,7 @@ None
| `queue` | string | The queue to which the application submitted |
| `appState` | string | The application state according to the ResourceManager - valid values are members of the YarnApplicationState enum: `FINISHED`, `FAILED`, `KILLED` |
| `finalStatus` | string | The final status of the application if finished - reported by the application itself - valid values are: `UNDEFINED`, `SUCCEEDED`, `FAILED`, `KILLED` |
| `progress` | float | The reported progress of the application as a percent. Long-lived YARN services may not provide a meaninful value here —or use it as a metric of actual vs desired container counts |
| `progress` | float | The reported progress of the application as a percent. Long-lived YARN services may not provide a meaningful value here —or use it as a metric of actual vs desired container counts |
| `trackingUrl` | string | The web URL of the application (via the RM Proxy) |
| `originalTrackingUrl` | string | The actual web URL of the application |
| `diagnosticsInfo` | string | Detailed diagnostics information on a completed application|
Expand Down Expand Up @@ -2019,7 +2019,7 @@ querying some entities, such as Domains; here the API deliberately
downgrades permission-denied outcomes as empty and not-founds responses.
This hides details of other domains from an unauthorized caller.
1. If the content of timeline entity PUT operations is invalid,
this failure *will not* result in an HTTP error code being retured.
this failure *will not* result in an HTTP error code being returned.
A status code of 200 will be returned —however, there will be an error code
in the list of failed entities for each entity which could not be added.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ By default, YARN will automatically detect and config GPUs when above config is
device number of GPUs is using `nvidia-smi -q` and search `Minor Number`
output.

When minor numbers are specified manually, admin needs to include indice of GPUs
When minor numbers are specified manually, admin needs to include indices of GPUs
as well, format is `index:minor_number[,index:minor_number...]`. An example
of manual specification is `0:0,1:1,2:2,3:4"`to allow YARN NodeManager to
manage GPU devices with indices `0/1/2/3` and minor number `0/1/2/4`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Currently only GET is supported. It retrieves information about the resource spe

### Security

The web service REST API's go through the same security as the web UI. If your cluster adminstrators have filters enabled you must authenticate via the mechanism they specified.
The web service REST API's go through the same security as the web UI. If your cluster administrators have filters enabled you must authenticate via the mechanism they specified.

### Headers Supported

Expand All @@ -70,7 +70,7 @@ This release supports gzip compression if you specify gzip in the Accept-Encodin

This release of the web service REST APIs supports responses in JSON and XML formats. JSON is the default. To set the response format, you can specify the format in the Accept header of the HTTP request.

As specified in HTTP Response Codes, the response body can contain the data that represents the resource or an error message. In the case of success, the response body is in the selected format, either JSON or XML. In the case of error, the resonse body is in either JSON or XML based on the format requested. The Content-Type header of the response contains the format requested. If the application requests an unsupported format, the response status code is 500. Note that the order of the fields within response body is not specified and might change. Also, additional fields might be added to a response body. Therefore, your applications should use parsing routines that can extract data from a response body in any order.
As specified in HTTP Response Codes, the response body can contain the data that represents the resource or an error message. In the case of success, the response body is in the selected format, either JSON or XML. In the case of error, the response body is in either JSON or XML based on the format requested. The Content-Type header of the response contains the format requested. If the application requests an unsupported format, the response status code is 500. Note that the order of the fields within response body is not specified and might change. Also, additional fields might be added to a response body. Therefore, your applications should use parsing routines that can extract data from a response body in any order.

### Response Errors

Expand Down Expand Up @@ -101,7 +101,7 @@ Response Body:

```json
{
app":
"app":
{
"id":"application_1324057493980_0001",
"user":"user1",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -576,7 +576,7 @@ system property in AM).

`[ ]` Web browser interaction verified in secure cluster.

`[ ]` REST client interation (GET operations) tested.
`[ ]` REST client integration (GET operations) tested.

`[ ]` Application continues to run after Kerberos Token expiry.

Expand Down

0 comments on commit 8c378d1

Please sign in to comment.