Skip to content

Commit

Permalink
Feat: FRIKA integrated & roadmap (#2)
Browse files Browse the repository at this point in the history
* feat: frika integrated & roadmap

Signed-off-by: Iztok Lebar Bajec <[email protected]>

* feat: nxt & amd go live

Signed-off-by: Iztok Lebar Bajec <[email protected]>

* fix: k8s text and wording

Signed-off-by: Iztok Lebar Bajec <[email protected]>

---------

Signed-off-by: Iztok Lebar Bajec <[email protected]>
  • Loading branch information
itzsimpl authored Jun 16, 2024
1 parent 413594d commit 2d8b240
Show file tree
Hide file tree
Showing 7 changed files with 1,684 additions and 4 deletions.
2 changes: 1 addition & 1 deletion docs/FRIDA/about.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# About FRIDA

As part of its strategy to support its researchers in their endeavor the University of Ljubljana, Faculty of Computer and Information Science ([UL FRI](https://www.fri.uni-lj.si)) actively invests in research infrastructure. In the summer of 2023, one such investment was the expansion of an existing, homegrown Slurm cluster with additional hardware. The Slurm cluster was initially set up with the help of the Development of Slovene in the Digital Environment (RSDO) project in 2020 when the first [NVIDIA DGX-A100](https://docs.nvidia.com/dgx/dgxa100-user-guide/) system was acquired. The most recent expansion consists of a new and more capable login node, and two compute nodes with 8 80GB/GPU GPUs each, one of them being an [NVIDIA DGX-H100](https://docs.nvidia.com/dgx/dgxh100-user-guide/), the currently most capable and sought-after NVIDIA DGX system. To mark this expansion the Slurm cluster was renamed to FRIDA.
FRIDA is a Slurm cluster that was initially set up with the help of the Development of Slovene in the Digital Environment (RSDO) project in 2020 when the first [NVIDIA DGX-A100](https://docs.nvidia.com/dgx/dgxa100-user-guide/) system was acquired. The most recent expansion consists of a new and more capable login node, and two compute nodes with 8 80GB/GPU GPUs each, one of them being an [NVIDIA DGX-H100](https://docs.nvidia.com/dgx/dgxh100-user-guide/), the currently most capable and sought-after NVIDIA DGX system.

FRIDA is planned to be progressively further expanded. For example in Q1 of 2024, attention will be given to future/alternative technologies, with the addition of development kit compute nodes equipped with AMD MI210 GPUs as well as NVIDIA GraceHopper Superchips. Later expansions will concentrate on faster Infiniband data interconnect, faster and larger shared data storage, as well as additional compute nodes.

Expand Down
7 changes: 7 additions & 0 deletions docs/FRIKA/about.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# About FRIKA

FRIKA consists of three NVIDIA HGX Redstone GPU servers specifically dedicated to inferencing. The principal goal is to provide infrastructure for researchers and laboratories that want to offer their solutions (models/applications) as web services and thus promote their research/development work at UL FRI.

FRIKA is currently running as a system of individual nodes that provide resources via Incus virtual machines and/or containers. Depending on requirements it is planned to be progressively further expanded. A long-term plan is to port all services to a Kubernetes-based cluster.

Research labs that own and manage their inferencing systems themselves, may inquire about the possibility of integrating their infrastructure into FRIKA. Cofunding of future FRIKA expansions is also possible. All inquiries should be addressed to the UL FRI Management Board, the technical details will be coordinated by the FRIKA technical committee.
8 changes: 8 additions & 0 deletions docs/FRIKA/access.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Obtaining resources

Access to the FRIKA infrastructure may be granted (upon request) to all employees of UL FRI. Applications should be addressed to the UL FRI Management Board. They should briefly explain the services that are to be deployed on the allocated VM and the amount of resources needed (number of vCPUs, amount of VM memory, amount of memory on the GPU, amount of disk), together with a justification of the scope. The maximum amount of RAM per GPU requested is 40GB (systems are based on A100 40GB SMX4). The application should also list the expected level of utilization, i.e. the expected number of users, and how this fits in with the promotion of UL FRI. All websites and/or services deployed on FRIKA are expected to announce that they are running on the UL FRI FRIKA infrastructure (exposure of a UL FRI logo is sufficient).

Once access is granted by the UL FRI Management Board, the technical questions should be directed to [email protected].

Resource usage is monitored and in case of higher numbers of applications, the allocated quota may be reduced depending on utilization history.

Loading

0 comments on commit 2d8b240

Please sign in to comment.