Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use as SWAP #3

Open
snshn opened this issue Dec 14, 2014 · 33 comments
Open

Use as SWAP #3

snshn opened this issue Dec 14, 2014 · 33 comments

Comments

@snshn
Copy link

snshn commented Dec 14, 2014

Was wondering if it could be possible to host a swap partition within vramfs or somehow patch vramfs to make it work as a swap partition?

My drive is encrypted, therefore I don't use SWAP partitions... but if this thing could give me 3GB or so of a swap-like fs, we could be onto something...

Do you think it could work without fuse, natively?

Oh, and great idea behind vramfs, really neat!

@Overv
Copy link
Owner

Overv commented Dec 14, 2014

It's possible to implement a block device with OpenCL backing it. It could probably be developed pretty quickly with something like BUSE.

@ptman
Copy link

ptman commented Dec 14, 2014

If you can provide a block device, then you can also build RAID-0 on top of the block devices.

@Overv
Copy link
Owner

Overv commented Dec 14, 2014

@ptman That is a great point. I'm going to look into writing a kernel module to do this tomorrow. I've tried BUSE, but it seems to be bottlenecking because it's based on the network block device interface.

@snshn
Copy link
Author

snshn commented Dec 14, 2014

A kernel module and a some kind of analogue to swapon/swapoff would make this thing look very serious.

Both FUSE and BUSE would definitely only slow things down.

Good luck @Overv, thanks for sharing!

@Overv
Copy link
Owner

Overv commented Dec 15, 2014

I've done some preliminary testing with BUSE and trivial OpenCL code. The read speed is 1.1 GB/s and the write speed 1.5 GB/s with ext4. Writing my own kernel module is going to take more time, and it'll still require a userspace daemon to interact with OpenCL.

@snshn
Copy link
Author

snshn commented Dec 15, 2014

Wow, very good news, @Overv!

I think the daemon is necessary just to provide the proper RAID support across multiple vramfs-based block devices and to control the amount of memory dedicated per adapter... I believe a package named vramfs-tools containing vramfsd and vramfsctl could fit the purpose...

Wondering what @torvalds will think of this project, maybe it'll end up being included in the tree like tmpfs...
4GB of VRAM on my Linux laptop feels like such a waste... bet I'm not the only one who feels that way.

Thanks for your work, once again!

@agrover
Copy link

agrover commented Dec 15, 2014

if you want a userspace-backed block (SCSI) device I would encourage you to look at TCMU, which was just added to Linux 3.18. It's part of the LIO kernel target. Using it along with the loopback fabric and https://github.com/agrover/tcmu-runner may fill in some missing pieces. tcmu-runner handles the "you need a daemon" part so the work would just consist of a vram-backed plugin for servicing SCSI commands like READ and WRITE. Then you'd have the basic block device, for swap or a filesystem or whatever.

(tcmu-runner is still alpha but I think it would save you from writing kernel code and a daemon from scratch. feedback welcome.)

@bisqwit
Copy link
Contributor

bisqwit commented Jan 4, 2020

While it is technically possible to create a file on VRAMFS and use it as a swap, this is risky: What happens if VRAMFS itself, or one of the GPU libraries, gets swapped? This can happen in a low-mem situation, i.e exactly in a situation that swap is designed to help. The kernel cannot possibly know that restoring data from the swap depends on the data that is… swapped in the swap.
This is not an issue for kernel-space filesystem/storage drivers because the kernel’s own RAM never gets swapped, but it is a conundrum for user-space stuff.

@j123b567
Copy link

For kernel-space driver, it would be nice to use directly TTM/GEM to allocate video ram buffers.

@bisqwit
Copy link
Contributor

bisqwit commented Jan 21, 2020

What are TTM/GEM?

Note that the slram/phram/mtdblock thing can only access at most like 256 MB of the memory, the size of the memory window (I guess) of the PCI device.

@j123b567
Copy link

I don't know much, but they are some interfaces to acces GPU memory inside kernel. So it can see all the GPU memory, not only some mapped part directly accesible. https://www.kernel.org/doc/html/latest/gpu/drm-mm.html

My situation, NVidia dedicated GPU with 4GB RAM and nouveau driver without OpenCL support. This memory is not mapped to memory space so I can't use them using slram/phram.

@dhalsimax
Copy link

dhalsimax commented Oct 6, 2020

It's possible to implement a block device with OpenCL backing it. It could probably be developed pretty quickly with something like BUSE.

The easy way to accomplish is to use vmrafs as is, make a file on vramfs disk then use a loop device on that file, format the loop device with mkswap and then swapon. With this method everything seems to work as I tried. Anyway the big issue using FUSE or BUSE is that both runs in user space and user space is swappable. I have not tried it, but suppose the memory of the vramfs process get swapped itself by the kernel, how would the kernel be able to recover by a page fault as it needs to reload in the first place ? I am curious what will happen then?

Edit: sorry I was not reading the comments before as bisqwit already explained...anyway I've tried to use as swap after a while got system freezing need a hard reboot (switch off and on power sob)...

@LHLaurini
Copy link

What happens if VRAMFS itself, or one of the GPU libraries, gets swapped?

Couldn't mlockall be used to prevent vramfs from getting swapped?

@montvid
Copy link

montvid commented Dec 15, 2020

Wonderful idea! I am runnnig an old headless server with a 1 gb ddr3 amd card opencl 1.1. I can use all the video ram as i use just ssh. Unfortunately vramfs does not let me create a swap file based swap I get "swapon: /mnt/vram/swapfile: swapon failed: Invalid argument". Can it be fixed? I see opencl 1.2 is merged into mesa 20.3 so good times ahead for this project.

@wonghang
Copy link

It doesn't work for me. Even I tried to mlockall() page for the userspace program. I think the nvidia driver allocated some memory that would be swapped. At some point, the computer will get into deadlock when memory is low.

I also tried the BUSE / nbd approach. It doesn't work for me as well.

I think we need to get into the nvidia driver, carefully develop a block device kernel driver and call these undocumented API:

cat /proc/kallsyms |grep rm_gpu_ops | sort -k 3
0000000000000000 t rm_gpu_ops_address_space_create	[nvidia]
0000000000000000 t rm_gpu_ops_address_space_destroy	[nvidia]
0000000000000000 t rm_gpu_ops_bind_channel_resources	[nvidia]
0000000000000000 t rm_gpu_ops_channel_allocate	[nvidia]
0000000000000000 t rm_gpu_ops_channel_destroy	[nvidia]
0000000000000000 t rm_gpu_ops_create_session	[nvidia]
0000000000000000 t rm_gpu_ops_destroy_access_cntr_info	[nvidia]
0000000000000000 t rm_gpu_ops_destroy_fault_info	[nvidia]
0000000000000000 t rm_gpu_ops_destroy_session	[nvidia]
0000000000000000 t rm_gpu_ops_device_create	[nvidia]
0000000000000000 t rm_gpu_ops_device_destroy	[nvidia]
0000000000000000 t rm_gpu_ops_disable_access_cntr	[nvidia]
0000000000000000 t rm_gpu_ops_dup_address_space	[nvidia]
0000000000000000 t rm_gpu_ops_dup_allocation	[nvidia]
0000000000000000 t rm_gpu_ops_dup_memory	[nvidia]
0000000000000000 t rm_gpu_ops_enable_access_cntr	[nvidia]
0000000000000000 t rm_gpu_ops_free_duped_handle	[nvidia]
0000000000000000 t rm_gpu_ops_get_channel_resource_ptes	[nvidia]
0000000000000000 t rm_gpu_ops_get_ecc_info	[nvidia]
0000000000000000 t rm_gpu_ops_get_external_alloc_ptes	[nvidia]
0000000000000000 t rm_gpu_ops_get_fb_info	[nvidia]
0000000000000000 t rm_gpu_ops_get_gpu_info	[nvidia]
0000000000000000 t rm_gpu_ops_get_non_replayable_faults	[nvidia]
0000000000000000 t rm_gpu_ops_get_p2p_caps	[nvidia]
0000000000000000 t rm_gpu_ops_get_pma_object	[nvidia]
0000000000000000 t rm_gpu_ops_has_pending_non_replayable_faults	[nvidia]
0000000000000000 t rm_gpu_ops_init_access_cntr_info	[nvidia]
0000000000000000 t rm_gpu_ops_init_fault_info	[nvidia]
0000000000000000 t rm_gpu_ops_memory_alloc_fb	[nvidia]
0000000000000000 t rm_gpu_ops_memory_alloc_sys	[nvidia]
0000000000000000 t rm_gpu_ops_memory_cpu_map	[nvidia]
0000000000000000 t rm_gpu_ops_memory_cpu_ummap	[nvidia]
0000000000000000 t rm_gpu_ops_memory_free	[nvidia]
0000000000000000 t rm_gpu_ops_own_page_fault_intr	[nvidia]
0000000000000000 t rm_gpu_ops_p2p_object_create	[nvidia]
0000000000000000 t rm_gpu_ops_p2p_object_destroy	[nvidia]
0000000000000000 t rm_gpu_ops_pma_alloc_pages	[nvidia]
0000000000000000 t rm_gpu_ops_pma_free_pages	[nvidia]
0000000000000000 t rm_gpu_ops_pma_pin_pages	[nvidia]
0000000000000000 t rm_gpu_ops_pma_register_callbacks	[nvidia]
0000000000000000 t rm_gpu_ops_pma_unpin_pages	[nvidia]
0000000000000000 t rm_gpu_ops_pma_unregister_callbacks	[nvidia]
0000000000000000 t rm_gpu_ops_query_caps	[nvidia]
0000000000000000 t rm_gpu_ops_query_ces_caps	[nvidia]
0000000000000000 t rm_gpu_ops_release_channel	[nvidia]
0000000000000000 t rm_gpu_ops_release_channel_resources	[nvidia]
0000000000000000 t rm_gpu_ops_report_non_replayable_fault	[nvidia]
0000000000000000 t rm_gpu_ops_retain_channel	[nvidia]
0000000000000000 t rm_gpu_ops_retain_channel_resources	[nvidia]
0000000000000000 t rm_gpu_ops_service_device_interrupts_rm	[nvidia]
0000000000000000 t rm_gpu_ops_set_page_directory	[nvidia]
0000000000000000 t rm_gpu_ops_stop_channel	[nvidia]
0000000000000000 t rm_gpu_ops_unset_page_directory	[nvidia]

to create a GPU session and allocate GPU memory in order to make a GPU swap truly possible.

@azureblue
Copy link

Hi guys, any update on this? Has anyone been able to reliably use VRAM as swap?

@bisqwit
Copy link
Contributor

bisqwit commented Dec 3, 2021

It only works if the following two conditions are met:

  1. The GPU driver code/data is never put in swap
  2. The vramfs driver code/data is never put in swap.
    If you somehow can guarantee these aspects, then using VRAM as swap will work.

@montvid
Copy link

montvid commented Dec 3, 2021

Did not work for me the one time I tried it. Seems the project is abandoned...

@wonghang
Copy link

wonghang commented Dec 3, 2021

fuse should be able not to swap itself. But I attempted to add mlockall() in vramfs code, it didn't work either. It appears that GPU driver (nvidia) and CUDA libraries was swapped.

In nvidia driver, there are some undocumented functions (prefix by rm_, run cat /proc/kallsyms | grep nvidia to see) to access GPU memory. I think they are parts of GPUDirect RDMA (https://docs.nvidia.com/cuda/gpudirect-rdma/index.html).
If we can somehow hack them and write a kernel driver to handle the paging, it may be possible to use GPU as swap.

@Atrate
Copy link
Contributor

Atrate commented May 29, 2022

It is possible to achieve this, see https://wiki.archlinux.org/title/Swap_on_video_RAM , section FUSE.

The vramfs driver code/data is never put in swap.

This can be achieved with https://wiki.archlinux.org/title/Swap_on_video_RAM#Complete_system_freeze_under_high_memory_pressure

I tested it under high memory pressure (stress -m 10 --vm-bytes 3G --vm-hang 10 on a 32G system) and it didn't fall over, but only after applying the aforementioned fix.

@bisqwit
Copy link
Contributor

bisqwit commented May 29, 2022

This looks like a proper solution indeed.

@Atrate
Copy link
Contributor

Atrate commented Nov 27, 2022

I've tried implementing mlockall. If you want to, you can test whether it works for you and fixes deadlocks without needing to use a systemd service.

#32

@twobombs
Copy link

I would like to add to this discussion that the addition of vramfs as a block device would help using vramfs as a dedicated L2ARC ZFS buffer.

We are using very big dedicated nvme swap raid arrays for quantum computing and need something that is faster then 8-16 NVME sticks in RAID to collect the IO in a buffer that is not in main memory.

We make use of a lot of (virtual) memory so an L2ARC buffer in vram would be awesome; the GPUs would get a new lease on life because we went to CPU only calculation because of huge memory requirements to store the eigen vector (think 8/16TB)

@Atrate
Copy link
Contributor

Atrate commented Jan 15, 2023

I would like to add to this discussion that the addition of vramfs as a block device would help using vramfs as a dedicated L2ARC ZFS buffer.

We are using very big dedicated nvme swap raid arrays for quantum computing and need something that is faster then 8-16 NVME sticks in RAID to collect the IO in a buffer that is not in main memory.

We make use of a lot of (virtual) memory so an L2ARC buffer in vram would be awesome; the GPUs would get a new lease on life because we went to CPU only calculation because of huge memory requirements to store the eigen vector (think 8/16TB)

@twobombs

You can make a loop device with losetup but NVME RAID will probably be faster than vramswap, the performance is still somewhat lacking in certain areas.

@twobombs
Copy link

@Atrate thank you very much for the loop solution. will look into this and if ZFS will allow a loop device as cache. the swap I/O usage pattern is random read/write, not stream. a PCIe VRAM device might offer better speeds whilst at the same time making the workload on NVME raid devices more 'stream'-lined when changes are comitted to the array.

@twobombs
Copy link

twobombs commented Feb 3, 2023

I went a step further and added VRAM cache files for ZFS based SWAP.
It is fairly hilarious to see IO come through on NVTOP

Screenshot_from_2023-02-01_20-10-18

@aedalzotto
Copy link

aedalzotto commented Aug 3, 2023

It is possible to achieve this, see https://wiki.archlinux.org/title/Swap_on_video_RAM , section FUSE.

The vramfs driver code/data is never put in swap.

This can be achieved with https://wiki.archlinux.org/title/Swap_on_video_RAM#Complete_system_freeze_under_high_memory_pressure

I tested it under high memory pressure (stress -m 10 --vm-bytes 3G --vm-hang 10 on a 32G system) and it didn't fall over, but only after applying the aforementioned fix.

The solution seems to work for me, but when I increase swappiness from 10 to 180, it simply freezes.
The same happens without increasing swappiness when running mprime.

I am running vramfs as a service, as the workaround cited above suggests. The only thing I think I am doing different is using a loopback, as my swapfile is being created with holes.

Does anyone have an idea of what is happening?

UPDATE:
I tracked the last boot journal, and it stated the following error:
kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] ERROR [CRTC:82:crtc-0] hw_done or flip_done timed out

@Atrate
Copy link
Contributor

Atrate commented Aug 12, 2023

In reply to: #3 (comment)

As suggested by Fanzhuyifan and others above I think that may be due to other GPU-management processes/libraries getting swapped out and maaaybe a fix is possible with a lot of systemd unit editing but that'd require tracking down every single library and process that is required for the operation of a dGPU and that seems like a chore.

@fanzhuyifan
Copy link

fanzhuyifan commented Aug 13, 2023

According to the documentation of mlockall,

mlockall() locks all pages mapped into the address space of the
calling process. This includes the pages of the code, data, and
stack segment, as well as shared libraries, user space kernel
data, shared memory, and memory-mapped files. All mapped pages
are guaranteed to be resident in RAM when the call returns
successfully; the pages are guaranteed to stay in RAM until later
unlocked.

So shared libraries directly used by vramfs being swapped out should not be the reason of system freezes.

Edit: Examining the resident size and virtual memory size of the vramfs process, I think the issue is that vramfs asks for additional memory to serve reads/writes.

@Atrate
Copy link
Contributor

Atrate commented Aug 13, 2023

Edit: Examining the resident size and virtual memory size of the vramfs process, I think the issue is that vramfs asks for additional memory to serve reads/writes.

Is it? mlockall is called with the MCL_CURRENT | MCL_FUTURE flags, so it should also prevent all future allocations of memory from being swapped, unless I misunderstood the documentation.

Code in vramfs:

if (mlockall(MCL_CURRENT | MCL_FUTURE)) {

Documentation:

       MCL_CURRENT
              Lock all pages which are currently mapped into the address
              space of the process.

       MCL_FUTURE
              Lock all pages which will become mapped into the address
              space of the process in the future.  These could be, for
              instance, new pages required by a growing heap and stack
              as well as new memory-mapped files or shared memory
              regions.

@fanzhuyifan
Copy link

fanzhuyifan commented Aug 13, 2023

Edit: Examining the resident size and virtual memory size of the vramfs process, I think the issue is that vramfs asks for additional memory to serve reads/writes.

Is it? mlockall is called with the MCL_CURRENT | MCL_FUTURE flags, so it should also prevent all future allocations of memory from being swapped, unless I misunderstood the documentation.

Here are the steps to prove my point (on linux):

  1. Start vramfs, say creating a filesystem with size 2000MB, and find the PID of the process.
  2. Run cat /proc/PID/status | grep Vm to find the memory information. On a particular run on my computer I got

VmPeak: 7060808 kB
VmSize: 7060808 kB
VmLck: 6990308 kB
VmPin: 0 kB
VmHWM: 275588 kB
VmRSS: 275588 kB
VmData: 144976 kB
VmStk: 164 kB
VmExe: 132 kB
VmLib: 14156 kB
VmPTE: 628 kB
VmSwap: 0 kB

  1. Write random data to a file on the vramfs, and check memory usage again.
    First run dd if=/dev/random of=/tmp/vram/swapfile bs=1M count=1000, and then I got:

VmPeak: 7585096 kB
VmSize: 7388488 kB
VmLck: 7317988 kB
VmPin: 0 kB
VmHWM: 286148 kB
VmRSS: 286148 kB
VmData: 156092 kB
VmStk: 164 kB
VmExe: 132 kB
VmLib: 14156 kB
VmPTE: 668 kB
VmSwap: 0 kB

Note that the bolded entries all increased.

  1. Let's read that file and check memory usage again.
    First run sha256sum /tmp/vram/swapfile, and then I got

VmPeak: 7585096 kB
VmSize: 7462220 kB
VmLck: 7391720 kB
VmPin: 0 kB
VmHWM: 296072 kB
VmRSS: 296072 kB
VmData: 165960 kB
VmStk: 164 kB
VmExe: 132 kB
VmLib: 14156 kB
VmPTE: 692 kB
VmSwap: 0 kB

The bolded entries increased again.

I believe this proves that vramfs asks for more memory when serving read and write requests.
I am not saying the extra memory is swapped. I am just saying that sometimes it asks for extra memory to serve read and write requests. I suspect that this is the reason the computer freezes when using vramfs as swap, even with the mlockall call. In a system with high memory pressure, the OS tries to swap certain memory pages to vramfs. To serve this request, vramfs needs to perform some writes to the vram, and in the process asks for more memory. Since there is already no available memory, the system freezes.

@jnturton
Copy link

jnturton commented Oct 6, 2023

Since there is already no available memory, the system freezes.

Wouldn't we see OOM Killer entries in the kernel logs in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

17 participants