Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blur function is too slow #986

Open
Aloxaf opened this issue Jul 8, 2019 · 22 comments
Open

blur function is too slow #986

Aloxaf opened this issue Jul 8, 2019 · 22 comments

Comments

@Aloxaf
Copy link

Aloxaf commented Jul 8, 2019

blur is too slow, imageproc::filter::gaussian_blur_f32 is faster than it, but still slow.

Reproduction steps

use image::imageops::blur;
use image::{Rgb, RgbImage};
use imageproc::drawing::draw_filled_circle_mut;
use imageproc::filter::gaussian_blur_f32;
use std::time::Instant;

fn main() {
    let mut image = RgbImage::new(1000, 1000);
    draw_filled_circle_mut(&mut image, (500, 500), 500, Rgb([255, 255, 255]));

    let start = Instant::now();
    blur(&image, 100.0).save("rust_1.png").unwrap();
    dbg!(start.elapsed());

    let start = Instant::now();
    gaussian_blur_f32(&image, 100.0).save("rust_2.png").unwrap();
    dbg!(start.elapsed());
}

result is

[src/main.rs:13] start.elapsed() = 12.362780732s
[src/main.rs:17] start.elapsed() = 3.335411994s

and Pillow

from PIL import Image, ImageDraw, ImageFilter
from time import time

img  = Image.new('RGB', (1000, 1000))
draw = ImageDraw.Draw(img)
draw.ellipse((0, 0, 1000, 1000), fill='#ffffff')

start = time()
img.filter(ImageFilter.GaussianBlur(100)).save('python.png')
print(f'{time() - start}s')

result is 0.1433885097503662s

Here is the final image

rust_1.png (it's a little strange...
rust_1

rust_2.png
rust_2

python.png
python

@HeroicKatora
Copy link
Member

Just to clarify, PIL uses a C version indirectly https://github.com/python-pillow/Pillow/blob/ab9a25d623fdd7f8de3e724b538f5660eac589ae/src/libImaging/BoxBlur.c#L294

It's still somewhat embarrasingly slow.

@Aloxaf
Copy link
Author

Aloxaf commented Jul 8, 2019

@HeroicKatora Yes, maybe it would be a good idea to port it rust?

@theotherphil
Copy link
Contributor

theotherphil commented Jul 8, 2019

Pillow appears to use (a variant on) iterated box filtering, as defined in http://www.mia.uni-saarland.de/Publications/gwosdek-ssvm11.pdf.

There's an open imageproc ticket to implement this: image-rs/imageproc#93

@theotherphil
Copy link
Contributor

I'll have a look at implementing this over the coming weekend.

@Shnatsel
Copy link
Contributor

Shnatsel commented Oct 2, 2019

There is actually a linear-complexity blur already implemented in Rust, where execution time is independent of blur radius: https://github.com/fschutt/fastblur

A copy of it was copied into resvg codebase and polished to match Chrome blur. You can find it in this file, search for box_blur. It should be fairly easy to extract it back into a common crate and/or copy it into image.

@vadixidav
Copy link

You may want to take a look at the code here. It was originally based on a different paper, written by the author of AKAZE, ported to rust by someone else, and then modified later by me (so it has been touched by many hands). It makes use of the following papers:

  • S. Grewenig, J. Weickert, C. Schroers, A. Bruhn. Cyclic Schemes for PDE-Based Image Analysis. Technical Report No. 327, Department of Mathematics, Saarland University, Saarbrücken, Germany, March 2013
  • S. Grewenig, J. Weickert, A. Bruhn. From box filtering to fast explicit diffusion. DAGM, 2010

It is actually not used for guassian blur in akaze itself. I am also not 100% sure it can help speed up guassian blur, but I think it can based on what I have seen in its use in akaze. I am no expert. The way this algorithm works is that it determines a number of diffusion steps that start small and grow bigger as the stability increases (which may not apply to guassian blur?). See this file for how the specialized blur is being applied (with ndarray). I was able to get pretty good speed by writing my filters with ndarray like that. My situation is a bit more specific to this library, but the code is free for you to take.

@RReverser
Copy link
Contributor

RReverser commented Mar 16, 2021

Also noticed this while comparing Wasm version of image::imageops::blur and JS implementation of Gaussian blur from https://github.com/nodeca/glur.

At high radius values the difference becomes ridiculous - e.g. for 3872x2592 image and radius 300px the image crate's version executes in 94 seconds, while JS executes in 0.6 seconds (same as for any other radius).

Having linear implementation in image crate would help a lot. fastblur from @Shnatsel's link looks promising, but it would be nice to integrate it into image or imageproc instead of relying on a 3rd-party dependency that is somewhat harder to discover.

@torfmaster torfmaster mentioned this issue Aug 2, 2024
7 tasks
@torfmaster
Copy link
Contributor

If anyone is interested: I created a first draft to resolve this issue in #2302 and I am willing to finalize this.

@Shnatsel Shnatsel added the kind: slow Not wrong, but still unusable label Sep 14, 2024
@Shnatsel
Copy link
Contributor

Shnatsel commented Oct 7, 2024

FWIW a new crate with multiple fast blur implementations was just released: https://github.com/awxkee/libblur

It relies on unsafe SIMD intrinsics, so it might not be usable as-is for image due to the no-unsafe policy, but we may be able to uplift parts of it or perhaps reuse the fallback implementations.

@awxkee
Copy link
Contributor

awxkee commented Oct 21, 2024

FWIW a new crate with multiple fast blur implementations was just released: https://github.com/awxkee/libblur

It relies on unsafe SIMD intrinsics, so it might not be usable as-is for image due to the no-unsafe policy, but we may be able to uplift parts of it or perhaps reuse the fallback implementations.

I can extract part of gaussian just for image crate with forbid unsafe if that makes sense.
I reworked convolution part without unsafe and situation looks something like that for RGB image 5000x4000:

fast_blur sigma 5: 793.836375ms
pure gaussian_blur 5 kernel size: 30.273625ms

fast_blur sigma 15: 803.334958ms
pure gaussian_blur 15 kernel size: 60.557ms

fast_blur sigma 35: 807.447583ms
pure gaussian_blur 35 kernel size: 133.914708ms

fast_blur sigma 151: 811.069792ms
pure gaussian_blur 151 kernel size: 495.628208ms

fast_blur sigma 251: 847.268125ms
pure gaussian_blur 151 kernel size: 875.879ms

pure gaussian_blur 251 kernel size with rayon: 125.079042ms

Pure gaussian blur parallelization works pretty fine so adding rayon instantly adds x10 performance, and multi-threading for blur's are preferred.

Perhaps I can also add stack blur that have O(1) in big O notation, which will be about ~x5-10 faster than blur based on CLT, which used here on fast_blur as far as I can tell.
Also blur's based on CLT have pretty high convergence with low controlling level, if you noticed that, and images became completely blurred sharply.

Here is stackblur time with disabled SIMD and multi-threading for the same image.

stackblur radius 151: 168.549ms

Even with forbid unsafe I wouldn't expect that it'll slowdown 8 times.

@Shnatsel
Copy link
Contributor

Image's fast_blur was contributed very recently, and has not received a great deal of attention yet. I'd be happy to see a PR migrating it to a faster algorithm.

I am somewhat concerned about the growing implementation complexity, but I think we could take it on for common and performance-sensitive operations such as resizing and blurring.

@awxkee
Copy link
Contributor

awxkee commented Oct 21, 2024

Everything of this actually sounds that it is pretty unreasonable :)

I do not think that blur is so common, might be spreaded a bit, but common I think this is not.

And for PR with this pixel store and organization and accessing them in the crate, it is the thing I'd like definitely stay away of that, and size of required code is not small, so complexity will grow noticable.

@Shnatsel
Copy link
Contributor

Blur operation isn't that common but it usually does become a bottleneck whenever it crops up, so I think optimizing it is still worthwhile.

@awxkee
Copy link
Contributor

awxkee commented Oct 21, 2024

May agree.

But issue with pixels storage, lack of really random access with iterator, lack of generic traits still an issue.

I may publish separate crate, but dealing with native problems of crate makes it immediately unreasonable.

@Shnatsel
Copy link
Contributor

If you could list the exact problems with the crate's API that prevent this kind of algorithms from being written on top of it in #2358, that would be very helpful!

We're looking to improve the API and ship v1.0 sometime in the foreseeable future, so info on what exactly is missing right now would be very helpful.

@awxkee
Copy link
Contributor

awxkee commented Oct 21, 2024

Raw row data must be fully accessible by each color component seperately or by set for ex. full RGBA ( this is optional, if each color component accesible this is completely fine ), with iterator without any checks and this call must be [inline(always)].

This can be done by continious allocated vector which is pretty common and very nice.

Or by brows, if necessary, it is very good sometimes, and sometimes are not, however, might confusing some who are not familiar with it. If y're not familiar it is an approach where instead of contiguous memory you can do gaps in layout and stores only reference to start ot the row, so your real image is a slice of row references &[&[T]].

This at least will make performant methods like this one and actually will prevent some people of doing bad things because someone, who starts, will search how to do so and will see this one.

For any real algorithms all generic traits must be implemented also. And all is a very vague definition because sometimes you'll need to make bitxor, or shift right, sometimes use fma when devices have it, and sometimes take a natural logarithm or get π from trait.

At the very least, all basic arithmetic must conform.

@awxkee
Copy link
Contributor

awxkee commented Oct 21, 2024

The thing is that for convolution for example y'll have to access each pixel many, many, many times, for example for blurring with kernel size = 151, on each pixel in the image y'll have to access at least of 150+150 pixels around at very least, if you can't optimize this is out, this is the end.

@Shnatsel
Copy link
Contributor

Yes, GenericImageView not allowing viewing rows is a known issue: #2300

@Shnatsel
Copy link
Contributor

Thank you for the input! I've linked your message from the relevant issues on the bug tracker. We'll see what we can do about these limitations.

@awxkee
Copy link
Contributor

awxkee commented Oct 21, 2024

Yup, if this reasonable, I still can do blur in separate crate.

Because all these limits can take years :)

@Shnatsel
Copy link
Contributor

We're hoping to start a concerted effort on fixing the API sometime around January. But we won't know if they're any good until someone builds a complex image processing algorithm on top of them. Would you be available then to kick the tires of the new APIs and try to port stack blur to them?

@awxkee
Copy link
Contributor

awxkee commented Oct 22, 2024

Yes, sure, when API looks like at least it worth trying, then I can try to do something on top of this.
I wouldn't say stack blur is complex, but ok :)

Let me know if I can help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants