Improving archiver space efficiency #3

AsterNighT · 2023-07-19T07:04:01Z

The idea of project is quite interesting. I haven't really tested it for long but it seems promising. The greatest drawback now seems to be the archiver storage consumption. It only takes like 3 minites to produce 100MB screenshots.

The screenshots seem to contain lots of duplications. It would be good if there would be a filter before an image is ever archived. A first idea to me is to check for duplicated images with hashes. It is also possible to "rank" the images based on the text extracted from it, but this would require careful research and design.

STRRL · 2023-07-20T09:19:44Z

The most mainstream video codec (like h.265 and others) could reduce storage usage by just encoding the diff parts between frames. So I think this goal could be completed by the future archiver implemented by the video, which is also in the ROADMAP.

What do you think about it?

STRRL · 2023-07-20T09:21:23Z

But that's a lot of tuning options on video encoding process, maybe the early implementation is not the best one(mostly perhaps) ... 🫣

STRRL · 2023-07-20T22:44:09Z

I just tried it on my Linux machine:

capture 2 screens in 4k resolution for about 10 minutes
images take 176 MiB
use ffmpeg for encoding video, with default profile and h265 encoding, 0.5 framerate.
it takes 12s to encode all the images into the video, but it consumes all the CPU during the encoding
the one of final video output files for each screen is 4 MiB, another is 3MiB
compression ratio is 176/(4+3) = 25~

STRRL · 2023-07-20T22:46:21Z

with the default h265 encoding, the video would take 4 MiB for every 10 minutes.

STRRL · 2023-07-20T22:47:53Z

so it would take about 200~ MiB for 8 hrs daily usage, 1.4~ GiB for weekly usage.

STRRL · 2023-07-20T22:49:13Z

I think it's close enough to the performance to Rewind on macOS

STRRL · 2023-07-20T22:53:08Z

the file size of the video always depends on the content, so it would increase with complex content, but I think it would NOT exceed of 10x more spaces.

I think it's kind of enough to use for now. 🤩

What do you think about it? @AsterNighT

AsterNighT · 2023-07-21T03:32:46Z

That makes sense. Actually I haven't heard of rewind before.
The way Dejavu runs now use like 15% of my cpu time (laptop, 6800H, mostly tesseract, I suppose.) And I think we would need something like a live streaming encoder. Not sure how much extra cpu it would take.

AsterNighT · 2023-07-24T07:38:41Z

I tried a few seemingly viable way for screen capturing and video encoding.

Call ffmpeg directly. It works, and the overhead is minimum. ffmpeg should be cross-platform but the arguments are not. And it does not provide an interface for processing the frames.
Capture screens and feed them to https://github.com/ralfbiedert/openh264-rs. This does not seem to support a frame-by-frame encoding (or it is supported by raw APIs, the documentation is limited). The document claims that it is cross-platform. Haven't tried it personally.
https://github.com/astraw/vpx-encode gives an example of encoding with libvpx. From the code it supports frame-by-frame encoding. But it neither builds on my windows nor linux.

STRRL · 2023-07-25T12:50:15Z

I dive into the detail about what rewind does..

A first idea to me is to check for duplicated images with hashes. It is also possible to "rank" the images based on the text extracted from it, but this would require careful research and design.

it really does the same thing: when there are no lots of changes in the content, it drops some picture, fallback to 1 images per 20s~. When there are lost of changes on the screen, it would use the 0.5 fps.

STRRL · 2023-07-25T14:59:36Z

There are lots of algorithms for image similarity detection.. I have been lost in them.

maybe I would make a simple one(Histogram comparison) and a heavy one(opencv).. and make it extensible

AsterNighT · 2023-07-26T04:02:10Z

I'm not sure but would manually detecting image similarity outperform encoding it with some encoding algorithm? It will be more tunable indeed. While would it not be a more accessible way, if you would like to compress texts, to use some compressing algorithm than manually detecting text similarity and deduplicate them?

AsterNighT · 2023-07-26T04:05:16Z

There are lots of algorithms for image similarity detection.. I have been lost in them.

maybe I would make a simple one(Histogram comparison) and a heavy one(opencv).. and make it extensible

I don't think simple algorithms like Histogram would be of great effect, consider this: You are reading a very long markdown article on, say, github. There will be load of texts and clearly you would like these texts to be recorded. But the histogram of the article will stay almost the same (after all they are texts only, in this sense they are "similar").

Or rather, maybe the filtering should be done after tesseract, not before. Since after all it is the texts that are searched with, not the image itself.

I'm thinking something like "Retaining the word set of the most recently captured X screenshots and calculate the similarity between the current pic and the set". Never done such thing before so I'm not sure if it works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving archiver space efficiency #3

Improving archiver space efficiency #3

AsterNighT commented Jul 19, 2023

STRRL commented Jul 20, 2023 •

edited

Loading

STRRL commented Jul 20, 2023

STRRL commented Jul 20, 2023 •

edited

Loading

STRRL commented Jul 20, 2023

STRRL commented Jul 20, 2023

STRRL commented Jul 20, 2023

STRRL commented Jul 20, 2023

AsterNighT commented Jul 21, 2023 •

edited

Loading

AsterNighT commented Jul 24, 2023 •

edited

Loading

STRRL commented Jul 25, 2023

STRRL commented Jul 25, 2023 •

edited

Loading

AsterNighT commented Jul 26, 2023

AsterNighT commented Jul 26, 2023 •

edited

Loading

Improving archiver space efficiency #3

Improving archiver space efficiency #3

Comments

AsterNighT commented Jul 19, 2023

STRRL commented Jul 20, 2023 • edited Loading

STRRL commented Jul 20, 2023

STRRL commented Jul 20, 2023 • edited Loading

STRRL commented Jul 20, 2023

STRRL commented Jul 20, 2023

STRRL commented Jul 20, 2023

STRRL commented Jul 20, 2023

AsterNighT commented Jul 21, 2023 • edited Loading

AsterNighT commented Jul 24, 2023 • edited Loading

STRRL commented Jul 25, 2023

STRRL commented Jul 25, 2023 • edited Loading

AsterNighT commented Jul 26, 2023

AsterNighT commented Jul 26, 2023 • edited Loading

STRRL commented Jul 20, 2023 •

edited

Loading

STRRL commented Jul 20, 2023 •

edited

Loading

AsterNighT commented Jul 21, 2023 •

edited

Loading

AsterNighT commented Jul 24, 2023 •

edited

Loading

STRRL commented Jul 25, 2023 •

edited

Loading

AsterNighT commented Jul 26, 2023 •

edited

Loading