Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge history db file into a common db file to avoid db file booming #518

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

lee3164
Copy link

@lee3164 lee3164 commented Feb 28, 2020

No description provided.

@brian-brazil
Copy link
Contributor

I don't believe this is safe, nothing stops something else with that pid coming along while this is running. In addition there's nothing stopping this being run concurrently.

@lee3164
Copy link
Author

lee3164 commented Feb 29, 2020

I don't believe this is safe, nothing stops something else with that pid coming along while this is running. In addition there's nothing stopping this being run concurrently.

  1. I think this method should be called when a process is atexit, so, nobody should called it at runtime.
    for uwsgi and gunicorn, they both have a hook to run method at worker exit
  2. for safe, if you insist on it, we can use a file lock, but I think process exit is a low probability event.

@lee3164
Copy link
Author

lee3164 commented Feb 29, 2020

#443 this issue has the same problem with me. I used multi process mode a few days and I found the endpoint for scrape latency was becoming more and more slow, so I checked the prometheus_multiproc_dir, and found there were 1000+ db files, because sometimes the worker would exit, so the number of db files increased dramatically

@brian-brazil
Copy link
Contributor

We can't make any assumptions about how this code is used, and given it's in a multi-process environment it must avoid any race conditions.

How would this work for gauges?

@lee3164
Copy link
Author

lee3164 commented Mar 5, 2020

We can't make any assumptions about how this code is used, and given it's in a multi-process environment it must avoid any race conditions.

How would this work for gauges?

As you can see in the code, we use the same method to merge metrics as they are being scraped. I have use this commit in my production environment, and it works well, the latency of scrape endpoint has been reduced significantly. We can use file lock to avoid race condition.

@lee3164
Copy link
Author

lee3164 commented Mar 5, 2020

We should point out that this code should only be used in the multiprocess mode

@lee3164
Copy link
Author

lee3164 commented Mar 5, 2020

I have added process lock code which is from django.core.files.locks

MmapedDict(merge_file).close()

# do merge, here we use the same method to merge
metrics = MultiProcessCollector.merge(files + merge_files, accumulate=False)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-brazil here I use the same method to merge metrics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will cause deadlock: mark_process_dead is already holding lock and calling merge will try to acquire the same lock again.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it will -- it looks like lock acquisition via flock is reentrant safe, but unlocking via flock isn't, so I think this will prematurely release the lock after the merge call here, at least on POSIX systems. I don't know what the locking behaviour on NT will be.

@brian-brazil
Copy link
Contributor

What happens if a process exists, sets a gauge, exits, and a new process comes along with the same pid?

@lee3164
Copy link
Author

lee3164 commented Mar 5, 2020

What happens if a process exists, sets a gauge, exits, and a new process comes along with the same pid?

            for s in metric.samples:
                name, labels, value, timestamp, exemplar = s
                if metric.type == 'gauge':
                    without_pid_key = (name, tuple([l for l in labels if l[0] != 'pid']))
                    if metric._multiprocess_mode == 'min':
                        current = samples_setdefault(without_pid_key, value)
                        if value < current:
                            samples[without_pid_key] = value
                    elif metric._multiprocess_mode == 'max':
                        current = samples_setdefault(without_pid_key, value)
                        if value > current:
                            samples[without_pid_key] = value
                    elif metric._multiprocess_mode == 'livesum':
                        samples[without_pid_key] += value
                    else:  # all/liveall
                        samples[(name, labels)] = value
  1. We have 5 mode for Gauge in multi process mode, all, min, max, liveall, livesum. In livesum and liveall mode, we will delete the died process's db file, in min and max mode, we will compare the old and new value and chose the correct value, in all mode, if this event happend, no matter old code or my commit, the new value will cover the old value.
  2. Process exit and get the same pid is a low probability event.

@brian-brazil
Copy link
Contributor

Process exit and get the same pid is a low probability event.

It's a certainty though at scale.

@lee3164
Copy link
Author

lee3164 commented Mar 5, 2020

Process exit and get the same pid is a low probability event.

It's a certainty though at scale.

Yes,Gauge is really a problem in multi process mode, developer should think it carefully. Do you have a good idea to reduce the latency in multiprocess mode when the nums of db files increase?

@brian-brazil
Copy link
Contributor

Currently the main approach is avoiding creating lots of files in the first place.

@PierreF
Copy link

PierreF commented Aug 12, 2020

I'm also hitting this issue (uWSGI restarted often, resulting in lots of files which cause /metrics endpoints to become too slow). I'm interested and willing to help to see this PR being merged.

With the exception of test that fail due to:

  • nested lock acquisition - mark_process_dead and merge took the lock
  • and labels pid=merge added on gauge in addition to pid=$original_pid. So we could not add this duplicated pid labels or ensure it's added "before".

What block this PR ?

  • locking seems correct. Merge is only performed on own pid files (no pid reuse issue) and with the lock file
  • gauges are merged as they already are. We just reads less files.
    • On gauges, this PR indeed don't fix the fact that number of gauge metrics will increase... but it's not a regression.

@brian-brazil
Copy link
Contributor

locking seems correct. Merge is only performed on own pid files (no pid reuse issue)

That's not the case, mark_process_dead runs in a different process.

@lee3164
Copy link
Author

lee3164 commented Aug 13, 2020

I' m using this feature in my production env now and there is no problem at present

@lee3164
Copy link
Author

lee3164 commented Aug 13, 2020

locking seems correct. Merge is only performed on own pid files (no pid reuse issue)

That's not the case, mark_process_dead runs in a different process.

mark_process_dead runs when process is going to exit, and it only do its own merge work, using a lock will make this progress serialized.

@PierreF
Copy link

PierreF commented Aug 13, 2020

In uWSGI case, mark_process_dead will probably run at process exit (so with own PID). At least it's own I did it (using Python atexit handler). But in Gunicorn, mark_process_dead will be called by master process for another PID.

This indeed open a (small) race-condition: if mark_process_dead is run with the PID of a terminated worker AND a new worker start using the same PID. The lock won't help there, since the new worker kept an open file descriptor to the file that mark_process_dead will remove.

This should however be quite rare as it require the OS to re-assign the PID of a recently terminated process. But nothing forbid this to happen.

@PierreF
Copy link

PierreF commented Aug 13, 2020

BTW, this race-condition is not new. Gauge livesum and liveall already had this issue (new worker open the file, then mark_process_dead run by master remove them). Or did I miss something that avoid this race condition which could be reused ?

@lee3164
Copy link
Author

lee3164 commented Aug 13, 2020

I think we can use pid+process_create_time to name the file, this could avoid 2 process with same pid to have same file name

@brian-brazil
Copy link
Contributor

I think we can use pid+process_create_time to name the file, this could avoid 2 process with same pid to have same file name

There's nothing stopping two processes being created and terminated in the same second with the same PID.

(new worker open the file, then mark_process_dead run by master remove them).

True, we should look at fixing that.

@lee3164
Copy link
Author

lee3164 commented Aug 13, 2020

image

There's nothing stopping two processes being created and terminated in the same second with the same PID.

yes, but I think this is a very very very low probability event. In psutil, I also find a comments related to this problem.

@PierreF
Copy link

PierreF commented Aug 13, 2020

In psutil, the create_time has more precision that in our case (it's a float number of seconds). But I agree that this event look like a very low probability (and in my opinion it's acceptable. maybe the merge should be an opt-in option ?)

The best solution I can think of that don't have "low probability" error is to add a new method that must be called by worker itself at exit, not by the master. It will merge files. User may or not register this method if the merge is wanted or not.

@brian-brazil
Copy link
Contributor

to add a new method that must be called by worker itself at exit, not by the master.

I doubt that our users will have that level of control. This needs to work with all forms of multiprocess setups, including gunicorn and multiprocessing - and handling things like a segfault of a child correctly.

@lee3164
Copy link
Author

lee3164 commented Aug 13, 2020

In psutil, the create_time has more precision that in our case (it's a float number of seconds). But I agree that this event look like a very low probability (and in my opinion it's acceptable. maybe the merge should be an opt-in option ?)

The best solution I can think of that don't have "low probability" error is to add a new method that must be called by worker itself at exit, not by the master. It will merge files. User may or not register this method if the merge is wanted or not.

why gunicorn worker cannot register this method to ateixt?

@brian-brazil
Copy link
Contributor

Atexit is not reliably generally, plus things might happen before it got registered.

@indrif
Copy link

indrif commented Sep 24, 2020

I still dont understand how to handle the growing number of db files. When is it safe to remove them? The documentation states that the files should be cleared on every deploy which suggests that they can be removed without any special hesitation unless a process is currently writing to them. So what happens if I remove files? Do I risk losing data? Then how come it works to purge the files on a hard restart? Wouldnt an easy solution be to just have some kind of cleanup cron that just removes old files?

I feel that the documentation regarding this issue is a bit lacking.

@brian-brazil
Copy link
Contributor

The documentation says that you should delete them when the process restarts, there's nothing more to it. Anything else is unsafe.

This issue is also not the place for usage questions, please take those to the prometheus-users mailing list.

Copy link

@mhumpula mhumpula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def collect(self) should be guarded by lock too. The list and subsequent merge can have different view on directory content otherwise.

@vearutop
Copy link

The issue from 2017 #204 is still not resolved, suggesting this client library is designed to serve metrics with multi-second latency.

I appreciate @brian-brazil is so careful about unlikely race conditions that may happen, but it means this library is in no way production-ready.

@maciej-gol
Copy link

maciej-gol commented May 27, 2022

Have you guys looked into a shared database as storage? For example, local redis or memcached? Their semantics when it comes to missing values align with Prometheus' (new/expired initializes to 0).

The documentation correctly states that in Python, multiprocessing is the dominant solution to handling concurrency. Unfortunately, it doesn't also mention that due to how Python manages memory, another dominant solution is rotating workers after a certain number (+/- jitter) requests have been handled. This causes the orphaned pidfiles issue to be more visible.

The suggested integration (that removes the db pidfile) suffers from inflating counter values. When only one process gets rotated and its' metrics file purged, Prometheus will scrape counters with lower values (by the number stored in the removed file). Consequently, it will assume the new counter value is a full increment. This, usually, is not true. We also have no way to determine how much the counter had increased since the purge (if it's below the previously observed value).

Shared database complicates the deployment slightly (you need to maintain local redis/memcached). For cases where you already have local redis/memcached, this might be a good option, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants