Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: AM1.16 MCPClient tries to refresh prometheus tmp database when prometheus is not enabled (and mcpclient crashed) #1711

Open
5 tasks
mamedin opened this issue Oct 17, 2024 · 6 comments

Comments

@mamedin
Copy link

mamedin commented Oct 17, 2024

Expected behaviour

  • No crash in MCPCient
  • If prometheus disabled in MCPclient, we don't need to refresh tmp prometheus database (because it doesn't exist)

Current behaviour

Sometimes MCPclient tries to refresh the prometheus database (I suppose when worker pool member is restarted because reaching the limit (new AM1.16 feature: pool of workers))

It is happening at the transfer start. And it is not happening in all transfers.

MCPServer log error:

DEBUG     2024-10-16 13:00:07  archivematica.mcp.server.packages:packages:_copy_from_transfer_sources:243:  source: 6ff352d0-f495-4330-920a-dcbe20a817b5.tar.gz, destination: /var/archivematica/sharedDirectory/tmp/tmp1gkokc0b/6ff352d0-f495-4330-920a-dcbe20a817b5.tar.gz
DEBUG     2024-10-16 13:00:11  archivematica.mcp.server.packages:packages:_start_package_transfer_with_auto_approval:424:  Package cbf9eee3-b327-4bb9-a2a3-915438751536: moving package to processing directory
DEBUG     2024-10-16 13:00:11  archivematica.mcp.server.packages:packages:_start_package_transfer_with_auto_approval:429:  Package cbf9eee3-b327-4bb9-a2a3-915438751536: starting workflow processing
DEBUG     2024-10-16 13:00:11  archivematica.mcp.server.jobs.chain:chain:__init__:89:  Creating JobChain 167dc382-4ab1-4051-8e22-e7f1c1bf3e6f for package cbf9eee3-b327-4bb9-a2a3-915438751536 (initial link 3229e01f-adf3-4294-85f7-4acb01b3fbcf)
INFO      2024-10-16 13:00:11  archivematica.mcp.server.jobs.client:client:run:104:  Running Extract zipped bag transfer (package cbf9eee3-b327-4bb9-a2a3-915438751536)
DEBUG     2024-10-16 13:00:11  archivematica.mcp.server.jobs.tasks:gearman_backend:submit:217:  Submitted gearman job d92fafcd-0b61-4ac1-93af-3e714db63a75 (extractzippedtransfer_v0.0)
ERROR     2024-10-16 13:00:11  archivematica.mcp.server.jobs.tasks:gearman_backend:update_task_results:238:  Gearman task batch d92fafcd-0b61-4ac1-93af-3e714db63a75 failed to execute
INFO      2024-10-16 13:00:11  archivematica.mcp.server.jobs.client:client:run:104:  Running Email fail report (package cbf9eee3-b327-4bb9-a2a3-915438751536)
DEBUG     2024-10-16 13:00:11  archivematica.mcp.server.jobs.tasks:gearman_backend:submit:217:  Submitted gearman job f19c544a-6b26-421c-bb3b-8552ec6c4cd6 (emailfailreport_v0.0)
ERROR     2024-10-16 13:00:11  archivematica.mcp.server.jobs.tasks:gearman_backend:update_task_results:238:  Gearman task batch f19c544a-6b26-421c-bb3b-8552ec6c4cd6 failed to execute
INFO      2024-10-16 13:00:11  archivematica.mcp.server.jobs.client:client:run:104:  Running Move to the failed directory (package cbf9eee3-b327-4bb9-a2a3-915438751536)
DEBUG     2024-10-16 13:00:11  archivematica.mcp.server.jobs.tasks:gearman_backend:submit:217:  Submitted gearman job 664f634a-9034-4a54-b85e-91438aae55d3 (movetransfer_v0.0)
ERROR     2024-10-16 13:00:11  archivematica.mcp.server.jobs.tasks:gearman_backend:update_task_results:238:  Gearman task batch 664f634a-9034-4a54-b85e-91438aae55d3 failed to execute

MCPCLient log in syslog:

Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: archivematicaClient.py: DEBUG     2024-10-16 13:00:11,421  archivematica.mcp.client.gearman:handle_job:95:  Gearman job request 
d92fafcd-0b61-4ac1-93af-3e714db63a75 received for extractzippedtransfer_v0.0
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: archivematicaClient.py: ERROR     2024-10-16 13:00:11,424  archivematica.mcp.client.gearman:on_job_exception:111:  An unhandled 
exception occurred processing a Gearman job
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: Traceback (most recent call last):
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/gearman/worker.py", line 218, in on_job_ex
ecute
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    job_result = function_callback(self, current_job)
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/opt/archivematica/archivematica/src/MCPClient/lib/client/gearman.py", line 101, in handle_job
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    with metrics.task_execution_time_histogram.labels(script_name=task_name).time():
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 203, i
n labels
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._metrics[labelvalues] = self.__class__(
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 600, i
n __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    super().__init__(
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 151, i
n __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._metric_init()
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 628, i
n _metric_init
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._sum = values.ValueClass(self._type, self._name, self._name + '_sum', self._labelnames, self._labelvalues, self._documen
tation)
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/values.py", line 68, in __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self.__reset()
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/values.py", line 82, in __reset
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    files[file_prefix] = MmapedDict(filename)
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/mmap_dict.py", line 64, in __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._f = open(filename, 'rb' if read_mode else 'a+b')
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/prometheus-stats7rrndq4_/histogram_1782.db'
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: archivematicaClient.py: DEBUG     2024-10-16 13:00:11,461  archivematica.mcp.client.gearman:handle_job:95:  Gearman job request
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: archivematicaClient.py: ERROR     2024-10-16 13:00:11,462  archivematica.mcp.client.gearman:on_job_exception:111:  An unhandled exception occurred processing a Gearman job
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: Traceback (most recent call last):
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/gearman/worker.py", line 218, in on_job_execute
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    job_result = function_callback(self, current_job)
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/opt/archivematica/archivematica/src/MCPClient/lib/client/gearman.py", line 101, in handle_job
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    with metrics.task_execution_time_histogram.labels(script_name=task_name).time():
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 203, in labels
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._metrics[labelvalues] = self.__class__(
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 600, in __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    super().__init__(
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 151, in __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._metric_init()
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 628, in _metric_init
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._sum = values.ValueClass(self._type, self._name, self._name + '_sum', self._labelnames, self._labelvalues, self._documentation)
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/values.py", line 68, in __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self.__reset()
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/values.py", line 82, in __reset
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    files[file_prefix] = MmapedDict(filename)
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/mmap_dict.py", line 64, in __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._f = open(filename, 'rb' if read_mode else 'a+b')
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/prometheus-stats7rrndq4_/histogram_1782.db'
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: archivematicaClient.py: DEBUG     2024-10-16 13:00:11,489  archivematica.mcp.client.gearman:handle_job:95:  Gearman job request 664f634a-9034-4a54-b85e-91438aae55d3 received for movetransfer_v0.0
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: archivematicaClient.py: ERROR     2024-10-16 13:00:11,490  archivematica.mcp.client.gearman:on_job_exception:111:  An unhandled exception occurred processing a Gearman job
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: Traceback (most recent call last):
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/gearman/worker.py", line 218, in on_job_execute
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    job_result = function_callback(self, current_job)
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/opt/archivematica/archivematica/src/MCPClient/lib/client/gearman.py", line 101, in handle_job
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    with metrics.task_execution_time_histogram.labels(script_name=task_name).time():
6 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/gearman/worker.py", line 218, in on_job_execute
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    job_result = function_callback(self, current_job)
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/opt/archivematica/archivematica/src/MCPClient/lib/client/gearman.py", line 101, in handle_job
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    with metrics.task_execution_time_histogram.labels(script_name=task_name).time():
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 203, in labels
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._metrics[labelvalues] = self.__class__(
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 600, in __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    super().__init__(
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 151, in __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._metric_init()
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/metrics.py", line 628, in _metric_init
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._sum = values.ValueClass(self._type, self._name, self._name + '_sum', self._labelnames, self._labelvalues, self._documentation)
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/values.py", line 68, in __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self.__reset()
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/values.py", line 82, in __reset
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    files[file_prefix] = MmapedDict(filename)
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:  File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.9/site-packages/prometheus_client/mmap_dict.py", line 64, in __init__
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]:    self._f = open(filename, 'rb' if read_mode else 'a+b')
Oct 16 13:00:11 RACAM-RH-DEVMPL python[1782]: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/prometheus-stats7rrndq4_/histogram_1782.db'

Steps to reproduce

It isn't easy to reproduce, it only happened 1 time on each 2 system pipelines. So I suspect it only happens when restarting a worker.

Your environment (version of Archivematica, operating system, other relevant details)

Rocky 9, AM1.16

SS separated VM
2 pipelines

The issue happened in both pipelines (Identical config)


For Artefactual use:

Before you close this issue, you must check off the following:

  • All pull requests related to this issue are properly linked
  • All pull requests related to this issue have been merged
  • A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
  • Documentation regarding this issue has been written and merged (if applicable)
  • Details about this issue have been added to the release notes (if applicable)
@mamedin
Copy link
Author

mamedin commented Oct 17, 2024

As workaround I added prometheus monitoring to MCPClients in localhost:

ARCHIVEMATICA_MCPCLIENT_MCPCLIENT_PROMETHEUS_BIND_PORT=7998
ARCHIVEMATICA_MCPCLIENT_MCPCLIENT_PROMETHEUS_BIND_ADDRESS=127.0.0.1

@mamedin
Copy link
Author

mamedin commented Oct 17, 2024

I suspect we are missing a conditional in some places like this:

https://github.com/artefactual/archivematica/blob/qa/1.x/src/MCPClient/lib/client/pool.py#L158

We should check the monitoring is enabled before running metrics functions

@replaceafill
Copy link
Member

We should check the monitoring is enabled before running metrics functions

I think we do, but maybe we've missed some code path. The metrics.worker_exit function you reference is decorated with skip_if_prometheus_disabled which checks exactly that.

Before applying this workaround what were the values of those two settings in the MCPClient configuration?

@currmie
Copy link

currmie commented Oct 22, 2024

Do we have a sense if this is occurring anytime there is a batch task?

@mamedin
Copy link
Author

mamedin commented Oct 23, 2024

We should check the monitoring is enabled before running metrics functions

I think we do, but maybe we've missed some code path. The metrics.worker_exit function you reference is decorated with skip_if_prometheus_disabled which checks exactly that.

Before applying this workaround what were the values of those two settings in the MCPClient configuration?

Both values were not defined before in /etc/sysconfig/archivematica-mcp-client file, so based on README.md, so if port not defined: If this value is not provided, metrics will not be served

@mamedin
Copy link
Author

mamedin commented Oct 30, 2024

I couldn't reproduce the issue in a similar environment:

Separate SS and dashboard VMs (Rocky9 AM1.16)

starting transfers with amclient. Transfer type: zipped bag

I used the following loop with the script a lot of times and never got the error (when using sleep=0 git some fails, because same tarball en sharedDir, but never prints the prometheus error)

#!/bin/bash

#Use AM virtualenv
source /usr/share/archivematica/virtualenvs/archivematica/bin/activate

#VARS
API_KEY=XXXXXXXXXXX
AM_USER=test
AM_URL=https://YYYYYYYYYYYY.archivematica.net
TRANSFER_SOURCE=ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
TRANSFER_NAME=tarredbag-from-amclient
TRANSFER_TYPE="zipped bag"
PROCESSING_CONFIG=automated
TRANSFER_DIR_OR_FILE=artefactual/archivematica-sampledata/SampleTransfers/BagExamples/TarGzBags/TarredBag.tar.gz
SLEEP_TIME=1

# Create loop
for i in {1..100}; do

        echo "Running a zipped bag transfer with amclient: $i"
        
        amclient create-package \
        --am-user-name ${AM_USER} \
        --am-url ${AM_URL} \
        --transfer-source ${TRANSFER_SOURCE} \
        --transfer-name ${TRANSFER_NAME} \
        --transfer-type "${TRANSFER_TYPE}" \
        --processing-config ${PROCESSING_CONFIG} \
        ${API_KEY} \
        ${TRANSFER_DIR_OR_FILE}

        # sleep
        echo "Sleep ${SLEEP_TIME}s"
        sleep ${SLEEP_TIME}

        # Clean dashboard
        echo "Cleaning dashboard"
        amclient close-completed-transfers \
        --am-user-name ${AM_USER} \
        --am-url ${AM_URL} \
        ${API_KEY}


        amclient close-completed-ingests \
        --am-user-name ${AM_USER} \
        --am-url ${AM_URL} \
        ${API_KEY}

done

@replaceafill replaceafill self-assigned this Dec 3, 2024
@replaceafill replaceafill added the Status: ready The issue is sufficiently described/scoped to be picked up by a developer. label Dec 3, 2024
@replaceafill replaceafill added this to the 1.17.0 milestone Dec 3, 2024
@replaceafill replaceafill added Status: in progress Issue that is currently being worked on. and removed Status: ready The issue is sufficiently described/scoped to be picked up by a developer. labels Dec 4, 2024
@replaceafill replaceafill removed this from the 1.17.0 milestone Dec 10, 2024
@replaceafill replaceafill removed their assignment Dec 10, 2024
@replaceafill replaceafill removed the Status: in progress Issue that is currently being worked on. label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants