Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Rest-API does not stage files with broken disk locations #7607

Open
christianvoss opened this issue Jun 24, 2024 · 4 comments
Open

Bulk Rest-API does not stage files with broken disk locations #7607

christianvoss opened this issue Jun 24, 2024 · 4 comments
Assignees

Comments

@christianvoss
Copy link

Hi,

we've been observing some curious behaviour with the bulk staging service. It appears the bulk service does not trigger stages, if there are disk locations known to dCache, even when these pools are offline. We've observed this recently, when a storage node had to be taken out of production for a week and we wanted to stage back some files needed by our users.

I've reproduced this also with the latest 9.2 dCache release: 9.2.21. What we see, when we want to stage a NEARLINE file with is:

{
"nextId": -1,
"uid": "f9b987ee-02b5-4ba8-a334-df4b24ed4b6a",
"arrivedAt": 1719238168621,
"startedAt": 1719238168720,
"lastModified": 1719238168753,
"status": "COMPLETED",
"targetPrefix": "/",
"targets": [
{
"target": "/pnfs/desy.de/exfel/archive/XFEL/raw/FXE/201802/p002271/r0081/RAW-R0081-LPD09-S00003.h5",
"state": "SKIPPED",
"submittedAt": 1719238168635,
"startedAt": 1719238168635,
"finishedAt": 1719238168750,
"id": 242049
}
]
}

The operation will always be skipped. But, dCache reports the file correctly as NEARLINE: "fileLocality": "NEARLINE",

In contrast, staging via SRM triggers a restore from tape immediately:

[vossc@naf-it01] [dev/vossc/no-macaroon-voms-directly] pnfs_qos_api $ srm-bring-online -lifetime=864000 srm://dcache-door-xfel01.desy.de:8443/pnfs/desy.de/exfel/archive/XFEL/raw/FXE/201802/p002271/r0081/RAW-R0081-LPD09-S00003.h5

[dcache-head-xfel02] (local) vossc > \sn pnfsidof /pnfs/desy.de/exfel/archive/XFEL/raw/FXE/201802/p002271/r0081/RAW-R0081-LPD09-S00003.h5
00005283EB13A8A943E9938C32E0BFFF47FC

[dcache-head-xfel02] (local) vossc > \sp rc ls
00005283EB13A8A943E9938C32E0BFFF47FC@world-net-/ m=1 r=0 [dcache-xfel499-01] [Waiting for stage: dcache-xfel499-01 06.24 16:10:40] {0,}

[dcache-head-xfel02] (local) vossc > \s dcache-xfel499-01 rh ls
a928e3c3-6454-4151-b186-0f3ab7b93757 ACTIVE Mon Jun 24 16:10:40 CEST 2024 Mon Jun 24 16:10:40 CEST 2024 00005283EB13A8A943E9938C32E0BFFF47FC xfel:FXE-2018

Is it possible for bulk to behave like SRM did in the past, or would the procedure be to 'disable' the location in chimera before triggering the stage?

Thanks a lot,
Christian

@DmitryLitvintsev
Copy link
Member

yes. Bulk purely relies on location information in chimera.

@christianvoss
Copy link
Author

Hi Dmitry,

thanks a lot for the clarification. I guess it is supposed to stage from tape even when a pool is offline?

Thanks a lot,
Christian

@DmitryLitvintsev
Copy link
Member

Hi Christian,

sorry it took a long time to get to this.

By disabling pool (pool disable -strict) or stopping the pool the system works as designed:

 [uqbar] (bulk@bulk1Domain) admin > \sn cacheinfoof 000081BB1343AB6A4783B342B28269116004 
 rw-uqbar-3


#  systemctl stop [email protected]

[uqbar] (local) admin > \c rw-uqbar-3 
(1) Cell does not exist.

execute the script to stage file:

$ python pin_many.py /pnfs/fs/usr/fermilab/users/litvinse/apache-maven-3.9.8-bin.tar.gz 
201 https://uqbar.fnal.gov:3880/api/v1/bulk-requests/92edcdd1-c04d-4d63-a85c-c9cdc28e7fcd
Cheking status
200 {
  "nextId" : -1,
  "uid" : "92edcdd1-c04d-4d63-a85c-c9cdc28e7fcd",
  "arrivedAt" : 1726695461378,
  "startedAt" : 1726695461395,
  "lastModified" : 1726695461395,
  "status" : "STARTED",
  "targetPrefix" : "/pnfs/fs/usr/fermilab",
  "targets" : [ {
    "target" : "/pnfs/fs/usr/fermilab/users/litvinse/apache-maven-3.9.8-bin.tar.gz",
    "state" : "RUNNING",
    "submittedAt" : 1726695461389,
    "startedAt" : 1726695461389,
    "id" : 363
  } ]
}

Observe:

ID           | ARRIVED             |            MODIFIED |        OWNER |     STATUS | UID
...
170          | 2024/09/18-16:37:41 | 2024/09/18-16:37:41 |    8637:3200 |  COMPLETED | 92edcdd1-c04d-4d63-a85c-c9cdc28e7fcd
[uqbar] (bulk@bulk1Domain) admin > request ls 92edcdd1-c04d-4d63-a85c-c9cdc28e7fcd 
ID           | ARRIVED             |            MODIFIED |        OWNER |     STATUS | UID
170          | 2024/09/18-16:37:41 | 2024/09/18-16:37:41 |    8637:3200 |  COMPLETED | 92edcdd1-c04d-4d63-a85c-c9cdc28e7fcd
[uqbar] (bulk@bulk1Domain) admin > \sn cacheinfoof 000081BB1343AB6A4783B342B28269116004 
 rw-uqbar-3 rw-uqbar-9

The script:

#!/usr/bin/env python

import json 
import requests
import os
import sys
from requests.exceptions import HTTPError

import urllib3
urllib3.disable_warnings()

base_url = "https://uqbar.fnal.gov:3880/api/v1/bulk-requests"
#base_url = "https://cmsdcatape.fnal.gov:3880/api/v1/bulk-requests"

if __name__ == "__main__":
    session = requests.Session()
    session.verify = "/etc/grid-security/certificates"
    session.cert = f"/tmp/x509up_u{os.getuid()}"
    session.key = f"/tmp/x509up_u{os.getuid()}"

    headers = { "accept" : "application/json",
                "content-type" : "application/json"}

    data =  {
    	"target" : sys.argv[1:],
       "clearOnFailure" : "true",
        "expandDirectories" : "none",
        "activity" : "PIN",
        "arguments": {
            "lifetime": "24",
            "lifetime-unit": "HOURS"
        }
    }


    try:
        r = session.post(base_url,
                         data=json.dumps(data),
                         headers=headers)
        r.raise_for_status()
        print (r.status_code, r.headers['request-url'])
    except HTTPError as exc:
        print(exc)
        sys.exit(1)

    rq =  r.headers['request-url']    
    print ("Cheking status")
    r = session.get(rq, headers=headers)
    r.raise_for_status()
    print (r.status_code, r.text)

I will re-scan (by eye) the condition that sets SKIPPED status

@DmitryLitvintsev
Copy link
Member

Could you do me a favor. could you use exact same script (need voms proxy) to run exact same exercise?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants