Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems on 15TB ZFS flatfs repo: mutating MFS dir with 25k entries, ipfs add hang #10588

Open
3 tasks done
ProximaNova opened this issue Nov 17, 2024 · 3 comments
Open
3 tasks done
Labels
kind/bug A bug in existing code (including security flaws) need/author-input Needs input from the original author need/triage Needs initial labeling and prioritization

Comments

@ProximaNova
Copy link

ProximaNova commented Nov 17, 2024

Checklist

Installation method

dist.ipfs.tech or ipfs-update

Version

Kubo version: 0.32.1
Repo version: 16
System version: amd64/linux
Golang version: go1.23.3

Config

{
  "API": {
    "HTTPHeaders": {
      "Access-Control-Allow-Methods": [
        "PUT",
        "POST",
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "http://10.0.0.232:5001",
        "http://localhost:3000",
        "http://127.0.0.1:5001",
        "https://webui.ipfs.io"
      ]
    }
  },
  "Addresses": {
    "API": "/ip4/10.0.0.232/tcp/5001",
    "Announce": null,
    "AppendAnnounce": null,
    "Gateway": "/ip4/10.0.0.232/tcp/8080",
    "NoAnnounce": null,
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/webrtc-direct",
      "/ip4/0.0.0.0/udp/4001/quic-v1",
      "/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
      "/ip6/::/udp/4001/webrtc-direct",
      "/ip6/::/udp/4001/quic-v1",
      "/ip6/::/udp/4001/quic-v1/webtransport"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic-v1/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "99999999999999999999h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 908070605040302000,
    "StorageMax": "12345678910123456789GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": true
    }
  },
  "Experimental": {
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "OptimisticProvide": false,
    "OptimisticProvideJobsPoolSize": 0,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "DeserializedResponses": null,
    "DisableHTMLErrors": null,
    "ExposeRoutingAPI": null,
    "HTTPHeaders": {},
    "NoDNSLink": false,
    "NoFetch": true,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": ""
  },
  "Identity": {
    "PeerID": "12D3KooWLowyP9yZLaxzactf2WAhR6cT1nahBmeRM23zPQR3CP4T"
  },
  "Internal": {},
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Ipwb": {
    "Replay": {
      "Host": "localhost",
      "Index": "/ipfs/bafkreic44im7s6qqtzq7gh3ose7nwkmmrohgmbyhqvxsdp3w5m6j6qdgeu",
      "Port": 2016
    }
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": true,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {
    "RemoteServices": {}
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {},
  "Routing": {
    "AcceleratedDHTClient": false,
    "Methods": null,
    "Routers": null
  },
  "Swarm": {
    "AddrFilters": null,
    "ConnMgr": {},
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": false,
    "RelayClient": {},
    "RelayService": {},
    "ResourceMgr": {},
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

Things were working fairly fast and OK (not great, but OK), then after a certain event a day ago things got way slower or stopped working. Setup: ZFS mirror pool of two 18 TB HDD which mostly contains IPFS data, like 15 TB of that. Things were working OK because like a month ago pinning stopped working. I saw some error about "cannot fix 1800 pins" or something occasionally. A day ago I was doing this with a list of 1,105,578 IPFS CIDs (totaling to 1.2 TB):
$ cat /home/u/Downloads/t5.txt | xargs -d "\n" sh -c 'for args do cid="$(echo "$args" | sed "s/ .*//g")"; fn="$(echo "$args" | sed "s/.* //g")"; date -u; ipfs files cp -p "/ipfs/$cid" "/dup/organize/4chan/mlp/$(echo "$fn" | perl -pE "s/^(....).*/\1/g")/$(echo "$fn" | perl -pE "s/^....(..).*/\1/g")/$fn"; date -u; done' _ >> /home/u/Downloads/t6.txt

What that command does: the input is many lines where each line is "[raw blocks CID] [Unix timestamp filename]" and each file is 1KB to 4MB in size. That command was running in offline mode yesterday; no ipfs daemon was running. It then puts those files in paths like this: "ipfs files cp -p /ipfs/[cid] /mfs/1419/00/1419000480902.jpg". It logs the timestamp of each "ipfs files cp" command to file "t6.txt".

That was the event which I think messed things up. It did 25,859 operations of copying files to MFS. After I canceled that command 24 or 48 hours ago, I have had persistent problems with my IPFS stuff. Such as the daemon not starting or hanging: ipfs/ipfs-docs#1956 - not a problem anymore. I do have the following problem; adding a small file to IPFS never finishes - this ran for like 30 minutes and didn't exit:

$ utc; ipfs add -rH --cid-version=1 --chunker=size-1048576 /home/u/Downloads/98-gfa.gif; utc
2024-11-17T22:58:52.545136781Z
added bafybeibfcytwdefk2hmatub3ab4wvfyei34xkwqz5ubzrqwslxi3d5ehau 98-gfa.gif
 1.35 MiB / 1.35 MiB [======================================] 100.00%
^C
Error: canceled
2024-11-17T23:23:04.879763382Z
$ # hit ctrl+c

And as said above, pinning doesn't work, so ipfs --offline pin add --progress bafybeibfcytwdefk2hmatub3ab4wvfyei34xkwqz5ubzrqwslxi3d5ehau is always stuck as "Fetched/Processed 0 nodes". About the 25,859 operations before it became bad: at the start of the text file you can see that files were copied to MFS quickly and at the end it went way slower:

$ head -n8 /home/u/Downloads/t6.txt
Sat Nov 16 02:58:39 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
$ tail -n8 /home/u/Downloads/t6.txt
Sun Nov 17 05:28:35 AM UTC 2024
Sun Nov 17 05:28:40 AM UTC 2024
Sun Nov 17 05:28:40 AM UTC 2024
Sun Nov 17 05:28:44 AM UTC 2024
Sun Nov 17 05:28:44 AM UTC 2024
Sun Nov 17 05:28:48 AM UTC 2024
Sun Nov 17 05:28:48 AM UTC 2024
$ # slow MFS

Like a week ago I saw some error about a dirty flag not being cleared. I have attached the output file of "$ ipfs diag profile" for more details. If there's something to be learned from this, I guess it's to not copy many files to MFS without the IPFS daemon running. I was trying to copy more than one million but only copied like 25,000. Also I've seen some weirdness with the "ipfs files" set of commands in the past (copy/move).

Related issue: #10383 titled "Improve data onbaording speed: ipfs add and ipfs dag import|export" (I recommend using raw blocks instead).

@ProximaNova ProximaNova added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Nov 17, 2024
@ProximaNova
Copy link
Author

ProximaNova commented Nov 17, 2024

GitHub said "File size too big: 25 MB are allowed, 50 MB were attempted to upload." So here's "ipfs-profile-2024-11-17-utc.zip":
https://files.catbox.moe/ic3ztm.zip or at ipfs://bafybeiarzkmarspokfjnhzxy7eeyquozca22robaazotpegjlabzlhmime

Update edit: This really is worse than expected. Add files command hanging on a small file (and sometimes oom-killing things) is one thing, but missing files is way worse. I see that the ipfs datastore indexes or whatever cannot find the following, meaning that it's completely gone or the index got messed up:

$ ipfs --offline ls -s bafybeiftetdkn3mxxfpi5jvtcxxlzjxcpezhhu7oemspuszlqdb6fkqm2m/selenium 1>/dev/null
Error: block was not found locally (offline): ipld: could not find bafybeiayzj2qrplhxgmircbo7mn7d6imcx2tceiiwjrvvmn72cenzrv7wa
$

I had 100% of that folder in the past:
https://web.archive.org/web/20241108195911/https://trustless-gateway.link/ipfs/bafybeiftetdkn3mxxfpi5jvtcxxlzjxcpezhhu7oemspuszlqdb6fkqm2m?format=car

Timeline:
2024-11-08: had 100% of that (this is before the ZFS message about done creating a mirror pool on the next line)
2024-11-11: resilvered 15.0T in 2 days 08:20:30 with 0 errors
2024-11-16: bad event with MFS running completely offline.

Point is that no filesystem or data corruption or data screw up happened other than ipfs's. That bafybeif...qm2m folder = 321 blocks (32,722,333 bytes). It corresponds to these .data files:
[snip]
That's 322 of them. I have 78 but am missing 244. (Tested by using two different repos in two different computers.) So those files got deleted somehow; good thing there was some remote capture of that CAR file. I can only hope that this is an isolated error and I'm not missing any other things. What deleted it? If I run $ ipfs files stat / then I can see CumulativeSize: 3673471960804799. This 3.673 petabytes number is not really true and is just the non-deduplicated size. It's 3.67 PB because occasionally I captured the CID of MFS root and put it in an MFS folder. This matters because StorageGCWatermark=908070605040302000 and StorageMax=12345678910123456789GB. Ten million zettabytes (storage max) is larger than 3.7 PB. Therefore, garbage collection (gc) should not have triggered.

I didn't run any gc command, and as far as I know, no gc was ran in the background. No point in running mirror HDDs if the data gets deleted by some software (RAID is not a backup). This is really annoying or frustrating.

@lidel lidel changed the title ipfs suddenly became very slow or nonfunctional (Kubo 0.32.1) ipfs suddenly became very slow or nonfunctional (15TB ZFS + MFS) Nov 19, 2024
@lidel lidel changed the title ipfs suddenly became very slow or nonfunctional (15TB ZFS + MFS) Various problems with15TB flatfs/ZFS repo (mutating MFS dir with 25k entries, ipfs add hangs) Nov 19, 2024
@lidel lidel changed the title Various problems with15TB flatfs/ZFS repo (mutating MFS dir with 25k entries, ipfs add hangs) Problems on 15TB ZFS flatfs repo: mutating MFS dir with 25k entries, ipfs add hang Nov 19, 2024
@gammazero
Copy link
Contributor

gammazero commented Nov 19, 2024

Note: It is not that the 244 out of 322 blocks were deleted. Adding files to MFS does not necessarily download all the data, but rather creates references to the data. Did you explicitly preload the data after ipfs files cp using either ipfs refs -r, ipfs dag stat, or ipfs pin add?

It'd be helpful to get profiles during the bad events such as:

  • While getting close to OOMing
  • While it's hanging trying to add a small file

@gammazero gammazero added the need/author-input Needs input from the original author label Nov 19, 2024
@hsanjuan
Copy link
Contributor

I think your leveldb datastore blew up.

Does add work with --pin=false ?

leveldb is used to store the pinset and the mfs root, probably it is hanging there.

That doesn't explain your "data missing" part, as the blocks are stored in flatfs (blocks folder). Please convert your CID to file name with https://go.dev/play/p/WOuGq_mJWbU. Then look if somewhere in blocks there is a file named like that (ending in .data). Afaik the only way that IPFS would have removed that would be GC'ing, but as you said that shouldn't have triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/author-input Needs input from the original author need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

3 participants