AdaptiveCaching

There are two places where caching decisions can be made:

in VP, I could use dataset name to aid decision to make a VP replica or not.
in Rucio when deciding to prepend or not xcache to the path. Decisions would be based on filename.

Questions to be answered:

are there datasets that need not be cached at all ? Yes. ds=panda.um.* always have cache hit rate: 1.
why are there differences in caching efficiency between sites ?
can we optimize VP placements to create more datasets that are more reused ?

data sources

find out what are .log.tgz files for. These are almost never accessed twice. These are pmerge jobs. They have two input datasets: one containing log files and other one containing root files. Both datasets and files always start with "panda.um.". Files are accessed only once. Fill factor is 100%.
find out when, why, and where from .lib.tgz files are read.
make a code that uses Rucio to get file to DS mapping.

VP has been set to not create Virtual Placement for datasets starting with "panda.um.". Source for pmerge jobs is always scratch disk. All of the VP queues have scratch disk mounted for r/w. That's why even with no VP replica, local pmerge jobs will run in VP queue. As long as scratch disk is local to the VP queue, these should not use xcache and be visible in gStream. Sites mounting remote scratch disk will still see pmerge jobs in xcache.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
accesses and files per scope.png		accesses and files per scope.png
accesses and files per site.png		accesses and files per site.png
analysis.ipynb		analysis.ipynb
xcache_data_extract.ipynb		xcache_data_extract.ipynb