Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slicer: Track and record paths in DB #77

Closed
wants to merge 10 commits into from

Conversation

woky
Copy link
Contributor

@woky woky commented Jun 21, 2023

This commit completes the Chisel DB implementation by recording path
values in the DB.

Recording paths is trickier because

  1. we don't know which paths to record until all packages are sliced
    because of glob patterns,
  2. we need to record paths to slices relationship even for implicitly
    created parent directories, and
  3. we want to compute digests only once for source paths with multiple
    target paths.

This commit introduces the pathTracker interface and implementation that
tackles the problems above. Its methods are called in various places in
slicer and as callbacks in deb/extract to update the state of tracked
paths. At a high level, these are the steps taken to track paths:

  1. The callback interface in deb/extract is utilized to track created
    paths, their attributes, and their content digests.
  2. The same information is tracked for non-extracted content paths.
  3. Until-paths are untracked.
  4. Slices are assigned to both requested paths and their implicit
    parent directories.
  5. Digests of mutated files are updated.

After that, the tracked paths are recorded in the DB.

Subjective performance impact is minimal.

Note that copyright files are not owned by any slice. It is a bug that
will be fixed in future commits. That's why the nil Slices attribute was
kept in expected DB objects in test cases.

@woky
Copy link
Contributor Author

woky commented Jun 21, 2023

This PR depends on #78, #83 and #72

@woky woky added the Blocked Waiting for something external label Jun 21, 2023
@woky woky force-pushed the pub/slicer-cdb-record-paths branch 5 times, most recently from dcf4699 to c56fd2f Compare June 22, 2023 10:38
Copy link

@anpep anpep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only judging coding style and idioms here as per the cross-team Go review initiative, so a bunch of nitpicks on what I consider a good PR (:

cmd/chisel/cmd_cut.go Outdated Show resolved Hide resolved
internal/archive/archive.go Outdated Show resolved Hide resolved
internal/db/db.go Show resolved Hide resolved
internal/db/types.go Outdated Show resolved Hide resolved
internal/slicer/pathtrack.go Outdated Show resolved Hide resolved
internal/slicer/slicer.go Outdated Show resolved Hide resolved
@woky
Copy link
Contributor Author

woky commented Aug 3, 2023

Hello @anpep

Thank you very much for your suggestion.s Due to the way the PRs are structured, I cannot fix these issue in this PRs, because they are in code introduced by commits from other downstream PRs, and as I have one downstream PR with bunch of fixup commits waiting to be reviewed, I will rebase this one with your suggestion when that WIP one gets squashed and merged.

@woky woky mentioned this pull request Aug 7, 2023
1 task
@woky woky force-pushed the pub/slicer-cdb-record-paths branch 3 times, most recently from c9272f5 to f0db49e Compare August 30, 2023 12:04
@woky woky force-pushed the pub/slicer-cdb-record-paths branch 2 times, most recently from a145894 to e9738dd Compare September 19, 2023 06:07
@woky woky force-pushed the pub/slicer-cdb-record-paths branch 10 times, most recently from 2cbe46f to 3f980be Compare October 9, 2023 19:15
Currently we only test with the embedded base-files files package. This
quite limits what we can test. But since commit a3315e3
("testutil/pkgdata: Create deb programatically") we have the ability to
define our own custom package content.

Add support for testing with custom packages. These can be defined in
each test case per archive name. The content can be defined directly in
test case definition with the help of testutil.MustMakeDeb().

The package content is wrapped in the testPackage structure. This
introduces one level of nesting in each test case package definition. In
future, we will add package metadata to the package definition, namely
version. So wrapping the package content in the structure now and later
just adding a new field to it the structure later will make the future
diff much smaller.
At some point we will need to get information about (deb) packages from
(Go) packages other than archive, namely from the slicer package.
For instance:
- Package entries in Chisel DB will include version, digest and
  architecture.
- Multi-archive package selection will need to know versions of a
  package in all configured archives.

Add Info() method to the Archive interface that returns a new
PackageInfo interface that contains accessors for the underlying
control.Section of the package. Currently these accessors are Name(),
Version(), Arch() and SHA256() methods.

The reason to introduce PackageInfo interface instead of returing
control.Section directly is keep the archive abstraction and not leak
implementation details.
@woky woky force-pushed the pub/slicer-cdb-record-paths branch from 3f980be to 8fb41c2 Compare October 12, 2023 16:40
woky added 8 commits October 20, 2023 10:07
From the code:

  These functions exists because we work with slash terminated
  directory paths that come from deb package tarballs but standard
  library path functions treat slash terminated paths as unclean.

We use ad-hoc code to make filepath.Dir() slash terminated in two places
right now. For example this code

  targetDir := filepath.Dir(strings.TrimRight(targetPath, "/")) + "/"

is not strictly correct as "/a/b/.///" will be turned into "/a/b/./"
which is still "/a/b".

Since we deal with slash terminated directory paths everywhere, create
and use dedicated helper functions to process them.
The purpose of parent directory handling in deb/extract.go is to create
parent directories of requested paths with the same attributes (only
mode as of now) that appear in the package tarball. However, the current
implementation is not correct when glob paths are requested.

In what follows parent directory refers to a directory path that is not
explicitly requested for extraction, but that is the parent of other
paths that are requested for extraction, and so it is assumed to be
implicitly requested for extraction.

Currently, whether a package path should be extracted is determined by
the shouldExtract() function that iterates over requested paths and for
each checks whether it matches the package path if it's glob, or if it's
non-glob, whether it equals the package path or whether some of its
target paths have the package path as the parent.

There are two problems with this implementation:

  1) It only checks whether a package path is the parent of any target
     path of a requested non-glob path. It does not, and probably even
     cannot, check whether it is the parent of a requested glob path.

  2) It iterates over the map of requested paths for every package path,
     even though for requested non-glob paths, it can match by directory
     lookup. And in each iteration, it checks whether a requested path
     is a glob by searching for wildcards in it.

This commit addresses mainly the first problem, but it touches the
second one as well.

Track modes of directories as we encounter them in the tarball. Then,
when creating a target path, create its missing parent directories with
modes with which they were recorded in the tarball, or 0755 if they were
not recorded yet. In the latter case, the directory mode is also tracked
so that the directory can be recreated with the proper mode if we find
it in the tarball later. This algorithm works for both glob and non-glob
requested paths.

Since the matching that was previously done in the shouldExtract()
function is not used anymore, the function was removed. As part of this
change, the requested glob paths are recorded before the extraction
begins into the dedicated list which is scanned only when requested
non-glob path lookup fails.

We still match requested non-glob and glob paths separately. Ideally, we
would use some kind of pattern matching on trees, something like a radix
tree which also supports wildcards, but this commit is humble.

One consequence of this change is that when an optional path that
doesn't exist in the tarball is requested, its parent directories are
not created (if they are not the parents of other requested paths). But
this new behavior is arguably more natural than the old behavior where
we created parent directories for non-existent paths, which seems to
have been just an artifact of the implementation. Therefore, one test
had to be changed for this behavior change.

Since we do not allow optional paths to be defined in slices, this
change is visible only to callers of the deb.Extract() function. In
chisel, these callers are extract test and slicer. The latter depended
on the old behavior to create parent directories for non-extracted
content paths by including their direct parent directories in the
requested paths. The behavior of chisel is preserved by changing slicer
to include all, and not just direct, parent directories of non-extracted
content paths in the requested paths.
We use the source tar entry header to compute fs.FileMode. When the
target path has a different mode, we modify the source header first.
Consequently, when a source path is mapped to multiple target paths, and
when one target path overrides the mode and the following one doesn't,
the following one will have the first one's mode. Fix it by remembering
the source header mode.
This commit adds support for callbacks from deb/extract.go. These
callbacks will be used by the Chisel DB implementation that will follow
to record metadata about installed paths.

The reason for adding callbacks instead of handling path recording in
deb/extract is to keep the concerns separated between internal packages.
In Chisel DB we need to record paths to slices relationship and digests
of regular files. But deb/extract doesn't know about slices. slicer on
the other hand knows about slices, and so it's the proper place to drive
Chisel DB creation. But slicer needs to cooperate with deb/extract
because it doesn't know which paths will be matched by globs and because
it needs the content of matched regular files. The callbacks provide a
mechanism for slicer to accomplish this.
This commit adds the skeleton db package that will grow into the full
Chisel DB implementation with forthcoming commits. The package contains
only New(), Save(), and Load() functions for creating, writing, and
reading the database, respectively.

The API exposes underlying jsonwall interfaces for both reading and
writing. An attempt was made to build an abstraction on top of jsonwall,
but it was later abandoned because of Go type system limitations and
difficulty to test.

The schema is set to 0.1 when writing. An error is returned when the
schema isn't 0.1 when reading. The schema and database location in the
output root directory are implementation details that are not exposed by
the API.
This commit adds types used by Chisel DB along with JSON serialization
code.

There are public types which don't contain any serialization-specific
fields, like kind, and all their fields are assignable from data types
used during slicing. They have private counterparts that are used for
JSON serialization.
This is the first commit that adds support for the DB into the slicer
and the cut command.

The database is created and saved in the cut command and populated in
slicer. Currently, only packages and slices are recorded in the DB.
Recording of paths is added in the forthcoming commit because it is more
intricate.

Slicer test is extended to allow testing of created DB objects. A list
of expected DB objects is added to each existing test case. These lists
are currently quite boring as most test cases use only the embedded
base-files package, but it'll be extended with path entries in the
forthcoming commit.
This commit completes the Chisel DB implementation by recording path
values in the DB.

Recording paths is trickier because

  1. we don't know which paths to record until all packages are sliced
     because of glob patterns,
  2. we need to record paths to slices relationship even for implicitly
     created parent directories, and
  3. we want to compute digests only once for source paths with multiple
     target paths.

This commit introduces the pathTracker interface and implementation that
tackles the problems above. Its methods are called in various places in
slicer and as callbacks in deb/extract to update the state of tracked
paths. At a high level, these are the steps taken to track paths:

  1. The callback interface in deb/extract is utilized to track created
     paths, their attributes, and their content digests.
  2. The same information is tracked for non-extracted content paths.
  3. Until-paths are untracked.
  4. Slices are assigned to both requested paths and their implicit
     parent directories.
  5. Digests of mutated files are updated.

After that, the tracked paths are recorded in the DB.

Subjective performance impact is minimal.

Note that copyright files are not owned by any slice. It is a bug that
will be fixed in future commits. That's why the nil Slices attribute was
kept in expected DB objects in test cases.
@woky woky force-pushed the pub/slicer-cdb-record-paths branch from 8fb41c2 to 926fd0d Compare October 20, 2023 08:08
@letFunny
Copy link
Collaborator

Per our offline discussion, I think our best way forward is to close this PR and discuss with @rebornplusplus to see if it is a priority right now, and how to amend the code here to make it compatible with the latest changes. We will take that discussion offline and we can always re-open the PRs or create new ones as we see fit.

@cjdcordeiro cjdcordeiro added the Decaying It's been a while, close or act on it label Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked Waiting for something external Decaying It's been a while, close or act on it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants