Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for SOURCE_DATE_EPOCH in sdist. #2133

Open
Carreau opened this issue May 24, 2020 · 8 comments
Open

support for SOURCE_DATE_EPOCH in sdist. #2133

Carreau opened this issue May 24, 2020 · 8 comments

Comments

@Carreau
Copy link
Contributor

Carreau commented May 24, 2020

SOURCE_DATE_EPOCH is useful for reproducible build, when set, no timestamp should be greater than this value.

It seem that setuptools sdist does not support SOURCE_DATE_EPOCH, I've traced it to the following:

sdit inherit from Commands, which leads to these successives calls.

Lib/distutils/cmd.py:Command.make_archive
Lib/distutils/archive_util.py:make_archive
Lib/distutils/archive_util.py:ARCHIVE_FORMATS
Lib/distutils/archive_util.py:make_tarball

Make tarball seem to be the right place to monkeypatch to look for SOURCE_DATE_EPOCH as it itself can pass a filter to tarfile.add(), which will ensure the mtime is bounded (it already pass a filter to set uid/gid).

With this most sdist (except tgz) are reproducibles. TGZ has this last problem that GzipFile adds time.time() in the header and that's a bit harder to patch.

Carreau added a commit to Carreau/setuptools that referenced this issue May 24, 2020
This pulls just enough of distutils' and modify the make_tarball
function in order to respect SOURCE_DATE_EPOCH; this will ensure that
_when set_ no timestamp in the final archive is greater than timestamp.

This allows (but is not always sufficient), to make bytes for bytes
reproducible build for example:

 - This does not work with `gztar`, and zip does embed a timestamp in
 the header which currently is `time.time()` in the standard library.

 - if some fields passed to setup.py have on determinstic ordering (for
 example using sets for dependencies).

 Partial work toward pypa#2133, with this I was able to make two bytes-identical
 sdist of IPython.
Carreau added a commit to Carreau/setuptools that referenced this issue May 25, 2020
This pulls just enough of distutils' and modify the make_tarball
function in order to respect SOURCE_DATE_EPOCH; this will ensure that
_when set_ no timestamp in the final archive is greater than timestamp.

This allows (but is not always sufficient), to make bytes for bytes
reproducible build for example:

 - This does not work with `gztar`, and zip does embed a timestamp in
 the header which currently is `time.time()` in the standard library.

 - if some fields passed to setup.py have on determinstic ordering (for
 example using sets for dependencies).

 Partial work toward pypa#2133, with this I was able to make two bytes-identical
 sdist of IPython.

You will see three types of modifications:

 - Referring explicitly to some of distutils namespace in a couple of
 places, to avoid duplicating more code. Note that despite some names
 _not_ changing as the name resolution is with respect to current
 module, unchanged functions will now use our modified version.

 - overwrite `make_archive` in sdist to use our patched version of the
 functions in archive_utils.

 - update make_tarball to look for SOURCE_DATE_EPOCH in environment and
 setup a filter to modify mtime while taring.
@joshuagl
Copy link

joshuagl commented Feb 9, 2021

There's some excellent work towards this started in #2136, thanks @Carreau! Are you planning to pick this up? If not, perhaps I could help finish up this work?

We would like to be able to produce reproducible sdists for python-tuf. (Curious readers can see: theupdateframework/python-tuf#1269)

@Carreau
Copy link
Contributor Author

Carreau commented Feb 9, 2021

Are you planning to pick this up? If not, perhaps I could help finish up this work?

At some point; but I don't have much time these days; feel free to take over.

@tiran
Copy link
Contributor

tiran commented Mar 18, 2021

I'm interested in reproducible sdists, too. Reproducible artifacts make it much easier to verify the provenance of code.

@dalcinl
Copy link
Contributor

dalcinl commented Aug 24, 2023

Just in case this is useful to others, I paste below a self-contained hunk of monkeypatching that allowed me to get reproducible (same sha256 hash) sdist tarballs. This hunk of code can be dumped in setup.py, for example.

# Support for Reproducible Builds
# https://reproducible-builds.org/docs/source-date-epoch/

timestamp = os.environ.get('SOURCE_DATE_EPOCH')
if timestamp is not None:
    import distutils.archive_util as archive_util
    import stat
    import tarfile
    import time

    timestamp = float(max(int(timestamp), 0))

    class Time:
        @staticmethod
        def time():
            return timestamp
        @staticmethod
        def localtime(_=None):
            return time.localtime(timestamp)

    class TarInfoMode:
        def __get__(self, obj, objtype=None):
            return obj._mode
        def __set__(self, obj, stmd):
            ifmt = stat.S_IFMT(stmd)
            mode = stat.S_IMODE(stmd) & 0o7755
            obj._mode = ifmt | mode

    class TarInfoAttr:
        def __init__(self, value):
            self.value = value
        def __get__(self, obj, objtype=None):
            return self.value
        def __set__(self, obj, value):
            pass

    class TarInfo(tarfile.TarInfo):
        mode = TarInfoMode()
        mtime = TarInfoAttr(timestamp)
        uid = TarInfoAttr(0)
        gid = TarInfoAttr(0)
        uname = TarInfoAttr('')
        gname = TarInfoAttr('')

    def make_tarball(*args, **kwargs):
        tarinfo_orig = tarfile.TarFile.tarinfo
        try:
            tarfile.time = Time()
            tarfile.TarFile.tarinfo = TarInfo
            return archive_util.make_tarball(*args, **kwargs)
        finally:
            tarfile.time = time
            tarfile.TarFile.tarinfo = tarinfo_orig

    archive_util.ARCHIVE_FORMATS['gztar'] = (
        make_tarball, *archive_util.ARCHIVE_FORMATS['gztar'][1:],
    )

A few explanations follow:

  1. The timestamp value has to be converted to float. Keeping it int will not go right, and the final tarball will miss the PAX header. This is because the code in tarfile assumes/expect mtime to be a float.
  2. I had to replace tarball.time to prevent current timestamp being injected in the compressed gzip stream.
  3. The monkeypatch of TarInfo.mode may be not strictly necessary, but helps with different umask user settings.
  4. The username/groupname and userid/groupid information stored in the tarball has to be overridden. To keep things simple and generic enough I just picked user/users and 1000/100. and following the recommendation from @haampie in the comment below, the username/groupname are set as the empty string and userid/groupid are set to zero.

PS: Maybe this approach is simple enough to incorporate into setuptools?

@Carreau
Copy link
Contributor Author

Carreau commented Aug 28, 2023

@dalcinl thanks that is great !

@haampie
Copy link
Contributor

haampie commented Oct 5, 2023

Better to use uid = gid = 0, and set uname/gname to empty string.

Otherwise you're in for fun surprises when extracting the tarball as root on systems that have uid/gid 1000.

In particular files that are executable only by the user would now be executable by this 1000 user, that can be a security issue.

Python itself notably tries to change ownership in tarfile: https://github.com/python/cpython/blob/2bbbab212fb10b3aeaded188fb5d6c001fb4bf74/Lib/tarfile.py#L2530

@dalcinl
Copy link
Contributor

dalcinl commented Oct 5, 2023

Better to use uid = gid = 0, and set uname/gname to empty string.

I've updated the code snippet as per your recommendation. Thanks.

@wimglenn
Copy link
Contributor

wimglenn commented May 15, 2024

The snippet from @dalcinl is very helpful, thanks!

If you don't have a setup.py at all, for those who normally use pyproject.toml, I've adapted the idea into a build backend at https://github.com/wimglenn/setuptools-reproducible, which can be used like this:

[build-system]
requires = ["setuptools-reproducible"]
build-backend = "setuptools_reproducible"

Some notes:

  • This wraps setuptools, and otherwise behaves identically as the setuptools.build_meta backend.
  • I did not bother to patch distutils.archive_util, we can just patch tarfile module directly. With PEP 517 we're already working in an isolated environment when the build backend hooks are called.
  • The patch of localtime seems unnecessary, it's only used by the TarFile.list method which is not needed at build time. I've left it out.

It's tested on {macOs, Linux, Windows} x Py-3.{8,9,10,11,12}. It does not work in Python-3.7 (and I did not bother to investigate why, since 3.7 is EOL now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants