Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precise versioning with local branches #118

Open
wants to merge 3 commits into
base: next_release
Choose a base branch
from

Conversation

josiahjohnston
Copy link
Contributor

This enables clear records of local versions of software, which can be invaluable during R&D for customizations. For example, let's say I check out a current copy the development branch, then add new modules and customize behavior to deal with edge cases and subtle bugs. Each commit I make may result in different solutions for the same dataset, but if every version is labeled as v2.0.4, I lack a clear record of which scenarios I need to re-execute, or how I generated a particular set of results.

PEP 440 explains the concept of local identifiers for this type of use case. In the development environment of my example, installing a copy of switch via pip install path/to/checkout will update the version from 2.0.4 to 2.0.4+[git_sha], or if I have uncommitted changes in the repository, it will be 2.0.4+[git_sha]+localmod. If the current git checkout is tagged as a release (having a git tag starting with 2 in our case), then the local modifier suffix is dropped.

This implementation should have no impact on "quickstart" instructions that install from pypi or conda repositories.

This implementation will try to find the precise local version (relies on git being installed), and write it into switch_model/data/installed_version.txt in the installed package directory. If the attempt to call a git subprocess fails, it will print a warning and provide the base version which is recorded in switch_model/version.py. version.py will attempt to load installed_version.txt from the data directory and will return that string if available; if unavailable, version.py will return the hard-coded version number. Finally, the version is written to the outputs directory to ensure a clear record for archival purposes. This version number is accessible in a) the pip catalog, b) switch --version, c) switch_model.__version__, d) in outputs/software_version.txt

I've used this pattern successfully in other software for scientific computing & medical devices, and it has been a life-saver. The code used here has worked effectively in Mac & Linux environments, and can be compatible with docker packaging. It could use validation in a Windows environment (minimally a basic sniff test), but since it is a nonessential add-on that fails gracefully, I expect it could be integrated even if it doesn't work seamlessly in all development environments.

Additionally, I think we would be better served if pre-release branches update the hard-coded version from 2.0.4 to 2.0.4+next_release, or a similar indication that it isn't a packaged release, and hasn't received the same degree of scrutiny.

@josiahjohnston josiahjohnston changed the base branch from master to next_release August 8, 2019 18:59
@josiahjohnston
Copy link
Contributor Author

One upshot of this commit is that installing in developer mode with --editable would be counter-indicated if one wished to maintain clean records. Although if you are doing quick edit/test cycles without committing, it wouldn't matter since clean records would be superfluous and this won't help track snapshots of uncommitted code (just flags them as uncommitted).

@mfripp
Copy link
Member

mfripp commented Aug 9, 2019

Hmm, I definitely need to think a little about this. A few points:

  1. In practice, it seems like publicly published models can usually just use a particular released version of Switch. We should try to make those often, so new features don't languish in unreleased software for long, but this should be workable, certainly simpler than trying to peg a published model to a particular commit.
  2. On the other hand, if we do want to peg a published model to a particular git commit, it is possible to do that, either by giving instructions to checkout that commit, or by including a copy of Switch in the model's repository as a submodule.
  3. Your changes seem to be focused more on re-running models as you update pre-released versions of Switch.
  4. This seems like a lot of extra stuff in general, and especially a lot of stuff to support that specialized use case (e.g., a new data directory within switch_model, which won't even be writeable in many cases, as well as a whole new numbering system). I would expect most users to use a vanilla version of Switch, and create their own custom modules in the study directory that can be managed along with the rest of the study data.
  5. Can this particular use case be managed differently, e.g., by using make to run switch to recreate your outputs, with a dependency on various modules and data files?
  6. I think the best practice is to update the version number whenever a branch is created for the next version. That way we can write and test the data upgrade scripts along with the rest of the code for that version. However, this only really works with a linear commit path. I don't know how we would handle data upgrades within feature branches. It may automatically be OK, as long as they all branch off the next version branch, rather than the master branch.
  7. I think the master branch in the main repository should always correspond to the currently released version of Switch. Users should always beware that anything other than master is prerelease. Again, as long as we merge feature/next-version branches into master and release them often, this shouldn't inconvenience colleagues who depend on near-cutting-edge features. And if they need prerelease software, they can just checkout that particular feature branch (possibly from a forked repository), install as developer, and pull as needed.

I haven't really thought through how all this relates to what you're doing in this branch, but I at least wanted to share my initial reactions.

@josiahjohnston
Copy link
Contributor Author

Thanks for the quick feedback.

Re: Points 1-3
Yup, the local version suffix is primarily focused on automatically and accurately tracking code and results during the course of active development. It's also applicable to custom branches that never make it into the master branch.

For people who stick to official releases, the only impact will be an unambiguous record of which version of Switch was used to make their results, and a clear indication of whether they accidentally wandered into a branch that diverged from an official release.

In other projects, I've found that accurately tracking (and recording) local versions to be invaluable for expedient troubleshooting and retrospectively understanding how results change as code evolves. While local versioning can be helpful for releasing results for a study (like the pegged git checkout strategy you describe), I've primarily used it for maintaining good records internally.

As far as I can remember, every study I've done or collaborated on has required some code customizations, only a subset of which ever made it into a master branch. This is both with Switch v2 & v1. I expect the only exceptions to the need for custom branches will be if every edit that is needed for a particular study is accepted into the master branch and tested for backwards compatibility (easier to guarantee if working solo or unilaterally, harder if working on a shared codebase).

Even in cases where people primarily wished to adjust inputs of an established study (like the Rhodium Group's extension of a Hawaiian study), they still required custom exports and other tweaks. As the codebase evolves and matures, the need to push the boundaries may reduce, but I don't expect it to fully disappear.

Re: Point 4
Yes, this has some extra stuff, but it all conforms to Python standards. And yes, package data is intended to be write-once during install. Data directories are another concept I've come to value from other projects, and are great for things like this, default configuration files, test data, or other data assets that commonly accompany a software project. There are other styles of setting up python data directories, but this is by far the most stable that I've found after considerable research and testing.

For people who use official releases of Switch, this will have no impact on the version they see.

While Switch 2.0 makes it possible to do any customization by writing new modules outside of the switch_model package (including copy + edit of core modules), I generally recommend learning git and committing to a branch because:

  • It promotes code versioning and incremental progress, a foundation of good practices in scientific computing and replicable science. Many (most) people I work with will not use version control if they start writing code in a new empty directory, but are more inclined to do good habits for a pre-existing repository.
  • it drastically simplifies pull requests
  • pull requests of generally-useful code are a great way of collaborating, giving back to the larger community, and reducing global workload
  • pull requests of region-specific subdirectories along with test cases reduces the chance of bitrot where the core modules become incompatible with custom modules. That was the basis of our recommendation in how_to_collaborate.txt to make new subdirectories for individual regions/institutions (Ex hawaii).

Re: Point 5
No, doing make files for pegging specific versions is impractical. The goal is to have clear records of which version of a moving codebase I've used for a particular run (without making me do an extra step of copying & pasting a manual recrod), not to retroactively give a recipe for replicating results after I've finished everything.

Re: Point 6
Agreed, but using a base version numbers like 2.0.5-alpha for the pre-release branch that follows 2.0.4, rather than 2.0.5. In this use case, I also prefer either an automated local versioning system as implemented here, or using other automated tools for version incrementing that will bump the alpha suffix from .0 to .1, etc before every git commit. Although, as you pointed out, a sequential versioning system breaks down with non-linear branches and merges. The local versioning system in conjunction with a reasonable base version addresses these complexities better than other approaches I've read about to date.

Re: Point 7
That isn't my top choice, but I could live with that. I prefer to have master be the branch that is moving towards the next release, and rely on tags to specify which specific versions are releases. If people want released versions, they should install from conda/pypi repos, or nab a tagged checkout for a particular release. That's the pattern I've seen & worked with most in various github projects..

@josiahjohnston
Copy link
Contributor Author

I forgot to respond to the data upgrade issue. I see support for data upgrades & backwards compatibility as strictly limited to official sequential releases.
I don't see a need for data upgrades on side branches with any of the use cases I'm familiar with, and wouldn't be able to comment on the feasibility of that without understanding specific use cases.

… exact version of code can be known: release[+gitsha]+localmod.

If git is available, the release will be based on the last tag in the git history that starts with "2". [+gitsha] will be ignored if this is exactly a release. +localmod will be dropped if there are no uncommitted modifications to the code.
If the git is unavailable or the attempt to get more precise information fails for whatever reason, the base version as recorded in version.py will be used.

Also, save the installed switch version in the output directory for improved record keeping and reproducibility, especially during active software development.
…ged version. Also use the git-standard "dirty" suffix instead of "localmod" for installations from code that hasn't been committed.
@mfripp
Copy link
Member

mfripp commented Jul 17, 2020

Finally getting back to this pull request, and I forgot we even had this much discussion of it. I'll check back to your comments above, but after looking at the code, I'm inclined to simplify this a lot:

  • continue with PEP-440 public version numbers (N.N.N[{a|b|rc}N][.postN][.devN]) for setup.py, switch_inputs_version.txt and identification of upgrade scripts
    • these applications require sequential version numbering, which isn't possible with versions based on commit SHAs (different SHAs can be on different forks on the way to a later release)
    • this means upgrade scripts should always focus on upgrading from one release to another, not between micro-releases (makes sense, since SHAs don't follow a particular sequence anyway)
    • I'd be happy to add .devN suffixes, incremented every time we pull a feature into the next_release branch; that should also support some collaboration methods, but not everything you want
      • I will also rename next_release to develop, since we are basically following the 'gitflow' workflow.
  • if the current installation is not in a git repository, switch --version and outputs/model_config.json (coming soon!) report just the main version number.
  • if the current installation is in a git repository and doesn't match a public release, then
    • switch --version reports the main version number, plus current SHA1, plus a list of uncommitted changes (if any) in the repository and possibly a list of "local modules" (modules in the local directory or maybe the whole search path)
    • switch solve stores the main version number and a list of all uncommitted modules that are currently loaded (from switch_model or local) in model_config.json
    • maybe these could be generalized to report origin info (package name or repository/SHA1/commitment status) for any local modules (generally one-off customizations for a particular project)
    • at the moment I envision these as fairly free text from switch --version and key-value pairs under switch_version in model_config.json. That should give enough info for automated tools to decide what configuration is needed to reproduce a particular model (although they'll probably need to give up if the model used modules with uncommitted changes).
    • this info is looked up whenever version information is needed; there's no need to remember to update a file in the switch_model/data directory

This simpler version would support

  • sequential versioning for packaging (PyPi) and data upgrade scripts
  • automatic identification of the exact version used for a particular model
    • if users are careful to commit changes before running the final model, this would enable easy re-installation via pip install https://github.com/switch-model/switch/archive/95c9d0f43b23.zip or similar
    • if users aren't careful about committing, this would at least give them an idea where to start for getting back to their original setup

@mfripp
Copy link
Member

mfripp commented Jul 22, 2020

@josiahjohnston, I think we have two fairly different workflows for using Switch, so I'm looking for something that will work for both. To do that, it would help to know a little more about your workflow.

The code in this branch seems to assume that you will run python setup.py or pip from inside the local git repository for your development copy of Switch, and that this will make a copy of the switch_model package in another location. This makes sense if you are using a virtual environment — you'd first create and activate the environment, then cd to the Switch repository, then run pip or setup.py there. But I wanted to ask, do you use that same workflow somehow for your Docker containers? Those seem to be self-contained, so I'm wondering how you migrate code from the Switch repository into them. Do you mount the host file system inside the Docker container, then cd to the Switch repository and run pip or setup.py to copy the code into the container? Or do you have some other procedure?

By the way, my workflow is generally to have one environment that I use for most active models, and I use pip install --editable . to give it access to the main Switch repository. It's kind of fast and loose, but it allows rapid turnaround between revising and running the model. For older models, I can then install a matching release of Switch, or possibly even a matching commit.

In your workflow, the git repository is visible when you run setup.py, but it's not visible at runtime (and the code doesn't change after you run setup.py). So you need to stamp the installation with the git status. In my workflow, the code may change after I run setup.py, but my code can see the git repository at runtime (that's where it runs from). So I can/must check the git status at runtime, and I don't want the installation to be pre-stamped with a local version number. I think it's possible to reconcile these, but I need to be clearer on how you're using Docker.

@josiahjohnston
Copy link
Contributor Author

I used virtual environments instead of docker containers. Docker containers were the next step up. I tried offering to set those up while I was still working on this in a professional capacity (not sure if I communicated that intent well), but never got around to that. It's not clear to me if that would help usability with target user base. Dockerfiles are easy enough to set up, and I might be able to pull one together if I stayed up late some night.

If you went with dockerfiles, then docker build would be analogous to pip install in a virtual environment, except you'd have archives of your prior builds. Each docker build could have it's own uniquely tagged version, and you could keep as many of those as you wanted. If set up properly, they'd all share the same underlying layers so more wouldn't take up much disk space. I'd probably set up a script to encapsulate the docker build command and tag each image with the precise version number.

Yup, you are right with impacts of --editable. That's fast and loose and has not way of tracking what code produced a given set of results. Fine for quick iterations where you are tracking a few things in your head. Bad for archiving results and reconstructing them later. This traceability is important for quality lab notebooks, research publications or public proceedings since tiny changes to formulations can lead to big changes in outputs. This is especially true for people who don't have PhDs in energy modeling and 15+ years experience writing & critically interpreting them.

Yup, data upgrade support wouldn't and shouldn't be applied to the precise versions that only differ in the git hash suffix. That functionality only applies if you bother bumping the version number.

All that being said, most people I've worked with are sloppy about git repos and traceability. I keep hoping people will up their game, possibly with the aid of data science curriculums and "Best Practices in Scientific Computing", but that's probably too optimistic. I regard this functionality as crucial for traceability & reproducibility for scientific computing. This is especially important in planning major long-term societal investments and the fate of our planet with global warming, since minor changes to models can produce wildly different results (whether by intention or accident), long-term models are often not numerically stable, and inputs have large uncertainties (both for present and long-term forecasts). But if most practitioners never bother to go through systematic processes, and most published policy papers on energy models decline to release their datasets or code, then I don't know if this feature matters from a practical perspective. And if your use cases involve releasing code and final runs with a single version of code, without needing traceability in your intermediate runs because you are that good, then maybe this isn't useful for you either..

I don't know what changes you are proposing or how that would impact things I used to use on a day-to-day basis to solve my pain points. I'm not working with this codebase in a professional capacity now and don't have the bandwidth to contribute in any real way, or get a deeper dive into how active or hypothetical energy modelers will use this software. If this PR seems useful to you or other users, then keep. If not, do whatever seems useful. If I manage to return to this in the future, I'll take a look at the outcome and can always restore portions that I need for my process & workflow.

@mfripp
Copy link
Member

mfripp commented Jul 24, 2020

Thanks, that's good to know. I may postpone this for now because it's getting complicated. For later reference, I think there is a strategy that could meet both of our needs (stamping a copy of Switch with repository status while copying it into a virtual environment, and also retrieving repository status directly from a developer install of Switch):

  • continue to record the main version number manually in setup.py, but add .devN suffixes as commits are added to the develop branch
  • add a function that retrieves repository status for the current package, if available (most recent commit, maybe a list of modified, tracked files)
  • call the repository status function after each model is solved and store the version number, repository status and other info in outputs/model_config.json
  • in setup.py, subclass setuptools.command.install.build_py.build_py (or maybe build); have this stamp repository info into switch_model/data/repository_status.txt
    • similar to https://www.anomaly.net.au/blog/running-pre-and-post-install-jobs-for-your-python-packages/
    • this needs to find repository status in the current working directory (as your current code does), not in the location where setup.py is stored, because pip copies the package to a temporary directory before calling setup.py; at that point, current working directory is the only way to know where the package came from
      • this needs to handle the case where someone installs from pypi or conda-forge but happens to be sitting in a Switch directory
        • this may mean we should stamp the distribution at an earlier stage, always before upload or copying; see below
  • whenever repository status is needed:
    • if Switch is running from a repository, gather live repository info
    • otherwise, if switch_model/data/repository_status.txt is available, use that
      • use importlib.resources.read_text('switch_model.data', 'repository_status.txt')
    • otherwise, report that this is a clean installation

I'm a little unsure how this fits with distributions though. PyPi uses wheels, which could potentially be stamped with repository info during the build process. If the repository info is then reported as part of the version number, it may prevent the wheel from uploading to PyPi (probably a good thing). If it isn't, then we can freely upload a dev or final version without worrying about whether it has been committed to the repository yet (maybe a good thing, maybe not). On the other hand, the conda-forge package builds from the source repository on pypi. I don't think this goes through a 'build' phase before it is uploaded, so I'd need to find some other hook to stamp the source distribution.

staadecker pushed a commit that referenced this pull request Jan 28, 2023
Cleanup and minor fixes to get_inputs post process
staadecker pushed a commit that referenced this pull request Jan 28, 2023
Cleanup and minor fixes to get_inputs post process
staadecker pushed a commit that referenced this pull request Jan 29, 2023
Cleanup and minor fixes to get_inputs post process
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants