Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributing software module #139

Merged
merged 9 commits into from
Oct 2, 2024
45 changes: 45 additions & 0 deletions modules/distributing/distributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
title: Distributing Software
type: reading
order: 4
---

# Distributing software (10 minutes)

How do you make it easy for someone else to obtain a copy and get it set up on their computer so that they can use it?

Modern software contsists of an often large collection of components (libraries, packages) that are combined together to form an application. This whole collection needs to be reproduced on the computer of the user for things to work. There are two ways of doing that: 1) combining them all together on the computer of the developer, and then wrapping everything up into a package, installer, container image, or VM image that is sent to the user, or 2) putting the components that you made yourself on the Internet (as a package), and relying on the user to download the other components (packages) and assembling it all together into a working application.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Modern software contsists of an often large collection of components (libraries, packages) that are combined together to form an application. This whole collection needs to be reproduced on the computer of the user for things to work. There are two ways of doing that: 1) combining them all together on the computer of the developer, and then wrapping everything up into a package, installer, container image, or VM image that is sent to the user, or 2) putting the components that you made yourself on the Internet (as a package), and relying on the user to download the other components (packages) and assembling it all together into a working application.
Modern software consists of an often large collection of components (libraries, packages) that are combined together to form an application. This whole collection needs to be reproduced on the computer of the user for things to work. There are two ways of doing that: 1) combining them all together on the computer of the developer, and then wrapping everything up into a package, installer, container image, or VM image that is sent to the user, or 2) putting the components that you made yourself on the Internet (as a package), and relying on the user to download the other components (packages) and assembling it all together into a working application.


## Monolithic applications

Option 1) works for applications, which are more or less independent. If they're used together, then it's by saving a file from one and opening it in another application. Each application contains all the bits it needs, and is installed on the user's computer in a separate folder, away from everything else. That means that different applications don't get in each other's way, but it's also rather inefficient if many applications use the same component, because you end up with many copies of that component.

If you do choose option 1), then you still have a choice between making a package, an installer, a container image, or a virtual machine image. A package is an archive (think a ZIP-file, which it often literally is) that contains, in this case, all the components needed by the application. Since it's just a file, a package needs to be installed by a special program called a package manager. The App Store or Play Store on your phone is such a program.

An installer is itself a computer program, that also contains all the components needed by the application. It gets downloaded by the user, who then runs it, after which it copies all the components from within itself onto the user's computer. It can then run there just like an application installed from a package using a package manager.

A container image is a special kind of package. It also contains all the parts needed to run a program, but it is run in a special isolated environment called a container. A normal application can access everything else on the computer, including files and parts of other applications. It's set up to use its own components of course, but it could access other things if it wanted or needed to. An application that runs in a container can't do this, it's isolated from everything else except for the operating system. This is an advantage for example if the software runs on a server that is accessible from the Internet, because it provides some security. It also makes it easy to run many copies of the software on many servers, so that you can serve many users.

A Virtual Machine finally is even more isolated. It contains its own operating system together with the application, so that the running application cannot even access the operating system on the user's computer. This has similar advantages as a container, being more secure, but it's also slower than using containers.

So these are the different ways option 1), distributing a monolithic application with everything included, can be implemented. As said, this reduces potential compatibility problems, but isn't very efficient because you end up with many copies of everything.

## Separate packages

Option 2) is more efficient than option 1), because the user can just install each component once, and then every other component that needs it can use it. There are drawbacks here as well though. First, the user needs to figure out which components are needed for a particular application, and then install them one by one. This puts them in an unpleasant place called "dependency hell".

Dependency hell was mostly solved by the invention of package managers, which automate the process of downloading and installing the required components. Example are pip, conda, apt, and Homebrew. If each component is put into a package with some metadata that describes which other packages it needs, then the package manager can do all that automatically, at least assuming that everything is Open Source and freely available online, because it cannot go to the shop to buy a license for everything. Still, often everything is Open Source and then this saves a huge amount of work. Dependency hell is not the only problem however.

Software is continuously developed, and that means that it changes over time. Those changes sometimes change how a component is used by other components, which then need to be updated too. So the user may end up with an older program that only works with an older version of component X, while they also want to used a different newer program that works only with a newer version of X. A good package manager will give an error message in that case, but that doesn't solve the problem. Which version do you install?

There are again two common solutions to this, distributions and environments. A distribution, like Ubuntu, is made by a group of people who create a collection of packages that are all compatible with each other, meaning that every package in it that uses package X works with the same version of package X, namely the one that's included in the distribution. This takes a significant amount of work, but it's very nice because you only have one version of everything, and maximal space efficiency. Of course there are still updates, but they happen once every six months or several years, and then everything is updated at once. That does mean that you don't get the latest version right away, but also that things just work and don't suddenly break. (Cathedral!)

Another way to fix the multiple options of X problem is to use environments. An environment is a separate part of the computer into which packages can be installed, in such a way that only packages within the environment are combined. So now you can install one application in one environment with one version of X, and the other application in another environment with another version of X. That costs more disk space, but it's easier to get the latest stuff, and it doesn't require all the work of constantly ensuring everything is compatible. So this makes option 2) look a bit more like option 1) again, although you can still have fewer environments than you have applications. (Bazaar!)

## Which option to choose when

Scientific software is often a script, which is basically the topmost component in the whole collection of components. Scripts mostly just tell other components what to do. Since the script isn't used by other components, it can be packaged as an application in either of the above-mentioned ways. Users can the install and run it to *reproduce* the results, but not easily use it in their own script or modify it to do something different but related.

Sometimes, scientists (or Research Software Engineers!) develop components that are intended for use by others in their scripts, or even in other components. Those need to be packaged as packages for a package manager, because they need to be combined with other packages on the user's computer. (The user is a programmer, in this case!) This allows the software to be *reused* by others in their scripts.

Finally, for others to be able to modify the software and perhaps contribute some new feature or fixes back to it, the source code of the software needs to be available through a public repository. Package managers and installers don't normally install software in a way that makes it easy to modify, as that's not what they're designed for. To be able to modify the software, you need the source code, in a version control system. So besides in a package or container repository, don't forget to make a public git repository too!
19 changes: 19 additions & 0 deletions modules/distributing/exercise-tracking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
title: Dependency tracking
type: exercise
order: 3
---

## Dependency tracking (10 minutes)

A common place to specify dependencies is in a file called `requirements.txt`, `pyproject.toml` or `environment.yml`.

Go into a source code repository of a piece of software you know and try to track down dependencies. Try to also find the soruce code of one of the dependencies and see if you can find the dependencies of this dependency. How many layers of this "dependency tree" can you follow?

You can also use one of the following projects:

- [ESMValTool](https://research-software-directory.org/software/esmvaltool)
- [LitStudy](https://research-software-directory.org/software/litstudy)
- [Haddock](https://research-software-directory.org/software/haddock3)
- [worcs](https://cjvanlissa.github.io/worcs/index.html)
- [democracy-topic-modelling](https://research-software-directory.org/software/democracy-topic-modelling)
7 changes: 7 additions & 0 deletions modules/distributing/further-reading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
title: Further reading
type: reading
order: 5
---

- Blogpost: [Understanding the “Why” of VM’s, Containers, & Virtual Environments](https://medium.com/kitchen-sink-data-science/software-fundamentals-for-machine-learning-series-understanding-the-why-of-vms-containers-89621cf66d23) Blogpost on the difference between
13 changes: 13 additions & 0 deletions modules/distributing/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
title: Distributing Software
category: Good Practices
order: 15
abstract: Software needs to be distributed to be used by others. What are environments, packages and containers and how do they help?
author: eScience Center
thumbnail: "thumbnail-containers.jpg"
visibility: visible
---


Photo by <a href="https://unsplash.com/@frankiefoto?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">frank mckenna</a> on <a href="https://unsplash.com/photos/assorted-color-filed-intermodal-containers-tjX_sniNzgQ?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Unsplash</a>

10 changes: 10 additions & 0 deletions modules/distributing/info.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Learning objectives
type: info
order: 0
---

Obtain the skills and knowledge necessary to address the following questions:
- What is software distribution and what aspects of it are important for research software?
- Why is it important to think about dependency management?
- What are environments, dependencies, packages and containers?
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added modules/distributing/media/fire.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added modules/distributing/media/shopping-list.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
103 changes: 103 additions & 0 deletions modules/distributing/slides-distributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
title: Distributing Software
type: slides
order: 1
author: Jaro Camphuijsen, Lourens Veen
---

<!-- .slide: data-state="title" -->

# Distributing Software

===

<!-- .slide: data-state="standard" -->

## Why distribute?

- For your future self
- For others that might be interesting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interested

- For reproducibility
- For reusability

note:
There are many reasons why you would want to distribute your software.

===

<!-- .slide: data-state="standard" -->

## Why can't I just publish and be done?

- A piece of software never operates in isolation.
- Depends on other software (third party packages, libraries)
- Depends on system software (operating system, drivers, firmware)
- Depends on hardware (your computer and the chips inside, display or printer)
- The world (hardware, software, people) around your software is constantly evolving

note:
Software by nature always depends on other software and hardware.

===

<!-- .slide: data-state="standard" data-background-image="media/fire.png"-->

note: Sometimes you enter dependency hell

===

<!-- .slide: data-state="standard" -->

## What issues may arise?

- Many dependencies
- Long chains of dependencies
- Conflicting dependencies
- Circular dependencies
- Package manager dependencies
- Diamond dependency

... and all of these are changing.


===

<!-- .slide: data-state="standard" -->

## What solutions exist?

Isolation or specification

===

<!-- .slide: data-state="standard" -->

## Isolation

![Layers of isolation](media/distributing-software-layers.png)

===

<!-- .slide: data-state="standard" -->

## Specification

Let the user (or some tool) solve the probem...

- requirements.txt
- environment.yml
- pyproject.toml
- package.json
etc...

note:
Specify the dependencies in a file and let the user build their own environment, container or vm.

===

## Drawbacks

Large amount of isolation enhances reproducibility but decreases flexibility.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you defining flexibility here? Ability to use the software as a dependency somewhere else?



===
Loading