Open-source, cross-platform, package and environment management, for any language
NumFOCUS says: "Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching easily between them. It works on Linux, OS X and Windows, and was created for Python programs but can package and distribute any software."
- Who should care about conda?
- What is the problem?
- What are the other solutions?
- What is conda?
- Computational Scientist (numerical science, modeling)
- Data Analyst (ETL, EDA, data viz, comp stats, machine learning)
- Developer (re-usable code, testable, reproduceable results)
- SysAdmin/DevOps (install, config, audit, test, security)
- get started using Python for data analysis.
- experiment with pkgs without affecting system installs.
- run others' code with different versions, dependencies.
- test against multiple versions for portability testing.
- run pre-deployment testing, before using it in production.
- reproduce old analysis env for regression/migration testing.
- You are a data analyst with e.g. strong skills in statistics and domain knowledge.
- want to give up proprietary solution.
- need a quick way to set-up a scientific python stack.
- want an extensible ecosystem, reuse community code.
- need reproducible envs to rerun old analyses, reproduce results, regression test code in new envs.
- need the ability to switch between multiple versions, to collaborate with others.
You may need to build from source and install:
- python interpreter (IPython)
- vectorized array library (numpy, xarray)
- data store and file I/O library (h5py, pytables, netcdf4)
- visualization library (matplotlib, seaborn, bokeh)
- math, statistics library (scipy, statsmodels, scikit-learn)
- maybe mapping/GIS library (bokeh-geo, datashader)
- maybe an IDE (spyder, jupyter)
- a compiler tool-chain, or two (gcc, g++, gfortran, ifort, absoft, make, auto-conf, CMake, python, perl, bash)
- fragmentation: open-source ecosystems are composed of many independent projects developing in parallel
- dependencies: some packages depend on libraries written in MANY languages (e.g. numpy and LAPACK/MKL)
- versions: Thing1 and Thing2 depend on incompatible versions of Thing3; dependency graph hell, collaboration hell
- build: maintaining/updating compiler tool-chains probably not the best use of a data analyst
- install: maintaining/updating an install process probably not the best use of a data analyst
- DIY
- platform-specific package managers
- pip+virtualenv
- conda
Do It Yourself: Build Everything from Source, Cross Fingers, then Link!
- total control! Mwahahaha.
- build everything yourself from source
- maintain build and install toolchains/frameworks
- manually solve "dependency hell", aka "trial-and-error"
- universe: "Oh, you needed that stack for another version/platform? LOL, start over!"
Platform-specific package managers
- linux: apt, yum
- osx: macports, homebrew
- win: virtualbox? (gratuitous recursion joke?)
- VMs:
GOTO 2
(cf. bad joke: sure, why not?) - not good if you need cross-platform consistency
- pip installs python packages within any environment
- does not precisely track non-python dependencies well
- great if you have an existing system Python installation and you want to install packages in or on it.
- not so great for multi-language dependency management and cross-platform binary installations.
- Features
- Conda is not Anaconda
- Environment Manager
- Package Manager
- Package Builder
- Package Distributor
- Documentation
- does both: a tool to manage packages and environments
- multi-language: any language, many languages
- cross-platform: "OS-agnostic", Linux, OSX, Windows
- open-source: github.com, BSD License, source is python
- supported: Continuum Analytics, NumFOCUS, conda-forge
- adopted: most of the SciPy/PyData communities
- multi-channel: can install pkgs from many "channels"
- works offline: package repo can be local, use file URL
- roll-your-own: building your own conda pkgs from source
- package-anything: not just for source, can pkg/version data
Conda is not Anaconda
- Conda is an open-source tool for creating and managing packages and environments.
- Anaconda is a distribution of conda packages.
- Conda is a conda pkg included in the Anaconda distribution.
Conda Myths
- google search
"conda myths"
- read the first hit: https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/
- Conda environments contain a collection of conda or pip packages
- Conda envs isolated from system installs, other user installs
- You can have many envs, each with different versions
- Modify one env, other envs NOT affected.
- Easily switch between environments
- Share/document/recreate your envs with
export
cmd
- Command line tool (CLI vs GUI) as primary interface
- Query and search the package index and user installation
- Add, update, and remove packages from Conda environments
- Installs pkgs under user home path (default), no admin needed
- Installs and updates packages and their dependencies
- You tell conda the high level package names you want
- Conda solves dependency problem for mutual compatibility
- repo.continuum.io is a large repository of conda packages, curated and distributed by Continuum Analytics, used as the default channel for installs and updates by all conda distributions.
- Anaconda is a complete conda package distribution for science, with hundreds of python packages for science.
- MiniConda is a minimum conda package distribution, with about 10 packages.
- conda-forge is a "community led collection of recipes, build infrastructure and distributions for the conda package manager."
For more on using conda see the following resources: