Skip to content

Commit

Permalink
Merge pull request #111 from akleeman/prepare-v0.1
Browse files Browse the repository at this point in the history
Prepare v0.1
  • Loading branch information
shoyer committed May 3, 2014
2 parents 9d09b43 + 9f15916 commit d7f4e96
Show file tree
Hide file tree
Showing 13 changed files with 1,212 additions and 194 deletions.
51 changes: 17 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ makes many powerful array operations possible:
dimensions (known in numpy as "broadcasting") based on dimension names,
regardless of their original order.
- Flexible split-apply-combine operations with groupby:
`x.groupby('time.dayofyear').apply(lambda y: y - y.mean())`.
`x.groupby('time.dayofyear').mean()`.
- Database like aligment based on coordinate labels that smoothly
handles missing values: `x, y = xray.align(x, y, join='outer')`.
- Keep track of arbitrary metadata in the form of a Python dictionary:
Expand All @@ -38,9 +38,10 @@ Because **xray** implements the same data model as the NetCDF file format,
xray datasets have a natural and portable serialization format. But it's
also easy to robustly convert an xray `DataArray` to and from a numpy
`ndarray` or a pandas `DataFrame` or `Series`, providing compatibility with
the full [scientific-python ecosystem][scipy].
the full [PyData ecosystem][pydata].

[pandas]: http://pandas.pydata.org/
[pydata]: http://pydata.org/
[scipy]: http://scipy.org/
[ndarray]: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html

Expand Down Expand Up @@ -143,43 +144,34 @@ labeled numpy arrays that provided some guidance for the design of xray.
- Be fast. There shouldn't be a significant overhead for metadata aware
manipulation of n-dimensional arrays, as long as the arrays are large
enough. The goal is to be as fast as pandas or raw numpy.
- Provide a uniform API for loading and saving scientific data in a variety
of formats (including streaming data).
- Take a pragmatic approach to metadata (attributes), and be very cautious
before implementing any functionality that relies on it. Automatically
maintaining attributes is a tricky and very hard to get right (see
discussion about Iris above).
- Support loading and saving labeled scientific data in a variety of formats
(including streaming data).

## Getting started

For more details, see the **[full documentation][docs]** (still a work in
progress) or the source code. **xray** is rapidly maturing, but it is still in
its early development phase. ***Expect the API to change.***
For more details, see the **[full documentation][docs]**, particularly the
**[tutorial][tutorial]**.

xray requires Python 2.7 and recent versions of [numpy][numpy] (1.8.0 or
later) and [pandas][pandas] (0.13.1 or later). [netCDF4-python][nc4],
[pydap][pydap] and [scipy][scipy] are optional: they add support for reading
and writing netCDF files and/or accessing OpenDAP datasets. We plan to
eventually support Python 3 but aren't there yet. The easiest way to get any
of these dependencies installed from scratch is to use [Anaconda][anaconda].
eventually support Python 3 but aren't there yet.

xray is not yet available on the Python package index (prior to its initial
release). For now, you need to install it from source:
You can install xray from the pypi with pip:

git clone https://github.com/akleeman/xray.git
# WARNING: this will automatically upgrade numpy & pandas if necessary!
pip install -e xray

Don't forget to `git fetch` regular updates!
pip install xray

[docs]: http://xray.readthedocs.org/
[tutorial]: http://xray.readthedocs.org/en/latest/tutorial.html
[numpy]: http://www.numpy.org/
[pydap]: http://www.pydap.org/
[anaconda]: https://store.continuum.io/cshop/anaconda/

## Anticipated API changes

Aspects of the API that we currently intend to change:
Aspects of the API that we currently intend to change in future versions of
xray:

- The constructor for `DataArray` objects will probably change, so that it
is possible to create new `DataArray` objects without putting them into a
Expand All @@ -192,19 +184,10 @@ Aspects of the API that we currently intend to change:
dimensional arrays.
- Future versions of xray will add better support for working with datasets
too big to fit into memory, probably by wrapping libraries like
[blaze][blaze]/[blz][blz] or [biggus][biggus]. More immediately:
- Array indexing will be made lazy, instead of immediately creating an
ndarray. This will make it easier to subsample from very large Datasets
incrementally using the `indexed` and `labeled` methods. We might need to
add a special method to allow for explicitly caching values in memory.
- We intend to support `Dataset` objects linked to NetCDF or HDF5 files on
disk to allow for incremental writing of data.

Once we get the API in a state we're comfortable with and improve the
documentation, we intend to release version 0.1. Our target is to do so before
the xray talk on May 3, 2014 at [PyData Silicon Valley][pydata].

[pydata]: http://pydata.org/sv2014/
[blaze][blaze]/[blz][blz] or [biggus][biggus]. More immediately, we intend
to support `Dataset` objects linked to NetCDF or HDF5 files on disk to
allow for incremental writing of data.

[blaze]: https://github.com/ContinuumIO/blaze/
[blz]: https://github.com/ContinuumIO/blz
[biggus]: https://github.com/SciTools/biggus
Expand Down
Binary file added doc/_static/opendap-prism-tmax.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed doc/_static/series_plot_example.png
Binary file not shown.
47 changes: 39 additions & 8 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Dataset
-------

Creating a dataset
~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated/

Expand All @@ -20,8 +20,6 @@ Attributes and underlying data
.. autosummary::
:toctree: generated/

Dataset.variables
Dataset.virtual_variables
Dataset.coordinates
Dataset.noncoordinates
Dataset.dimensions
Expand All @@ -45,10 +43,14 @@ and values given by ``DataArray`` objects.
Dataset.copy
Dataset.iteritems
Dataset.itervalues
Dataset.virtual_variables

Comparisons
~~~~~~~~~~~

.. autosummary::
:toctree: generated/

Dataset.equals
Dataset.identical

Expand All @@ -58,8 +60,8 @@ Selecting
.. autosummary::
:toctree: generated/

Dataset.indexed_by
Dataset.labeled_by
Dataset.indexed
Dataset.labeled
Dataset.reindex
Dataset.reindex_like
Dataset.rename
Expand All @@ -74,12 +76,26 @@ IO / Conversion
.. autosummary::
:toctree: generated/

Dataset.dump
Dataset.to_netcdf
Dataset.dumps
Dataset.dump_to_store
Dataset.to_dataframe
Dataset.from_dataframe

Dataset internals
~~~~~~~~~~~~~~~~~

These attributes and classes provide a low-level interface for working
with Dataset variables. In general you should use the Dataset dictionary-
like interface instead and working with DataArray objects:

.. autosummary::
:toctree: generated/

Dataset.variables
Variable
Coordinate

Backends (experimental)
~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -109,10 +125,24 @@ Attributes and underlying data
:toctree: generated/

DataArray.values
DataArray.as_index
DataArray.coordinates
DataArray.name
DataArray.dataset
DataArray.attrs
DataArray.encoding
DataArray.variable

NDArray attributes
~~~~~~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

DataArray.ndim
DataArray.shape
DataArray.size
DataArray.dtype

Selecting
~~~~~~~~~
Expand All @@ -123,8 +153,8 @@ Selecting
DataArray.__getitem__
DataArray.__setitem__
DataArray.loc
DataArray.indexed_by
DataArray.labeled_by
DataArray.indexed
DataArray.labeled
DataArray.reindex
DataArray.reindex_like
DataArray.rename
Expand All @@ -150,6 +180,7 @@ Computations
DataArray.transpose
DataArray.T
DataArray.reduce
DataArray.get_axis_num
DataArray.all
DataArray.any
DataArray.argmax
Expand Down
5 changes: 3 additions & 2 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,10 @@ def __getattr__(cls, name):
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.intersphinx',
'numpydoc',
'ipython_directive',
'ipython_console_highlighting'
'IPython.sphinxext.ipython_directive',
'IPython.sphinxext.ipython_console_highlighting',
]

autosummary_generate = True
Expand Down
64 changes: 42 additions & 22 deletions doc/data-structures.rst
Original file line number Diff line number Diff line change
@@ -1,54 +1,74 @@
Data structures
===============

``xray``'s core data structures are the ``Dataset``, ``Variable`` and
``DataArray``.
xray's core data structures are the :py:class:`~xray.Dataset`,
the :py:class:`~xray.Variable` (including its subclass
:py:class:`~xray.Coordinate`) and the :py:class:`~xray.DataArray`.

The document is intended as a technical summary of the xray data model. It
should be mostly of interest to advanced users interested in extending or
contributing to xray internals.

Dataset
-------

``Dataset`` is netcdf-like object consisting of **variables** (a dictionary of
Variable objects) and **attributes** (an ordered dictionary) which together
form a self-describing data set.
:py:class:`~xray.Dataset` is a Python object representing a fully self-
described dataset of labeled N-dimensional arrays. It consists of:

1. **variables**: A dictionary of Variable objects.
2. **dimensions**: A dictionary giving the lengths of shared dimensions, which
are required to be consistent across all variables in a Dataset.
3. **attributes**: An ordered dictionary of metadata.

The design of the Dataset is based by the
`NetCDF <http://www.unidata.ucar.edu/software/netcdf/>`__ file format for
self-described scientific data. This is a data model that has become very
successful and widely used in the geosciences.

The Dataset is an intelligent container. It allows for simultaneous integer
or label based indexing of all of its variables, supports split-apply-combine
operations with groupby, and can be converted to and from
:py:class:`pandas.DataFrame` objects.

Variable
--------

``Variable`` implements **xray's** basic extended array object. It supports the
numpy ndarray interface, but is extended to support and use metadata. It
consists of:
:py:class:`~xray.Variable` implements xray's basic extended array object. It
supports the numpy ndarray interface, but is extended to support and use
basic metadata (not including coordinate values). It consists of:

1. **dimensions**: A tuple of dimension names.
2. **data**: The n-dimensional array (typically, of type ``numpy.ndarray``)
storing the array's data. It must have the same number of dimensions as the
length of the "dimensions" attribute.
2. **data**: The N-dimensional array (for example, of type
:py:class:`numpy.ndarray`) storing the array's data. It must have the same
number of dimensions as the length of the "dimensions" attribute.
3. **attributes**: An ordered dictionary of additional metadata to associate
with this array.

The main functional difference between Variables and numpy.ndarrays is that
The main functional difference between Variables and numpy arrays is that
numerical operations on Variables implement array broadcasting by dimension
name. For example, adding an Variable with dimensions `('time',)` to another
Variable with dimensions `('space',)` results in a new Variable with dimensions
`('time', 'space')`. Furthermore, numpy reduce operations like ``mean`` or
``sum`` are overwritten to take a "dimension" argument instead of an "axis".

Variables are light-weight objects used as the building block for datasets.
However, usually manipulating data in the form of a DataArray should be
preferred (see below), because they can use more complete metadata in the full
of other dataset variables.
**However, manipulating data in the form of a Dataset or DataArray should
almost always be preferred** (see below), because they can use more complete
metadata in context of coordinate labels.

DataArray
---------

``DataArray`` is a flexible hybrid of Dataset and Variable that attempts to
provide the best of both in a single object. Under the covers, DataArrays
are simply pointers to a dataset (the ``dataset`` attribute) and the name of a
"focus variable" in the dataset (the ``focus`` attribute), which indicates to
which variable array operations should be applied.
A :py:class:`~xray.DataArray` object is a multi-dimensional array with labeled
dimensions and coordinates. Coordinate labels give it additional power over the
Variable object, so it should be preferred for all high-level use.

Under the covers, DataArrays are simply pointers to a dataset (the ``dataset``
attribute) and the name of a variable in the dataset (the ``name`` attribute),
which indicates to which variable array operations should be applied.

DataArray objects implement the broadcasting rules of Variable objects, but
also use and maintain coordinates (aka "indices"). This means you can do
intelligent (and fast!) label based indexing on DataArrays (via the
``.loc`` attribute), do flexibly split-apply-combine operations with
``groupby`` and also easily export them to ``pandas.DataFrame`` or
``pandas.Series`` objects.
``groupby`` and convert them to or from :py:class:`pandas.Series` objects.
Loading

0 comments on commit d7f4e96

Please sign in to comment.