Merge pull request #111 from akleeman/prepare-v0.1

Prepare v0.1
pydata · May 3, 2014 · d7f4e96 · d7f4e96
2 parents 9d09b43 + 9f15916
commit d7f4e96
Show file tree

Hide file tree

Showing 13 changed files with 1,212 additions and 194 deletions.
diff --git a/README.md b/README.md
@@ -22,7 +22,7 @@ makes many powerful array operations possible:
     dimensions (known in numpy as "broadcasting") based on dimension names,
     regardless of their original order.
   - Flexible split-apply-combine operations with groupby:
-    `x.groupby('time.dayofyear').apply(lambda y: y - y.mean())`.
+    `x.groupby('time.dayofyear').mean()`.
   - Database like aligment based on coordinate labels that smoothly
     handles missing values: `x, y = xray.align(x, y, join='outer')`.
   - Keep track of arbitrary metadata in the form of a Python dictionary:
@@ -38,9 +38,10 @@ Because **xray** implements the same data model as the NetCDF file format,
 xray datasets have a natural and portable serialization format. But it's
 also easy to robustly convert an xray `DataArray` to and from a numpy
 `ndarray` or a pandas `DataFrame` or `Series`, providing compatibility with
-the full [scientific-python ecosystem][scipy].
+the full [PyData ecosystem][pydata].
 
 [pandas]: http://pandas.pydata.org/
+[pydata]: http://pydata.org/
 [scipy]: http://scipy.org/
 [ndarray]: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html
 
@@ -143,43 +144,34 @@ labeled numpy arrays that provided some guidance for the design of xray.
   - Be fast. There shouldn't be a significant overhead for metadata aware
     manipulation of n-dimensional arrays, as long as the arrays are large
     enough. The goal is to be as fast as pandas or raw numpy.
-  - Provide a uniform API for loading and saving scientific data in a variety
-    of formats (including streaming data).
-  - Take a pragmatic approach to metadata (attributes), and be very cautious
-    before implementing any functionality that relies on it. Automatically
-    maintaining attributes is a tricky and very hard to get right (see
-    discussion about Iris above).
+  - Support loading and saving labeled scientific data in a variety of formats
+    (including streaming data).
 
 ## Getting started
 
-For more details, see the **[full documentation][docs]** (still a work in
-progress) or the source code. **xray** is rapidly maturing, but it is still in
-its early development phase. ***Expect the API to change.***
+For more details, see the **[full documentation][docs]**, particularly the
+**[tutorial][tutorial]**.
 
 xray requires Python 2.7 and recent versions of [numpy][numpy] (1.8.0 or
 later) and [pandas][pandas] (0.13.1 or later). [netCDF4-python][nc4],
 [pydap][pydap] and [scipy][scipy] are optional: they add support for reading
 and writing netCDF files and/or accessing OpenDAP datasets. We plan to
-eventually support Python 3 but aren't there yet. The easiest way to get any
-of these dependencies installed from scratch is to use [Anaconda][anaconda].
+eventually support Python 3 but aren't there yet.
 
-xray is not yet available on the Python package index (prior to its initial
-release). For now, you need to install it from source:
+You can install xray from the pypi with pip:
 
-    git clone https://github.com/akleeman/xray.git
-    # WARNING: this will automatically upgrade numpy & pandas if necessary!
-    pip install -e xray
-
-Don't forget to `git fetch` regular updates!
+    pip install xray
 
 [docs]: http://xray.readthedocs.org/
+[tutorial]: http://xray.readthedocs.org/en/latest/tutorial.html
 [numpy]: http://www.numpy.org/
 [pydap]: http://www.pydap.org/
 [anaconda]: https://store.continuum.io/cshop/anaconda/
 
 ## Anticipated API changes
 
-Aspects of the API that we currently intend to change:
+Aspects of the API that we currently intend to change in future versions of
+xray:
 
  - The constructor for `DataArray` objects will probably change, so that it
    is possible to create new `DataArray` objects without putting them into a
@@ -192,19 +184,10 @@ Aspects of the API that we currently intend to change:
    dimensional arrays.
  - Future versions of xray will add better support for working with datasets
    too big to fit into memory, probably by wrapping libraries like
-   [blaze][blaze]/[blz][blz] or [biggus][biggus]. More immediately:
-   - Array indexing will be made lazy, instead of immediately creating an
-     ndarray. This will make it easier to subsample from very large Datasets
-     incrementally using the `indexed` and `labeled` methods. We might need to
-     add a special method to allow for explicitly caching values in memory.
-   - We intend to support `Dataset` objects linked to NetCDF or HDF5 files on
-     disk to allow for incremental writing of data.
-
-Once we get the API in a state we're comfortable with and improve the
-documentation, we intend to release version 0.1. Our target is to do so before
-the xray talk on May 3, 2014 at [PyData Silicon Valley][pydata].
-
-[pydata]: http://pydata.org/sv2014/
+   [blaze][blaze]/[blz][blz] or [biggus][biggus]. More immediately, we intend
+   to support `Dataset` objects linked to NetCDF or HDF5 files on disk to
+   allow for incremental writing of data.
+
 [blaze]: https://github.com/ContinuumIO/blaze/
 [blz]: https://github.com/ContinuumIO/blz
 [biggus]: https://github.com/SciTools/biggus

diff --git a/doc/_static/opendap-prism-tmax.png b/doc/_static/opendap-prism-tmax.png
diff --git a/doc/_static/series_plot_example.png b/doc/_static/series_plot_example.png
diff --git a/doc/api.rst b/doc/api.rst
@@ -7,7 +7,7 @@ Dataset
 -------
 
 Creating a dataset
-~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~
 .. autosummary::
    :toctree: generated/
 
@@ -20,8 +20,6 @@ Attributes and underlying data
 .. autosummary::
    :toctree: generated/
 
-   Dataset.variables
-   Dataset.virtual_variables
    Dataset.coordinates
    Dataset.noncoordinates
    Dataset.dimensions
@@ -45,10 +43,14 @@ and values given by ``DataArray`` objects.
    Dataset.copy
    Dataset.iteritems
    Dataset.itervalues
+   Dataset.virtual_variables
 
 Comparisons
 ~~~~~~~~~~~
 
+.. autosummary::
+   :toctree: generated/
+
    Dataset.equals
    Dataset.identical
 
@@ -58,8 +60,8 @@ Selecting
 .. autosummary::
    :toctree: generated/
 
-   Dataset.indexed_by
-   Dataset.labeled_by
+   Dataset.indexed
+   Dataset.labeled
    Dataset.reindex
    Dataset.reindex_like
    Dataset.rename
@@ -74,12 +76,26 @@ IO / Conversion
 .. autosummary::
    :toctree: generated/
 
-   Dataset.dump
+   Dataset.to_netcdf
    Dataset.dumps
    Dataset.dump_to_store
    Dataset.to_dataframe
    Dataset.from_dataframe
 
+Dataset internals
+~~~~~~~~~~~~~~~~~
+
+These attributes and classes provide a low-level interface for working
+with Dataset variables. In general you should use the Dataset dictionary-
+like interface instead and working with DataArray objects:
+
+.. autosummary::
+   :toctree: generated/
+
+   Dataset.variables
+   Variable
+   Coordinate
+
 Backends (experimental)
 ~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -109,10 +125,24 @@ Attributes and underlying data
    :toctree: generated/
 
    DataArray.values
+   DataArray.as_index
    DataArray.coordinates
    DataArray.name
    DataArray.dataset
    DataArray.attrs
+   DataArray.encoding
+   DataArray.variable
+
+NDArray attributes
+~~~~~~~~~~~~~~~~~~
+
+.. autosummary::
+    :toctree: generated/
+
+    DataArray.ndim
+    DataArray.shape
+    DataArray.size
+    DataArray.dtype
 
 Selecting
 ~~~~~~~~~
@@ -123,8 +153,8 @@ Selecting
    DataArray.__getitem__
    DataArray.__setitem__
    DataArray.loc
-   DataArray.indexed_by
-   DataArray.labeled_by
+   DataArray.indexed
+   DataArray.labeled
    DataArray.reindex
    DataArray.reindex_like
    DataArray.rename
@@ -150,6 +180,7 @@ Computations
    DataArray.transpose
    DataArray.T
    DataArray.reduce
+   DataArray.get_axis_num
    DataArray.all
    DataArray.any
    DataArray.argmax

diff --git a/doc/conf.py b/doc/conf.py
@@ -85,9 +85,10 @@ def __getattr__(cls, name):
 extensions = [
     'sphinx.ext.autodoc',
     'sphinx.ext.autosummary',
+    'sphinx.ext.intersphinx',
     'numpydoc',
-    'ipython_directive',
-    'ipython_console_highlighting'
+    'IPython.sphinxext.ipython_directive',
+    'IPython.sphinxext.ipython_console_highlighting',
 ]
 
 autosummary_generate = True

diff --git a/doc/data-structures.rst b/doc/data-structures.rst
@@ -1,54 +1,74 @@
 Data structures
 ===============
 
-``xray``'s core data structures are the ``Dataset``, ``Variable`` and
-``DataArray``.
+xray's core data structures are the :py:class:`~xray.Dataset`,
+the :py:class:`~xray.Variable` (including its subclass
+:py:class:`~xray.Coordinate`) and the :py:class:`~xray.DataArray`.
+
+The document is intended as a technical summary of the xray data model. It
+should be mostly of interest to advanced users interested in extending or
+contributing to xray internals.
 
 Dataset
 -------
 
-``Dataset`` is netcdf-like object consisting of **variables** (a dictionary of
-Variable objects) and **attributes** (an ordered dictionary) which together
-form a self-describing data set.
+:py:class:`~xray.Dataset` is a Python object representing a fully self-
+described dataset of labeled N-dimensional arrays. It consists of:
+
+1. **variables**: A dictionary of Variable objects.
+2. **dimensions**: A dictionary giving the lengths of shared dimensions, which
+   are required to be consistent across all variables in a Dataset.
+3. **attributes**: An ordered dictionary of metadata.
+
+The design of the Dataset is based by the
+`NetCDF <http://www.unidata.ucar.edu/software/netcdf/>`__ file format for
+self-described scientific data. This is a data model that has become very
+successful and widely used in the geosciences.
+
+The Dataset is an intelligent container. It allows for simultaneous integer
+or label based indexing of all of its variables, supports split-apply-combine
+operations with groupby, and can be converted to and from
+:py:class:`pandas.DataFrame` objects.
 
 Variable
 --------
 
-``Variable`` implements **xray's** basic extended array object. It supports the
-numpy ndarray interface, but is extended to support and use metadata. It
-consists of:
+:py:class:`~xray.Variable` implements xray's basic extended array object. It
+supports the numpy ndarray interface, but is extended to support and use
+basic metadata (not including coordinate values). It consists of:
 
 1. **dimensions**: A tuple of dimension names.
-2. **data**: The n-dimensional array (typically, of type ``numpy.ndarray``)
-   storing the array's data. It must have the same number of dimensions as the
-   length of the "dimensions" attribute.
+2. **data**: The N-dimensional array (for example, of type
+   :py:class:`numpy.ndarray`) storing the array's data. It must have the same
+   number of dimensions as the length of the "dimensions" attribute.
 3. **attributes**: An ordered dictionary of additional metadata to associate
    with this array.
 
-The main functional difference between Variables and numpy.ndarrays is that
+The main functional difference between Variables and numpy arrays is that
 numerical operations on Variables implement array broadcasting by dimension
 name. For example, adding an Variable with dimensions `('time',)` to another
 Variable with dimensions `('space',)` results in a new Variable with dimensions
 `('time', 'space')`. Furthermore, numpy reduce operations like ``mean`` or
 ``sum`` are overwritten to take a "dimension" argument instead of an "axis".
 
 Variables are light-weight objects used as the building block for datasets.
-However, usually manipulating data in the form of a DataArray should be
-preferred (see below), because they can use more complete metadata in the full
-of other dataset variables.
+**However, manipulating data in the form of a Dataset or DataArray should
+almost always be preferred** (see below), because they can use more complete
+metadata in context of coordinate labels.
 
 DataArray
 ---------
 
-``DataArray`` is a flexible hybrid of Dataset and Variable that attempts to
-provide the best of both in a single object. Under the covers, DataArrays
-are simply pointers to a dataset (the ``dataset`` attribute) and the name of a
-"focus variable" in the dataset (the ``focus`` attribute), which indicates to
-which variable array operations should be applied.
+A :py:class:`~xray.DataArray` object is a multi-dimensional array with labeled
+dimensions and coordinates. Coordinate labels give it additional power over the
+Variable object, so it should be preferred for all high-level use.
+
+Under the covers, DataArrays are simply pointers to a dataset (the ``dataset``
+attribute) and the name of a variable in the dataset (the ``name`` attribute),
+which indicates to which variable array operations should be applied.
 
 DataArray objects implement the broadcasting rules of Variable objects, but
 also use and maintain coordinates (aka "indices"). This means you can do
 intelligent (and fast!) label based indexing on DataArrays (via the
 ``.loc`` attribute), do flexibly split-apply-combine operations with
-``groupby`` and also easily export them to ``pandas.DataFrame`` or
-``pandas.Series`` objects.
+``groupby`` and convert them to or from :py:class:`pandas.Series` objects.