diff --git a/README.md b/README.md index 5c9d2732a82..63d84bfc10f 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ makes many powerful array operations possible: dimensions (known in numpy as "broadcasting") based on dimension names, regardless of their original order. - Flexible split-apply-combine operations with groupby: - `x.groupby('time.dayofyear').apply(lambda y: y - y.mean())`. + `x.groupby('time.dayofyear').mean()`. - Database like aligment based on coordinate labels that smoothly handles missing values: `x, y = xray.align(x, y, join='outer')`. - Keep track of arbitrary metadata in the form of a Python dictionary: @@ -38,9 +38,10 @@ Because **xray** implements the same data model as the NetCDF file format, xray datasets have a natural and portable serialization format. But it's also easy to robustly convert an xray `DataArray` to and from a numpy `ndarray` or a pandas `DataFrame` or `Series`, providing compatibility with -the full [scientific-python ecosystem][scipy]. +the full [PyData ecosystem][pydata]. [pandas]: http://pandas.pydata.org/ +[pydata]: http://pydata.org/ [scipy]: http://scipy.org/ [ndarray]: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html @@ -143,43 +144,34 @@ labeled numpy arrays that provided some guidance for the design of xray. - Be fast. There shouldn't be a significant overhead for metadata aware manipulation of n-dimensional arrays, as long as the arrays are large enough. The goal is to be as fast as pandas or raw numpy. - - Provide a uniform API for loading and saving scientific data in a variety - of formats (including streaming data). - - Take a pragmatic approach to metadata (attributes), and be very cautious - before implementing any functionality that relies on it. Automatically - maintaining attributes is a tricky and very hard to get right (see - discussion about Iris above). + - Support loading and saving labeled scientific data in a variety of formats + (including streaming data). ## Getting started -For more details, see the **[full documentation][docs]** (still a work in -progress) or the source code. **xray** is rapidly maturing, but it is still in -its early development phase. ***Expect the API to change.*** +For more details, see the **[full documentation][docs]**, particularly the +**[tutorial][tutorial]**. xray requires Python 2.7 and recent versions of [numpy][numpy] (1.8.0 or later) and [pandas][pandas] (0.13.1 or later). [netCDF4-python][nc4], [pydap][pydap] and [scipy][scipy] are optional: they add support for reading and writing netCDF files and/or accessing OpenDAP datasets. We plan to -eventually support Python 3 but aren't there yet. The easiest way to get any -of these dependencies installed from scratch is to use [Anaconda][anaconda]. +eventually support Python 3 but aren't there yet. -xray is not yet available on the Python package index (prior to its initial -release). For now, you need to install it from source: +You can install xray from the pypi with pip: - git clone https://github.com/akleeman/xray.git - # WARNING: this will automatically upgrade numpy & pandas if necessary! - pip install -e xray - -Don't forget to `git fetch` regular updates! + pip install xray [docs]: http://xray.readthedocs.org/ +[tutorial]: http://xray.readthedocs.org/en/latest/tutorial.html [numpy]: http://www.numpy.org/ [pydap]: http://www.pydap.org/ [anaconda]: https://store.continuum.io/cshop/anaconda/ ## Anticipated API changes -Aspects of the API that we currently intend to change: +Aspects of the API that we currently intend to change in future versions of +xray: - The constructor for `DataArray` objects will probably change, so that it is possible to create new `DataArray` objects without putting them into a @@ -192,19 +184,10 @@ Aspects of the API that we currently intend to change: dimensional arrays. - Future versions of xray will add better support for working with datasets too big to fit into memory, probably by wrapping libraries like - [blaze][blaze]/[blz][blz] or [biggus][biggus]. More immediately: - - Array indexing will be made lazy, instead of immediately creating an - ndarray. This will make it easier to subsample from very large Datasets - incrementally using the `indexed` and `labeled` methods. We might need to - add a special method to allow for explicitly caching values in memory. - - We intend to support `Dataset` objects linked to NetCDF or HDF5 files on - disk to allow for incremental writing of data. - -Once we get the API in a state we're comfortable with and improve the -documentation, we intend to release version 0.1. Our target is to do so before -the xray talk on May 3, 2014 at [PyData Silicon Valley][pydata]. - -[pydata]: http://pydata.org/sv2014/ + [blaze][blaze]/[blz][blz] or [biggus][biggus]. More immediately, we intend + to support `Dataset` objects linked to NetCDF or HDF5 files on disk to + allow for incremental writing of data. + [blaze]: https://github.com/ContinuumIO/blaze/ [blz]: https://github.com/ContinuumIO/blz [biggus]: https://github.com/SciTools/biggus diff --git a/doc/_static/opendap-prism-tmax.png b/doc/_static/opendap-prism-tmax.png new file mode 100644 index 00000000000..3c98b35b82c Binary files /dev/null and b/doc/_static/opendap-prism-tmax.png differ diff --git a/doc/_static/series_plot_example.png b/doc/_static/series_plot_example.png deleted file mode 100644 index a789d088c8f..00000000000 Binary files a/doc/_static/series_plot_example.png and /dev/null differ diff --git a/doc/api.rst b/doc/api.rst index 6dca39f0b6f..189384feffb 100644 --- a/doc/api.rst +++ b/doc/api.rst @@ -7,7 +7,7 @@ Dataset ------- Creating a dataset -~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~ .. autosummary:: :toctree: generated/ @@ -20,8 +20,6 @@ Attributes and underlying data .. autosummary:: :toctree: generated/ - Dataset.variables - Dataset.virtual_variables Dataset.coordinates Dataset.noncoordinates Dataset.dimensions @@ -45,10 +43,14 @@ and values given by ``DataArray`` objects. Dataset.copy Dataset.iteritems Dataset.itervalues + Dataset.virtual_variables Comparisons ~~~~~~~~~~~ +.. autosummary:: + :toctree: generated/ + Dataset.equals Dataset.identical @@ -58,8 +60,8 @@ Selecting .. autosummary:: :toctree: generated/ - Dataset.indexed_by - Dataset.labeled_by + Dataset.indexed + Dataset.labeled Dataset.reindex Dataset.reindex_like Dataset.rename @@ -74,12 +76,26 @@ IO / Conversion .. autosummary:: :toctree: generated/ - Dataset.dump + Dataset.to_netcdf Dataset.dumps Dataset.dump_to_store Dataset.to_dataframe Dataset.from_dataframe +Dataset internals +~~~~~~~~~~~~~~~~~ + +These attributes and classes provide a low-level interface for working +with Dataset variables. In general you should use the Dataset dictionary- +like interface instead and working with DataArray objects: + +.. autosummary:: + :toctree: generated/ + + Dataset.variables + Variable + Coordinate + Backends (experimental) ~~~~~~~~~~~~~~~~~~~~~~~ @@ -109,10 +125,24 @@ Attributes and underlying data :toctree: generated/ DataArray.values + DataArray.as_index DataArray.coordinates DataArray.name DataArray.dataset DataArray.attrs + DataArray.encoding + DataArray.variable + +NDArray attributes +~~~~~~~~~~~~~~~~~~ + +.. autosummary:: + :toctree: generated/ + + DataArray.ndim + DataArray.shape + DataArray.size + DataArray.dtype Selecting ~~~~~~~~~ @@ -123,8 +153,8 @@ Selecting DataArray.__getitem__ DataArray.__setitem__ DataArray.loc - DataArray.indexed_by - DataArray.labeled_by + DataArray.indexed + DataArray.labeled DataArray.reindex DataArray.reindex_like DataArray.rename @@ -150,6 +180,7 @@ Computations DataArray.transpose DataArray.T DataArray.reduce + DataArray.get_axis_num DataArray.all DataArray.any DataArray.argmax diff --git a/doc/conf.py b/doc/conf.py index 243cd6a3122..4d94e81f46f 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -85,9 +85,10 @@ def __getattr__(cls, name): extensions = [ 'sphinx.ext.autodoc', 'sphinx.ext.autosummary', + 'sphinx.ext.intersphinx', 'numpydoc', - 'ipython_directive', - 'ipython_console_highlighting' + 'IPython.sphinxext.ipython_directive', + 'IPython.sphinxext.ipython_console_highlighting', ] autosummary_generate = True diff --git a/doc/data-structures.rst b/doc/data-structures.rst index 959d4215737..6b723a12947 100644 --- a/doc/data-structures.rst +++ b/doc/data-structures.rst @@ -1,31 +1,50 @@ Data structures =============== -``xray``'s core data structures are the ``Dataset``, ``Variable`` and -``DataArray``. +xray's core data structures are the :py:class:`~xray.Dataset`, +the :py:class:`~xray.Variable` (including its subclass +:py:class:`~xray.Coordinate`) and the :py:class:`~xray.DataArray`. + +The document is intended as a technical summary of the xray data model. It +should be mostly of interest to advanced users interested in extending or +contributing to xray internals. Dataset ------- -``Dataset`` is netcdf-like object consisting of **variables** (a dictionary of -Variable objects) and **attributes** (an ordered dictionary) which together -form a self-describing data set. +:py:class:`~xray.Dataset` is a Python object representing a fully self- +described dataset of labeled N-dimensional arrays. It consists of: + +1. **variables**: A dictionary of Variable objects. +2. **dimensions**: A dictionary giving the lengths of shared dimensions, which + are required to be consistent across all variables in a Dataset. +3. **attributes**: An ordered dictionary of metadata. + +The design of the Dataset is based by the +`NetCDF `__ file format for +self-described scientific data. This is a data model that has become very +successful and widely used in the geosciences. + +The Dataset is an intelligent container. It allows for simultaneous integer +or label based indexing of all of its variables, supports split-apply-combine +operations with groupby, and can be converted to and from +:py:class:`pandas.DataFrame` objects. Variable -------- -``Variable`` implements **xray's** basic extended array object. It supports the -numpy ndarray interface, but is extended to support and use metadata. It -consists of: +:py:class:`~xray.Variable` implements xray's basic extended array object. It +supports the numpy ndarray interface, but is extended to support and use +basic metadata (not including coordinate values). It consists of: 1. **dimensions**: A tuple of dimension names. -2. **data**: The n-dimensional array (typically, of type ``numpy.ndarray``) - storing the array's data. It must have the same number of dimensions as the - length of the "dimensions" attribute. +2. **data**: The N-dimensional array (for example, of type + :py:class:`numpy.ndarray`) storing the array's data. It must have the same + number of dimensions as the length of the "dimensions" attribute. 3. **attributes**: An ordered dictionary of additional metadata to associate with this array. -The main functional difference between Variables and numpy.ndarrays is that +The main functional difference between Variables and numpy arrays is that numerical operations on Variables implement array broadcasting by dimension name. For example, adding an Variable with dimensions `('time',)` to another Variable with dimensions `('space',)` results in a new Variable with dimensions @@ -33,22 +52,23 @@ Variable with dimensions `('space',)` results in a new Variable with dimensions ``sum`` are overwritten to take a "dimension" argument instead of an "axis". Variables are light-weight objects used as the building block for datasets. -However, usually manipulating data in the form of a DataArray should be -preferred (see below), because they can use more complete metadata in the full -of other dataset variables. +**However, manipulating data in the form of a Dataset or DataArray should +almost always be preferred** (see below), because they can use more complete +metadata in context of coordinate labels. DataArray --------- -``DataArray`` is a flexible hybrid of Dataset and Variable that attempts to -provide the best of both in a single object. Under the covers, DataArrays -are simply pointers to a dataset (the ``dataset`` attribute) and the name of a -"focus variable" in the dataset (the ``focus`` attribute), which indicates to -which variable array operations should be applied. +A :py:class:`~xray.DataArray` object is a multi-dimensional array with labeled +dimensions and coordinates. Coordinate labels give it additional power over the +Variable object, so it should be preferred for all high-level use. + +Under the covers, DataArrays are simply pointers to a dataset (the ``dataset`` +attribute) and the name of a variable in the dataset (the ``name`` attribute), +which indicates to which variable array operations should be applied. DataArray objects implement the broadcasting rules of Variable objects, but also use and maintain coordinates (aka "indices"). This means you can do intelligent (and fast!) label based indexing on DataArrays (via the ``.loc`` attribute), do flexibly split-apply-combine operations with -``groupby`` and also easily export them to ``pandas.DataFrame`` or -``pandas.Series`` objects. \ No newline at end of file +``groupby`` and convert them to or from :py:class:`pandas.Series` objects. diff --git a/doc/getting-started.rst b/doc/getting-started.rst deleted file mode 100644 index b68d01561be..00000000000 --- a/doc/getting-started.rst +++ /dev/null @@ -1,114 +0,0 @@ -Getting Started -=============== - -.. ipython:: python - :suppress: - - import numpy as np - np.random.seed(123456) - -Creating a ``Dataset`` ----------------------- - -Let's create some ``XArray`` objects and put them in a ``Dataset``: - -.. ipython:: python - - import xray - import numpy as np - import pandas as pd - time = xray.XArray('time', pd.date_range('2010-01-01', periods=365)) - us_state = xray.XArray('us_state', ['WA', 'OR', 'CA', 'NV']) - temp_data = (30 * np.cos(np.pi * np.linspace(-1, 1, 365).reshape(-1, 1)) - + 5 * np.arange(5, 9).reshape(1, -1) - + 3 * np.random.randn(365, 4)) - temperature = xray.XArray( - ['time', 'us_state'], temp_data, attributes={'units': 'degrees_F'}) - avg_rain = xray.XArray( - 'us_state', [27.66, 37.39, 17.28, 7.87], {'units': 'inches/year'}) - ds = xray.Dataset({'time': time, 'temperature': temperature, - 'us_state': us_state, 'avg_rain': avg_rain}, - attributes={'title': 'example dataset'}) - ds - -This dataset contains two non-coordinate variables, ``temperature`` and -``avg_rain``, as well as the coordinates ``time`` and ``us_state``. - -We can now access the contents of ``ds`` as self-described ``DataArray`` -objects: - -.. ipython:: python - - ds['temperature'] - -As you might guess, ``Dataset`` acts like a dictionary of variables. We -dictionary syntax to modify dataset variables in-place: - -.. ipython:: python - - ds['foo'] = ('us_state', 0.1 * np.random.rand(4)) - ds - del ds['foo'] - ds - -On the first line, we used a shortcut: we specified the 'foo' variable by -a tuple of the arguments for ``XArray`` instead of an ``XArray`` object. -This works, because a dataset can contain only ``XArray`` objects. - -We can also access some derived variables from time dimensions without -actually needing to put them in our dataset: - -.. ipython:: python - - ds['time.dayofyear'] - -Dataset math ------------- - -We can manipulate variables in a dataset like numpy arrays, while still -keeping track of their metadata: - -.. ipython:: python - - np.tan((ds['temperature'] + 10) ** 2) - -Sometimes, we really want just the plain numpy array. That's easy, too: - -.. ipython:: python - - ds['temperature'].data - -An advantage of sticking with dataset arrays is that we can use dimension -based broadcasting instead of numpy's shape based broadcasting: - -.. ipython:: python - - # this wouldn't work in numpy, because both these variables are 1d: - ds['time.month'] * ds['avg_rain'] - -We can also apply operations across dimesions by name instead of using -axis numbers: - -.. ipython:: python - - ds['temperature'].mean('time') - -Integration with ``pandas`` ---------------------------- - -Turning a dataset into a ``pandas.DataFrame`` broadcasts all the variables -over all dimensions: - -.. ipython:: python - - df = ds.to_dataframe() - df.head() - -Using the ``plot`` method on pandas objects is almost certainly the easiest way -to plot xray objects: - -.. ipython:: python - - # ds['temperature'].to_series() would work in place of df['temperature'] here - @savefig series_plot_example.png width=6in - df['temperature'].unstack('us_state').plot() diff --git a/doc/index.rst b/doc/index.rst index 62a5e2114ac..f3fa377dea9 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -1,14 +1,61 @@ xray: extended arrays for working with scientific datasets in Python ==================================================================== -**xray** is a Python package for working with aligned sets of homogeneous, -n-dimensional arrays. It implements flexible array operations and dataset -manipulation for in-memory datasets within the Common Data Model widely -used for self-describing scientific data (e.g., the NetCDF file format). +**xray** is a Python package for working with aligned sets of +homogeneous, n-dimensional arrays. It implements flexible array +operations and dataset manipulation for in-memory datasets within the +`Common Data +Model `__ +widely used for self-describing scientific data (e.g., the +`NetCDF `__ file +format). -For a longer introduction to **xray**, see the project's README on GitHub_. +Why xray? +--------- -.. _GitHub: https://github.com/akleeman/xray +Adding dimensions names and coordinate values to numpy's +`ndarray `__ +makes many powerful array operations possible: + +- Apply operations over dimensions by name: ``x.sum('time')``. +- Select values by label instead of integer location: + ``x.loc['2014-01-01']`` or ``x.labeled(time='2014-01-01')``. +- Mathematical operations (e.g., ``x - y``) vectorize across multiple + dimensions (known in numpy as "broadcasting") based on dimension + names, regardless of their original order. +- Flexible split-apply-combine operations with groupby: + ``x.groupby('time.dayofyear').mean()``. +- Database like aligment based on coordinate labels that smoothly + handles missing values: ``x, y = xray.align(x, y, join='outer')``. +- Keep track of arbitrary metadata in the form of a Python dictionary: + ``x.attrs``. + +**xray** aims to provide a data analysis toolkit as powerful as +`pandas `__ but designed for working with +homogeneous N-dimensional arrays instead of tabular data. Indeed, much +of its design and internal functionality (in particular, fast indexing) +is shamelessly borrowed from pandas. + +Because **xray** implements the same data model as the NetCDF file +format, xray datasets have a natural and portable serialization format. +But it's also easy to robustly convert an xray ``DataArray`` to and from +a numpy ``ndarray`` or a pandas ``DataFrame`` or ``Series``, providing +compatibility with the full `PyData ecosystem `__. + +For a longer introduction to **xray** and its design goals, see +`the project's GitHub page `__. The GitHub +page is where to go to look at the code, report a bug or make your own +contribution. You can also get in touch via `Twitter +`__. + +.. note :: + + **xray** is still very new -- it is on its first release and is only a few + months old. Although we will make a best effort to maintain the current + API, it is likely that the API will change in future versions as xray + matures. Some changes are already anticipated, as called out in the + `Tutorial `_ and the project `README + `__. Contents -------- @@ -16,6 +63,7 @@ Contents .. toctree:: :maxdepth: 1 + installing + tutorial data-structures - getting-started api diff --git a/doc/installing.rst b/doc/installing.rst new file mode 100644 index 00000000000..8a67a4d6cc1 --- /dev/null +++ b/doc/installing.rst @@ -0,0 +1,24 @@ +Installing xray +=============== + +xray requires Python 2.7 and recent versions of +`numpy `__ (1.8.0 or later) and +`pandas `__ (0.13.1 or later). +`netCDF4-python `__, +`pydap `__ and `scipy `__ are +optional: they add support for reading and writing netCDF files and/or +accessing OpenDAP datasets. + +The easiest way to get all these dependencies installed is to use the +`Anaconda python distribution `__. + +To install xray, use pip: + +:: + + pip install xray + +.. warning:: + + If you don't already have recent versions of numpy and pandas installed, + installing xray will automatically update them. diff --git a/doc/tutorial.rst b/doc/tutorial.rst new file mode 100644 index 00000000000..d8339cef927 --- /dev/null +++ b/doc/tutorial.rst @@ -0,0 +1,954 @@ +Tutorial +======== + +.. ipython:: python + :suppress: + + import numpy as np + np.random.seed(123456) + +To get started, we will import numpy, pandas and xray: + +.. ipython:: python + + import numpy as np + import pandas as pd + import xray + +``Dataset`` objects +------------------- + +:py:class:`xray.Dataset` is xray's primary data structure. It is a dict-like +container of labeled arrays (:py:class:`xray.DataArray` objects) with aligned +dimensions. It is designed as an in-memory representation of the data model +from the `NetCDF`__ file format. + +__ http://www.unidata.ucar.edu/software/netcdf/ + +Creating a ``Dataset`` +~~~~~~~~~~~~~~~~~~~~~~ + +To make an :py:class:`xray.Dataset` from scratch, pass in a dictionary with +values in the form ``(dimensions, data[, attributes])``. + +- `dimensions` should be a sequence of strings. +- `data` should be a numpy.ndarray (or array-like object) that has a + dimensionality equal to the length of the dimensions list. + +.. ipython:: python + + foo_values = np.random.RandomState(0).rand(3, 4) + times = pd.date_range('2000-01-01', periods=3) + + ds = xray.Dataset({'time': ('time', times), + 'foo': (['time', 'space'], foo_values)}) + ds + +You can also insert :py:class:`xray.Variable` or :py:class:`xray.DataArray` +objects directly into a ``Dataset``, or create an dataset from a +:py:class:`pandas.DataFrame` with +:py:meth:`Dataset.from_dataframe ` or from a +NetCDF file on disk with :py:func:`~xray.open_dataset`. See +`Working with pandas`_ and `Serialization and IO`_. + +``Dataset`` contents +~~~~~~~~~~~~~~~~~~~~ + +:py:class:`~xray.Dataset` implements the Python dictionary interface, with +values given by :py:class:`xray.DataArray` objects: + +.. ipython:: python + + 'foo' in ds + + ds.keys() + + ds['time'] + +The valid keys include each listed "coordinate" and "noncoordinate". +Coordinates are arrays that labels values along a particular dimension, which +they index by keeping track of a :py:class:`pandas.Index` object. They +are created automatically from dataset arrays whose name is equal to the one +item in their list of dimensions. + +Noncoordinates include all arrays in a ``Dataset`` other than its coordinates. +These arrays can exist along multiple dimensions. The numbers in the columns in +the ``Dataset`` representation indicate the order in which dimensions appear +for each array (on a ``Dataset``, the dimensions are always listed in +alphabetical order). + +We didn't explicitly include a coordinate for the "space" dimension, so it +was filled with an array of ascending integers of the proper length: + +.. ipython:: python + + ds['space'] + + ds['foo'] + +Noncoordinates and coordinates are listed explicitly by the +:py:attr:`~xray.Dataset.noncoordinates` and +:py:attr:`~xray.Dataset.coordinates` attributes. + +There are also a few derived variables based on datetime coordinates that you +can access from a dataset (e.g., "year", "month" and "day"), even if you didn't +explicitly add them. These are known as +":py:attr:`~xray.Dataset.virtual_variables`": + +.. ipython:: python + + ds['time.dayofyear'] + +Finally, datasets also store arbitrary metadata in the form of `attributes`: + +.. ipython:: python + + ds.attrs + + ds.attrs['title'] = 'example attribute' + ds + +xray does not enforce any restrictions on attributes, but serialization to +some file formats may fail if you put in objects that aren't strings, numbers +or arrays. + +Modifying datasets +~~~~~~~~~~~~~~~~~~ + +We can update a dataset in-place using Python's standard dictionary syntax: + +.. ipython:: python + + ds['numbers'] = ('space', [10, 10, 20, 20]) + ds['abc'] = ('time', ['A', 'B', 'C']) + ds + +It should be evident now how a ``Dataset`` lets you store many arrays along a +(partially) shared set of common dimensions and coordinates. + +To change the variables in a ``Dataset``, you can use all the standard dictionary +methods, including ``values``, ``items``, ``__del__``, ``get`` and +``update``. + +You also can select and unselect an explicit list of variables by using the +:py:meth:`~xray.Dataset.select` and :py:meth:`~xray.Dataset.unselect` methods +to return a new ``Dataset``. `select` automatically includes the relevant +coordinate values: + +.. ipython:: python + + ds.select('abc') + +If a coordinate is given as an argument to `unselect`, it also unselects all +variables that use that coordinate: + +.. ipython:: python + + ds.unselect('time', 'space') + +You can copy a ``Dataset`` by using the :py:meth:`~xray.Dataset.copy` method: + +.. ipython:: python + + ds2 = ds.copy() + del ds2['time'] + ds2 + +By default, the copy is shallow, so only the container will be copied: the +contents of the ``Dataset`` will still be the same underlying +:py:class:`xray.Variable`. You can copy all data by supplying the argument +``deep=True``. + +``DataArray`` objects +--------------------- + +The contents of a :py:class:`~xray.Dataset` are :py:class:`~xray.DataArray` +objects, xray's version of a labeled multi-dimensional array. +``DataArray`` supports metadata aware array operations based on their +labeled dimensions (axis names) and labeled coordinates (tick values). + +The idea of the DataArray is to provide an alternative to +:py:class:`pandas.Series` and :py:class:`pandas.DataFrame` with functionality +much closer to standard numpy N-dimensional array. Unlike pandas objects, +slicing or manipulating a DataArray always returns another DataArray, and all +items in a DataArray must have a single (homogeneous) data type. (To work +with heterogeneous data in xray, put separate DataArrays in the same Dataset.) + +You create a DataArray by getting an item from a Dataset: + +.. ipython:: python + + foo = ds['foo'] + foo + +.. note:: + + You currently cannot make a DataArray without putting objects into Dataset + first, unless you use the :py:meth:`DataArray.from_series ` + class method to convert an existing :py:class:`pandas.Series`. We do + intend to define a constructor for making DataArray objects directly in a + future version of xray. + +Internally, data arrays are uniquely defined by only two attributes: + +- :py:attr:`~xray.DataArray.dataset`: a dataset object. +- :py:attr:`~xray.DataArray.name`: the name of a variable in the array's + dataset. + +Like pandas objects, they can be thought of as fancy wrapper around a +numpy array: + +.. ipython:: python + + foo.values + +They also have a tuple of dimension labels: + +.. ipython:: python + + foo.dimensions + +They track of their coordinates (tick labels) in a read-only ordered +dictionary mapping from dimension names to :py:class:`~xray.Coordinate` +objects: + +.. ipython:: python + + foo.coordinates + +They also keep track of their own attributes: + +.. ipython:: python + + foo.attrs + +You can pull out other variable (including coordinates) from a DataArray's +dataset by indexing the data array with a string: + +.. ipython:: python + + foo['time'] + +Usually, xray automatically manages the `Dataset` objects that data arrays +points to in a satisfactory fashion. For example, it will keep around other +dataset variables when possible until there are potential conflicts, such as +when you apply a mathematical operation. + +However, in some cases, particularly for performance reasons, you may want to +explicitly ensure that the dataset only includes the variables you are +interested in. For these cases, use the :py:meth:`xray.DataArray.select` +method to select the names of variables you want to keep around, by default +including the name of only the DataArray itself: + +.. ipython:: python + + foo2 = foo.select() + + foo2 + +`foo2` is generally an equivalent labeled array to `foo`, but we dropped the +dataset variables that are no longer relevant: + +.. ipython:: python + + foo.dataset.keys() + + foo2.dataset.keys() + +Array indexing +-------------- + +Indexing a :py:class:`~xray.DataArray` works (mostly) just like it does for +numpy arrays, except that the returned object is always another DataArray: + +.. ipython:: python + + foo[:2] + + foo[0, 0] + + foo[:, [2, 1]] + +xray also supports label based indexing like pandas. Because +:py:class:`~xray.Coordinate` is a thin wrapper around a +:py:class:`pandas.Index`, label indexing is very fast. To do +label based indexing, use the :py:attr:`~xray.DataArray.loc` attribute: + +.. ipython:: python + + foo.loc['2000-01-01':'2000-01-02', 0] + +You can do any of the label indexing operations `supported by pandas`__ with +the exception of boolean arrays, including looking up particular labels, using +slice syntax and using arrays of labels. Like pandas, label based indexing is +*inclusive* of both the start and stop bounds. + +__ http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label + +Setting values with label based indexing is also supported: + +.. ipython:: python + + foo.loc['2000-01-01', [1, 2]] = -10 + foo + +With labeled dimension names, we do not have to rely on dimension order and can +use them explicitly to slice data with the :py:meth:`~xray.DataArray.indexed` +and :py:meth:`~xray.DataArray.labeled` methods: + +.. ipython:: python + + # index by array indices + foo.indexed(space=0, time=slice(0, 2)) + + # index by coordinate labels + foo.labeled(time=slice('2000-01-01', '2000-01-02')) + +The arguments to these methods can be any objects that could index the array +along that dimension, e.g., labels for an individual value, Python ``slice`` +objects or 1-dimensional arrays. + +We can also use these methods to index all variables in a dataset +simultaneously, returning a new dataset: + +.. ipython:: python + + ds.indexed(space=[0], time=[0]) + +.. ipython:: python + + ds.labeled(time='2000-01-01') + +Indexing with xray objects has one important difference from indexing numpy +arrays: you can only use one-dimensional arrays to index xray objects, and +each indexer is applied "orthogonally" along independent axes, instead of +using numpy's array broadcasting. This means you can do indexing like this, +which wouldn't work with numpy arrays: + +.. ipython:: python + + foo[ds['time.day'] > 1, ds['space'] <= 3] + +This is a much simpler model than numpy's `advanced indexing`__, +and is basically the only model that works for labeled arrays. If you would +like to do advanced indexing, so you always index ``.values`` instead: + +__ http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html + +.. ipython:: python + + foo.values[foo.values > 0.5] + +``DataArray`` math +------------------ + +The metadata of :py:class:`~xray.DataArray` objects enables particularly nice +features for doing mathematical operations. + +Basic math +~~~~~~~~~~ + +Basic math works just as you would expect: + +.. ipython:: python + + foo - 3 + +You can also use any of numpy's or scipy's many `ufunc`__ functions directly on +a DataArray: + +__ http://docs.scipy.org/doc/numpy/reference/ufuncs.html + +.. ipython:: python + + np.sin(foo) + +Aggregation +~~~~~~~~~~~ + +Whenever feasible, DataArrays have metadata aware version of standard methods +and properties from numpy arrays. For example, we can easily take a metadata +aware :py:attr:`xray.DataArray.transpose`: + +.. ipython:: python + + foo.T + +Most of these methods have been updated to take a `dimension` argument instead +of `axis`. This allows for very intuitive syntax for aggregation methods that +are applied along particular dimension(s): + +.. ipython:: python + + foo.sum('time') + +.. ipython:: python + + foo.std(['time', 'space']) + +Currently, these are the standard numpy array methods which do not automatically +skip missing values, but we expect to switch to NA skipping versions (like +pandas) in the future. For now, you can do NA skipping aggregate by passing +NA aware numpy functions to the :py:attr:`~xray.DataArray.reduce` method: + +.. ipython:: python + + foo.reduce(np.nanmean, 'time') + +If you ever need to figure out the axis number for a dimension yourself (say, +for wrapping library code designed to work with numpy arrays), you can use the +:py:meth:`~xray.DataArray.get_axis_num` method: + +.. ipython:: python + + foo.get_axis_num('space') + +Broadcasting +~~~~~~~~~~~~ + +With dimension names, we automatically align dimensions ("broadcasting" in +the numpy parlance) by name instead of order. This means that you should never +need to bother inserting dimensions of length 1 with operations like +:py:func:`np.reshape` or :py:const:`np.newaxis`, which is pretty routinely +required when working with standard numpy arrays. + +This is best illustrated by a few examples. Consider two one-dimensional +arrays with different sizes aligned along different dimensions: + +.. ipython:: python + + foo[:, 0] + + foo[0, :] + +With xray, we can apply binary mathematical operations to arrays, and their +dimensions are expanded automatically: + +.. ipython:: python + + foo[:, 0] - foo[0, :] + +Moreover, dimensions are always reordered to the order in which they first +appeared. That means you can always subtract an array from its transpose! + +.. ipython:: python + + foo - foo.T + +Coordinate based alignment +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can also align arrays based on their coordinate values, very similarly +to how pandas handles alignment. This can be done with the +:py:meth:`~xray.DataArray.reindex` or :py:meth:`~xray.DataArray.reindex_like` +methods, or the :py:func:`~xray.align` top-level function. All these work +interchangeably with both DataArray and Dataset objects with any number of +overlapping dimensions. + +To demonstrate, we'll make a subset DataArray with new values: + +.. ipython:: python + + bar = (10 * foo[:2, :2]).rename('bar') + bar + +Reindexing ``foo`` with ``bar`` selects out the first two values along each +dimension: + +.. ipython:: python + + foo.reindex_like(bar) + +The opposite operation asks us to reindex to a larger shape, so we fill in +the missing values with `NaN`: + +.. ipython:: python + + bar.reindex_like(foo) + +The :py:func:`~xray.align` is even more flexible: + +.. ipython:: python + + xray.align(ds, bar, join='inner') + +Pandas does this sort of index based alignment automatically when doing math, +using an `join='outer'`. This is an intended feature for xray, too, but we +haven't turned it on yet, because it is not clear that an outer join (which +preserves all missing values) is the best choice for working with high- +dimension arrays. Arguably, an inner join makes more sense, because that is +less likely to result in memory blow-ups. Hopefully, this point will eventually +become moot when python libraries better support working with arrays that +cannot be directly represented in a block of memory. + +GroupBy: split-apply-combine +---------------------------- + +Pandas has very convenient support for `"group by"`__ operations, which +implement the `split-apply-combine`__ strategy for crunching data: + +__ http://pandas.pydata.org/pandas-docs/stable/groupby.html +__ http://www.jstatsoft.org/v40/i01/paper + +- Split your data into multiple independent groups. +- Apply some function to each group. +- Combine your groups back into a single data object. + +xray implements this same pattern using very similar syntax to pandas. Group by +operations work on both :py:class:`~xray.Dataset` and +:py:class:`~xray.DataArray` objects. Note that currently, you can only group +by a single one-dimensional variable (eventually, we hope to remove this +limitation). + +Split +~~~~~ + +Recall the "numbers" variable in our dataset: + +.. ipython:: python + + ds['numbers'] + +If we groupby the name of a variable in a dataset (we can also use a DataArray +directly), we get back a :py:class:`xray.GroupBy` object: + +.. ipython:: python + + ds.groupby('numbers') + +This object works very similarly to a pandas GroupBy object. You can view +the group indices with the ``groups`` attribute: + +.. ipython:: python + + ds.groupby('numbers').groups + +You can also iterate over over groups in ``(label, group)`` pairs: + +.. ipython:: python + + list(ds.groupby('numbers')) + +Just like in pandas, creating a GroupBy object doesn't actually split the data +until you want to access particular values. + +Apply +~~~~~ + +To apply a function to each group, you can use the flexible +:py:attr:`xray.GroupBy.apply` method. The resulting objects are automatically +concatenated back together along the group axis: + +.. ipython:: python + + def standardize(x): + return (x - x.mean()) / x.std() + + foo.groupby('numbers').apply(standardize) + +Group by objects resulting from DataArrays also have shortcuts to aggregate +a function over each element of the group: + +.. ipython:: python + + foo.groupby('numbers').mean() + +Squeezing +~~~~~~~~~ + +When grouping over a dimension, you can control whether the dimension is +squeezed out or if it should remain with length one on each group by using +the ``squeeze`` parameter: + +.. ipython:: python + + list(foo.groupby('space'))[0][1] + +.. ipython:: python + + list(foo.groupby('space', squeeze=False))[0][1] + +Although xray will attempt to automatically +:py:attr:`~xray.DataArray.transpose` dimensions back into their original order +when you use apply, it is sometimes useful to set ``squeeze=False`` to +guarantee that all original dimensions remain unchanged. + +You can always squeeze explicitly later with the Dataset or DataArray +:py:meth:`~xray.DataArray.squeeze` methods. + +Combining data +-------------- + +Concatenate +~~~~~~~~~~~ + +To combine arrays along a dimension into a larger arrays, you can use the +:py:meth:`DataArray.concat ` and +:py:meth:`Dataset.concat ` class methods: + +.. ipython:: python + + xray.DataArray.concat([foo[0], foo[1]], 'new_dim') + + xray.Dataset.concat([ds.labeled(time='2000-01-01'), + ds.labeled(time='2000-01-03')], + 'new_dim') + +:py:meth:`Dataset.concat ` has a number of options which +control how it combines data, and in particular, how it handles conflicting +variables between datasets. + +Merge and update +~~~~~~~~~~~~~~~~ + +To combine multiple Datasets, you can use the +:py:meth:`~xray.Dataset.merge` and :py:meth:`~xray.Dataset.update` methods. +Merge checks for conflicting variables before merging and by +default it returns a new Dataset: + +.. ipython:: python + + ds.merge({'hello': ('space', np.arange(4) + 10)}) + +In contrast, update modifies a dataset in-place without checking for conflicts, +and will overwrite any existing variables with new values: + +.. ipython:: python + + ds.update({'space': ('space', [10.2, 9.4, 6.4, 3.9])}) + +However, dimensions are still required to be consistent between different +Dataset variables, so you cannot change the size of a dimension unless you +replace all dataset variables that use it. + +Equals and identical +~~~~~~~~~~~~~~~~~~~~ + +xray objects can be compared by using the :py:meth:`~xray.DataArray.equals` +and :py:meth:`~xray.DataArray.identical` methods. + +``equals`` checks dimension names, coordinate labels and array values: + +.. ipython:: python + + foo.equals(foo.copy()) + +``identical`` also checks attributes, and the name of each object: + +.. ipython:: python + + foo.identical(foo.rename('bar')) + +In contrast, the ``==`` for ``DataArray`` objects performs element- wise +comparison (like numpy): + +.. ipython:: python + + foo == foo.copy() + +Like pandas objects, two xray objects are still equal or identical if they have +missing values marked by `NaN`, as long as the missing values are in the same +locations in both objects. This is not true for `NaN` in general, which usually +compares `False` to everything, including itself: + +.. ipython:: python + + np.nan == np.nan + +Working with ``pandas`` +----------------------- + +One of the most important features of xray is the ability to convert to and +from :py:mod:`pandas` objects to interact with the rest of the PyData +ecosystem. For example, for plotting labeled data, we highly recommend +using the visualization `built in to pandas itself`__ or provided by the pandas +aware libraries such as `Seaborn`__ and `ggplot`__. + +__ http://pandas.pydata.org/pandas-docs/stable visualization.html +__ http://stanford.edu/~mwaskom/software/seaborn/ +__ http://ggplot.yhathq.com/ + +Fortunately, there are straightforward representations of +:py:class:`~xray.Dataset` and :py:class:`~xray.DataArray` in terms of +:py:class:`pandas.DataFrame` and :py:class:`pandas.Series`, respectively. +The representation works by flattening noncoordinates to 1D, and turning the +tensor product of coordinates into a :py:class:`pandas.MultiIndex`. + +``pandas.DataFrame`` +~~~~~~~~~~~~~~~~~~~~ + +To convert to a ``DataFrame``, use the :py:meth:`Dataset.to_dataframe() +` method: + +.. ipython:: python + + df = ds.to_dataframe() + df + +We see that each noncoordinate in the Dataset is now a column in the DataFrame. +The ``DataFrame`` representation is reminiscent of Hadley Wickham's notion of +`tidy data`__. To convert the ``DataFrame`` to any other convenient representation, +use ``DataFrame`` methods like :py:meth:`~pandas.DataFrame.reset_index`, +:py:meth:`~pandas.DataFrame.stack` and :py:meth:`~pandas.DataFrame.unstack`. + +__ http://vita.had.co.nz/papers/tidy-data.pdf + +To create a ``Dataset`` from a ``DataFrame``, use the +:py:meth:`~xray.Dataset.from_dataframe` class method: + +.. ipython:: python + + xray.Dataset.from_dataframe(df) + +Notice that that dimensions of noncoordinates in the ``Dataset`` have now +expanded after the round-trip conversion to a ``DataFrame``. This is because +every object in a ``DataFrame`` must have the same indices, so needed to +broadcast the data of each array to the full size of the new ``MultiIndex``. + +``pandas.Series`` +~~~~~~~~~~~~~~~~~ + +``DataArray`` objects have a complementary representation in terms of a +:py:class:`pandas.Series`. Using a Series preserves the ``Dataset`` to +``DataArray`` relationship, because ``DataFrames`` are dict-like containers +of ``Series``. The methods are very similar to those for working with +DataFrames: + +.. ipython:: python + + s = foo.to_series() + s + + xray.DataArray.from_series(s) + +Serialization and IO +-------------------- + +xray supports direct serialization and IO to several file formats. For more +options, consider exporting your objects to pandas (see the preceeding section) +and using its broad range of `IO tools`__. + +__ http://pandas.pydata.org/pandas-docs/stable/io.html + +Pickle +~~~~~~ + +The simplest way to serialize an xray object is to use Python's built-in pickle +module: + +.. ipython:: python + + import cPickle as pickle + + pkl = pickle.dumps(ds) + + pickle.loads(pkl) + +Pickle support is important because it doesn't require any external libraries +and lets you use xray objects with Python modules like ``multiprocessing``. +However, there are two important cavaets: + +1. To simplify serialization, xray's support for pickle currently loads all + array values into memory before dumping an object. This means it is not + suitable for serializing datasets too big to load into memory (e.g., from + NetCDF or OpenDAP). +2. Pickle will only work as long as the internal data structure of xray objects + remains unchanged. Because the internal design of xray is still being + refined, we make no guarantees (at this point) that objects pickled with + this version of xray will work in future versions. + +Reading and writing to disk (NetCDF) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Currently, the only external serialization format that xray supports is +`NetCDF`__. NetCDF is a file format for fully self-described datasets that is +widely used in the geosciences and supported on almost all platforms. We use +NetCDF because xray was based on the NetCDF data model, so NetCDF files on disk +directly correspond to :py:class:`~xray.Dataset` objects. Recent versions +NetCDF are based on the even more widely used HDF5 file-format. + +__ http://www.unidata.ucar.edu/software/netcdf/ + +Reading and writing NetCDF files with xray requires the +`Python-NetCDF4`__ library. + +__ https://github.com/Unidata/netcdf4-python + +We can save a Dataset to disk using the +:py:attr:`Dataset.to_netcdf <~xray.Dataset.to_netcdf>` method: + +.. use verbatim because readthedocs doesn't have netCDF4 support + +.. ipython:: + :verbatim: + + In [1]: ds.to_netcdf('saved_on_disk.nc') + +By default, the file is saved as NetCDF4. + +We can load NetCDF files to create a new Dataset using the +:py:func:`~xray.open_dataset` function: + +.. ipython:: + :verbatim: + + In [1]: ds_disk = xray.open_dataset('saved_on_disk.nc') + + In [2]: ds_disk + + Dimensions: (space: 4, time: 3) + Coordinates: + space X + time X + Noncoordinates: + foo 1 0 + numbers 0 + abc 0 + Attributes: + title: example attribute + +Data is loaded lazily from NetCDF files. You can manipulate, slice and subset +Dataset and DataArray objects, and no array values are loaded into memory until +necessary. For an example of how these lazy arrays work, since the OpenDAP +section below. + +.. note:: + + Although xray provides reasonable support for incremental reads of files on + disk, it does not yet support incremental writes, which is important for + dealing with datasets that do not fit into memory. This is a significant + shortcoming which is on the roadmap for fixing in the next major version, + which will include the ability to create ``Dataset`` objects directly + linked to a NetCDF file on disk. + +NetCDF files follow some conventions for encoding datetime arrays (as numbers +with a "units" attribute) and for packing and unpacking data (as +described by the "scale_factor" and "_FillValue" attributes). If the argument +``decode_cf=True`` (default) is given to ``open_dataset``, xray will attempt +to automatically decode the values in the NetCDF objects according to +`CF conventions`__. Sometimes this will fail, for example, if a variable +has an invalid "units" or "calendar" attribute. For these cases, you can +turn this decoding off manually. + +__ http://cfconventions.org/ + +You can view this encoding information and control the details of how xray +serializes objects, by viewing and manipulating the +:py:attr:`DataArray.encoding ` attribute: + +.. ipython:: + :verbatim: + + In [1]: ds_disk['time'].encoding + {'calendar': u'proleptic_gregorian', + 'chunksizes': None, + 'complevel': 0, + 'contiguous': True, + 'dtype': dtype('float64'), + 'fletcher32': False, + 'least_significant_digit': None, + 'shuffle': False, + 'units': u'days since 2000-01-01 00:00:00', + 'zlib': False} + +Working with remote datasets (OpenDAP) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +xray includes support for `OpenDAP`__ (via the NetCDF4 library or Pydap), which +lets us access large datasets over HTTP. + +__ http://www.opendap.org/ + +For example, we can open a connetion to GBs of weather data produced by the +`PRISM`__ project, and hosted by +`International Research Institute for Climate and Society`__ at Columbia: + +__ http://www.prism.oregonstate.edu/ +__ http://iri.columbia.edu/ + +.. ipython:: + :verbatim: + + In [3]: remote_data = xray.open_dataset( + 'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods') + + In [4]: remote_data + + Dimensions: (T: 1432, X: 1405, Y: 621) + Coordinates: + T X + X X + Y X + Noncoordinates: + ppt 0 2 1 + tdmean 0 2 1 + tmax 0 2 1 + tmin 0 2 1 + Attributes: + Conventions: IRIDL + expires: 1401580800 + + In [5]: remote_data['tmax'] + + [1249427160 values with dtype=float64] + Attributes: + pointwidth: 120 + units: Celsius_scale + missing_value: -9999 + standard_name: air_temperature + expires: 1401580800 + +We can select and slice this data any number of times, and nothing is loaded +over the network until we look at particular values: + +.. ipython:: + :verbatim: + + In [4]: tmax = remote_data['tmax'][:500, ::3, ::3] + + In [5]: tmax + + [48541500 values with dtype=float64] + Attributes: + pointwidth: 120 + units: Celsius_scale + missing_value: -9999 + standard_name: air_temperature + expires: 1401580800 + +Now, let's access and plot a small subset: + +.. ipython:: + :verbatim: + + In [6]: tmax_ss = tmax[0] + +For this dataset, we still need to manually fill in some of the values with +`NaN` to indicate that they are missing. As soon as we access +``tmax_ss.values``, the values are loaded over the network and cached on the +DataArray so they can be manipulated: + +.. ipython:: + :verbatim: + + In [7]: tmax_ss.values[tmax_ss.values < -99] = np.nan + +Finally, we can plot the values with matplotlib: + +.. ipython:: + :verbatim: + + In [8]: import matplotlib.pyplot as plt + + In [9]: from matplotlib.cm import get_cmap + + In [10]: plt.figure(figsize=(9, 5)) + + In [11]: plt.gca().patch.set_color('0') + + In [12]: plt.contourf(tmax_ss['X'], tmax_ss['Y'], tmax_ss.values, 20, + cmap=get_cmap('RdBu_r')) + + In [13]: plt.colorbar() + +.. image:: _static/opendap-prism-tmax.png diff --git a/setup.py b/setup.py index bb4c61b08f8..d315eeb5eec 100644 --- a/setup.py +++ b/setup.py @@ -15,6 +15,69 @@ VERSION = '%d.%d.%d' % (MAJOR, MINOR, MICRO) QUALIFIER = '' + +DISTNAME = 'xray' +LICENSE = 'Apache' +AUTHOR = 'Stephan Hoyer, Alex Kleeman, Eugene Brevdo' +AUTHOR_EMAIL = 'shoyer@climate.com' +URL = 'https://github.com/akleeman/xray' +CLASSIFIERS = [ + 'Development Status :: 3 - Alpha', + 'License :: OSI Approved :: Apache Software License', + 'Operating System :: OS Independent', + 'Intended Audience :: Science/Research', + 'Programming Language :: Python :: 2.7', + 'Topic :: Scientific/Engineering', +] + + +DESCRIPTION = "Extended arrays for working with scientific datasets in Python" +LONG_DESCRIPTION = """ +**xray** is a Python package for working with aligned sets of +homogeneous, n-dimensional arrays. It implements flexible array +operations and dataset manipulation for in-memory datasets within the +`Common Data +Model `__ +widely used for self-describing scientific data (e.g., the NetCDF file +format). + +Why xray? +--------- + +Adding dimensions names and coordinate values to numpy's +`ndarray `__ +makes many powerful array operations possible: + +- Apply operations over dimensions by name: ``x.sum('time')``. +- Select values by label instead of integer location: + ``x.loc['2014-01-01']`` or ``x.labeled(time='2014-01-01')``. +- Mathematical operations (e.g., ``x - y``) vectorize across multiple + dimensions (known in numpy as "broadcasting") based on dimension + names, regardless of their original order. +- Flexible split-apply-combine operations with groupby: + ``x.groupby('time.dayofyear').mean()``. +- Database like aligment based on coordinate labels that smoothly + handles missing values: ``x, y = xray.align(x, y, join='outer')``. +- Keep track of arbitrary metadata in the form of a Python dictionary: + ``x.attrs``. + +**xray** aims to provide a data analysis toolkit as powerful as +`pandas `__ but designed for working with +homogeneous N-dimensional arrays instead of tabular data. Indeed, much +of its design and internal functionality (in particular, fast indexing) +is shamelessly borrowed from pandas. + +Because **xray** implements the same data model as the NetCDF file +format, xray datasets have a natural and portable serialization format. +But it's also easy to robustly convert an xray ``DataArray`` to and from +a numpy ``ndarray`` or a pandas ``DataFrame`` or ``Series``, providing +compatibility with the full `PyData ecosystem `__. + +For more about **xray**, see the project's `GitHub page +`__ and `documentation +`__ +""" + # code to extract and write the version copied from pandas, which is available # under the BSD license: FULLVERSION = VERSION @@ -82,13 +145,16 @@ def write_version_py(filename=None): write_version_py() -setup(name='xray', +setup(name=DISTNAME, version=FULLVERSION, - description='Extended arrays for working with scientific datasets', - author='Stephan Hoyer, Alex Kleeman, Eugene Brevdo', - author_email='TODO', + license=LICENSE, + author=AUTHOR, + author_email=AUTHOR_EMAIL, + classifiers=CLASSIFIERS, + description=DESCRIPTION, + long_description=LONG_DESCRIPTION, install_requires=['numpy >= 1.8', 'pandas >= 0.13.1'], tests_require=['mock >= 1.0.1', 'nose >= 1.0'], - url='https://github.com/akleeman/xray', + url=URL, test_suite='nose.collector', packages=['xray', 'xray.backends']) diff --git a/test/test_data_array.py b/test/test_data_array.py index 9ad85dd243b..8876a7a07a5 100644 --- a/test/test_data_array.py +++ b/test/test_data_array.py @@ -1,4 +1,5 @@ import numpy as np +import pandas as pd from copy import deepcopy from textwrap import dedent @@ -41,6 +42,9 @@ def test_properties(self): self.dv.name = 'bar' with self.assertRaises(AttributeError): self.dv.dataset = self.ds + self.assertIsInstance(self.ds['x'].as_index, pd.Index) + with self.assertRaisesRegexp(ValueError, 'must be 1-dimensional'): + self.ds['foo'].as_index def test_equals_and_identical(self): da2 = self.dv.copy() diff --git a/xray/data_array.py b/xray/data_array.py index cd8cbc36701..ff083f65c08 100644 --- a/xray/data_array.py +++ b/xray/data_array.py @@ -157,8 +157,9 @@ def _in_memory(self): @property def as_index(self): - """The variable's data as a pandas.Index""" - return self.variable.as_index + """The variable's data as a pandas.Index. Only possible for 1D arrays. + """ + return self.variable.to_coord().as_index @property def dimensions(self):