Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Converter.to_info to customize info and search #1884

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

braingram
Copy link
Contributor

@braingram braingram commented Dec 20, 2024

Description

This PR adds support for a new Converter method to_info. This new method is for converting a custom object to an "info" container (list, tuple, dict) to provide additional information during info (and as a side effect during search since this uses common code in _node_info.py).

To provide an example where this is useful. Currently an AsdfFile with an astropy Table renders with largely unhelpful info:

>>> from astropy.table import Table
>>> import asdf
>>> af = asdf.AsdfFile({"my_table": Table({"col_a": [1, 2, 3], "col_b": [4, 5, 6]})})
>>> af.info()
root (AsdfObject)
└─my_table (Table)

With this PR the Table (and Column) Converters in asdf-astropy can be updated (only the TableConverter update shown here).

    def to_info(self, obj):
        return {
            "columns": {n: obj[n] for n in obj.colnames},
            "meta": dict(obj.meta),
            "n_rows": len(obj),
        }

to improve the output:

└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column): shape=(3,), dtype=int64
  │ │ ├─name (str): col_a
  │ │ └─dtype (dtype[int64]): int64
  │ └─col_b (Column): shape=(3,), dtype=int64
  │   ├─name (str): col_b
  │   └─dtype (dtype[int64]): int64
  ├─meta (dict): {}
  └─n_rows (int): 3

This has the added benefit of allowing search to find column names for the above example:

>>> af.search("col_a")
root (AsdfObject)
└─my_table (Table)
  └─columns (dict)
    └─col_a (Column): shape=(3,), dtype=int64

This PR uses the Converter.to_info feature to implement custom information of ndarray/NDArrayType. On main the following:

import asdf, numpy as np
asdf.AsdfFile({"arr": np.arange(42)}).info()

produces:

root (AsdfObject)
└─arr (ndarray): shape=(42,), dtype=int64

with this PR the output is different:

root (AsdfObject)
└─arr (ndarray)
  ├─shape (tuple)
  │ └─[0] (int): 42
  └─dtype (Int64DType): int64

During working on this feature several bugs were uncovered and a few were fixed:

Fixes #1882 #1889 #1890

Tasks

  • run pre-commit on your machine
  • run pytest on your machine
  • Does this PR add new features and / or change user-facing code / API? (if not, label with no-changelog-entry-needed)
    • write news fragment(s) in changes/: echo "changed something" > changes/<PR#>.<changetype>.rst (see below for change types)
    • update relevant docstrings and / or docs/ page
    • for any new features, add unit tests
news fragment change types...
  • changes/<PR#>.feature.rst: new feature
  • changes/<PR#>.bugfix.rst: bug fix
  • changes/<PR#>.doc.rst: documentation change
  • changes/<PR#>.removal.rst: deprecation or removal of public API
  • changes/<PR#>.general.rst: infrastructure or miscellaneous change

@braingram braingram changed the title WIP: info/seach/schema_info fixes Add Converter.to_info to customize info and search Dec 24, 2024
@braingram
Copy link
Contributor Author

The stdatamodels downstream failure is unrelated: spacetelescope/stdatamodels#371

@braingram braingram marked this pull request as ready for review December 24, 2024 19:21
@braingram braingram requested a review from a team as a code owner December 24, 2024 19:21
@perrygreenfield
Copy link
Contributor

This is a nice feature. I wonder if there should be a keyword option to suppress this option in the cases it will really blow up the output (tables with lots of columns, or a very complex GWCS).

@braingram
Copy link
Contributor Author

This is a nice feature. I wonder if there should be a keyword option to suppress this option in the cases it will really blow up the output (tables with lots of columns, or a very complex GWCS).

Thanks for giving it a look.

At the moment the output is subject to the max_rows (and max_cols) options. For the example output if the total number of rows exceedsmax_rows (if max_rows is an integer) the output will begin to be truncated starting with the most distal branches of the tree, pruning the output to:

└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column) ...
  │ └─col_b (Column) ...
  ├─meta (dict): {}
  └─n_rows (int): 3

and then to:

└─my_table (Table) ...

There is also an existing show_values option. If that's disabled (and max_rows=None):

└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column)
  │ │ ├─name (str)
  │ │ └─dtype (dtype[int64])
  │ └─col_b (Column)
  │   ├─name (str)
  │   └─dtype (dtype[int64])
  ├─meta (dict)
  └─n_rows (int)

That output doesn't look too helpful so one option might be to not use to_info when show_values=False. That seems a little hacky and may require some additional code refactoring (as to_info is used during the NodeInfo collection whereas show_values is used in the post-collection display of the NodeInfo instance).

We could add a flag use_to_info to NodeInfo to control if to_info is used to expand nodes and then expose it for control via:

  • AsdfFile.info
  • asdf.info
  • asdftool info
  • etc...

At the moment I'm leaning towards relying on max_rows as the way to control the verbosity of the output but don't object to adding more API to control this output. Let me know what works best for you.

@perrygreenfield
Copy link
Contributor

I worry that if the very lengthy item comes before items of interest, max_rows may not be that much help. For your example, while the column info is attenuated somewhat, there may still be a lot of lines for big tables. This may work better for GWCS if that shows a deep expression tree (though some representations are more text based; it depends on how the to_info is implemented). I wonder if another approach is truncate a node's output based on how many lines it contains (max_node_info?), though I would probably exclude meta from that.

@braingram
Copy link
Contributor Author

I worry that if the very lengthy item comes before items of interest, max_rows may not be that much help. For your example, while the column info is attenuated somewhat, there may still be a lot of lines for big tables. This may work better for GWCS if that shows a deep expression tree (though some representations are more text based; it depends on how the to_info is implemented). I wonder if another approach is truncate a node's output based on how many lines it contains (max_node_info?), though I would probably exclude meta from that.

While working on this and #1875 I noticed that max_rows is rather more complicated than I expected. It starts hiding the deepest/most-distal rows/branches first. Since the to_info generated subtree will be 1 deeper than the item this can help (for some trees) to keep things tidy when limiting the number of rows. If I modify the above example adding a nested structure prior to my_table and call af.info (with the default max_rows=24) the output includes:

─aaa (dict)
│ ├─bbb (dict)
│ │ ├─ccc (dict) ...
│ │ └─ccc2 (int): 1
│ └─bbb2 (list)
│   ├─[0] (int): 5
│   ├─[1] (int): 6
│   └─[2] (int): 7
└─my_table (Table)
  ├─columns (dict) ...
  ├─meta (dict): {}
  └─n_rows (int): 3

In this case the columns subtree was hidden (because of it's depth) but so was ccc in the nested aaa structure. If I use max_rows=12 the output condenses to:

├─aaa (dict) ...
└─my_table (Table) ...
Some nodes not shown.

Because aaa and my_table are the most shallow structure they will be hidden last (put another way is that info has a preference for displaying more root-level objects). max_rows can also be a tuple of rows per depth. Calling af.info(max_rows=(24, None, 3)) outputs:

├─aaa (dict)
│ ├─bbb (dict)
│ │ ├─ccc (dict) ...
│ │ └─ccc2 (int): 1
│ └─bbb2 (list)
│   ├─[0] (int): 5
│   ├─[1] (int): 6
│   └─[2] (int): 7
└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column) ...
  │ └─col_b (Column) ...
  ├─meta (dict): {}
  └─n_rows (int): 3
Some nodes not shown.

I think the existing max_rows options are pretty useful here and if a user wants further filtering they can always select an item/branch/etc from the tree asdf.info(af['aaa']):

root (AsdfObject)
├─bbb (dict)
│ ├─ccc (dict)
│ │ └─ddd (list)
│ │   ├─[0] (int): 1
│ │   ├─[1] (int): 2
│ │   ├─[2] (int): 3
│ │   └─[3] (int): 4
│ └─ccc2 (int): 1
└─bbb2 (list)
  ├─[0] (int): 5
  ├─[1] (int): 6
  └─[2] (int): 7

As you noted the details about how to_info is implemented for Table, Column, WCS etc will help to inform where and how we may want to control the output. I suggest that we hold off adding any other options to info until after we have some experience with to_info.

@perrygreenfield
Copy link
Contributor

Yes, it (max_rows) may already doing mostly what we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

info search and schema_info do not use the AsdfFile.extension_manager
2 participants