Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aggregate param to tree_diff #323

Merged
merged 3 commits into from
Nov 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added:
- Tree Export: Print tree to allow alias.
- Tree Export: Mermaid diagram to include theme.
### Fixed:
- Misc: Doctest for docstrings, docstring to indicate usage prefers `node_name` to `name`.
- Tree Helper: Get tree diff to take in `aggregate` parameter to indicate differences at the top-level node.
- Misc: Documentation to include tips and tricks on working with custom classes.
### Changed:
- Misc: Docstring to indicate usage prefers `node_name` to `name`.
- Misc: Standardise testing fixtures.
### Fixed:
- Misc: Polars set up to work on laptop with M1 chip.
- Tree Export: Mermaid diagram title to add newline.
- Tree Helper: Get tree diff string replacement bug when the path change is substring of another path.
- Tree Export: Polars unit test to work with old (<=1.9.0) and new polars version.
- Tree Helper: Get tree diff string replacement bug when the path change is substring of another path.

## [0.22.1] - 2024-11-03
### Added:
Expand Down
160 changes: 105 additions & 55 deletions bigtree/tree/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@ def get_tree_diff(
other_tree: node.Node,
only_diff: bool = True,
detail: bool = False,
aggregate: bool = False,
attr_list: List[str] = [],
fallback_sep: str = "/",
) -> node.Node:
Expand All @@ -267,6 +268,9 @@ def get_tree_diff(
If `detail=True`, (added) and (moved to) will be used instead of (+), (removed) and (moved from)
will be used instead of (-).

If `aggregate=True`, differences (+)/(added)/(moved to) and (-)/(removed)/(moved from) will only be indicated at
the parent-level. This is useful when a subtree is shifted and we want the differences to shown only at the top node.

!!! note

- tree and other_tree must have the same `sep` symbol, otherwise this will raise ValueError
Expand All @@ -276,50 +280,79 @@ def get_tree_diff(
Examples:
>>> # Create original tree
>>> from bigtree import Node, get_tree_diff, list_to_tree
>>> root = list_to_tree(["Downloads/Pictures/photo1.jpg", "Downloads/file1.doc", "Downloads/photo2.jpg"])
>>> root = list_to_tree(["Downloads/Pictures/photo1.jpg", "Downloads/file1.doc", "Downloads/Trip/photo2.jpg"])
>>> root.show()
Downloads
├── Pictures
│ └── photo1.jpg
├── file1.doc
└── photo2.jpg
└── Trip
└── photo2.jpg

>>> # Create other tree
>>> root_other = list_to_tree(["Downloads/Pictures/photo1.jpg", "Downloads/Pictures/photo2.jpg", "Downloads/file1.doc"])
>>> root_other = list_to_tree(
... ["Downloads/Pictures/photo1.jpg", "Downloads/Pictures/Trip/photo2.jpg", "Downloads/file1.doc", "Downloads/file2.doc"]
... )
>>> root_other.show()
Downloads
├── Pictures
│ ├── photo1.jpg
│ └── photo2.jpg
└── file1.doc
│ └── Trip
│ └── photo2.jpg
├── file1.doc
└── file2.doc

>>> # Get tree differences
# Get tree differences
>>> tree_diff = get_tree_diff(root, root_other)
>>> tree_diff.show()
Downloads
├── Pictures
│ └── photo2.jpg (+)
└── photo2.jpg (-)
│ └── Trip (+)
│ └── photo2.jpg (+)
├── Trip (-)
│ └── photo2.jpg (-)
└── file2.doc (+)

>>> # Get tree differences - all differences
>>> tree_diff = get_tree_diff(root, root_other, only_diff=False)
>>> tree_diff.show()
Downloads
├── Pictures
│ ├── photo1.jpg
│ └── photo2.jpg (+)
│ ├── Trip (+)
│ │ └── photo2.jpg (+)
│ └── photo1.jpg
├── Trip (-)
│ └── photo2.jpg (-)
├── file1.doc
└── photo2.jpg (-)
└── file2.doc (+)

>>> # Get tree differences - all differences with details
>>> tree_diff = get_tree_diff(root, root_other, only_diff=False, detail=True)
>>> tree_diff.show()
Downloads
├── Pictures
│ ├── photo1.jpg
│ └── photo2.jpg (moved to)
│ ├── Trip (moved to)
│ │ └── photo2.jpg (moved to)
│ └── photo1.jpg
├── Trip (moved from)
│ └── photo2.jpg (moved from)
├── file1.doc
└── photo2.jpg (moved from)
└── file2.doc (added)

Comparing tree attributes
>>> # Get tree differences - all differences with details on aggregated level
>>> tree_diff = get_tree_diff(root, root_other, only_diff=False, detail=True, aggregate=True)
>>> tree_diff.show()
Downloads
├── Pictures
│ ├── Trip (moved to)
│ │ └── photo2.jpg
│ └── photo1.jpg
├── Trip (moved from)
│ └── photo2.jpg
├── file1.doc
└── file2.doc (added)

# Comparing tree attributes

- (~) will be added to node name if there are differences in tree attributes defined in `attr_list`.
- The node's attributes will be a list of [value in `tree`, value in `other_tree`]
Expand Down Expand Up @@ -361,6 +394,7 @@ def get_tree_diff(
other_tree (Node): tree to be compared with
only_diff (bool): indicator to show all nodes or only nodes that are different (+/-), defaults to True
detail (bool): indicator to differentiate between different types of diff e.g., added or removed or moved
aggregate (bool): indicator to only add difference indicator to parent-level e.g., when shifting subtrees
attr_list (List[str]): tree attributes to check for difference, defaults to empty list
fallback_sep (str): sep to fall back to if tree and other_tree has sep that clashes with symbols "+" / "-" / "~".
All node names in tree and other_tree should not contain this fallback_sep, defaults to "/"
Expand All @@ -383,6 +417,7 @@ def get_tree_diff(

name_col = "name"
path_col = "PATH"
parent_col = "PARENT"
indicator_col = "Exists"
tree_sep = tree.sep

Expand All @@ -391,26 +426,34 @@ def get_tree_diff(
_tree,
name_col=name_col,
path_col=path_col,
parent_col=parent_col,
attr_dict={k: k for k in attr_list},
)
for _tree in (tree, other_tree)
)

# Check tree structure difference
data_both = data[[path_col, name_col] + attr_list].merge(
data_other[[path_col, name_col] + attr_list],
data_both = data[[path_col, name_col, parent_col] + attr_list].merge(
data_other[[path_col, name_col, parent_col] + attr_list],
how="outer",
on=[path_col, name_col],
on=[path_col, name_col, parent_col],
indicator=indicator_col,
)
if aggregate:
data_both_agg = data_both[
(data_both[indicator_col] == "left_only")
| (data_both[indicator_col] == "right_only")
].drop_duplicates(subset=[name_col, parent_col], keep=False)
else:
data_both_agg = data_both

# Handle tree structure difference
nodes_removed = list(data_both[data_both[indicator_col] == "left_only"][path_col])[
::-1
]
nodes_added = list(data_both[data_both[indicator_col] == "right_only"][path_col])[
::-1
]
nodes_removed = list(
data_both_agg[data_both_agg[indicator_col] == "left_only"][path_col]
)[::-1]
nodes_added = list(
data_both_agg[data_both_agg[indicator_col] == "right_only"][path_col]
)[::-1]

moved_from_indicator: List[bool] = [True for _ in range(len(nodes_removed))]
moved_to_indicator: List[bool] = [True for _ in range(len(nodes_added))]
Expand All @@ -432,8 +475,8 @@ def get_tree_diff(

def add_suffix_to_path(
_data: pd.DataFrame, _condition: pd.Series, _original_name: str, _suffix: str
) -> pd.DataFrame:
"""Add suffix to path string
) -> None:
"""Add suffix to path string, in-place

Args:
_data (pd.DataFrame): original data with path column
Expand All @@ -446,35 +489,42 @@ def add_suffix_to_path(
"""
_data.iloc[_condition.values, _data.columns.get_loc(path_col)] = _data.iloc[
_condition.values, _data.columns.get_loc(path_col)
].str.replace(_original_name, f"{_original_name} ({suffix})", regex=True)
return _data

for node_removed, move_indicator in zip(nodes_removed, moved_from_indicator):
if not detail:
suffix = "-"
elif move_indicator:
suffix = "moved from"
else:
suffix = "removed"
condition_node_removed = data_both[path_col].str.endswith(
node_removed
) | data_both[path_col].str.contains(node_removed + tree_sep)
data_both = add_suffix_to_path(
data_both, condition_node_removed, node_removed, suffix
)
for node_added, move_indicator in zip(nodes_added, moved_to_indicator):
if not detail:
suffix = "+"
elif move_indicator:
suffix = "moved to"
else:
suffix = "added"
condition_node_added = data_both[path_col].str.endswith(node_added) | data_both[
path_col
].str.contains(node_added + tree_sep)
data_both = add_suffix_to_path(
data_both, condition_node_added, node_added, suffix
)
].str.replace(_original_name, f"{_original_name} ({_suffix})", regex=True)

def add_suffix_to_data(
_data: pd.DataFrame,
nodes_diff: List[str],
move_indicator: List[bool],
suffix_general: str,
suffix_move: str,
suffix_not_moved: str,
) -> None:
"""Add suffix to data, in-place

Args:
_data (pd.DataFrame): original data with path column
nodes_diff (List[str]): list of paths that were modified (e.g., added/removed)
move_indicator (List[bool]): move indicator to indicate path was moved instead of added/removed
suffix_general (str): path suffix for general case
suffix_move (str): path suffix if path was moved
suffix_not_moved (str): path suffix if path is not moved (e.g., added/removed)
"""
for _node_diff, _move_indicator in zip(nodes_diff, move_indicator):
if not detail:
suffix = suffix_general
else:
suffix = suffix_move if _move_indicator else suffix_not_moved
condition_node_modified = data_both[path_col].str.endswith(
_node_diff
) | data_both[path_col].str.contains(_node_diff + tree_sep)
add_suffix_to_path(data_both, condition_node_modified, _node_diff, suffix)

add_suffix_to_data(
data_both, nodes_removed, moved_from_indicator, "-", "moved from", "removed"
)
add_suffix_to_data(
data_both, nodes_added, moved_to_indicator, "+", "moved to", "added"
)

# Check tree attribute difference
path_changes_list_of_dict: List[Dict[str, Dict[str, Any]]] = []
Expand Down
52 changes: 48 additions & 4 deletions docs/gettingstarted/demo/tree.md
Original file line number Diff line number Diff line change
Expand Up @@ -965,7 +965,11 @@ To compare tree attributes:
- `(~)`: Node has different attributes, only available when comparing attributes

For more details, `(moved from)`, `(moved to)`, `(added)`, and `(removed)` can
be indicated instead if `(+)` and `(-)`.
be indicated instead if `(+)` and `(-)` by passing `detail=True`.

For aggregating the differences at the parent-level instead of having `(+)` and
`(-)` at every child node, pass in `aggregate=True`. This is useful if
subtrees are shifted, and if you want to view the shifting at the parent-level.

=== "Only differences"
```python hl_lines="20"
Expand Down Expand Up @@ -1029,13 +1033,14 @@ be indicated instead if `(+)` and `(-)`.
# └── g (+)
```
=== "With details"
```python hl_lines="21"
```python hl_lines="23"
from bigtree import str_to_tree, get_tree_diff

root = str_to_tree("""
a
├── b
│ ├── d
│ │ └── g
│ └── e
└── c
└── f
Expand All @@ -1044,9 +1049,10 @@ be indicated instead if `(+)` and `(-)`.
root_other = str_to_tree("""
a
├── b
│ └── g
│ └── h
└── c
├── d
│ └── g
└── f
""")

Expand All @@ -1055,10 +1061,48 @@ be indicated instead if `(+)` and `(-)`.
# a
# ├── b
# │ ├── d (moved from)
# │ │ └── g (moved from)
# │ ├── e (removed)
# │ └── h (added)
# └── c
# └── d (moved to)
# └── g (moved to)
```
=== "With aggregated differences"
```python hl_lines="23"
from bigtree import str_to_tree, get_tree_diff

root = str_to_tree("""
a
├── b
│ ├── d
│ │ └── g
│ └── e
└── c
└── f
""")

root_other = str_to_tree("""
a
├── b
│ └── h
└── c
├── d
│ └── g
└── f
""")

tree_diff = get_tree_diff(root, root_other, detail=True, aggregate=True)
tree_diff.show()
# a
# ├── b
# │ ├── d (moved from)
# │ │ └── g
# │ ├── e (removed)
# │ └── g (added)
# │ └── h (added)
# └── c
# └── d (moved to)
# └── g
```
=== "Attribute difference"
```python hl_lines="25"
Expand Down
Loading
Loading