Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide and apply file rename/modify strategy #76

Closed
yeraydavidrodriguez opened this issue Jan 18, 2022 · 6 comments
Closed

Decide and apply file rename/modify strategy #76

yeraydavidrodriguez opened this issue Jan 18, 2022 · 6 comments
Assignees
Labels
help wanted Extra attention is needed refactor

Comments

@yeraydavidrodriguez
Copy link
Collaborator

yeraydavidrodriguez commented Jan 18, 2022

When a gold image is renamed, there is a new element in the DVC Diff structure:

        {
            "renamed": [
                {
                    "path": {
                        "old": "path/original_file.txt",
                        "new": "path/renamed_file.txt",
                    }
                },
            ],
        }
  

This renaming can be applied to the corresponding base image and the git repository with two different approaches:

  1. Perform a file rename using the OS functions, committing the renaming action.
  2. Copy the old file using the new name and remove the old one, committing the file deletion and the file addition.

Those two approaches fulfill the basic requirement (a file identical to the old one remains in the repo, but with the new name), but they have specific pros and cons:

Simplicity: The second approach allows us to implement all file management (add/delete/modify/rename) with only two operations (add and delete), therefore maintaining less complexity and performing fewer types of git operations. All events linked with file add and delete actions will be also triggered automatically with the rename operation, simplifying the workflow.

Performance: However, the first approach is much faster and independent of file sizes. This can be irrelevant for a single rename of a small file, but a batch rename of thousands of files (i.e. a name schema change or adding a prefix), the difference can be dramatic.

The same occurs with the modify operations: are we going to support it with a "modify" commit, or shall we delete the old file and add the new one?

In this second scenario, both alternatives have similar performance, so if we decide to use a "delete + add" strategy for the rename operation, the same could be advisable for the modify operation.

A decision about the most convenient approach should be taken and implemented, if applicable.

@josecelano
Copy link
Member

We have a previous discussion here.

@da2ce7
Copy link

da2ce7 commented Jan 18, 2022

Hello @yeraydavidrodriguez

Thank-you for the clear write-up.

Since we have a clean rename section of the DVC Diff, I think that we should take advantage of it and use the OS's built-in mv command.

I think that this is the fastest, and most reliable for at this stage of the development.

While I previously recommend that we use the most minimal, methodology:

1. Delete,
2. Add.

We can update it to:

1. Rename,
2. Delete,
3. Add.

This optimization should be safe and should work in all cases.

@josecelano
Copy link
Member

Hello @yeraydavidrodriguez

Thank-you for the clear write-up.

Since we have a clean rename section of the DVC Diff, I think that we should take advantage of it and use the OS's built-in mv command.

I think that this is the fastest, and most reliable for at this stage of the development.

While I previously recommend that we use the most minimal, methodology:

1. Delete,
2. Add.

We can update it to:

1. Rename,
2. Delete,
3. Add.

This optimization should be safe and should work in all cases.

and what about "Modify" @da2ce7 ? (Same filename different content)

{
  "added":[
    {
      "path":"data/000001/32/000001-32.600.2.tif"
    }
  ],
  "deleted":[
    {
      "path":"data/000002/32/000002-32.600.2.tif"
    }
  ],
  "modified":[
    {
      "path":"data/000003/32/000003-32.600.2.tif",
      "hash": {
        "old": "2b5e0fdd787b569b4fcb4bafa543d270",
        "new": "f237c73d43f87f84081796e3c0e22e37"
      }
    }
  ],
  "renamed": [
    {
      "path": {
        "old": "data/00000444/32/00000444-32.600.2.tif",
        "new": "data/000004/32/000004-32.600.2.tif"
      }
    }
  ]
}  

@da2ce7
Copy link

da2ce7 commented Jan 19, 2022

Hello @josecelano and @yeraydavidrodriguez

I think that modify can be interpreted as a Delete followed by an Add.

@josecelano
Copy link
Member

Hello @josecelano and @yeraydavidrodriguez

I think that modify can be interpreted as a Delete followed by an Add.

OK, thank you for the clarification

@yeraydavidrodriguez
Copy link
Collaborator Author

Thank you both, everything is clear to implementation. I close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed refactor
Projects
None yet
Development

No branches or pull requests

3 participants