Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix file name parse error (#132) #133

Merged
merged 5 commits into from
Oct 24, 2024
Merged

Conversation

annakrystalli
Copy link
Member

This PR resolves #132 by bolstering the parsing of file names in parse_file_name() to ensure they match the specific pattern required rather than just checking patterns of interest are extracted successfully.

I also added more fine-grained check with more informative error messages and fixed a bug in which the compression extensions were not being successfully recorded and returned.

Copy link

@annakrystalli annakrystalli requested a review from elray1 October 24, 2024 12:33
Copy link
Contributor

@nickreich nickreich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turn-around on this!

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏃🏽‍♀️⚡ THAT WAS FAST WORK! Splitting out the concerns of the valid structure and the extension, giving an informative response is 💯

@annakrystalli
Copy link
Member Author

annakrystalli commented Oct 24, 2024

Thanks for the reviews @nickreich and @zkamvar .

I'm going to merge but am wondering if we can put to bed the following issue as well #81

Offically this PR allows for round IDs that are not strictly ISO dates, so long as they match the pattern [A-Za-z0-9_]. Overall the regex pattern for round IDs is:
^(\\d{4}-\\d{2}-\\d{2})|[A-Za-z0-9_]+$

Should I go ahead and open issues to:

@annakrystalli annakrystalli merged commit 25416ba into main Oct 24, 2024
8 checks passed
@annakrystalli annakrystalli deleted the ak/fix-filename-parse-error/132 branch October 24, 2024 14:37
@zkamvar
Copy link
Member

zkamvar commented Oct 24, 2024

I'm going to merge but am wondering if we can put to bed the following issue as well #81

I think closing that issue should require specific tests for this. At the moment, all tests assume ISO dates as round IDs?

@annakrystalli
Copy link
Member Author

There is a test for round_1 round ID here:

file_path <- "model-output/team1-goodmodel/round_1-team1-good_model.parquet"
file_meta <- parse_file_name(file_path)
expect_equal(
basename(file_path),
paste(file_meta$round_id, file_meta$team_abbr, file_meta$model_abbr,
sep = "-"
) %>%
paste(file_meta$ext, sep = ".")
)
})

@zkamvar
Copy link
Member

zkamvar commented Oct 25, 2024

There is a test for round_1 round ID here:

file_path <- "model-output/team1-goodmodel/round_1-team1-good_model.parquet"
file_meta <- parse_file_name(file_path)
expect_equal(
basename(file_path),
paste(file_meta$round_id, file_meta$team_abbr, file_meta$model_abbr,
sep = "-"
) %>%
paste(file_meta$ext, sep = ".")
)
})

Oh good point! Sorry about that. I'll close #81 as completed here.

@nickreich
Copy link
Contributor

Regarding the suggestion to

Add a regex pattern to the tasks.json round_id property

I have forgotten where we are currently on this, but don't round IDs have to be YYYY-MM-DDs? I have a small worry about adding a complex regex expression as a field to a config file. Regex is a whole thing that is hard to understand for someone who hasn't interacted with them before. E.g. could we expect a hub admin to know what to put in here?

@annakrystalli
Copy link
Member Author

Totally agreed that regex are complex and confusing. Hence I'm suggesting a clarification in hubDocs that makes the expectation clear in plain language that doesn't require interpretating regex.

Having said that we have always accepted round IDs that are not ISO dates which is exactly why I'm wanting to formalise this so there aren't further misconceptions and having to answer this question over and over again.

@nickreich
Copy link
Contributor

Maybe we should migrate further discussion to a specific issue? I think basically I'm on board, and I like the idea of being more specific about round_id formatting. I guess as long as we provide several examples of commonly used formats for regex that hub admins could include in their config file it would be ok.

@annakrystalli
Copy link
Member Author

annakrystalli commented Oct 25, 2024

Maybe we should migrate further discussion to a specific issue?

Can do but we've had multiple issues and discussions about this already:

and we've established that the pattern above is what is accepted (and in truth has always been accepted. It just hasn't been documented well enough hence the lingering confusion about this).

What I'm proposing is to open specific issues that codify what has effectively always been the case in the appropriate locations rather than open another discussion issue.

I guess as long as we provide several examples of commonly used formats for regex that hub admins could include in their config file it would be ok.

The hub admins would not need to provide a regex,. In the config it's the actual values that are encoded and the schema will perform the validation of the config, ensuring round id values match the expected pattern. Hence hub admins only need to ensure their proposed round IDs match the proposed pattern. Otherwise validate_config() will complain (once we actually formally codify it).

We do already provide valid examples of what round IDs should look like in the schema here:
https://github.com/hubverse-org/schemas/blob/1b1b9b2e0810483c37faa386649e3083d5fcb05e/v3.0.1/tasks-schema.json#L28-L31

  • the idea is to bolster hubDocs documentation also,

@annakrystalli
Copy link
Member Author

annakrystalli commented Oct 25, 2024

Just to be crystal clear, hub admins just need to know that round IDs must be either:

  • ISO dates (YYYY-MM-DD)
  • Any other round identifier that contains letters, numbers or _

We can verbally explain this in hubDocs while the pattern suggested for the schema and dynamic validation of round ID values in the round ID column happens behing the scenes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

space in filename doesn't throw validation error
3 participants