Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bi-Directional Transcoding of Invalid Identifiers #3

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

miguelh-nvidia
Copy link

Description of Proposal

TfMakeValidIdentifier, was used in OpenUSD to convert any identifier into a valid identifier. However, it creates a
non-bidirectional relationship, for example, something like カーテンウォール would be transformed into ________________.

The objective of this proposal is to provide an alternative to TfMakeValidIdentifier that can take any identifier
(potentially with invalid characters) and transform it into a OpenUSD valid identifier.

Contributing

@miguelh-nvidia miguelh-nvidia requested review from asluk and nvmkuruc March 4, 2024 12:25
Copy link

@nvmkuruc nvmkuruc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed implementation should touch on whether or not invalid UTF-8 code points are allowed to be encoded or decoded, specifically code points that are greater than the max code point, the replacement code point, and the surrogate ranges.

proposals/transcoding_invalid_identifiers/README.md Outdated Show resolved Hide resolved
@miguelh-nvidia
Copy link
Author

The proposed implementation should touch on whether or not invalid UTF-8 code points are allowed to be encoded or decoded, specifically code points that are greater than the max code point, the replacement code point, and the surrogate ranges.

Agree, it is not clear in the proposed API. I changed to reflect on that.

  • input should be valid UTF-8
  • output with invalid UTF-8
  • reference to TfUtf8CodePoint for the other aspects (i.e. max code point, replacement code point and surrogate ranges).

);
```

Existing valid identifiers with `tn__` prefix will produce no changes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point we had discussed that the right behavior is to attempt to decode and then reencode. Motivation-- Let's say I have "tn__MünchenGermany_rEi5, an identifier previously encoded with SdfBoostringEncodeIdentifier, and I want to ensure it's ASCII encode with SdfBoostringEncodeAsciiIdentiifer. It also can function as a validator. Run SdfBootrstringEncodeIdentifier as a way of ensuring that an identifier is properly encoded.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually vice versa-- I'm leaving a domain that required Ascii identifiers and now I want to "upgrade" to Utf8.

Copy link
Author

@miguelh-nvidia miguelh-nvidia Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the input for encode should by any UTF-8 string and we do not really have any interpretation of what that string is. The output of encoding should be:

  • empty string: in case of invalid UTF-8 string
  • the same string: if the UTF-8 string is already in the domain of valid characters
  • encoded string (i.e. with tn__ prefix): if the UTF-8 string is not in the domain of valid characters.

I think the 3 proposed methods pose the minimum set of operations to fix the problem mentioned above:

std::optional<std::string> SdfBootstringReencodeIdentifier(const std::string& identifier) {
   std::string originalIdentifier = SdfBootstringDecodeIdentifier(identifier).value_or(identifier);
   return SdfBootstringEncodeIdentifier(originalIdentifier);
}

That can be added into the set of methods to the API, and I think the intent is clear: it will attempt to check the passed identifier is a valid encoded identifier and it will encode it again.

@miguelh-nvidia miguelh-nvidia force-pushed the bidirectional_transcoding_identifiers branch from 4b2a201 to 376b8e8 Compare April 26, 2024 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants