-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bi-Directional Transcoding of Invalid Identifiers #3
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed implementation should touch on whether or not invalid UTF-8 code points are allowed to be encoded or decoded, specifically code points that are greater than the max code point, the replacement code point, and the surrogate ranges.
Agree, it is not clear in the proposed API. I changed to reflect on that.
|
); | ||
``` | ||
|
||
Existing valid identifiers with `tn__` prefix will produce no changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one point we had discussed that the right behavior is to attempt to decode and then reencode. Motivation-- Let's say I have "tn__MünchenGermany_rEi5
, an identifier previously encoded with SdfBoostringEncodeIdentifier
, and I want to ensure it's ASCII encode with SdfBoostringEncodeAsciiIdentiifer
. It also can function as a validator. Run SdfBootrstringEncodeIdentifier
as a way of ensuring that an identifier is properly encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or actually vice versa-- I'm leaving a domain that required Ascii identifiers and now I want to "upgrade" to Utf8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the input for encode should by any UTF-8 string and we do not really have any interpretation of what that string is. The output of encoding should be:
- empty string: in case of invalid UTF-8 string
- the same string: if the UTF-8 string is already in the domain of valid characters
- encoded string (i.e. with
tn__
prefix): if the UTF-8 string is not in the domain of valid characters.
I think the 3 proposed methods pose the minimum set of operations to fix the problem mentioned above:
std::optional<std::string> SdfBootstringReencodeIdentifier(const std::string& identifier) {
std::string originalIdentifier = SdfBootstringDecodeIdentifier(identifier).value_or(identifier);
return SdfBootstringEncodeIdentifier(originalIdentifier);
}
That can be added into the set of methods to the API, and I think the intent is clear: it will attempt to check the passed identifier is a valid encoded identifier and it will encode it again.
4b2a201
to
376b8e8
Compare
Description of Proposal
TfMakeValidIdentifier
, was used in OpenUSD to convert any identifier into a valid identifier. However, it creates anon-bidirectional relationship, for example, something like
カーテンウォール
would be transformed into________________
.The objective of this proposal is to provide an alternative to
TfMakeValidIdentifier
that can take any identifier(potentially with invalid characters) and transform it into a OpenUSD valid identifier.
Contributing