Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you represent bencoded strings that contain a <hex> tag to avoid ambiguity? #7

Open
josecelano opened this issue Nov 1, 2024 · 12 comments · May be fixed by #13
Open

How do you represent bencoded strings that contain a <hex> tag to avoid ambiguity? #7

josecelano opened this issue Nov 1, 2024 · 12 comments · May be fixed by #13
Assignees
Labels
bug Something isn't working

Comments

@josecelano
Copy link
Member

josecelano commented Nov 1, 2024

Relates to: Chocobo1/bencode_online#3

How do you represent bencoded strings that contain a tag to avoid ambiguity with the tags introduced by not utf-8 bencoded strings?

Submitted on reddit by Icarium-Lifestealer

UPDATE

Example, two bencoded values producing the same output:

printf "13:<hex>ff</hex>" | cargo run -> "<hex>ff</hex>"  (13 bytes)
printf "1:\xff"           | cargo run -> "<hex>ff</hex>"  (1 byte)

List of special characters used in JSON that need to be escaped in string values:

  • \b Backspace (ascii code 08)
  • \f Form feed (ascii code 0C)
  • \n New line
  • \r Carriage return
  • \t Tab
  • \" Double quote
  • \\ Backslash character
@josecelano
Copy link
Member Author

josecelano commented Nov 1, 2024

Proposal 1

Maybe we can always include some tags. For byte sequences containing valid UTF-8:

"<utf8>spam</utf8>"

And not UTF-8 sequences:

"<bytes>fffe</bytes>"

The encoded value 6:<utf8> would be "<utf8><utf8></utf8>".

@josecelano
Copy link
Member Author

josecelano commented Nov 7, 2024

Proposal 2

"<string>spam</string>"

And not UTF-8 sequences:

"<hex>fffe</hex>"

This would be half compatible with the other implementation.

@josecelano
Copy link
Member Author

josecelano commented Nov 7, 2024

Proposal 3

Maybe we can event simplify the metadata using just a prefix instead of a html-style tag:

"string:spam"

And not UTF-8 sequences:

"hex:fffe"

@josecelano
Copy link
Member Author

Hey, @da2ce7, can you add an example of what we discussed today?

I mean one example like this:

printf "13:<hex>ff</hex>" | cargo run -> "<hex>ff</hex>"  (13 bytes)
printf "1:\xff"           | cargo run -> "<hex>ff</hex>"  (1 byte)

For the problem with escaped chars you mentioned in the meeting. I can't come up with one example.

@josecelano
Copy link
Member Author

Hey, @da2ce7, can you add an example of what we discussed today?

I mean one example like this:

printf "13:<hex>ff</hex>" | cargo run -> "<hex>ff</hex>"  (13 bytes)
printf "1:\xff"           | cargo run -> "<hex>ff</hex>"  (1 byte)

For the problem with escaped chars you mentioned in the meeting. I can't come up with one example.

Hi @da2ce7, if you have a concrete example for this ☝🏼 that would be awesome! Thanks.

@da2ce7
Copy link
Contributor

da2ce7 commented Nov 12, 2024

The problem is that it is lossy transformation.

There are multiple bencode inputs that can produce identical json results.

The issue is if we escape XML style tags, then there is two inputs that produces the same output.

The <\/> input and </> inputs produce the same <\/> output.

So we don't know if the bencode input was "pre-escaped" or not.

@josecelano
Copy link
Member Author

The problem is that it is lossy transformation.

There are multiple bencode inputs that can produce identical json results.

The issue is if we escape XML style tags, then there is two inputs that produces the same output.

The <\/> input and </> inputs produce the same <\/> output.

So we don't know if the bencode input was "pre-escaped" or not.

Hi @da2ce7 forward slash does not need to be escaped in JSON. The two examples you mention <\/> and </> produce a different JSON value:

printf "3:</>"  | cargo run -> "</>"
printf "4:<\/>" | cargo run -> "<\\/>"

For the backslash (hex 5C), it's escaped but I guess we can revert it without ambiguity. We only need to undo escaped chars in the JSON value before converting them into bytes, right?

printf "1:\x5c" | cargo run -> "\\"

@da2ce7
Copy link
Contributor

da2ce7 commented Nov 13, 2024

I think that you are correct. If we always insert the escaping character, then it will have a reversible encoding
. 👍

@josecelano
Copy link
Member Author

I think that you are correct. If we always insert the escaping character, then it will have a reversible encoding . 👍

Hey @da2ce7 and what proposal do you prefer? I prefer the proposal 2

"<string>spam</string>"
"<hex>ff</hex>"

Because:

  • It's half-compatible with the other implementation.
  • It's extensible, we can add more tags later like:
"<hex>ff</hex><metada>...</medatada>"

as long as we include <hex>...</hex> at the "root" level of this kind of HTML doc.

With proposal 3, we can only add new prefixes.

@josecelano
Copy link
Member Author

@da2ce7 has just told me that proposal 2 is fine.

@josecelano
Copy link
Member Author

After merging this change, we have to update the draft TEP.

@josecelano
Copy link
Member Author

After merging this change, we have to update the draft TEP.

And also bump the version to 0.2.0 and publish it on crates.io.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants