Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should we encode text? #2

Open
NuckChorris opened this issue Jan 20, 2013 · 4 comments
Open

How should we encode text? #2

NuckChorris opened this issue Jan 20, 2013 · 4 comments
Labels

Comments

@NuckChorris
Copy link
Member

If we work from Unicode, we could theoretically create a variable-length encoding based in base-5 instead of simply tossing UTF-8 into the block encoding. This would provide for greater efficiency for text storage while retaining the entire Unicode charset.

Any ideas on how to spec this out?

@dsamarin
Copy link
Member

Shouldn't we separate the format of data that we have in our code from the format of the code itself? We should have an encoding standard for FLIC the way SPARQCode is for the QR code.

Okay back to Unicode storage. Each character is basically one of 1,112,064 numbers. This is 2410412245. The most common characters will be in the range 0 to 127 or 10025. If we want to do it similar to UTF-8, each "byte" can have a bit of information saying, "Hey brah, there's also more bits in the next byte." But representing a bit of information in base 5 is strange.

Let's say we have 4 dots in a dotbyte here. We can represent 625 numbers (44445+1). The bit of information we get can be whether a value is between 0-312 or 313-624. Then we extract the information of having more characters and then normalize it back by subtracting 313. There's a sticky issue caused by the uneven base where you can't represent the number 312 but also say that there is another dotbyte coming because 312 + 313 is 625 and is out of our range 😢 .... so there's still a way around it but I'll work on this idea more later.

@dsamarin
Copy link
Member

Okay I thought of something. Given 4 dots in a dotbyte, the first dot can be the digit signifying how many more dotbytes are part of this character.

So for a character from 0 to 4445 (124), we can just encode it as a 0 dot followed by 3 dots of representation.
For a character from 10005 (125) to 44444445 (78,124), we encode it as a 1 dot followed by 7 dots of representation.
For a character from 100000005 (78,125) to 444444444445 (48,828,124), we encode it as a 2 dot followed by 11 dots of representation.

This is more than enough... so there are a few things to consider.

  1. The common characters of text are within the range 0-127 or about 3 dots. We shouldn't use less than 3 dots to represent a character.
  2. We need 9 dots of representation for the biggest Unicode character.
  3. The first dot should represent an integer length from 3 to 9 (that's 7 values, over what we can fit in a dot)

So by this we should probably map the number from 0 to 4 to numbers from 3 to 9. Just kinda eyeballing a character frequency chart, I propose this arbitrary table:

ValueDots in character
03
14
26
38
49

So we just have our length dot, and then the dots that make up the character.

Out of all possible representations with this format, 57% are Unicode values (including restricted/private use).

@NuckChorris
Copy link
Member Author

SPARQCode does not handle the text encoding and that article is wrong ― QR Codes actually use a text mode representing the Latin-1 charset (with Hiragana and binary modes available too)

@dsamarin
Copy link
Member

Okay I'm just saying I think we should totally separate any type of encoding from the set of dots that is the barcode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants