-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should we encode text? #2
Comments
Shouldn't we separate the format of data that we have in our code from the format of the code itself? We should have an encoding standard for FLIC the way SPARQCode is for the QR code. Okay back to Unicode storage. Each character is basically one of 1,112,064 numbers. This is 2410412245. The most common characters will be in the range 0 to 127 or 10025. If we want to do it similar to UTF-8, each "byte" can have a bit of information saying, "Hey brah, there's also more bits in the next byte." But representing a bit of information in base 5 is strange. Let's say we have 4 dots in a dotbyte here. We can represent 625 numbers (44445+1). The bit of information we get can be whether a value is between 0-312 or 313-624. Then we extract the information of having more characters and then normalize it back by subtracting 313. There's a sticky issue caused by the uneven base where you can't represent the number 312 but also say that there is another dotbyte coming because 312 + 313 is 625 and is out of our range 😢 .... so there's still a way around it but I'll work on this idea more later. |
Okay I thought of something. Given 4 dots in a dotbyte, the first dot can be the digit signifying how many more dotbytes are part of this character. So for a character from 0 to 4445 (124), we can just encode it as a 0 dot followed by 3 dots of representation. This is more than enough... so there are a few things to consider.
So by this we should probably map the number from 0 to 4 to numbers from 3 to 9. Just kinda eyeballing a character frequency chart, I propose this arbitrary table:
So we just have our length dot, and then the dots that make up the character. Out of all possible representations with this format, 57% are Unicode values (including restricted/private use). |
SPARQCode does not handle the text encoding and that article is wrong ― QR Codes actually use a text mode representing the Latin-1 charset (with Hiragana and binary modes available too) |
Okay I'm just saying I think we should totally separate any type of encoding from the set of dots that is the barcode. |
If we work from Unicode, we could theoretically create a variable-length encoding based in base-5 instead of simply tossing UTF-8 into the block encoding. This would provide for greater efficiency for text storage while retaining the entire Unicode charset.
Any ideas on how to spec this out?
The text was updated successfully, but these errors were encountered: