How should we encode text? #2

NuckChorris · 2013-01-20T01:32:48Z

If we work from Unicode, we could theoretically create a variable-length encoding based in base-5 instead of simply tossing UTF-8 into the block encoding. This would provide for greater efficiency for text storage while retaining the entire Unicode charset.

Any ideas on how to spec this out?

dsamarin · 2013-01-20T20:24:57Z

Shouldn't we separate the format of data that we have in our code from the format of the code itself? We should have an encoding standard for FLIC the way SPARQCode is for the QR code.

Okay back to Unicode storage. Each character is basically one of 1,112,064 numbers. This is 241041224₅. The most common characters will be in the range 0 to 127 or 1002₅. If we want to do it similar to UTF-8, each "byte" can have a bit of information saying, "Hey brah, there's also more bits in the next byte." But representing a bit of information in base 5 is strange.

Let's say we have 4 dots in a dotbyte here. We can represent 625 numbers (4444₅+1). The bit of information we get can be whether a value is between 0-312 or 313-624. Then we extract the information of having more characters and then normalize it back by subtracting 313. There's a sticky issue caused by the uneven base where you can't represent the number 312 but also say that there is another dotbyte coming because 312 + 313 is 625 and is out of our range 😢 .... so there's still a way around it but I'll work on this idea more later.

dsamarin · 2013-01-20T22:02:16Z

Okay I thought of something. Given 4 dots in a dotbyte, the first dot can be the digit signifying how many more dotbytes are part of this character.

So for a character from 0 to 444₅ (124), we can just encode it as a 0 dot followed by 3 dots of representation.
For a character from 1000₅ (125) to 4444444₅ (78,124), we encode it as a 1 dot followed by 7 dots of representation.
For a character from 10000000₅ (78,125) to 44444444444₅ (48,828,124), we encode it as a 2 dot followed by 11 dots of representation.

This is more than enough... so there are a few things to consider.

The common characters of text are within the range 0-127 or about 3 dots. We shouldn't use less than 3 dots to represent a character.
We need 9 dots of representation for the biggest Unicode character.
The first dot should represent an integer length from 3 to 9 (that's 7 values, over what we can fit in a dot)

So by this we should probably map the number from 0 to 4 to numbers from 3 to 9. Just kinda eyeballing a character frequency chart, I propose this arbitrary table:

Value	Dots in character
0	3
1	4
2	6
3	8
4	9

So we just have our length dot, and then the dots that make up the character.

Out of all possible representations with this format, 57% are Unicode values (including restricted/private use).

NuckChorris · 2013-01-20T23:29:22Z

SPARQCode does not handle the text encoding and that article is wrong ― QR Codes actually use a text mode representing the Latin-1 charset (with Hiragana and binary modes available too)

dsamarin · 2013-01-21T01:00:07Z

Okay I'm just saying I think we should totally separate any type of encoding from the set of dots that is the barcode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should we encode text? #2

How should we encode text? #2

NuckChorris commented Jan 20, 2013

dsamarin commented Jan 20, 2013

dsamarin commented Jan 20, 2013

NuckChorris commented Jan 20, 2013

dsamarin commented Jan 21, 2013

How should we encode text? #2

How should we encode text? #2

Comments

NuckChorris commented Jan 20, 2013

dsamarin commented Jan 20, 2013

dsamarin commented Jan 20, 2013

NuckChorris commented Jan 20, 2013

dsamarin commented Jan 21, 2013