-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the Tor DB as GeoIP database #807
base: next
Are you sure you want to change the base?
Conversation
import java.util.Map; | ||
import java.util.zip.ZipException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are needed despite no other changes to this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are needed despite no other changes to this file?
It appears Arne forgot to include the other changes in IPConverter, which you can see are explained in a way that it is already done. I figure he intends to come back to paste the remaining changes later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArneBab Are these changes incomplete? Is this a draft?
38cd230
to
89fdffe
Compare
a) The DB file size is bigger. Current IpToCountry.dat is 1.2 MiB, Tor DB is 4 MiB, optimized Tor DB is 2 MiB. "Optimized" means I used 'sed' to remove an unnecessary column in the DB. If you really want to go for size, you can zip it to ~700 KiB, this increases runtime a bit, but it's still ~15x faster than the old method. b) Tor uses "??" instead of "ZZ" for unknown codes, and still uses "CS" which stands for Serbia and Montenegro - the country stopped existing in 2006. Maybe someone could ask them why their DB uses "CS"... This can be solved easily by just replacing them with 'sed'. c) The new DB is sorted in ascending order, which means that the function to do the binary search has to be changed (right now I simply reverse the array), which saves another ~6 ms. I don't know how to do this. human-readable text file in a zip. Advantages: - It's human readable - It's easy to update because we can use Tor geoip - It's a lot faster than the base85 approach - It has a smaller file size ==== New zip file ==== The code no longer uses the IpToCountry.dat file and instead uses a zip file called IpToCountry.zip. This zip file is expected to contain exactly one text file in the Tor geoip format according to the spec below. The zip file should be compressed to save space (~2 MiB uncompressed -> 0.7 MiB compressed). ==== New IpToCountry.txt file ==== Format for each line: <fromIP>,<ISO 3166-1 alpha-2 country code> Example: 16781312,JP This is like to old format, but not base85 encoded. Empty lines are allowed. Comments may start with any symbol other than a number. ---------------------------------- Get the raw .txt file here: https://github.com/torproject/tor/raw/main/src/config/geoip The file has to be processed with the following three 'sed' commands: sed -E -i 's/([0-9]*),[0-9]*,([A-Z]*)/\1,\2/g' IpToCountry.txt && sed -E -i 's/,\?\?/,ZZ/g' IpToCountry.txt && sed -E -i 's/,CS/,RS/g' IpToCountry.txt 1) Remove last column, because Tor geoip format is: fromIP,toIP,countryCode. Freenet does not need to toIP value, the binary search algorithm will take care of this. 2) Replace '??' with 'ZZ' for unknown countries, because '??' is not in the ISO 3166 standard. 3) Replace 'CS' with 'RS' because the country 'CS' is not in the ISO 3166 standard. Zip this text file into IpToCountry.zip and place it in the main Freenet folder. ==== Code changes ==== The base85 code is left in the source as well as the file reader for the old format. - src/freenet/clients/http/geoip/IPConverter.java -- zip reader to save space. -- ArrayList is allocated with 180000 slots to have it not resize that many times (does not matter for speed though anyway). -- Ignore empty lines and lines that start with anything but a number (comments). -- Cast (int) to the Long value, exactly like the old code did. -- Get country, identical to old code. -- Reverse the List, because the binary search expects the list to be in descending order. Takes <10 ms. -- Convert the List<Integer/Short> to int[]/short[] to save lots of memory. See below for explanation. Takes <10 ms. -- Catch all possible errors. -- I did not feel confident in messing with the binary search because I might overlook some edge case where indexes would no longer match, so I left it alone. Reversing both arrays takes less than 10 ms combined. - src/freenet/node/NodeFile.java b/src/freenet/node/NodeFile.java -- Changed default location from 'IpToCountry.dat' to 'IpToCountry.zip'. Memory from heap dump according to VisualVM: List<Integer> vs int[]: 3.3 MiB vs 660 KiB List<Short> vs short[]: 2.0 MiB vs 330 KiB ==== Further changes (aka 'more stuff to do for Arne' :) ) ==== https://github.com/freenet/scripts#releasing-stable-freenet-builds The FAQ link has to be removed as the old IP DB site is no longer used. /scripts/setup-release-environment Has to be adjusted. How did it work in the past few years Arne, because the website has been offline for a while? The new zip file has to be added to the insert/release script.
89fdffe
to
8465bc5
Compare
a) The DB file size is bigger. Current IpToCountry.dat is 1.2 MiB, Tor DB is 4 MiB, optimized Tor DB is 2 MiB. "Optimized" means I used 'sed' to remove an unnecessary column in the DB. If you really want to go for size, you can zip it to ~700 KiB, this increases runtime a bit, but it's still ~15x faster than the old method.
b) Tor uses "??" instead of "ZZ" for unknown codes, and still uses "CS" which stands for Serbia and Montenegro - the country stopped existing in 2006. Maybe someone could ask them why their DB uses "CS"... This can be solved easily by just replacing them with 'sed'.
c) The new DB is sorted in ascending order, which means that the function to do the binary search has to be changed (right now I simply reverse the array), which saves another ~6 ms. I don't know how to do this.
human-readable text file in a zip.
Advantages:
==== New zip file ====
The code no longer uses the IpToCountry.dat file and instead uses a zip file called IpToCountry.zip. This zip file is expected to contain exactly one text file in the Tor geoip format according to the spec below. The zip file should be compressed to save space (~2 MiB uncompressed -> 0.7 MiB compressed).
==== New IpToCountry.txt file ====
Format for each line: ,<ISO 3166-1 alpha-2 country code>
Example: 16781312,JP
This is like to old format, but not base85 encoded.
Empty lines are allowed.
Comments may start with any symbol other than a number.
Get the raw .txt file here: https://github.com/torproject/tor/raw/main/src/config/geoip
The file has to be processed with the following three 'sed' commands:
sed -E -i 's/([0-9]),[0-9],([A-Z]*)/\1,\2/g' IpToCountry.txt && sed -E -i 's/,??/,ZZ/g' IpToCountry.txt && sed -E -i 's/,CS/,RS/g' IpToCountry.txt
Zip this text file into IpToCountry.zip and place it in the main Freenet folder.
==== Code changes ====
The base85 code is left in the source as well as the file reader for the old format.
-- ArrayList is allocated with 180000 slots to have it not resize that many times (does not matter for speed though anyway). -- Ignore empty lines and lines that start with anything but a number (comments). -- Cast (int) to the Long value, exactly like the old code did. -- Get country, identical to old code.
-- Reverse the List, because the binary search expects the list to be in descending order. Takes <10 ms. -- Convert the List<Integer/Short> to int[]/short[] to save lots of memory. See below for explanation. Takes <10 ms. -- Catch all possible errors.
-- I did not feel confident in messing with the binary search because I might overlook some edge case where indexes would no longer match, so I left it alone. Reversing both arrays takes less than 10 ms combined.
Memory from heap dump according to VisualVM:
List vs int[]: 3.3 MiB vs 660 KiB
List vs short[]: 2.0 MiB vs 330 KiB
==== Further changes (aka 'more stuff to do for Arne' :) ) ====
https://github.com/freenet/scripts#releasing-stable-freenet-builds The FAQ link has to be removed as the old IP DB site is no longer used.
/scripts/setup-release-environment
Has to be adjusted. How did it work in the past few years Arne, because the website has been offline for a while?
The new zip file has to be added to the insert/release script.