Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the Tor DB as GeoIP database #807

Open
wants to merge 1 commit into
base: next
Choose a base branch
from

Conversation

ArneBab
Copy link
Contributor

@ArneBab ArneBab commented Nov 27, 2022

a) The DB file size is bigger. Current IpToCountry.dat is 1.2 MiB, Tor DB is 4 MiB, optimized Tor DB is 2 MiB. "Optimized" means I used 'sed' to remove an unnecessary column in the DB. If you really want to go for size, you can zip it to ~700 KiB, this increases runtime a bit, but it's still ~15x faster than the old method.

b) Tor uses "??" instead of "ZZ" for unknown codes, and still uses "CS" which stands for Serbia and Montenegro - the country stopped existing in 2006. Maybe someone could ask them why their DB uses "CS"... This can be solved easily by just replacing them with 'sed'.

c) The new DB is sorted in ascending order, which means that the function to do the binary search has to be changed (right now I simply reverse the array), which saves another ~6 ms. I don't know how to do this.

human-readable text file in a zip.

Advantages:

  • It's human readable
  • It's easy to update because we can use Tor geoip
  • It's a lot faster than the base85 approach
  • It has a smaller file size

==== New zip file ====

The code no longer uses the IpToCountry.dat file and instead uses a zip file called IpToCountry.zip. This zip file is expected to contain exactly one text file in the Tor geoip format according to the spec below. The zip file should be compressed to save space (~2 MiB uncompressed -> 0.7 MiB compressed).

==== New IpToCountry.txt file ====

Format for each line: ,<ISO 3166-1 alpha-2 country code>
Example: 16781312,JP
This is like to old format, but not base85 encoded.

Empty lines are allowed.
Comments may start with any symbol other than a number.


Get the raw .txt file here: https://github.com/torproject/tor/raw/main/src/config/geoip

The file has to be processed with the following three 'sed' commands:

sed -E -i 's/([0-9]),[0-9],([A-Z]*)/\1,\2/g' IpToCountry.txt && sed -E -i 's/,??/,ZZ/g' IpToCountry.txt && sed -E -i 's/,CS/,RS/g' IpToCountry.txt

  1. Remove last column, because Tor geoip format is: fromIP,toIP,countryCode. Freenet does not need to toIP value, the binary search algorithm will take care of this. 2) Replace '??' with 'ZZ' for unknown countries, because '??' is not in the ISO 3166 standard. 3) Replace 'CS' with 'RS' because the country 'CS' is not in the ISO 3166 standard.

Zip this text file into IpToCountry.zip and place it in the main Freenet folder.

==== Code changes ====

The base85 code is left in the source as well as the file reader for the old format.

  • src/freenet/clients/http/geoip/IPConverter.java -- zip reader to save space.
    -- ArrayList is allocated with 180000 slots to have it not resize that many times (does not matter for speed though anyway). -- Ignore empty lines and lines that start with anything but a number (comments). -- Cast (int) to the Long value, exactly like the old code did. -- Get country, identical to old code.
    -- Reverse the List, because the binary search expects the list to be in descending order. Takes <10 ms. -- Convert the List<Integer/Short> to int[]/short[] to save lots of memory. See below for explanation. Takes <10 ms. -- Catch all possible errors.

-- I did not feel confident in messing with the binary search because I might overlook some edge case where indexes would no longer match, so I left it alone. Reversing both arrays takes less than 10 ms combined.

  • src/freenet/node/NodeFile.java b/src/freenet/node/NodeFile.java -- Changed default location from 'IpToCountry.dat' to 'IpToCountry.zip'.

Memory from heap dump according to VisualVM:
List vs int[]: 3.3 MiB vs 660 KiB
List vs short[]: 2.0 MiB vs 330 KiB

==== Further changes (aka 'more stuff to do for Arne' :) ) ====

https://github.com/freenet/scripts#releasing-stable-freenet-builds The FAQ link has to be removed as the old IP DB site is no longer used.

/scripts/setup-release-environment
Has to be adjusted. How did it work in the past few years Arne, because the website has been offline for a while?

The new zip file has to be added to the insert/release script.

import java.util.Map;
import java.util.zip.ZipException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are needed despite no other changes to this file?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are needed despite no other changes to this file?

It appears Arne forgot to include the other changes in IPConverter, which you can see are explained in a way that it is already done. I figure he intends to come back to paste the remaining changes later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArneBab Are these changes incomplete? Is this a draft?

@ArneBab ArneBab force-pushed the use-new-iptocountry-source-zip branch from 38cd230 to 89fdffe Compare September 9, 2023 03:12
a) The DB file size is bigger. Current IpToCountry.dat is 1.2 MiB, Tor
DB is 4 MiB, optimized Tor DB is 2 MiB. "Optimized" means I used 'sed'
to remove an unnecessary column in the DB. If you really want to go
for size, you can zip it to ~700 KiB, this increases runtime a bit,
but it's still ~15x faster than the old method.

b) Tor uses "??" instead of "ZZ" for unknown codes, and still uses
"CS" which stands for Serbia and Montenegro - the country stopped
existing in 2006. Maybe someone could ask them why their DB uses
"CS"... This can be solved easily by just replacing them with 'sed'.

c) The new DB is sorted in ascending order, which means that the
function to do the binary search has to be changed (right now I simply
reverse the array), which saves another ~6 ms. I don't know how to do
this.

human-readable text file in a zip.

Advantages:
- It's human readable
- It's easy to update because we can use Tor geoip
- It's a lot faster than the base85 approach
- It has a smaller file size

==== New zip file ====

The code no longer uses the IpToCountry.dat file and instead uses a zip file called IpToCountry.zip. This zip file is expected to contain exactly one text file in the Tor geoip format according to the spec below.
The zip file should be compressed to save space (~2 MiB uncompressed -> 0.7 MiB compressed).

==== New IpToCountry.txt file ====

Format for each line: <fromIP>,<ISO 3166-1 alpha-2 country code>
Example: 16781312,JP
This is like to old format, but not base85 encoded.

Empty lines are allowed.
Comments may start with any symbol other than a number.

----------------------------------

Get the raw .txt file here: https://github.com/torproject/tor/raw/main/src/config/geoip

The file has to be processed with the following three 'sed' commands:

sed -E -i 's/([0-9]*),[0-9]*,([A-Z]*)/\1,\2/g' IpToCountry.txt && sed -E -i 's/,\?\?/,ZZ/g' IpToCountry.txt && sed -E -i 's/,CS/,RS/g' IpToCountry.txt

1) Remove last column, because Tor geoip format is: fromIP,toIP,countryCode. Freenet does not need to toIP value, the binary search algorithm will take care of this.
2) Replace '??' with 'ZZ' for unknown countries, because '??' is not in the ISO 3166 standard.
3) Replace 'CS' with 'RS' because the country 'CS' is not in the ISO 3166 standard.

Zip this text file into IpToCountry.zip and place it in the main Freenet folder.

==== Code changes ====

The base85 code is left in the source as well as the file reader for the old format.

- src/freenet/clients/http/geoip/IPConverter.java
-- zip reader to save space.
-- ArrayList is allocated with 180000 slots to have it not resize that many times (does not matter for speed though anyway).
-- Ignore empty lines and lines that start with anything but a number (comments).
-- Cast (int) to the Long value, exactly like the old code did.
-- Get country, identical to old code.
-- Reverse the List, because the binary search expects the list to be in descending order. Takes <10 ms.
-- Convert the List<Integer/Short> to int[]/short[] to save lots of memory. See below for explanation. Takes <10 ms.
-- Catch all possible errors.

-- I did not feel confident in messing with the binary search because I might overlook some edge case where indexes would no longer match, so I left it alone. Reversing both arrays takes less than 10 ms combined.

- src/freenet/node/NodeFile.java b/src/freenet/node/NodeFile.java
-- Changed default location from 'IpToCountry.dat' to 'IpToCountry.zip'.

Memory from heap dump according to VisualVM:
List<Integer> vs int[]: 3.3 MiB vs 660 KiB
List<Short> vs short[]: 2.0 MiB vs 330 KiB

==== Further changes (aka 'more stuff to do for Arne' :) ) ====

https://github.com/freenet/scripts#releasing-stable-freenet-builds
The FAQ link has to be removed as the old IP DB site is no longer used.

/scripts/setup-release-environment
Has to be adjusted. How did it work in the past few years Arne, because the website has been offline for a while?

The new zip file has to be added to the insert/release script.
@ArneBab ArneBab force-pushed the use-new-iptocountry-source-zip branch from 89fdffe to 8465bc5 Compare April 26, 2024 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants