From 38cd2306d8e02652a2d65f2080d11d760c6ead47 Mon Sep 17 00:00:00 2001 From: naejadu Date: Sun, 27 Nov 2022 15:55:39 +0100 Subject: [PATCH] Use the Tor DB as GeoIP database a) The DB file size is bigger. Current IpToCountry.dat is 1.2 MiB, Tor DB is 4 MiB, optimized Tor DB is 2 MiB. "Optimized" means I used 'sed' to remove an unnecessary column in the DB. If you really want to go for size, you can zip it to ~700 KiB, this increases runtime a bit, but it's still ~15x faster than the old method. b) Tor uses "??" instead of "ZZ" for unknown codes, and still uses "CS" which stands for Serbia and Montenegro - the country stopped existing in 2006. Maybe someone could ask them why their DB uses "CS"... This can be solved easily by just replacing them with 'sed'. c) The new DB is sorted in ascending order, which means that the function to do the binary search has to be changed (right now I simply reverse the array), which saves another ~6 ms. I don't know how to do this. human-readable text file in a zip. Advantages: - It's human readable - It's easy to update because we can use Tor geoip - It's a lot faster than the base85 approach - It has a smaller file size ==== New zip file ==== The code no longer uses the IpToCountry.dat file and instead uses a zip file called IpToCountry.zip. This zip file is expected to contain exactly one text file in the Tor geoip format according to the spec below. The zip file should be compressed to save space (~2 MiB uncompressed -> 0.7 MiB compressed). ==== New IpToCountry.txt file ==== Format for each line: , Example: 16781312,JP This is like to old format, but not base85 encoded. Empty lines are allowed. Comments may start with any symbol other than a number. ---------------------------------- Get the raw .txt file here: https://github.com/torproject/tor/raw/main/src/config/geoip The file has to be processed with the following three 'sed' commands: sed -E -i 's/([0-9]*),[0-9]*,([A-Z]*)/\1,\2/g' IpToCountry.txt && sed -E -i 's/,\?\?/,ZZ/g' IpToCountry.txt && sed -E -i 's/,CS/,RS/g' IpToCountry.txt 1) Remove last column, because Tor geoip format is: fromIP,toIP,countryCode. Freenet does not need to toIP value, the binary search algorithm will take care of this. 2) Replace '??' with 'ZZ' for unknown countries, because '??' is not in the ISO 3166 standard. 3) Replace 'CS' with 'RS' because the country 'CS' is not in the ISO 3166 standard. Zip this text file into IpToCountry.zip and place it in the main Freenet folder. ==== Code changes ==== The base85 code is left in the source as well as the file reader for the old format. - src/freenet/clients/http/geoip/IPConverter.java -- zip reader to save space. -- ArrayList is allocated with 180000 slots to have it not resize that many times (does not matter for speed though anyway). -- Ignore empty lines and lines that start with anything but a number (comments). -- Cast (int) to the Long value, exactly like the old code did. -- Get country, identical to old code. -- Reverse the List, because the binary search expects the list to be in descending order. Takes <10 ms. -- Convert the List to int[]/short[] to save lots of memory. See below for explanation. Takes <10 ms. -- Catch all possible errors. -- I did not feel confident in messing with the binary search because I might overlook some edge case where indexes would no longer match, so I left it alone. Reversing both arrays takes less than 10 ms combined. - src/freenet/node/NodeFile.java b/src/freenet/node/NodeFile.java -- Changed default location from 'IpToCountry.dat' to 'IpToCountry.zip'. Memory from heap dump according to VisualVM: List vs int[]: 3.3 MiB vs 660 KiB List vs short[]: 2.0 MiB vs 330 KiB ==== Further changes (aka 'more stuff to do for Arne' :) ) ==== https://github.com/freenet/scripts#releasing-stable-freenet-builds The FAQ link has to be removed as the old IP DB site is no longer used. /scripts/setup-release-environment Has to be adjusted. How did it work in the past few years Arne, because the website has been offline for a while? The new zip file has to be added to the insert/release script. --- src/freenet/clients/http/geoip/IPConverter.java | 11 +++++++++-- src/freenet/node/NodeFile.java | 2 +- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/src/freenet/clients/http/geoip/IPConverter.java b/src/freenet/clients/http/geoip/IPConverter.java index f0d4861027c..63463f15956 100644 --- a/src/freenet/clients/http/geoip/IPConverter.java +++ b/src/freenet/clients/http/geoip/IPConverter.java @@ -1,15 +1,22 @@ package freenet.clients.http.geoip; +import java.io.BufferedReader; import java.io.File; -import java.io.FileNotFoundException; import java.io.IOException; -import java.io.RandomAccessFile; +import java.io.InputStream; +import java.io.InputStreamReader; import java.lang.ref.SoftReference; import java.lang.ref.WeakReference; +import java.nio.file.NoSuchFileException; +import java.util.ArrayList; import java.util.Arrays; +import java.util.Collections; import java.util.HashMap; import java.util.LinkedHashMap; +import java.util.List; import java.util.Map; +import java.util.zip.ZipException; +import java.util.zip.ZipFile; import freenet.clients.http.StaticToadlet; import freenet.node.Node; diff --git a/src/freenet/node/NodeFile.java b/src/freenet/node/NodeFile.java index 8ec2be80fd0..f5df0267966 100644 --- a/src/freenet/node/NodeFile.java +++ b/src/freenet/node/NodeFile.java @@ -9,7 +9,7 @@ public enum NodeFile { Seednodes(InstallDirectory.Node, "seednodes.fref"), InstallerWindows(InstallDirectory.Run, "freenet-latest-installer-windows.exe"), InstallerNonWindows(InstallDirectory.Run, "freenet-latest-installer-nonwindows.jar"), - IPv4ToCountry(InstallDirectory.Run, "IpToCountry.dat"); + IPv4ToCountry(InstallDirectory.Run, "IpToCountry.zip"); private final InstallDirectory dir; private final String filename;