Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Memory Leak in libpostal_parse_address #676

Open
BookGin opened this issue Nov 13, 2024 · 1 comment
Open

Potential Memory Leak in libpostal_parse_address #676

BookGin opened this issue Nov 13, 2024 · 1 comment

Comments

@BookGin
Copy link

BookGin commented Nov 13, 2024

Hi!

There seems to be memory leak issue in libpostal_parse_address. The memory usage will increase over time when parsing the same address.


My country is

This issue is not specific to any country or address. I tried using other addresses or random strings, but the issue still remains.


Here's how I'm using libpostal

The program parses the example address 10M times and use Linux pmap to print its memory usage.

// gcc -o app app.c $(pkg-config --cflags --libs libpostal)
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <libpostal/libpostal.h>

int main(int argc, char **argv) {
    if (!libpostal_setup() || !libpostal_setup_parser()) {
        exit(EXIT_FAILURE);
    }

    libpostal_address_parser_options_t options = libpostal_get_address_parser_default_options();

    int count = 10000000;
    int batch = 100000;
    for (int i = 0; i < count; i++) {
        libpostal_address_parser_response_t *parsed = libpostal_parse_address("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA", options);
        libpostal_address_parser_response_destroy(parsed);
        if (i % batch == 0)
        {
          char command[256];
          sprintf(command, "pmap -x %d > %d.txt", getpid(), i / batch + 1);
          puts(command);
          system(command);
        }
    }

    libpostal_teardown();
    libpostal_teardown_parser();
}

Here's what I did

See above.


Here's what I got

The memory usage increases over time.

echo "File                     Kbytes   RSS    Dirty"; for i in {5..100..5}; do echo -n "$i.txt: " && cat $i.txt | grep total; done
File                     Kbytes   RSS    Dirty
5.txt: total kB         1942360 1924872 1921816
10.txt: total kB         2007900 1960788 1957732
15.txt: total kB         2007900 1980316 1977260
20.txt: total kB         2073436 1999848 1996792
25.txt: total kB         2073436 2019380 2016324
30.txt: total kB         2073436 2038912 2035856
35.txt: total kB         2204508 2058444 2055388
40.txt: total kB         2204508 2077972 2074916
45.txt: total kB         2204508 2097504 2094448
50.txt: total kB         2204508 2117036 2113980
55.txt: total kB         2204508 2136568 2133512
60.txt: total kB         2204508 2156100 2153044
65.txt: total kB         2204508 2175632 2172576
70.txt: total kB         2466652 2195160 2192104
75.txt: total kB         2466652 2214692 2211636
80.txt: total kB         2466652 2234224 2231168
85.txt: total kB         2466652 2253756 2250700
90.txt: total kB         2466652 2273288 2270232
95.txt: total kB         2466652 2292816 2289760
100.txt: total kB         2466652 2312348 2309292

I also use valgrind to run 1M times but it does not report memory leak.

valgrind ./app2
==3615986== Memcheck, a memory error detector
==3615986== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3615986== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==3615986== Command: ./app2
==3615986==
==3615986== Warning: set address range perms: large range [0x2f85a040, 0x3fa385f0) (undefined)
==3615986== Warning: set address range perms: large range [0x3fa39040, 0x4fc175f0) (undefined)
==3615986== Warning: set address range perms: large range [0x3fa391ca, 0x4fc171ca) (defined)
==3615986== Warning: set address range perms: large range [0x3fa39028, 0x4fc17608) (noaccess)
==3615986== Warning: set address range perms: large range [0x6577c040, 0x82a05c8c) (undefined)
==3615986== Warning: set address range perms: large range [0x2f85a028, 0x3fa38608) (noaccess)
==3615986== Warning: set address range perms: large range [0x6577c028, 0x82a05ca4) (noaccess)
==3615986==
==3615986== HEAP SUMMARY:
==3615986==     in use at exit: 0 bytes in 0 blocks
==3615986==   total heap usage: 71,539,052 allocs, 71,539,052 frees, 7,820,286,857 bytes allocated
==3615986==
==3615986== All heap blocks were freed -- no leaks are possible
==3615986==
==3615986== For lists of detected and suppressed errors, rerun with: -s
==3615986== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Here's what I was expecting

The memory usage should not increase overtime.


For parsing issues, please answer "yes" or "no" to all that apply.

This is not parsing issues.


Here's what I think could be improved

See above.

More information:

  1. libpostal git version: 8f2066b1d30f4290adf59cacc429980f139b8545
  2. OS: Ubuntu 20.04.6 LTS 5.4.0-192-generic
@albarrentine
Copy link
Contributor

albarrentine commented Nov 13, 2024

That's likely memory fragmentation (each of the ten million parses is doing some mallocs and frees - most of our stuff internally uses dynamic arrays and even for arrays of strings we keep a data structure internally which packs a bunch of C strings into one contiguous array, but there might be a lot of small strings created by strdup for the purposes of the API call since we wanted to keep a simple API for the higher level bindings that didn't require knowing any fancier data structures). If valgrind doesn't report a leak, it's unlikely there truly is one (false positives are more common than false negatives), especially not something that would leak 50+ bytes per call.

Might check something like mtrace or malloc_stats, and/or could probably duplicate the pattern by simply doing some strtok/strdup/free combo 10 million times on the input string without using libpostal at all. Reducing the memory growth is maybe possible with some kind of memory pooling or using small-string caching a la what Python does, but the API contract is you get some char *s that you have to free, so at some point that's roping in system malloc. If you can confirm there's something leaking memory in libpostal itself happy to fix, but not sure there's much to be done there otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants