Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert GPX/KML to geojson #61

Merged
merged 26 commits into from
May 4, 2016
Merged

Convert GPX/KML to geojson #61

merged 26 commits into from
May 4, 2016

Conversation

GretaCB
Copy link
Contributor

@GretaCB GretaCB commented Mar 7, 2016

Per #19

  • Convert each layer of KML/GPX into geojson file (max 15 layers)
  • Add index file for each new file/layer > 10MB
  • Validate layer names
  • Include metadata file for original KML/GPX (run through mapnik-omnivore)
  • Drop GPX track_points. Found that these are automatically inserted by gdal (most of the time). @BergWerkGIS did we decide on if possible to allow track_points produced by an actual GPS device?
  • Tests (anything we're not testing that we should?)

@BergWerkGIS

@GretaCB
Copy link
Contributor Author

GretaCB commented Mar 14, 2016

Per https://github.com/mapbox/unpacker/issues/907, this PR just needs to run mapnik-omnivore and add the output as a JSON file to the bundle (for both KML/GPX preprocessor and geojson-index preprocessor). Happy to jump on this tomorrow.

@GretaCB
Copy link
Contributor Author

GretaCB commented Mar 18, 2016

The togeojson KML tests fail 50% of the time locally with the following error:

not ok 9 .end() called twice
  ---
    operator: fail
  ...

/Users/mapbox/dev/preprocessorcerer/node_modules/queue-async/build/queue.js:41
        if (!tasks[i]) throw new Error; // detect multiple callbacks

Going to split each fixture into its own tape test as a first step. Wonder if a previous test in the for loop is erroring and an assert.end() is lingering.

@GretaCB
Copy link
Contributor Author

GretaCB commented Mar 18, 2016

Looks like the original error that was causing the .end() callback error is:

max tree depth:8
split ratio:0.55
processing '/var/folders/2p/m7jcvfp90wngshgr9j0cd9tc0000gn/T/f32881494358414c/empty-features-only.geojson' as GeoJSON
Mapnik LOG> 2016-03-18 14:39:26: Error loading mapped memory file: '/var/folders/2p/m7jcvfp90wngshgr9j0cd9tc0000gn/T/f32881494358414c/empty-features-only.geojson' (Invalid argument)
Error : cannot memory map /var/folders/2p/m7jcvfp90wngshgr9j0cd9tc0000gn/T/f32881494358414c/empty-features-only.geojson
Error: failed to process /var/folders/2p/m7jcvfp90wngshgr9j0cd9tc0000gn/T/f32881494358414c/empty-features-only.geojson
done!

I console logged the above error from within the indexing logic.

I suspect mapnik doesnt like some of the geojson files we are feeding it for indexing or some kind of race condition when indexing each layer?

cc @artemp @springmeyer

@GretaCB
Copy link
Contributor Author

GretaCB commented Mar 18, 2016

Per chat with @springmeyer , looks like the indexing error is pointing to something else going on in the converted geojson layers.

I took a look at the resulting converted geojson layers for the tests that are failing, and looks as if they are producing incomplete objects (invalid json).

Note: not all resulting geojson layers are invalid.

Going to keep 👀

@wilhelmberg
Copy link
Contributor

One step further with debugging the failing test.

I have proof now, that something is still holding on to tmpfile when it should be deleted:

image

Unfortunately it's a pain to find out: running the test from the console the exception occurs every time.
Slowly stepping thru the test in a debug session works.
Jumping between different break points in a debug session fails.
I've already tried different things (without success):

  • delay before delete
  • nulling src in wmtiff
  • ...

@GretaCB
Copy link
Contributor Author

GretaCB commented Mar 21, 2016

I think we have here a good ole fashioned control flow problem (when I remove createIndex locally, tests no longer fail). Looks like createIndex can potentially trigger before the forEach loops are finished. This is because createIndex is being called outside of the forEach loops, and is not relying on any sort of callback or way to determine when layers are in fact ready to be indexed.

I like your idea from yesterday, that we should index after all layers are converted. We can also refactor the main function a bit to better manage flow and modularize a bit.

Found a great example of how we can try untangling some of this:

var fs = require('fs');

function read_directory(path, next) {
  fs.readdir(".", function (err, files) {
    var count = files.length,
        results = {};
    files.forEach(function (filename) {
      fs.readFile(filename, function (data) {
        results[filename] = data;
        count--;
        if (count <= 0) {
          next(results);
        }
      });
    });
  });
}

function read_directories(paths, next) {
  var count = paths.length,
      data = {};
  paths.forEach(function (path) {
    read_directory(path, function (results) {
      data[path] = results;
      count--;
      if (count <= 0) {
        next(data);
      }
    });
  });
}

read_directories(['articles', 'authors', 'skin'], function (data) {
  // Do something
});

wilhelmberg added a commit to mapbox/node-wmtiff that referenced this pull request Mar 22, 2016
@wilhelmberg
Copy link
Contributor

Nothing wrong with the control flow, just gdal has to be manually disposed to get rid of the file reference faster:

The 4 references to the dataset in the screenshot above came from

  • wmtiff
  • tif-reproject.preprocessor.js
  • tif-toBytes.preprocessor.js
  • preprocessor.criteria

as each of them opens the dataset (maybe room for improvement?) and ds is still holding on to the file (despite of ds.close()).
Maybe because everything is done in callbacks and node hasn't yet let go of the objects.


Working on the next failing test now:

# [parts mbtiles] correct number of parts for mbtiles without grids table
ok 40 no error
ok 41 four parts

fs.js:772
  return binding.unlink(pathModule._makeLong(path));
                 ^
Error: EBUSY, resource busy or locked 'C:\Users\ADMINI~1\AppData\Local\Temp\2\79c7b0a1613d996b'
    at Object.fs.unlinkSync (fs.js:772:18)
    at c:\mb\preprocessorcerer\test\mbtiles-byTiles.test.js:17:10
    at Statement.<anonymous> (c:\mb\preprocessorcerer\parts\mbtiles-byTiles.js:25:9)
npm ERR! Test failed.  See above for more details.

@wilhelmberg
Copy link
Contributor

Same with failing mbtiles test: resources are not properly disposed.
In this case connection to the database is not closed.
Fixed: abf63bc


Now all tests run through again, but 8 are failing.
Continuing the test quest ... 😏

@wilhelmberg
Copy link
Contributor

I think we have here a good ole fashioned control flow problem

Yes, you are right. This is a control flow problem.

I didn't realize that you were talking about GPX test and I was talking about tiff test

Here createIndex is called async, but at the same time control flow continues to here

@wilhelmberg
Copy link
Contributor

That's also why .end() is called twiced: 1x from createIndex() and 1x from the exported function itself.

not ok 4 expected error message
  ---
    operator: equal
    expected: 'GPX does not contain any valid features.'
    actual:   undefined
  ...
not ok 5 .end() called twice
  ---
    operator: fail
  ...

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.5%) to 95.238% when pulling c9ca030 on togeojson into 455fc99 on master.

@wilhelmberg
Copy link
Contributor

npm test works for me now - no failed tests.
But I have to use mapbox/node-wmtiff@6c80973.

@GretaCB
Copy link
Contributor Author

GretaCB commented Mar 23, 2016

Node v0.10.40 worked fine on OSX as well. I took a look around preprocessorcer and did a little research, but didnt find much that explains the immortal tmpfile. I did find this

Datasets get closed automatically when node garbage collects them, but there is no guarantee that node will garbage collect before exiting.

This lead me down a google search of node GC, but not sure that's the proper route at this point. Will keep pondering.

@wilhelmberg
Copy link
Contributor

I did find this

Great find. Thanks.
This was exactly my hunch:
"Making sure the dataset is closed and flushed before the process exits fixes the problem for me. Datasets get closed automatically when node garbage collects them, but there is no guarantee that node will garbage collect before exiting."

but there is no guarantee that node will garbage collect before exiting or in our case before we try to delete the dataset.

but didnt find much that explains the immortal tmpfile

The explanation above was the reason for not being able to delete the tmpfile, as node hat not yet done its garbage collection.

Sorry if I caused any confusion, but this should be fixed now and the file should not be immortal anymore.

ds.close() and ds = null I've put into the code at various places should take care of disposing the objects (references to the file).

This lead me down a google search of node GC, but not sure that's the proper route at this point. Will keep pondering.

Not necessary to spend more time on this.
I've literally spent weeks on node GC when investigating strange node-mapnik crashes some time ago.

There are GC parameters you can pass to the node executable to fiddle with the internals of its GC
e.g.: --expose-gc, or others to change memory usage before GC kicks in, or to prevent GC at all.
But none of them really helped and we decided to stick with the defaults.


The take away here is: always dispose of resources properly as soon as you are done with them.

Be it a file (fs.close()), a database connection (db.close()) or another native wrapper (like gdal).

@springmeyer
Copy link
Contributor

The take away here is: always dispose of resources properly as soon as you are done with them.

Awesome work. Also agree @BergWerkGIS. This is a pretty classic case of windows behavior being more strict and that helping highlight core issues of the code that should be written more robustly. 💯

ds.close() and ds = null I've put into the code at various places should take care of disposing the objects (references to the file).

That is neat/great that setting to null triggers the actual close (I presume because it triggers the garbage collection sooner). I'm left however thinking that it is probably worth bringing this up again with Brian and Brandon. I'm sure the current close() behavior of not releasing until garbage collection makes sense given the design and object inheritance in node-gdal. But nevertheless it seems like odd behavior. Perhaps they would reconsider having ds.close() actually close the underlying gdal file handle directly or perhaps adding a ds.closeNow() method to force it. I've not looked at the code enough to know if this is feasible, but I think it would be great @BergWerkGIS if you would raise this for them in a ticket to flesh this out further.

@wilhelmberg
Copy link
Contributor

@springmeyer Already looked at the code and I think this is purely on node's side (and our implementation) and there is nothing else gdal can do about it (although not familiar with node's GC internals).

ds.close() calls dispose() to null the internal resources:

    if (this_dataset) {
        LOG("Disposing Dataset [%p]", this_dataset);
        ptr_manager.dispose(uid);
        LOG("Disposed Dataset [%p]", this_dataset);
        this_dataset = NULL;
    }

Pseudo code of our implementation looks like this:

function one(){var ds = new gdal.ds; return two(ds);}
function two(ds){return three(ds);}
function three(ds){return four(ds);}
function four(ds){fs.unlink(ds);}

And with this implementation it absolutely makes sense for me that node hasn't fully released ds yet, as in method four() we are essentially still in method one().
Or am I missing something about node/js callbacks?

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 95.111% when pulling f781e3c on togeojson into 455fc99 on master.

@GretaCB
Copy link
Contributor Author

GretaCB commented Mar 24, 2016

UPDATE on funky test failure

Per chat with Jake, node-mapnik no longer supports builds for node 0.12

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 95.111% when pulling cd5474e on togeojson into 455fc99 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 95.111% when pulling a61209a on togeojson into 455fc99 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 95.111% when pulling d6af91f on togeojson into 455fc99 on master.

@wilhelmberg
Copy link
Contributor

Cannot reproduce the unlink problem locally also Travis didn't have a problem on the last run:
https://travis-ci.org/mapbox/preprocessorcerer/builds/118480437


The highlighted files are those left after npm test.
Will take a look where there are coming from.

image

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 95.111% when pulling 2a20b12 on togeojson into 455fc99 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 95.133% when pulling 338d5a2 on togeojson into 455fc99 on master.

@GretaCB
Copy link
Contributor Author

GretaCB commented Apr 11, 2016

Hmmm, looking into why tests are still failing here, so we can get this ready to rock.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.6%) to 95.143% when pulling 5aad818 on togeojson into 455fc99 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.6%) to 95.143% when pulling 4cc9d80 on togeojson into 059ebf9 on master.

@@ -34,6 +34,7 @@ module.exports = function(infile, outdir, callback) {
.on('exit', function() {
// If error printed to --validate-features log
if (data.indexOf('Error') != -1) {
console.log(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug logging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thank you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GretaCB what's the plan of attack?
This has changed here (https://github.com/mapbox/preprocessorcerer/compare/stderr-fix#diff-39694ce4a586caad3e6ad813c295b454R32) too as latest mapnik-index is now returning an error code, if --validate-features fails.
Need to wait to update to latest node-mapnik?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan of attack re: mapbox/mapnik-swoop#14
cc @mapsam

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working through a node-mapnik dev build so we can test this right now.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 95.133% when pulling 18c3a26 on togeojson into 059ebf9 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 95.133% when pulling 18c3a26 on togeojson into 059ebf9 on master.

@GretaCB GretaCB merged commit 4f8d380 into master May 4, 2016
@GretaCB GretaCB deleted the togeojson branch May 4, 2016 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants