Parquet support #10

vkhodygo · 2023-10-02T16:02:17Z

This should close #5.

I'd like to know what you think about this.

andrewbaxter439 · 2023-10-04T09:44:10Z

This looks great! I'll aim to try testing it out ASAP and can likely be pulled. In previous tries of converting to parquet the three problems faced were:

Integer values with "null" being wrongly coded as "character"
Specifying schema for such cols as integer causing problems with "null" character values in column
Trailing commas being interpreted as empty column

As far as I can see your code seems to address all the above.

andrewbaxter439 · 2023-10-04T10:08:14Z

Suggestions:

simpaths_parallel/R/convert_parquet.R

Lines 72 to 75 in efb4cfa

    
           f.base <- scenario_id |> 
        
             paste("output", sep = "/") |> 
        
             paste(path_string, sep = "/") |> 
        
             paste("csv", sep = "/")

replace with file.path(scenario_id, "output", path_string, "csv") as a safer, system-agnostic, path specification?

simpaths_parallel/R/convert_parquet.R

Line 194 in efb4cfa

paths_pattern = "^2023\\d+\\_([1-9]|[1234][0-9]|50)00$") {

This will stop working in 2024!
A less specific file path of say "^20\\d+_\\d+$" will still catch all directories of runs (until we run it in the year 2100) without any danger of missing runs with either higher seeds or seeds not ending in "00"

simpaths_parallel/R/convert_parquet.R

Line 121 in efb4cfa

run = run + (batch_id - 1) * 20,

This makes sense, and if I'm following it correctly it will renumber the runs 1-1000 (rather than my old system of having 100-119, 200-219, ... 5000-5019 etc.)
Minor future compatibility issue will be if the batch size -b passed to do_full_run.sh changes from 20 at all this will break the hard-coded numbering system. Might be too niche a problem to worry about for now

simpaths_parallel/R/convert_parquet.R

Line 198 in efb4cfa

map(\(x) convert.single_simulation(x, sid), .progress = TRUE)

furrr::future_map might be useful here to parallelise? Haven't tried running this yet - my own implementations used the annoying read_csv_arrow which loaded whole dataset (but didn't need schemas), whereas open_csv_arrow I presume should be faster/smoother - so parallelisation might not be possible/needed.

simpaths_parallel/R/convert_parquet.R

Line 7 in efb4cfa

source("./schemas/schemas.R")

will check how this runs but I would have thought it should be source("R/schemas/schemas.R") if running from project root?

I've noted some of the above suggestions as tick-boxes. Check off any you don't think need actioned and we can discuss/patch others as needed.

vkhodygo · 2023-10-31T16:59:04Z

I've fixed some of the issues, I also replaced a hardcoded postfix with a parameter, however, other methods do not incorporate this change yet.

TODO: finish this bit

We could employ furrr::future_map, but I have a few concerns. arrow is inherently parallel itself, and I don't want to create a cluster of threads that spawn threads of their own. This approach leads to potential bottlenecks and crashes, better safe than sorry. Besides, this code is reasonably performant as it is, at the moment it's about an hour per scenario.

Regarding your last point, I'm not sure how it works in R. Any suggestions?

P.S. I also would like you to take a look at convert.single_simulation. It's in need of code consolidation (switch?), could you please give it a go?

andrewbaxter439 · 2023-11-01T10:20:38Z

Hi Vlad, thanks for update. I was working on this the other week and hit some barriers to progress at the same time as my simpaths parallel runs stopped working all over again as I attempted to generate a test dataset to work on. Spent hours watching it repeatedly crash after a tiny number of runs, but then noticed that the home drive is 99% full?? Ran out of time to give this to solve that week. Will pick this up again in the coming weeks.

vkhodygo · 2023-11-01T13:27:49Z

but then noticed that the home drive is 99% full

That's my fault, I apologize. For some reason quotas don't work, I also need a lot of space to store ML models and produced synthetic results. Meanwhile, try using /storage for that.

andrewbaxter439 · 2023-11-10T16:53:53Z

Finally got round to testing this after a lot of problems. I think with the tweaks in #12 it works perfectly in giving useable parquet files for baseline/reform runs.

UPD: add R script to convert csv to parquet

efb4cfa

vkhodygo added the enhancement New feature or request label Oct 2, 2023

vkhodygo requested a review from andrewbaxter439 October 2, 2023 16:02

vkhodygo assigned vkhodygo and andrewbaxter439 Oct 2, 2023

vkhodygo added 3 commits October 31, 2023 16:37

UPD: replace hardcoded suffix with parameter

2359818

UPD: replace string concatenation with platform agnostic file.path

b0eb406

UPD: relax folder name regexp conditions

4fb1f4a

UPD: add batch id suffix propagation

6777b79

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet support #10

Parquet support #10

vkhodygo commented Oct 2, 2023

andrewbaxter439 commented Oct 4, 2023

andrewbaxter439 commented Oct 4, 2023 •

edited by vkhodygo

Loading

vkhodygo commented Oct 31, 2023 •

edited

Loading

andrewbaxter439 commented Nov 1, 2023

vkhodygo commented Nov 1, 2023

andrewbaxter439 commented Nov 10, 2023

Parquet support #10

Are you sure you want to change the base?

Parquet support #10

Conversation

vkhodygo commented Oct 2, 2023

andrewbaxter439 commented Oct 4, 2023

andrewbaxter439 commented Oct 4, 2023 • edited by vkhodygo Loading

vkhodygo commented Oct 31, 2023 • edited Loading

andrewbaxter439 commented Nov 1, 2023

vkhodygo commented Nov 1, 2023

andrewbaxter439 commented Nov 10, 2023

andrewbaxter439 commented Oct 4, 2023 •

edited by vkhodygo

Loading

vkhodygo commented Oct 31, 2023 •

edited

Loading