-
-
Notifications
You must be signed in to change notification settings - Fork 40
Developer Diary
Spend last weekend moving, so didn't do much, and this morning was spent on Julia's JDF.jl and FastGroupBy.jl so there was no much done on disk.frame. But I need to continue at this pace and only work on Sunday mornings on open source to prevent burn out.
Someone reported the difficulty in getting rid of progress bar. I think I want to spend some time on that. The 5th Nov is Melbourne cup so it's a holiday. So I will see if I get time to fix that.
Did a search of "disk.frame" on Twitter and LinkedIn and found that so many people have tweeted and posted about disk.frame already and have it useful. One person did say disk.frame was "messy" with too many dependencies though. Which is kinda true, but given a tool with no dependency that doesn't solve my problems vs one with lots of dependencies but does, I know which one I will choose. So I shall begin the journey to minimise dependencies at some point.
I have asked Jacky to be included as a contributor on CRAN for his amazing contribution in hard sorting.
Performed a review of Jacky Poon's PR on hard_arrange
which is technically quite excellent.
Started the boring journey to updating all tests to write to the tempdir()
. Up to test-group-by.R
I broke the master branch. And Bruno couldn't install the package to check if a bug has been fixed! It has.
Now I need to set up Travis etc properly!
Started work on introducing chunk_group_by
so that group_by
is reserved for a grander group-by framework. This might be a lot of work so I've started a new branch. I recall Dask had a framework but it's difficult to find the exact page. I will do the same.
Also did some thinking around how to incorporate disk.frame with drake, which is a fantastic tool. I actually wanted to use drake in the backend to enable restartable operations. In the early days of {disk.frame}, I found that {future} often fails. So a restartable operation was kind of important. Now that future is much more stable, it's less urgent to implement restartable, and also I want to focus on bug fixes, documentation, usage guides, rather than new features.
There is the pesky sample_frac which is causing warnings with the cran checks. So I need to look at it next week.
Also found that the author of drake starred {disk.frame} today!
I wanted to find a way to make some cash for {disk.frame} and wanted research options. Decided to take a look at NumFOCUS shop and found that they are using spreadshirt, so I've done the same. Their interface is often confusing and it will take a while to customise the designs, so I will need to cut down on the product before I go on there again. The prices are crazy too. I will not put it on the funding page for now.
The CRAN submission is really quick if your package is already on CRAN. I just submitted and it gave an error that I think is caused by the vignette cache, so I deleted the cache and resubmit. When I woke up in the morning, it's on CRAN! I still have to wait a few days for the Windows binary to compile though. So I will publicise it after that because most users are Windows.
After I got home today I started writing the GLM functions and I found out just how unreliable both {biglm
} and {speedglm
} are and neither are managed on open-source platforms, so it's hard to contribute. I once contacted the author of {speedglm
} to suggest a feature of some sort and he replied but wanted me to hold his hand on how to use Github over email. I tried my best, but I don't he is using Github.
This also reminds of the time I email the author of {biglm
} to tell him about how default
as a column name will cause issues because the word default is a reserved word in SQLite and if your package talks to SQLite you must make sure that 'default'
is passed instead of just default
. That never got anywhere either and I ended up hosting a fixed version of biglm, just for that.
I think it's the right to submit 0.1.1 given now that there's a bit more documentation some new functions and fixed a pretty big bug with reading files bigger than available RAM.
However, I ran into some pretty serious issues with building vignette. After trying many things, I accidentally found devtools::clean_vignettes()
. Hopefully, it will work out! And nope it didn't work!
Turns out my .Rbuildignore
was having some weird interference with the vignette by excluding some of vignette. So I reverted .Rbuildignore
and the check passed with 1 note which was that some folders are non-standard and appear in the build folder! Ok So now I just ignore those! I created a misc
folder to keep random stuff that I don't need.
Woke to a couple of issues submitted by a user. The way he uses the package is quite surprising but I think represents what a normal user would do. So I fixed the issues. I was working on a Julia dataframe serialiser, but I had to put that down and focus on it. Very happy that I was able to resolve the issues and made some changes to my code.
Finally came up with ways to fix the 30G load bugs! Turns out data.table was greedy and tries to mmap the whole file when it is bigger than RAM! So had to use some other chunk reader to bring in the lines and let fread
parse them. Also checked out LaF. Initially, I couldn't figure out how it worked, but I remembered chunked and I took a look at the source, and then I understood how to use LaF. The LaF docs are really bare-bone. Too bare-bone. I am sure they would have more users, if the documentation was better and if they had bothered to implemented some function that makes it straightforward to use. But currently, you have to create this dm from a file and open the dm with laf_open
before you can process it with process_blocks
!?! Why not just make a function called process_csv_file_by_chunk
and let the user input a file?
I also looked at iotools
. Too complicated. Same issue with LaF
, so won't be using it.
disk.frame beats spark too! Anyone else wanna challenge?
Also found bigreadr
which has similar designs to disk.frame
but the author is too busy to work on it. Would have been a good collaboration project though.
Got a couple of people asking disk.frame question! One I was able to solve. The person added chunks manually, but the chunks were named from 1.fst etc. This use-case I had catered for but decided to remove. Now I have added it back in. Another person asked about reading in a 30G file. He wrote a blog on doing it with sparklyr. Spark required him to combine the data into one big CSV before loading into spark. Whereas with disk.frame, you can just load the files individually!
Decided to download some NYC taxi data and ran it through converting to disk.frame. It ran remarkably well with no issues. But then I realized the map
actually had parallelism disabled as it was using lapply
only, so I switched it back to future_lapply
.
I then decided to double-check the Fannie Mae tutorials are still working. And then I realized that I can save so much code now that my csv_to_disk.frame
can naturally handle multiple CSVs in one go instead of having to manually do a rbindlist.disk.frame
. As I was running through the code I realized how a lack of progress reporting was making the user-experience less than ideal, so I tried to use .progress
implemented by furrr
to display basic progress bars, like this
Converting CSV to disk.frame: Progress: ────────────────────────────────────── 100%
Fixed more issues with the CRAN submission including converting all cases of T/F to TRUE/FALSE and making sure that I use message()
which I had never known about instead of print()
for printing user messages. There were some other fixes including not writing to user's filespace and write to tempdir()
only. Also, I cannot use more than 2 cores, so I had to comment setup_disk.frame()
as that may use more than 2 cores.
After fixing all that I went to submit but was greeted with
Submission server vacation from on Aug. 9, 2019 to Aug 18, 2019
During this time, the submission of packages is not possible.
Sorry for any inconvenience
This is OK. I will just submit one week later.
Got a maintainer email from Martina Schmirl. I need to reduce the title to less than 65 characters and there were a number of requirements for running the examples.
Fixing them now.
Adding examples to every function. BORING!!! But I think will be super-useful for other users.
Submitted to CRAN for the fourth time. I feel sorry for the amount of volunteer time I have already taken up. The "sin" I committed was writing to user's directory and not writing to tempdir()
. I have fixed this. Along the way I have added the default outdir
as writing to somewhere in tempdir()
so the user doesn't have to specify outdir =
every time.
I am also tempted to make overwrite = getOption("disk.frame.overwrite.default")
and set the option to TRUE
by default. As I find setting overwrite=
annoying. But then again, people use disk.frame
to manipulate large amounts of data so it's better to be safe and default to overwrite = FALSE
.
I have submitted to CRAN but then realised that setup_disk.frame(gui=T)
does NOT work if you started a fresh session of R