Replies: 26 comments
-
Dear Polkas, thanks for your suggestion. It would be wonderful if we can speedup A faster method should not affect any of the statistical properties. Could your code be considered a drop-in replacement for The current
Is each of these done in your solution? If so, I suggest that you open a pull request that will add |
Beta Was this translation helpful? Give feedback.
-
Hi, There is no need to calculate differences and choose the smallest distance. This step is omit by finding closes match directly for every miss. The only thing that I did not implement is adding small noise to break ties, although It is not clear for me that this step is needed. This method is recommended for continuous variables. Still I could add random vector of length the same as obs to the obs vector but it forces us to sorting once again every new miss value iteration. Yes I could pull request if you will be satisfied with a new solution. |
Beta Was this translation helpful? Give feedback.
-
It is important to break ties explicitly. If we don't do that, we end up with the same set of matches for cases that have the same predicted value. The random generator does not need to be fancy, and may be fast and cheap. The catch is that you may need to resort with every case. Perhaps there's a smarter way to break ties? |
Beta Was this translation helpful? Give feedback.
-
Ok, so I will think how to solve it as this is a important step. It might to
be easy to implement, we will see.
|
Beta Was this translation helpful? Give feedback.
-
My solution seems superior in terms of time efficiency but It returns values and do not taking into account ties as major of solutions (I will not want to change this behaviour at the moment). My friend @bbaranow https://github.com/bbaranow recommended to you another solution. https://cran.r-project.org/web/packages/RANN/RANN.pdf nn2() gives you much better performance and eps option (randomness - ties), be careful with default k. I will stick with my solution and implement it in my package as it is at least a few times faster than other best solutions. It is mainly much faster compared with nn2 with increasing k, which will be much bigger for longer datasets. The k should be at least and close to sqrt(nobs), in my opinion. Ps. another nice solution is knnx.index https://cran.r-project.org/web/packages/FNN/FNN.pdf |
Beta Was this translation helpful? Give feedback.
-
Some benchmarks so x500 or x50, even x50 seems huge improvement > vals = rnorm(100000)
>
> ss = rnorm(10000)
>
> neibo(vals,ss,200)[1:10]
[1] 0.81831921 0.08398758 -1.51465258 0.63023247 -1.49508517 -0.44044937 0.23820550 -0.35262505 -0.05485046 -0.07017419
>
> vals[mice:::matcher(vals,ss,200)][1:10]
[1] 0.82230085 0.08458459 -1.50485773 0.63139331 -1.50042102 -0.44243047 0.23895351 -0.35322907 -0.05334500 -0.07054163
>
> tt <- t(rmultinom(10000, 1L, rep(1L, 200)))
>
> vals[rowSums(nn2(vals,ss,200)$nn.idx*tt)][1:10]
[1] 0.82414926 0.08634572 -1.50921362 0.62780761 -1.49780378 -0.43951530 0.24146793 -0.35572542 -0.05485046 -0.07178673
>
> vals[knnx.index(vals,ss,200)][1:10]
[1] 0.82116125 0.08442159 -1.50936181 0.63020304 -1.50124389 -0.44015239 0.23931200 -0.35350583 -0.05536845 -0.07103484
>
> microbenchmark::microbenchmark(neibo(vals,ss,200),
+ mice:::matcher(vals,ss,200),
+ rowSums(nn2(vals,ss,200)$nn.idx*tt),
+ rowSums(nn2(vals,ss,200,searchtype='radius',eps=0.0001)$nn.idx*tt),
+ knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[1]),
+ knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[2]),
+ knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[3]),
+ knnx.index(vals,ss,200,algorithm=c("kd_tree", "cover_tree","CR", "brute")[4]),
+ times=3)
Unit: milliseconds
expr min lq mean median uq max neval
neibo(vals, ss, 200) 28.21266 29.02797 29.52209 29.84328 30.1768 30.51031 3
mice:::matcher(vals, ss, 200) 15075.24839 15351.60940 15701.57446 15627.97041 16014.7375 16401.50457 3
rowSums(nn2(vals, ss, 200)$nn.idx * tt) 248.59326 260.20204 266.20996 271.81081 275.0183 278.22580 3
rowSums(nn2(vals, ss, 200, searchtype = "radius", eps = 1e-04)$nn.idx * tt) 91.31625 96.96789 132.15978 102.61954 152.5815 202.54354 3
knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree", "CR", "brute")[1]) 241.95104 247.53542 254.26468 253.11980 260.4215 267.72321 3
knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree", "CR", "brute")[2]) 1155.50126 1157.54041 1185.72831 1159.57957 1200.8418 1242.10411 3
knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree", "CR", "brute")[3]) 6060.24203 6177.12370 6224.06726 6294.00536 6305.9799 6317.95439 3
knnx.index(vals, ss, 200, algorithm = c("kd_tree", "cover_tree", "CR", "brute")[4]) 4616.75115 4633.58245 4642.22515 4650.41375 4654.9621 4659.51054 3 |
Beta Was this translation helpful? Give feedback.
-
Thanks for your comparisons. I had a look at your suggestions.
|
Beta Was this translation helpful? Give feedback.
-
I updated previous benchmarks. As I understand eps could break ties if the another closest value is bigger/smaller than at least eps value. However I see that you are interested in case where there are an integer vector and some values are repeated, in such cases PMM mostly should not be used. PMM as i understand take the profit from variability around some data point. You want to have some variablility in terms of indexes but it is as i think not so usefull, we need only a values. Waht is the difference that user get value 10 from the index 10 or value 10 from the index 98. |
Beta Was this translation helpful? Give feedback.
-
Thanks for adding the Predictive mean matching is a versatile imputation method, and - when properly done - works for continuous and discrete outcomes, as well as for continuous and discrete predictors. I have written a section on its strengths and limitations in my book: https://stefvanbuuren.name/fimd/sec-pmm.html. |
Beta Was this translation helpful? Give feedback.
-
Dear all, as a side note I would like to add that the idea of sorting also appeared here for a while: https://github.com/alexanderrobitzsch/miceadds/blob/master/src/miceadds_rcpp_pmm6.cpp Even a pure R version can be made relatively fast. Best, |
Beta Was this translation helpful? Give feedback.
-
Nice to meet you. Thanks for you response. I read this code and the R too. However this code contains a lot of copy operations. What could be an error is looking across -k,+k closest values not k closest. I understand that this bigger widow is an effect of this algorithm, we do not know in which direction go - approximation was accepted. Another major thing is to sort the data by y_hat (although y are imputed), there were compared y_hat (for obs) vs y_hat (for miss) not y (what we observed) vs y_hat (for miss) - this is losing information of what we observed so very unexpected. I find out that it could be easily changed to use yobs in the code. If i changed everything after regression by this 5 lines, I get what i/we want:
neibo is the function presented in this issue - matcher function. |
Beta Was this translation helpful? Give feedback.
-
A few adjustments is made to previous message. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the response. I do not know why one should not sort y_hat because we are matching y_hat and not the observed y. Agreed that there is a lot of copying but I doubt that this increases computation time substantially. I also agree that one does not know the direction of searching, but I think that I handled it somehow (I wrote the function 3 or 4 years ago, so I have to again dive into the code). |
Beta Was this translation helpful? Give feedback.
-
you assessed a y_hat for the whole dataset, even for observations which is not missing. We have vector c(y_hat_miss,y_hat_obs) then you sort it. When the vector is sorted you make a one loop over it using -k,+k widow and sample from it every time. Precisely i expected to sort c(y_hat_miss,y_obs (not hat)) and then use this window. Even that this method is only an approximation. I spend 1 hour on this code, it is very clever and tricky. |
Beta Was this translation helpful? Give feedback.
-
Alexander, thanks for the pointer to your approach. Do you know whether your method makes any special provisions for handling ties in y_hat? |
Beta Was this translation helpful? Give feedback.
-
@stefvanbuuren: Yes, ties are broken. I have to say that I my intention was not to translate the original matching function into a faster code. First, I sort the vector (y_hat(obs), y_hat(mis)). I look for nearest neighbors for a particular y_hat(mis),i of case i. There could be k nearest neighbors in y_hat(obs) to y_hat(mis),i that are smaller and k nearest neighbors that are larger (if there exist k smaller or larger neighbors). I am aware that this approach is not the original pmm approach. For "typical" datasets, I do not suppose dramatic differences of the implemented approach and the mice approach. But, in any case, presorting data in advance and to conduct the case-wise matching will decrease computation time. @Polkas: I still do not see why I should sort y_obs because the y_hat values are matched. You correctly described the implemented approach that use a window with smaller and larger y_hat values of the neighbors. When I originally wrote R code, the aim was to avoid the loop over cases with missing observations. |
Beta Was this translation helpful? Give feedback.
-
hey @alexanderrobitzsch if you hilite @stef you hilite the wrong guy, just like if i'd hilite @alex i'd piss off someone unrelated.... |
Beta Was this translation helpful? Give feedback.
-
@stef: Agreed. I recognized it to late. |
Beta Was this translation helpful? Give feedback.
-
Here's a little contrived problem just to make the problem with ties more explicit. Suppose we have the following donors (i.e., cases with data <- data.frame(
yhat = c(9, rep(10, 10), 11),
y = c(8, 6:15, 12))
> data
yhat y
1 9 8
2 10 6
3 10 7
4 10 8
5 10 9
6 10 10
7 10 11
8 10 12
9 10 13
10 10 14
11 10 15
12 11 12 Also, suppose we want to find However, if we're coming from below, we will select the five donors with y-values Random breaking the ties would solve it. |
Beta Was this translation helpful? Give feedback.
-
Maybe I would like to add a suggestion. If one would have an appropriate matching procedure available that does not break ties, one could simply handle it by reshuffling the input data |
Beta Was this translation helpful? Give feedback.
-
@alexanderrobitzsch Thanks, I think reshuffling will help to get unsorted Perhaps it depends on the nature of the data whether "same set" would be problem. Can we trust reshuffling if we have a simple imputation model with just one categorical predictor? The answer is probably yes, but I will need a little more thinking. |
Beta Was this translation helpful? Give feedback.
-
New matchindex() functionNotwithstanding my rusty knowledge of C++, I was able to create a speedy function that incorporates many of the suggestions and enhancement noted above. The
All in all,
Use the For datasets under a 1000 rows, you won’t notice a big difference in speed. Speed gains are good, e.g. by a factor 2-3. However, for datasets with over 5000 records, the speed gains can be spectacular. For example, when overimputing a dataset with 10,000 rows predictive mean matching with
Thanks @Polkas for bringing these large gains to my attention. If you have any suggestions for further improvements, please let me know. I expect that new algorithm will have the same statistical properties as the old one, but we still need to look into this and confirm proper behaviour. |
Beta Was this translation helpful? Give feedback.
-
Great to see this improvement. This put a mice PMM on another level. I will try to test it and provide some feedback shortly. |
Beta Was this translation helpful? Give feedback.
-
Awesome! Will do some testing. |
Beta Was this translation helpful? Give feedback.
-
I calculated part of table 3.3 from FIMD for the new
For small samples, |
Beta Was this translation helpful? Give feedback.
-
Commit 4df1f3d incorporates the pmm speedup in |
Beta Was this translation helpful? Give feedback.
-
I am a creator of miceFast package. Currently I am developing a faster PMM solution - more precisely your Rcpp matcher function.
Here you could find a test code, it is even x1000 times faster or more:
Polkas/miceFast#10
As you package is much more popular than my and this function is very important in your project It is very crucial to take into account a potential improvement.
Beta Was this translation helpful? Give feedback.
All reactions