Add reference & rewrite readme and vignette

Moran79 · Oct 31, 2024 · 23260e3 · 23260e3
1 parent 2429305
commit 23260e3
Show file tree

Hide file tree

Showing 15 changed files with 175 additions and 121 deletions.
diff --git a/R/Treee.R b/R/Treee.R
@@ -71,9 +71,13 @@
 #'
 #' @export
 #'
-#' @references Wang, S. (2024). A New Forward Discriminant Analysis Framework
-#'   Based On Pillai's Trace and ULDA. \emph{arXiv preprint arXiv:2409.03136}.
-#'   Available at \url{https://arxiv.org/abs/2409.03136}.
+#' @references Wang, S. (2024). FoLDTree: A ULDA-Based Decision Tree Framework
+#'   for Efficient Oblique Splits and Feature Selection. \emph{arXiv preprint
+#'   arXiv:2410.23147}. Available at \url{https://arxiv.org/abs/2410.23147}.
+#'
+#'   Wang, S. (2024). A New Forward Discriminant Analysis Framework Based On
+#'   Pillai's Trace and ULDA. \emph{arXiv preprint arXiv:2409.03136}. Available
+#'   at \url{https://arxiv.org/abs/2409.03136}.
 #'
 #' @examples
 #' fit <- Treee(datX = iris[, -5], response = iris[, 5], verbose = FALSE)
@@ -95,7 +99,7 @@ Treee <- function(datX,
                   misClassCost = NULL,
                   missingMethod = c("medianFlag", "newLevel"),
                   kSample = -1,
-                  verbose = TRUE){ # Change verbose to FALSE before CRAN submission
+                  verbose = TRUE){
 
   # Standardize the Arguments -----------------------------------------------
 

diff --git a/R/plot.R b/R/plot.R
@@ -6,7 +6,7 @@
 #'
 #' @section Overall Tree Structure:
 #'
-#'   A full tree diagram is displayed using [visNetwork] when `node` is not
+#'   A full tree diagram is displayed using \link[visNetwork]{visNetwork} when `node` is not
 #'   specified (the default is `-1`). The color represents the most common
 #'   (plurality) class within each node, and the size of each terminal node
 #'   reflects its relative sample size. Below each node, the fraction of

diff --git a/README.Rmd b/README.Rmd
@@ -21,47 +21,59 @@ knitr::opts_chunk$set(
 ![CRAN Downloads](https://cranlogs.r-pkg.org/badges/grand-total/LDATree)
 <!-- badges: end -->
 
-`LDATree` is an R modeling package for fitting classification trees. If you are unfamiliar with classification trees, here is a [tutorial](http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/) about the traditional CART and its R implementation `rpart`.
+`LDATree` is an R modeling package for fitting classification trees with oblique splits.
+
+* If you are unfamiliar with classification trees, here is a [tutorial](http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/) about the traditional CART and its R implementation `rpart`.
+
+* More details about the LDATree can be found in Wang, S. (2024). *FoLDTree: A ULDA-Based Decision Tree Framework for Efficient Oblique Splits and Feature Selection*. arXiv preprint arXiv:2410.23147. [Link](https://arxiv.org/abs/2410.23147).
 
 ## Overview
 
-Compared to other similar trees, `LDATree` sets itself apart in the following ways:
+Compared to other similar trees, `LDATree` distinguishes itself in the following ways:
+
+* Using Uncorrelated Linear Discriminant Analysis (ULDA) from the `folda` package, it can **efficiently find oblique splits**.
+
+* It provides both ULDA and forward ULDA as the splitting rule and node model. Forward ULDA has intrinsic **variable selection**, which helps mitigate the influence of noise variables.
 
-* It applies the idea of LDA (Linear Discriminant Analysis) when selecting variables, finding splits, and fitting models in terminal nodes.
+* It automatically **handles missing values**.
 
-* It addresses certain limitations of the R implementation of LDA (`MASS::lda`), such as handling missing values, dealing with more features than samples, and constant values within groups.
+* It can output both predicted class and **class probability**.
 
-* Re-implement LDA using the Generalized Singular Value Decomposition (GSVD), LDATree offers quick response, particularly with large datasets.
+* It supports **downsampling**, which can be used to balance classes or accelerate the model fitting process.
+
+* It includes several **visualization** tools to provide deeper insights into the data.
 
-* The package also includes several visualization tools to provide deeper insights into the data.
 
 ## Installation
 
 ``` r
 install.packages("LDATree")
 ```
 
-The CRAN version is an outdated one from 08/2023. Please stay tune for the latest version, which will be released around 10/2024. Meanwhile, feel free to try the undocumented version bellow.
+You can install the development version of `LDATree` from [GitHub](https://github.com/) with:
 
 ```{r,fig.asp=0.618,out.width = "80%",fig.align = "center", eval=FALSE}
-library(devtools)
-install_github('Moran79/LDATree')
+# install.packages("devtools")
+devtools::install_github('Moran79/LDATree')
 ```
 
-## Usage
+## Basic Usage
+
+We offer two main tree types in the `LDATree` package: LDATree and FoLDTree. For the splitting rule and node model, LDATree uses ULDA, while FoLDTree uses forward ULDA.
 
-To build an LDATree:
+To build an LDATree (or FoLDTree):
 
 ```{r,fig.asp=0.618,out.width = "100%",fig.align = "center"}
 library(LDATree)
 set.seed(443)
 diamonds <- as.data.frame(ggplot2::diamonds)[sample(53940, 2000),]
 datX <- diamonds[, -2]
 response <- diamonds[, 2] # we try to predict "cut"
-fit <- Treee(datX = datX, response = response, verbose = FALSE)
+fit <- Treee(datX = datX, response = response, verbose = FALSE) # by default, it is a pre-stopping FoLDTree
+# fit <- Treee(datX = datX, response = response, verbose = FALSE, ldaType = "all", pruneMethod = "post") # if you want to fit a post-pruned LDATree.
 ```
 
-To plot the LDATree:
+To plot the LDATree (or FoLDTree):
 
 ```{r,fig.asp=0.618,out.width = "80%",fig.align = "center", eval=FALSE}
 # View the overall tree.
@@ -76,7 +88,7 @@ plot(fit)
 plot(fit, datX = datX, response = response, node = 1)
 
 # 2. Density plot on the first LD score
-plot(fit, datX = datX, response = response, node = 3)
+plot(fit, datX = datX, response = response, node = 7)
 
 # 3. A message
 plot(fit, datX = datX, response = response, node = 2)
@@ -96,6 +108,11 @@ predictions <- predict(fit, datX, type = "all")
 head(predictions)
 ```
 
+More examples can be found in the [vignette](https://iamwangsiyu.com/LDATree/articles/LDATree.html).
+
+## References
+
+* Wang, S. (2024). FoLDTree: A ULDA-based decision tree framework for efficient oblique splits and feature selection. *arXiv preprint*, arXiv:2410.23147. Retrieved from https://arxiv.org/abs/2410.23147.
 
 ## Getting help
 

diff --git a/README.md b/README.md
@@ -11,60 +11,73 @@ status](https://www.r-pkg.org/badges/version/LDATree)](https://CRAN.R-project.or
 ![CRAN Downloads](https://cranlogs.r-pkg.org/badges/grand-total/LDATree)
 <!-- badges: end -->
 
-`LDATree` is an R modeling package for fitting classification trees. If
-you are unfamiliar with classification trees, here is a
-[tutorial](http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/)
-about the traditional CART and its R implementation `rpart`.
+`LDATree` is an R modeling package for fitting classification trees with
+oblique splits.
+
+- If you are unfamiliar with classification trees, here is a
+  [tutorial](http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/)
+  about the traditional CART and its R implementation `rpart`.
+
+- More details about the LDATree can be found in Wang, S. (2024).
+  *FoLDTree: A ULDA-Based Decision Tree Framework for Efficient Oblique
+  Splits and Feature Selection*. arXiv preprint arXiv:2410.23147.
+  [Link](https://arxiv.org/abs/2410.23147).
 
 ## Overview
 
-Compared to other similar trees, `LDATree` sets itself apart in the
+Compared to other similar trees, `LDATree` distinguishes itself in the
 following ways:
 
-- It applies the idea of LDA (Linear Discriminant Analysis) when
-  selecting variables, finding splits, and fitting models in terminal
-  nodes.
+- Using Uncorrelated Linear Discriminant Analysis (ULDA) from the
+  `folda` package, it can **efficiently find oblique splits**.
+
+- It provides both ULDA and forward ULDA as the splitting rule and node
+  model. Forward ULDA has intrinsic **variable selection**, which helps
+  mitigate the influence of noise variables.
+
+- It automatically **handles missing values**.
 
-- It addresses certain limitations of the R implementation of LDA
-  (`MASS::lda`), such as handling missing values, dealing with more
-  features than samples, and constant values within groups.
+- It can output both predicted class and **class probability**.
 
-- Re-implement LDA using the Generalized Singular Value Decomposition
-  (GSVD), LDATree offers quick response, particularly with large
-  datasets.
+- It supports **downsampling**, which can be used to balance classes or
+  accelerate the model fitting process.
 
-- The package also includes several visualization tools to provide
-  deeper insights into the data.
+- It includes several **visualization** tools to provide deeper insights
+  into the data.
 
 ## Installation
 
 ``` r
 install.packages("LDATree")
 ```
 
-The CRAN version is an outdated one from 08/2023. Please stay tune for
-the latest version, which will be released around 10/2024. Meanwhile,
-feel free to try the undocumented version bellow.
+You can install the development version of `LDATree` from
+[GitHub](https://github.com/) with:
 
 ``` r
-library(devtools)
-install_github('Moran79/LDATree')
+# install.packages("devtools")
+devtools::install_github('Moran79/LDATree')
 ```
 
-## Usage
+## Basic Usage
 
-To build an LDATree:
+We offer two main tree types in the `LDATree` package: LDATree and
+FoLDTree. For the splitting rule and node model, LDATree uses ULDA,
+while FoLDTree uses forward ULDA.
+
+To build an LDATree (or FoLDTree):
 
 ``` r
 library(LDATree)
 set.seed(443)
 diamonds <- as.data.frame(ggplot2::diamonds)[sample(53940, 2000),]
 datX <- diamonds[, -2]
 response <- diamonds[, 2] # we try to predict "cut"
-fit <- Treee(datX = datX, response = response, verbose = FALSE)
+fit <- Treee(datX = datX, response = response, verbose = FALSE) # by default, it is a pre-stopping FoLDTree
+# fit <- Treee(datX = datX, response = response, verbose = FALSE, ldaType = "all", pruneMethod = "post") # if you want to fit a post-pruned LDATree.
 ```
 
-To plot the LDATree:
+To plot the LDATree (or FoLDTree):
 
 ``` r
 # View the overall tree.
@@ -84,7 +97,7 @@ plot(fit, datX = datX, response = response, node = 1)
 ``` r
 
 # 2. Density plot on the first LD score
-plot(fit, datX = datX, response = response, node = 3)
+plot(fit, datX = datX, response = response, node = 7)
 ```
 
 <img src="man/figures/README-plot2-2.png" width="80%" style="display: block; margin: auto;" />
@@ -118,6 +131,15 @@ head(predictions)
 #> 6    Ideal    6 4.827312e-03 0.061274797 0.1978061 0.027410359 0.7086815
 ```
 
+More examples can be found in the
+[vignette](https://iamwangsiyu.com/LDATree/articles/LDATree.html).
+
+## References
+
+- Wang, S. (2024). FoLDTree: A ULDA-based decision tree framework for
+  efficient oblique splits and feature selection. *arXiv preprint*,
+  arXiv:2410.23147. Retrieved from <https://arxiv.org/abs/2410.23147>.
+
 ## Getting help
 
 If you encounter a clear bug, please file an issue with a minimal

diff --git a/cran-comments.md b/cran-comments.md
@@ -1,9 +1,11 @@
-## Resubmission (08/25/2023)
+# R CMD Check Results
 
-Sorry for the inconvenience!
+>   New maintainer:
+>     Siyu Wang <[email protected]>
+>   Old maintainer(s):
+>     Siyu Wang <[email protected]>
+>
+> 0 errors ✔ | 0 warnings ✔ | 1 note ✖
 
-This is a resubmission. In this version I have:
+I have changed my email. Thanks for your time!
 
-* increment the version number from 0.1.1 to 0.1.2
-
-* Fixed one HTML plot in the vignette due to a CRAN check error for flavor r-release-macos-x86_64. The error message is *Pandoc is required to build R Markdown vignettes but not available. Please make sure it is installed.*
diff --git a/man/LDATree-package.Rd b/man/LDATree-package.Rd
diff --git a/man/Treee.Rd b/man/Treee.Rd
diff --git a/man/figures/README-plot1-1.png b/man/figures/README-plot1-1.png
diff --git a/man/figures/README-plot2-2.png b/man/figures/README-plot2-2.png
diff --git a/man/plot.Treee.Rd b/man/plot.Treee.Rd
diff --git a/tests/testthat/test-Treee.R b/tests/testthat/test-Treee.R
@@ -3,7 +3,7 @@ test_that("folda: work on tibble", {
   dat <- ggplot2::diamonds[1:100,]
   fit <- Treee(dat[, -2], response = dat[[2]], verbose = FALSE)
   result <- predict(fit, dat)
-  expect_equal(result[1:4], c("Very Good", "Ideal", "Ideal", "Premium"))
+  expect_equal(result[1:4], c("Ideal", "Premium", "Premium", "Very Good"))
 })
 
 test_that("folda: all columns are constant", {