Skip to content

Commit

Permalink
Add reference & rewrite readme and vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
Moran79 committed Oct 31, 2024
1 parent 2429305 commit 23260e3
Show file tree
Hide file tree
Showing 15 changed files with 175 additions and 121 deletions.
12 changes: 8 additions & 4 deletions R/Treee.R
Original file line number Diff line number Diff line change
Expand Up @@ -71,9 +71,13 @@
#'
#' @export
#'
#' @references Wang, S. (2024). A New Forward Discriminant Analysis Framework
#' Based On Pillai's Trace and ULDA. \emph{arXiv preprint arXiv:2409.03136}.
#' Available at \url{https://arxiv.org/abs/2409.03136}.
#' @references Wang, S. (2024). FoLDTree: A ULDA-Based Decision Tree Framework
#' for Efficient Oblique Splits and Feature Selection. \emph{arXiv preprint
#' arXiv:2410.23147}. Available at \url{https://arxiv.org/abs/2410.23147}.
#'
#' Wang, S. (2024). A New Forward Discriminant Analysis Framework Based On
#' Pillai's Trace and ULDA. \emph{arXiv preprint arXiv:2409.03136}. Available
#' at \url{https://arxiv.org/abs/2409.03136}.
#'
#' @examples
#' fit <- Treee(datX = iris[, -5], response = iris[, 5], verbose = FALSE)
Expand All @@ -95,7 +99,7 @@ Treee <- function(datX,
misClassCost = NULL,
missingMethod = c("medianFlag", "newLevel"),
kSample = -1,
verbose = TRUE){ # Change verbose to FALSE before CRAN submission
verbose = TRUE){

# Standardize the Arguments -----------------------------------------------

Expand Down
2 changes: 1 addition & 1 deletion R/plot.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#'
#' @section Overall Tree Structure:
#'
#' A full tree diagram is displayed using [visNetwork] when `node` is not
#' A full tree diagram is displayed using \link[visNetwork]{visNetwork} when `node` is not
#' specified (the default is `-1`). The color represents the most common
#' (plurality) class within each node, and the size of each terminal node
#' reflects its relative sample size. Below each node, the fraction of
Expand Down
45 changes: 31 additions & 14 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,47 +21,59 @@ knitr::opts_chunk$set(
![CRAN Downloads](https://cranlogs.r-pkg.org/badges/grand-total/LDATree)
<!-- badges: end -->

`LDATree` is an R modeling package for fitting classification trees. If you are unfamiliar with classification trees, here is a [tutorial](http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/) about the traditional CART and its R implementation `rpart`.
`LDATree` is an R modeling package for fitting classification trees with oblique splits.

* If you are unfamiliar with classification trees, here is a [tutorial](http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/) about the traditional CART and its R implementation `rpart`.

* More details about the LDATree can be found in Wang, S. (2024). *FoLDTree: A ULDA-Based Decision Tree Framework for Efficient Oblique Splits and Feature Selection*. arXiv preprint arXiv:2410.23147. [Link](https://arxiv.org/abs/2410.23147).

## Overview

Compared to other similar trees, `LDATree` sets itself apart in the following ways:
Compared to other similar trees, `LDATree` distinguishes itself in the following ways:

* Using Uncorrelated Linear Discriminant Analysis (ULDA) from the `folda` package, it can **efficiently find oblique splits**.

* It provides both ULDA and forward ULDA as the splitting rule and node model. Forward ULDA has intrinsic **variable selection**, which helps mitigate the influence of noise variables.

* It applies the idea of LDA (Linear Discriminant Analysis) when selecting variables, finding splits, and fitting models in terminal nodes.
* It automatically **handles missing values**.

* It addresses certain limitations of the R implementation of LDA (`MASS::lda`), such as handling missing values, dealing with more features than samples, and constant values within groups.
* It can output both predicted class and **class probability**.

* Re-implement LDA using the Generalized Singular Value Decomposition (GSVD), LDATree offers quick response, particularly with large datasets.
* It supports **downsampling**, which can be used to balance classes or accelerate the model fitting process.

* It includes several **visualization** tools to provide deeper insights into the data.

* The package also includes several visualization tools to provide deeper insights into the data.

## Installation

``` r
install.packages("LDATree")
```

The CRAN version is an outdated one from 08/2023. Please stay tune for the latest version, which will be released around 10/2024. Meanwhile, feel free to try the undocumented version bellow.
You can install the development version of `LDATree` from [GitHub](https://github.com/) with:

```{r,fig.asp=0.618,out.width = "80%",fig.align = "center", eval=FALSE}
library(devtools)
install_github('Moran79/LDATree')
# install.packages("devtools")
devtools::install_github('Moran79/LDATree')
```

## Usage
## Basic Usage

We offer two main tree types in the `LDATree` package: LDATree and FoLDTree. For the splitting rule and node model, LDATree uses ULDA, while FoLDTree uses forward ULDA.

To build an LDATree:
To build an LDATree (or FoLDTree):

```{r,fig.asp=0.618,out.width = "100%",fig.align = "center"}
library(LDATree)
set.seed(443)
diamonds <- as.data.frame(ggplot2::diamonds)[sample(53940, 2000),]
datX <- diamonds[, -2]
response <- diamonds[, 2] # we try to predict "cut"
fit <- Treee(datX = datX, response = response, verbose = FALSE)
fit <- Treee(datX = datX, response = response, verbose = FALSE) # by default, it is a pre-stopping FoLDTree
# fit <- Treee(datX = datX, response = response, verbose = FALSE, ldaType = "all", pruneMethod = "post") # if you want to fit a post-pruned LDATree.
```

To plot the LDATree:
To plot the LDATree (or FoLDTree):

```{r,fig.asp=0.618,out.width = "80%",fig.align = "center", eval=FALSE}
# View the overall tree.
Expand All @@ -76,7 +88,7 @@ plot(fit)
plot(fit, datX = datX, response = response, node = 1)
# 2. Density plot on the first LD score
plot(fit, datX = datX, response = response, node = 3)
plot(fit, datX = datX, response = response, node = 7)
# 3. A message
plot(fit, datX = datX, response = response, node = 2)
Expand All @@ -96,6 +108,11 @@ predictions <- predict(fit, datX, type = "all")
head(predictions)
```

More examples can be found in the [vignette](https://iamwangsiyu.com/LDATree/articles/LDATree.html).

## References

* Wang, S. (2024). FoLDTree: A ULDA-based decision tree framework for efficient oblique splits and feature selection. *arXiv preprint*, arXiv:2410.23147. Retrieved from https://arxiv.org/abs/2410.23147.

## Getting help

Expand Down
74 changes: 48 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,60 +11,73 @@ status](https://www.r-pkg.org/badges/version/LDATree)](https://CRAN.R-project.or
![CRAN Downloads](https://cranlogs.r-pkg.org/badges/grand-total/LDATree)
<!-- badges: end -->

`LDATree` is an R modeling package for fitting classification trees. If
you are unfamiliar with classification trees, here is a
[tutorial](http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/)
about the traditional CART and its R implementation `rpart`.
`LDATree` is an R modeling package for fitting classification trees with
oblique splits.

- If you are unfamiliar with classification trees, here is a
[tutorial](http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/)
about the traditional CART and its R implementation `rpart`.

- More details about the LDATree can be found in Wang, S. (2024).
*FoLDTree: A ULDA-Based Decision Tree Framework for Efficient Oblique
Splits and Feature Selection*. arXiv preprint arXiv:2410.23147.
[Link](https://arxiv.org/abs/2410.23147).

## Overview

Compared to other similar trees, `LDATree` sets itself apart in the
Compared to other similar trees, `LDATree` distinguishes itself in the
following ways:

- It applies the idea of LDA (Linear Discriminant Analysis) when
selecting variables, finding splits, and fitting models in terminal
nodes.
- Using Uncorrelated Linear Discriminant Analysis (ULDA) from the
`folda` package, it can **efficiently find oblique splits**.

- It provides both ULDA and forward ULDA as the splitting rule and node
model. Forward ULDA has intrinsic **variable selection**, which helps
mitigate the influence of noise variables.

- It automatically **handles missing values**.

- It addresses certain limitations of the R implementation of LDA
(`MASS::lda`), such as handling missing values, dealing with more
features than samples, and constant values within groups.
- It can output both predicted class and **class probability**.

- Re-implement LDA using the Generalized Singular Value Decomposition
(GSVD), LDATree offers quick response, particularly with large
datasets.
- It supports **downsampling**, which can be used to balance classes or
accelerate the model fitting process.

- The package also includes several visualization tools to provide
deeper insights into the data.
- It includes several **visualization** tools to provide deeper insights
into the data.

## Installation

``` r
install.packages("LDATree")
```

The CRAN version is an outdated one from 08/2023. Please stay tune for
the latest version, which will be released around 10/2024. Meanwhile,
feel free to try the undocumented version bellow.
You can install the development version of `LDATree` from
[GitHub](https://github.com/) with:

``` r
library(devtools)
install_github('Moran79/LDATree')
# install.packages("devtools")
devtools::install_github('Moran79/LDATree')
```

## Usage
## Basic Usage

To build an LDATree:
We offer two main tree types in the `LDATree` package: LDATree and
FoLDTree. For the splitting rule and node model, LDATree uses ULDA,
while FoLDTree uses forward ULDA.

To build an LDATree (or FoLDTree):

``` r
library(LDATree)
set.seed(443)
diamonds <- as.data.frame(ggplot2::diamonds)[sample(53940, 2000),]
datX <- diamonds[, -2]
response <- diamonds[, 2] # we try to predict "cut"
fit <- Treee(datX = datX, response = response, verbose = FALSE)
fit <- Treee(datX = datX, response = response, verbose = FALSE) # by default, it is a pre-stopping FoLDTree
# fit <- Treee(datX = datX, response = response, verbose = FALSE, ldaType = "all", pruneMethod = "post") # if you want to fit a post-pruned LDATree.
```

To plot the LDATree:
To plot the LDATree (or FoLDTree):

``` r
# View the overall tree.
Expand All @@ -84,7 +97,7 @@ plot(fit, datX = datX, response = response, node = 1)
``` r

# 2. Density plot on the first LD score
plot(fit, datX = datX, response = response, node = 3)
plot(fit, datX = datX, response = response, node = 7)
```

<img src="man/figures/README-plot2-2.png" width="80%" style="display: block; margin: auto;" />
Expand Down Expand Up @@ -118,6 +131,15 @@ head(predictions)
#> 6 Ideal 6 4.827312e-03 0.061274797 0.1978061 0.027410359 0.7086815
```

More examples can be found in the
[vignette](https://iamwangsiyu.com/LDATree/articles/LDATree.html).

## References

- Wang, S. (2024). FoLDTree: A ULDA-based decision tree framework for
efficient oblique splits and feature selection. *arXiv preprint*,
arXiv:2410.23147. Retrieved from <https://arxiv.org/abs/2410.23147>.

## Getting help

If you encounter a clear bug, please file an issue with a minimal
Expand Down
14 changes: 8 additions & 6 deletions cran-comments.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
## Resubmission (08/25/2023)
# R CMD Check Results

Sorry for the inconvenience!
> New maintainer:
> Siyu Wang <[email protected]>
> Old maintainer(s):
> Siyu Wang <[email protected]>
>
> 0 errors ✔ | 0 warnings ✔ | 1 note ✖
This is a resubmission. In this version I have:
I have changed my email. Thanks for your time!

* increment the version number from 0.1.1 to 0.1.2

* Fixed one HTML plot in the vignette due to a CRAN check error for flavor r-release-macos-x86_64. The error message is *Pandoc is required to build R Markdown vignettes but not available. Please make sure it is installed.*
6 changes: 3 additions & 3 deletions man/LDATree-package.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 7 additions & 3 deletions man/Treee.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Binary file modified man/figures/README-plot1-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified man/figures/README-plot2-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion man/plot.Treee.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion tests/testthat/test-Treee.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ test_that("folda: work on tibble", {
dat <- ggplot2::diamonds[1:100,]
fit <- Treee(dat[, -2], response = dat[[2]], verbose = FALSE)
result <- predict(fit, dat)
expect_equal(result[1:4], c("Very Good", "Ideal", "Ideal", "Premium"))
expect_equal(result[1:4], c("Ideal", "Premium", "Premium", "Very Good"))
})

test_that("folda: all columns are constant", {
Expand Down
Loading

0 comments on commit 23260e3

Please sign in to comment.