Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percent class for working with percentages #127

Open
Nic-Chr opened this issue Mar 7, 2024 · 3 comments
Open

Percent class for working with percentages #127

Nic-Chr opened this issue Mar 7, 2024 · 3 comments

Comments

@Nic-Chr
Copy link
Contributor

Nic-Chr commented Mar 7, 2024

Motivation

Working with percentages in R can be annoying to say the least and in day-to-day analyses I tend to find myself in this general workflow:

  • Create one vector of proportions
  • Create another vector of formatted percentages
  • Make sure to use the proportions for math operations and the percentages for pretty outputs

Having a percent class object could reduce this workflow by combining the two vectors into one, reducing the work needed to manage independent vectors.

Describe the solution you'd like
It would be nice to see a percent class that represents proportions without losing precision and simply prints them as
percentages.
This would help analysts across PHS spend less time thinking about how to format percentages.

Describe alternatives you've considered
I have made a small package that does this, see: github.com/NicChr/percent
I'm aware of scales::percent() but this returns a character vector whereas as_percent() does no transformations at all, returning an object of class "percent", preserving the proportions vector and printing as a "percent" vector in tibbles.

@Tina815 @Moohan Let me know if you think this would be a good fit for phsmethods and if so I'd be happy to assist in future implementations.

If this was deemed to be a good fit, I would be happy for the code to be copied over, reducing the need for another package dependency.

Below I've included some basic examples.

Basic usage

library(remotes)
install_github("NicChr/percent", force = FALSE)
library(percent)

# Motivation --------------------------------------------------------------

### Percentage of NAs by column

## Normal workflow might look like this

library(dplyr)
na_counts <- colSums(is.na(starwars))
prop <- na_counts / nrow(starwars)
perc <- round(prop * 100, 2)
perc <- paste0(perc, "%")
names(perc) <- names(prop)
perc
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>       "0%"     "6.9%"   "32.18%"    "5.75%"       "0%"       "0%"   "50.57%" 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>     "4.6%"     "4.6%"   "11.49%"     "4.6%"       "0%"       "0%"       "0%"

## With `as_percent` it's a bit easier


perc2 <- as_percent(prop)
perc2
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>   "0.000%"   "6.897%"  "32.184%"   "5.747%"   "0.000%"   "0.000%"  "50.575%" 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>   "4.598%"   "4.598%"  "11.494%"   "4.598%"   "0.000%"   "0.000%"   "0.000%"
class(perc2)
#> [1] "percent"
unclass(perc2) # Under the hood it is just the proportions
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000 0.50574713 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#> 0.04597701 0.04597701 0.11494253 0.04597701 0.00000000 0.00000000 0.00000000

### We can then work with the perc vector without ever needing to use prop


round(perc2, 0)
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>       "0%"       "7%"      "32%"       "6%"       "0%"       "0%"      "51%" 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>       "5%"       "5%"      "11%"       "5%"       "0%"       "0%"       "0%"
round(perc2, 1)
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>     "0.0%"     "6.9%"    "32.2%"     "5.7%"     "0.0%"     "0.0%"    "50.6%" 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>     "4.6%"     "4.6%"    "11.5%"     "4.6%"     "0.0%"     "0.0%"     "0.0%"
round(perc2, 2)
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>    "0.00%"    "6.90%"   "32.18%"    "5.75%"    "0.00%"    "0.00%"   "50.57%" 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>    "4.60%"    "4.60%"   "11.49%"    "4.60%"    "0.00%"    "0.00%"    "0.00%"

### halves are rounded up


round(percent(14.5))
#> [1] "15%"

### We can use math operations as well


# Number of NAs
nrow(starwars) * perc # This won't work
#> Error in nrow(starwars) * perc: non-numeric argument to binary operator
nrow(starwars) * perc2 # This does
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>          0          6         28          5          0          0         44 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>          4          4         10          4          0          0          0

### Usage in ggplot


library(ggplot2)
df <- starwars %>%
  count(homeworld, sort = TRUE) %>%
  mutate(homeworld = if_else(row_number() %in% 1:5, homeworld, "Other"),
         homeworld = if_else(is.na(homeworld), "Other", homeworld)) %>%
  filter(homeworld != "Other") %>%
  count(homeworld, wt = n, sort = TRUE) %>%
  mutate(homeworld = factor(homeworld, levels = unique(homeworld))) %>%
  mutate(perc = as_percent(n/sum(n))) %>%
  arrange(desc(perc))
df
#> # A tibble: 4 × 3
#>   homeworld     n perc     
#>   <fct>     <int> <percent>
#> 1 Naboo        11 40.741%  
#> 2 Tatooine     10 37.037%  
#> 3 Alderaan      3 11.111%  
#> 4 Coruscant     3 11.111%

### Pie chart


df %>%
  ggplot(aes(x = 1, y = perc, fill = homeworld)) + 
  geom_col() +
  scale_y_continuous(labels = as_percent) +
  coord_polar(theta = "y") +
  geom_text(aes(label = perc),
            position = position_stack(vjust = 0.5),
            size = 3) +
  theme_void(base_size = 12) + 
  labs(title = "Pie-chart of top 5 most common starwars planets")
#> Don't know how to automatically pick scale for object of type <percent>.
#> Defaulting to continuous.

Created on 2024-03-07 with reprex v2.0.2

@Moohan
Copy link
Member

Moohan commented Jul 2, 2024

This was agreed to take forward as a PR

@Nic-Chr
Copy link
Contributor Author

Nic-Chr commented Jul 3, 2024

There are a few things I'm not sure about regarding the implementation from a user-perspective.

  1. Right now I have 2 functions, percent() and as_percent().
    percent() simply creates a percent vector from percentage inputs, e.g. 100 becomes 100%.
    as_percent() converts proportions to percentages.
    It's not clear to me which is more intuitive from a user-friendly perspective and if we should just use 1 or both or something a bit different?
  2. When doing any kind of math involving percent vectors, what do we think is the most logical or expected outcome? For example, to me it would seem sensible to return a percent vector when two percent vectors are multiplied. When one is a percent vector, and the other is a numeric vector the outcome is a bit less trivial. Right now my implementation always returns a percent vector in this case but it might make more sense to depend on the order of classes such that if the LHS is a percent and RHS is not, then result is a percent. Likewise is the LHS is not a percent and RHS is a percent, then the result should be a numeric.
  3. Should as.character.percent() apply rounding by default? The reason I opted for this is because it plays nicely with ggplot2 which relies on calling as.character in plots, which makes things easier to read. On the other hand a user might expect to see all the underlying digits when using as.character.percent().
  4. A similar concern as 3, format.percent() by default applies decimal digit rounding instead of the usual significant digit rounding that format() uses. This is because imo decimal rounding is much nicer for percentages generally and hence is more useful for users. A solution would be to just make this distinction clear in the documentation.

@Nic-Chr
Copy link
Contributor Author

Nic-Chr commented Jul 22, 2024

Have now opened a PR here #130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants