Skip to content

Commit

Permalink
Merge branch 'development' into update_doc
Browse files Browse the repository at this point in the history
  • Loading branch information
Jennit07 authored Aug 19, 2024
2 parents 382c377 + 0e4adc5 commit bcc074f
Show file tree
Hide file tree
Showing 14 changed files with 747 additions and 48 deletions.
9 changes: 6 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,10 @@ Authors@R: c(
person("Public Health Scotland", , , "[email protected]", role = "cph"),
person("James", "McMahon", , "[email protected]", role = c("aut"),
comment = c(ORCID = "0000-0002-5380-2029")),
person("Megan", "McNicol", , "[email protected]", role = c("cre", "aut"))
person("Megan", "McNicol", , "[email protected]", role = c("cre", "aut")),
person("Zihao", "Li", , "[email protected]", role = c("aut"),
comment = c(ORCID = "0000-0002-5178-2124")),
person("Jennifer", "Thom", , "[email protected]", role = c("aut"))
)
Description: This package provides helper functions for working with
the Source Linkage Files (SLFs). The functions are mainly focused on
Expand All @@ -17,7 +20,7 @@ URL: https://public-health-scotland.github.io/slfhelper/,
https://github.com/Public-Health-Scotland/slfhelper
BugReports: https://github.com/Public-Health-Scotland/slfhelper/issues
Depends:
R (>= 4.0)
R (>= 4.0.0)
Imports:
arrow (>= 12.0.1),
cli (>= 3.6.1),
Expand Down Expand Up @@ -53,4 +56,4 @@ Language: en-GB
LazyData: true
Roxygen: list(markdown = TRUE, roclets = c("collate","namespace", "rd",
"vignette" ))
RoxygenNote: 7.3.1
RoxygenNote: 7.3.2
82 changes: 54 additions & 28 deletions R/read_slf.R
Original file line number Diff line number Diff line change
Expand Up @@ -53,15 +53,19 @@ read_slf <- function(
# but the column wasn't selected we need to add it (and remove later)
remove_partnership_var <- FALSE
remove_recid_var <- FALSE
if (!is.null(col_select)) {
if (!is.null(partnerships) &
!("hscp2018" %in% col_select)) {
col_select <- c(col_select, "hscp2018")
if (!rlang::quo_is_null(rlang::enquo(col_select))) {
if (!is.null(partnerships) &&
stringr::str_detect(rlang::quo_text(rlang::enquo(col_select)),
stringr::coll("hscp2018"),
negate = TRUE
)) {
remove_partnership_var <- TRUE
}
if (!is.null(recids) & file_version == "episode" &
!("recid" %in% col_select)) {
col_select <- c(col_select, "recid")
if (!is.null(recids) && file_version == "episode" &&
stringr::str_detect(rlang::quo_text(rlang::enquo(col_select)),
stringr::coll("recid"),
negate = TRUE
)) {
remove_recid_var <- TRUE
}
}
Expand All @@ -71,27 +75,48 @@ read_slf <- function(
function(file_path) {
slf_table <- arrow::read_parquet(
file = file_path,
col_select = !!col_select,
col_select = {{ col_select }},
as_data_frame = FALSE
)

if (!is.null(recids)) {
if (!is.null(partnerships)) {
if (remove_partnership_var) {
slf_table <- cbind(
slf_table,
arrow::read_parquet(
file = file_path,
col_select = "hscp2018",
as_data_frame = FALSE
)
)
}
slf_table <- dplyr::filter(
slf_table,
.data$recid %in% recids
.data$hscp2018 %in% partnerships
)
if (remove_partnership_var) {
slf_table <- dplyr::select(slf_table, -"hscp2018")
}
}
if (!is.null(partnerships)) {

if (!is.null(recids)) {
if (remove_recid_var) {
slf_table <- cbind(
slf_table,
arrow::read_parquet(
file = file_path,
col_select = "recid",
as_data_frame = FALSE
)
)
}
slf_table <- dplyr::filter(
slf_table,
.data$hscp2018 %in% partnerships
.data$recid %in% recids
)
}
if (remove_partnership_var) {
slf_table <- dplyr::select(slf_table, -"hscp2018")
}
if (remove_recid_var) {
slf_table <- dplyr::select(slf_table, -"recid")
if (remove_recid_var) {
slf_table <- dplyr::select(slf_table, -"recid")
}
}

return(slf_table)
Expand Down Expand Up @@ -146,15 +171,16 @@ read_slf_episode <- function(
}
# TODO add option to drop blank CHIs?
# TODO add a filter by recid option

data <- read_slf(
year = year,
col_select = unique(col_select),
file_version = "episode",
partnerships = unique(partnerships),
recids = unique(recids),
as_data_frame = as_data_frame,
dev = dev
return(
read_slf(
year = year,
col_select = {{ col_select }},
file_version = "episode",
partnerships = unique(partnerships),
recids = unique(recids),
as_data_frame = as_data_frame,
dev = dev
)
)

if ("keytime1" %in% colnames(data)) {
Expand Down Expand Up @@ -203,7 +229,7 @@ read_slf_individual <- function(
return(
read_slf(
year = year,
col_select = unique(col_select),
col_select = {{ col_select }},
file_version = "individual",
partnerships = unique(partnerships),
as_data_frame = as_data_frame,
Expand Down
Binary file modified data/ep_file_vars.rda
Binary file not shown.
Binary file modified data/indiv_file_vars.rda
Binary file not shown.
2 changes: 1 addition & 1 deletion man/ep_file_vars.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/indiv_file_vars.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 6 additions & 2 deletions tests/testthat/test-multiple_years.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ test_that("read multiple years works for individual file", {
indiv <- read_slf_individual(c("1718", "1819"),
col_select = c("year", "anon_chi")
) %>%
dplyr::slice_sample(n = 100)
dplyr::group_by(year) %>%
dplyr::slice_sample(n = 50) %>%
dplyr::ungroup()

# Test for anything odd
expect_s3_class(indiv, "tbl_df")
Expand All @@ -35,7 +37,9 @@ test_that("read multiple years works for episode file", {
ep <- read_slf_episode(c("1718", "1819"),
col_select = c("year", "anon_chi")
) %>%
dplyr::slice_sample(n = 100)
dplyr::group_by(year) %>%
dplyr::slice_sample(n = 50) %>%
dplyr::ungroup()

# Test for anything odd
expect_s3_class(ep, "tbl_df")
Expand Down
9 changes: 4 additions & 5 deletions tests/testthat/test-read_slf_episode.R
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,8 @@ for (year in years) {
expect_equal(nrow(ep_file), 110)
})

# Need to come back to this test - some files have different lengths
# test_that("Episode file has the expected number of variables", {
# # Test for correct number of variables (will need updating)
# expect_length(ep_file, 241)
# })
test_that("Episode file has the expected number of variables", {
# Test for correct number of variables (will need updating)
expect_length(ep_file, 251)
})
}
5 changes: 2 additions & 3 deletions tests/testthat/test-read_slf_individual.R
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,8 @@ test_that("Reads individual file correctly", {
# Test for the correct number of rows
expect_equal(nrow(indiv_file), 100)

# Need to come back to this test - some files have different lengths
# # Test for correct number of variables (will need updating)
# expect_length(indiv_file, 184)
# Test for correct number of variables (will need updating)
expect_length(indiv_file, 193)
}
})

Expand Down
55 changes: 55 additions & 0 deletions tests/testthat/test-tidyselect_columns.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
skip_on_ci()


test_that("tidyselect helpers work for column selection in the episode file", {
expect_named(
read_slf_episode("1920", col_select = dplyr::starts_with("dd")),
c("dd_responsible_lca", "dd_quality")
)
expect_named(
read_slf_episode("1920", col_select = c("year", dplyr::starts_with("dd"))),
c("year", "dd_responsible_lca", "dd_quality")
)
expect_named(
read_slf_episode("1920", col_select = !dplyr::matches("[aeiou]"))
)
})

test_that("col_select works when columns are added", {
expect_named(
read_slf_episode("1920", col_select = "year", recids = "DD"),
"year"
)
expect_named(
read_slf_episode("1920", col_select = "year", partnerships = "S37000001"),
"year"
)
expect_named(
read_slf_episode(
"1920",
col_select = c("year", dplyr::contains("dd")),
recids = "DD"
)
)
expect_named(
read_slf_episode(
"1920",
col_select = c("year", dplyr::contains("cij")),
partnerships = "S37000001"
)
)
})

test_that("tidyselect helpers work for column selection in the individual file", {
expect_named(
read_slf_individual("1920", col_select = dplyr::starts_with("dd")),
c("dd_noncode9_episodes", "dd_noncode9_beddays", "dd_code9_episodes", "dd_code9_beddays")
)
expect_named(
read_slf_individual("1920", col_select = c("year", dplyr::starts_with("dd"))),
c("year", "dd_noncode9_episodes", "dd_noncode9_beddays", "dd_code9_episodes", "dd_code9_beddays")
)
expect_named(
read_slf_individual("1920", col_select = !dplyr::matches("[aeiou]"))
)
})
98 changes: 98 additions & 0 deletions vignettes/slf-documentation.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
title: "slf-documentation"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{slf-documentation}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

```{r setup, include = FALSE}
library(slfhelper)
```

## SLFhelper

`SLFhelper` contains some easy to use functions designed to make working with the Source Linkage Files (SLFs) as efficient as possible.

### Filter functions:

- `year` returns financial year of interest. You can also select multiple years using `c("1718", "1819", "1920")`
- `recid` returns recids of interest. Selecting this is beneficial for specific analysis.
- `partnerships` returns partnerships of interest. Selecting certain partnerships will reduce the SLFs size.
- `col_select` returns columns of interest. This is the best way to reduce the SLFs size.

### Data snippets:

- `ep_file_vars` returns a list of all variables in the episode files.
- `indiv_file_vars` returns a list of all variables in the individual files.
- `partnerships` returns a list of partnership names (HSCP_2018 codes)
- `recid` returns a list of all recids available in the SLFs.
- `ep_file_bedday_vars` returns a list of all bedday related variables in the SLFs.
- `ep_file_cost_vars` returns a list of all cost related variables in the SLFs.

### Anon CHI

- Use the function `get_chi()` to easily switch `anon_chi` to `chi`.
- Use the function `get_anon_chi()` to easily switch `chi` to `anon_chi`.

### Memory usage in SLFS

While working with the Source Linkage Files (SLFs), it is recommended to use the features of the SLFhelper package to maximase the memory usage in posit, see [PHS Data Science Knowledge Base](https://public-health-scotland.github.io/knowledge-base/docs/Posit%20Infrastructure?doc=Memory%20Usage%20in%20SMR01.md) for further guidance on memory usage in posit workbench.

Reading a full SLF file can be time consuming and take up resources on posit workbench. In the episode file there are `r length(slfhelper::ep_file_vars)` variables and around 12 million rows compared to the individual file where there are `r length(slfhelper::indiv_file_vars)` variables and around 6 million rows in each file. This can be reduced by using available selections in SLFhelper to help reduce the size of the SLFs for analysis and to free up resources in posit workbench.

The tables below show the memory usage of each full size SLF.

#### Episode File

| Year | Memory Usage (GiB) |
|------|:------------------:|
| 1718 | 22 |
| 1819 | 22 |
| 1920 | 22 |
| 2021 | 19 |
| 2122 | 21 |
| 2223 | 21 |
| 2324 | 18 |

#### Individual File

| Year | Memory Usage (GiB) |
|------|:------------------:|
| 1718 | 6.8 |
| 1819 | 6.8 |
| 1920 | 7.0 |
| 2021 | 7.0 |
| 2122 | 7.0 |
| 2223 | 7.1 |
| 2324 | 5.1 |

If one can use selection features in SLFhelper, the session memory requirement can be reduced. There are `r length(slfhelper::ep_file_vars)` columns for a year episode file of size around 20 GiB. Hence, on average, a column with all rows takes around 0.1 GiB, which can give a rough estimate on the session memory one needs. Taking Year 1920 as a demonstration, the following tables present various sizes of extracts from the SLF files, from 5 columns to all columns, along with the amount of memory required to work with the data one reads in. Keep in mind that tables below are just recommendations, and that memory usage depends on how one handles data and optimises data pipeline.


#### Episode File
| Column Number | Memory usage (GiB) | Session Memory Recommendation |
|---------------|:------------------:|---------------------------------------------------|
| 5 | 0.5 | 4 GiB (4096 MiB) |
| 10 | 1.4 | between 4 GiB (4096 MiB) and 8 GiB (8192 MiB) |
| 50 | 5.1 | between 8 GiB (8192 MiB) and 16 GiB (16384 MiB) |
| 150 | 13 | between 20 GiB (20480 MiB) and 38 GiB (38912 MiB) |
| 251 | 22 | between 32 GiB (32768 MiB) and 64 GiB (65536 MiB) |

#### Individual File

| Column Number | Memory usage (GiB) | Session Memory Recommendation |
|---------------|:------------------:|---------------------------------------------------|
| 5 | 0.7 | 4 GiB (4096 MiB) |
| 10 | 0.8 | 4 GiB (4096 MiB) |
| 50 | 2.2 | between 4 GiB (4096 MiB) and 8 GiB (8192 MiB) |
| 150 | 5.5 | between 8 GiB (8192 MiB) and 16 GiB (16384 MiB) |
| 193 | 7.0 | between 11 GiB (11264 MiB) and 21 GiB (21504 MiB) |
Loading

0 comments on commit bcc074f

Please sign in to comment.