small vignette updates

ESHackathon · Aug 23, 2024 · 5f1adcf · 5f1adcf
1 parent 7bb6b90
commit 5f1adcf
Showing 1 changed file with 16 additions and 13 deletions.
diff --git a/vignettes/citesource_new_benchmark_testing.Rmd b/vignettes/citesource_new_benchmark_testing.Rmd
@@ -28,7 +28,7 @@ When estimating the comprehensiveness of a search, researchers often compile a l
 
 This vignette will provide an example of how CiteSource can be used to speed up the process of benchmarking especially when comparing variations of search strings or search strategies. 
 
-## 1. Installation of packages and loading libraries
+## 1. Install and loading CiteSource
 
 Use the following code to install CiteSource. Currently, CiteSource lives on GitHub, so you may need to first install the remotes package.
 
@@ -43,7 +43,7 @@ Use the following code to install CiteSource. Currently, CiteSource lives on Git
 #Load the CiteSource
 library(CiteSource)
 ```
-## 2. Import and tag citation files with custom metadata
+## 2. Import citation files
 
 Users can import multiple .ris or .bib files into CiteSource, which the user can label with source information such as database or platform. In this case we are uploading the results from five different strings, which were run in Web of Science.
 
@@ -56,7 +56,7 @@ citation_files <- list.files(path = file_path, pattern = "\\.ris", full.names =
 citation_files
 ```
 
-### Tagging citation files
+## 3. Assign custom metadata
 In this example the benchmark file takes an NA for cite_source while the search files are tagged with search 1, search 2, etc. The cite_label for these files is tagged as search, while the benchmark is used for the benchmark files. In further vignettes you will see how the label can be used for post screening and citations that were included in the final synthesis.
 ```{r}
 # Create a tibble that contains metadata about the citation files
@@ -73,13 +73,13 @@ imported_tbl <- tibble::tribble(
 dplyr::mutate(files = paste0(file_path, files))
 
 # Save the imported citations as raw_citations
-raw_citations <- read_citations(metadata = imported_tbl)
+raw_citations <- read_citations(metadata = imported_tbl, verbose = FALSE)
 ```
-## 3. Deduplication and source information
+## 4. Deduplicate & create data tables
 
 CiteSource allows users to merge duplicates while maintaining information in the cite_source metadata field. Thus, information about the origin of the records is not lost in the deduplication process. The next few steps produce the dataframes that we can use in subsequent analyses, along with a summary of records from each source.
 
-```{r, results = FALSE, message=FALSE, warning=FALSE}
+```{r}
 #Deduplicate citations. This yields a dataframe of all records with duplicates merged, but the originating source information maintained in a new variable called cite_source.
 unique_citations <- dedup_citations(raw_citations)
 
@@ -88,15 +88,21 @@ n_unique <- count_unique(unique_citations)
 
 #For each unique citation, determine which sources were present
 source_comparison <- compare_sources(unique_citations, comp_type = "sources")
+```
+
+## 5. Review internal duplication
+
+Once we have imported, added custom metadata, and identified duplicates, it can be helpful to review the initial record count data to ensure everything looks okay. As a part of the deduplication process, duplicate records may have been identified within sources. The initial record table will provide you with a count of how many records were initially in each source file, and the count of distinct records that will vary if there were any duplicates identified within the source file. 
 
+```{r}
 #Initial upload/post internal deduplication table creation
 initial_records_search <- calculate_initial_records(unique_citations)
 initial_record_table_search <- create_initial_record_table(initial_records_search)
 initial_record_table_search
 
 ```
 
-## 4. Upset plot to compare discovery of benchmarking articles
+## 6. Compare overlapp with an upset plot
 
 An upset plot is useful for visualizing overlap across multiple sources and provides detail about the number of shared and unique records. Using this data we'll outline a few potential uses when benchmarking testing a search.
 
@@ -105,7 +111,6 @@ We have uploaded 15 benchmarking articles. Of these 15 articles, the upset plot
 ```{r, fig.alt="An upset plot visualizing the overlap of benchmarking articles found across five search strategies. The plot highlights that nine articles were identified by all five searches, while four benchmarking articles were missed entirely. Additional columns show the number of articles shared across different combinations of search strategies."}
 #Generate a source comparison upset plot.
 plot_source_overlap_upset(source_comparison, decreasing = c(TRUE, TRUE))
-
 ```
 
 Looking at the first column, we see that 9 benchmarking articles were found across every search. One may hypothesize that the 157 citations that follow in the second column may have a high number of relevant articles due to the fact that they were also discovered across the five searches. If a researcher was interested in building a larger group of benchmarking articles, they may want to review these articles first.
@@ -116,18 +121,16 @@ Another decision in this case may be to drop search #2 and #3 as each of these s
 
 Finally, as we'll see in the next step, we can examine closely the four articles that weren't found in any search approach. This will help us adjust our search to better capture relevant studies. 
 
-## 5. Reviewing the record table
+## 7. Compare overlapp with a record level table
 This record table is helpful in reviewing which citations were found across each database as well as quickly checking to see which benchmarking articles were not found in the search.
 
 ```{r}
-
 unique_citations %>%
   dplyr::filter(stringr::str_detect(cite_source, "benchmark")) %>%
   record_level_table(return = "DT")
-
 ```
 
-## 6. Exporting for further analysis
+## 8. Exporting for further analysis
 
 We may want to export our deduplicated set of results (or any of our dataframes) for further analysis or to save them in a convenient format for subsequent use. CiteSource offers a set of export functions called `export_csv`, `export_ris` and `export_bib` that will save dataframes as a .csv file, .ris file or .bib file, respectively. 
 
@@ -137,7 +140,7 @@ You can then reimport exported files to pick up a project or analysis without ha
 The separate argument can be used to create separate columns for cite_source, cite_label or cite_string to facilitate analysis.
 
 ```{r}
-#export_csv(unique_citations, filename = "citesource_export.csv", separate = "cite_source")
+#export_csv(unique_citations, filename = "citesource_export.csv")
 ```
 
 ### Generate a .ris file