Skip to content

Commit

Permalink
Merge pull request #114 from beniaminogreen/documentation_fixes
Browse files Browse the repository at this point in the history
Update documentation and NEWS.md
  • Loading branch information
beniaminogreen authored Feb 14, 2024
2 parents aeb3eac + 55651ad commit fe41c2b
Show file tree
Hide file tree
Showing 18 changed files with 156 additions and 290 deletions.
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,5 @@ importFrom(dplyr,pull)
importFrom(stats,pnorm)
importFrom(stats,runif)
importFrom(utils,installed.packages)
importFrom(utils,packageVersion)
useDynLib(zoomerjoin, .registration = TRUE)
3 changes: 2 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@
## New features

* Several performance improvements (#101, #104).
* Added support for joining based on hamming distance (#100).

## Bug fixes

* When `clean = TRUE`, strings were not coerced to lower case. This is now the
* When `clean = TRUE`, strings were not coerced to lower case. This is now the
case (#105).
* Fix argument `progress`, which didn't print anything when it was `TRUE` (#107).

Expand Down
2 changes: 1 addition & 1 deletion R/euclidean_logical_joins.R
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#' Spatial joins Using LSH
#' Fuzzy joins for Euclidean distance using Locality Sensitive Hashing
#'
#' @inheritParams jaccard_left_join
#' @param threshold The distance threshold below which units should be
Expand Down
70 changes: 37 additions & 33 deletions R/hamming_logical_joins.R
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#' Fuzzy inner-join using minihashing
#' Fuzzy joins for Hamming distance using Locality Sensitive Hashing
#'
#' Find similar rows between two tables using the hamming distance. The hamming
#' distance is equal to the number characters two strings differ by, or is
Expand Down Expand Up @@ -39,7 +39,39 @@
#' to adhere to the same standards as the dplyr-joins, and uses the same
#' logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).
#'
#' @rdname hamming-joins
#' @export
#' @examples
#' # load baby names data
#' # install.packages("babynames")
#' library(babynames)
#'
#' baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500])
#' baby_names_mispelled <- data.frame(
#' name_mispelled = gsub("[aeiouy]", "x", baby_names$name)
#' )
#'
#' # Run the join and only keep rows that have a match:
#' hamming_inner_join(
#' baby_names,
#' baby_names_mispelled,
#' by = c("name" = "name_mispelled"),
#' threshold = 3,
#' n_bands = 150,
#' band_width = 10,
#' clean = FALSE # default
#' )
#'
#' # Run the join and keep all rows from the first dataset, regardless of whether
#' # they have a match:
#' hamming_left_join(
#' baby_names,
#' baby_names_mispelled,
#' by = c("name" = "name_mispelled"),
#' threshold = 3,
#' n_bands = 150,
#' band_width = 10,
#' )
hamming_inner_join <- function(a, b,
by = NULL,
n_bands = 100,
Expand All @@ -58,14 +90,7 @@ hamming_inner_join <- function(a, b,
clean=clean)
}

#' Fuzzy anti-join using minihashing
#'
#' @inheritParams hamming_inner_join
#'
#' @return a tibble fuzzily-joined on the basis of the variables in `by.` Tries
#' to adhere to the same standards as the dplyr-joins, and uses the same
#' logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).
#'
#' @rdname hamming-joins
#' @export
hamming_anti_join <- function(a, b,
by = NULL,
Expand All @@ -85,14 +110,7 @@ hamming_anti_join <- function(a, b,
clean=clean)
}

#' Fuzzy left-join using minihashing
#'
#' @inheritParams hamming_inner_join
#'
#' @return a tibble fuzzily-joined on the basis of the variables in `by.` Tries
#' to adhere to the same standards as the dplyr-joins, and uses the same
#' logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).
#'
#' @rdname hamming-joins
#' @export
hamming_left_join <- function(a, b,
by = NULL,
Expand All @@ -112,14 +130,7 @@ hamming_left_join <- function(a, b,
clean=clean)
}

#' Fuzzy left-join using minihashing
#'
#' @inheritParams hamming_inner_join
#'
#' @return a tibble fuzzily-joined on the basis of the variables in `by.` Tries
#' to adhere to the same standards as the dplyr-joins, and uses the same
#' logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).
#'
#' @rdname hamming-joins
#' @export
hamming_right_join <- function(a, b,
by = NULL,
Expand All @@ -140,14 +151,7 @@ hamming_right_join <- function(a, b,
}


#' Fuzzy full-join using minihashing
#'
#' @inheritParams hamming_inner_join
#'
#' @return a tibble fuzzily-joined on the basis of the variables in `by.` Tries
#' to adhere to the same standards as the dplyr-joins, and uses the same
#' logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).
#'
#' @rdname hamming-joins
#' @export
hamming_full_join <- function(a, b,
by = NULL,
Expand Down
2 changes: 1 addition & 1 deletion R/jaccard_logical_joins.R
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#' Fuzzy joins using minihashing
#' Fuzzy joins for Jaccard distance using MinHash
#'
#' @param a,b The two dataframes to join.
#'
Expand Down
2 changes: 1 addition & 1 deletion R/string_group.R
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
#'
#' @export
#' @importFrom stats runif
#' @importFrom utils installed.packages
#' @importFrom utils installed.packages packageVersion
jaccard_string_group <- function(string, n_gram_width = 2, n_bands = 45, band_width = 8, threshold = .7, progress = FALSE) {
if (system.file(package = "igraph") == "") {
stop("library 'igraph' must be installed to run this function")
Expand Down
26 changes: 13 additions & 13 deletions man/em_link.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions man/euclidean-joins.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

86 changes: 85 additions & 1 deletion man/hamming_inner_join.Rd → man/hamming-joins.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

57 changes: 0 additions & 57 deletions man/hamming_anti_join.Rd

This file was deleted.

Loading

0 comments on commit fe41c2b

Please sign in to comment.