Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation and NEWS.md #114

Merged
merged 4 commits into from
Feb 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,5 @@ importFrom(dplyr,pull)
importFrom(stats,pnorm)
importFrom(stats,runif)
importFrom(utils,installed.packages)
importFrom(utils,packageVersion)
useDynLib(zoomerjoin, .registration = TRUE)
3 changes: 2 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@
## New features

* Several performance improvements (#101, #104).
* Added support for joining based on hamming distance (#100).

## Bug fixes

* When `clean = TRUE`, strings were not coerced to lower case. This is now the
* When `clean = TRUE`, strings were not coerced to lower case. This is now the
case (#105).
* Fix argument `progress`, which didn't print anything when it was `TRUE` (#107).

Expand Down
2 changes: 1 addition & 1 deletion R/euclidean_logical_joins.R
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#' Spatial joins Using LSH
#' Fuzzy joins for Euclidean distance using Locality Sensitive Hashing
#'
#' @inheritParams jaccard_left_join
#' @param threshold The distance threshold below which units should be
Expand Down
70 changes: 37 additions & 33 deletions R/hamming_logical_joins.R
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#' Fuzzy inner-join using minihashing
#' Fuzzy joins for Hamming distance using Locality Sensitive Hashing
#'
#' Find similar rows between two tables using the hamming distance. The hamming
#' distance is equal to the number characters two strings differ by, or is
Expand Down Expand Up @@ -39,7 +39,39 @@
#' to adhere to the same standards as the dplyr-joins, and uses the same
#' logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).
#'
#' @rdname hamming-joins
#' @export
#' @examples
#' # load baby names data
#' # install.packages("babynames")
#' library(babynames)
#'
#' baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500])
#' baby_names_mispelled <- data.frame(
#' name_mispelled = gsub("[aeiouy]", "x", baby_names$name)
#' )
#'
#' # Run the join and only keep rows that have a match:
#' hamming_inner_join(
#' baby_names,
#' baby_names_mispelled,
#' by = c("name" = "name_mispelled"),
#' threshold = 3,
#' n_bands = 150,
#' band_width = 10,
#' clean = FALSE # default
#' )
#'
#' # Run the join and keep all rows from the first dataset, regardless of whether
#' # they have a match:
#' hamming_left_join(
#' baby_names,
#' baby_names_mispelled,
#' by = c("name" = "name_mispelled"),
#' threshold = 3,
#' n_bands = 150,
#' band_width = 10,
#' )
hamming_inner_join <- function(a, b,
by = NULL,
n_bands = 100,
Expand All @@ -58,14 +90,7 @@ hamming_inner_join <- function(a, b,
clean=clean)
}

#' Fuzzy anti-join using minihashing
#'
#' @inheritParams hamming_inner_join
#'
#' @return a tibble fuzzily-joined on the basis of the variables in `by.` Tries
#' to adhere to the same standards as the dplyr-joins, and uses the same
#' logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).
#'
#' @rdname hamming-joins
#' @export
hamming_anti_join <- function(a, b,
by = NULL,
Expand All @@ -85,14 +110,7 @@ hamming_anti_join <- function(a, b,
clean=clean)
}

#' Fuzzy left-join using minihashing
#'
#' @inheritParams hamming_inner_join
#'
#' @return a tibble fuzzily-joined on the basis of the variables in `by.` Tries
#' to adhere to the same standards as the dplyr-joins, and uses the same
#' logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).
#'
#' @rdname hamming-joins
#' @export
hamming_left_join <- function(a, b,
by = NULL,
Expand All @@ -112,14 +130,7 @@ hamming_left_join <- function(a, b,
clean=clean)
}

#' Fuzzy left-join using minihashing
#'
#' @inheritParams hamming_inner_join
#'
#' @return a tibble fuzzily-joined on the basis of the variables in `by.` Tries
#' to adhere to the same standards as the dplyr-joins, and uses the same
#' logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).
#'
#' @rdname hamming-joins
#' @export
hamming_right_join <- function(a, b,
by = NULL,
Expand All @@ -140,14 +151,7 @@ hamming_right_join <- function(a, b,
}


#' Fuzzy full-join using minihashing
#'
#' @inheritParams hamming_inner_join
#'
#' @return a tibble fuzzily-joined on the basis of the variables in `by.` Tries
#' to adhere to the same standards as the dplyr-joins, and uses the same
#' logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).
#'
#' @rdname hamming-joins
#' @export
hamming_full_join <- function(a, b,
by = NULL,
Expand Down
2 changes: 1 addition & 1 deletion R/jaccard_logical_joins.R
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#' Fuzzy joins using minihashing
#' Fuzzy joins for Jaccard distance using MinHash
#'
#' @param a,b The two dataframes to join.
#'
Expand Down
2 changes: 1 addition & 1 deletion R/string_group.R
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
#'
#' @export
#' @importFrom stats runif
#' @importFrom utils installed.packages
#' @importFrom utils installed.packages packageVersion
jaccard_string_group <- function(string, n_gram_width = 2, n_bands = 45, band_width = 8, threshold = .7, progress = FALSE) {
if (system.file(package = "igraph") == "") {
stop("library 'igraph' must be installed to run this function")
Expand Down
26 changes: 13 additions & 13 deletions man/em_link.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions man/euclidean-joins.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

86 changes: 85 additions & 1 deletion man/hamming_inner_join.Rd → man/hamming-joins.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

57 changes: 0 additions & 57 deletions man/hamming_anti_join.Rd

This file was deleted.

Loading
Loading