diff --git a/articles/guided_tour.html b/articles/guided_tour.html
index a93edd6..0d04954 100644
--- a/articles/guided_tour.html
+++ b/articles/guided_tour.html
@@ -190,20 +190,20 @@ <h2 id="basic-syntax">Basic Syntax:<a class="anchor" aria-label="anchor" href="#
 <span>  n_bands <span class="op">=</span> <span class="fl">20</span>, band_width <span class="op">=</span> <span class="fl">6</span>, threshold <span class="op">=</span> <span class="fl">.8</span></span>
 <span><span class="op">)</span></span>
 <span><span class="fu"><a href="https://rdrr.io/r/base/print.html" class="external-link">print</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/Sys.time.html" class="external-link">Sys.time</a></span><span class="op">(</span><span class="op">)</span> <span class="op">-</span> <span class="va">start_time</span><span class="op">)</span></span></code></pre></div>
-<pre><code><span><span class="co">## Time difference of 0.01206851 secs</span></span></code></pre>
+<pre><code><span><span class="co">## Time difference of 0.01161695 secs</span></span></code></pre>
 <div class="sourceCode" id="cb9"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/base/print.html" class="external-link">print</a></span><span class="op">(</span><span class="va">join_out</span><span class="op">)</span></span></code></pre></div>
 <pre><code><span><span class="co">## <span style="color: #949494;"># A tibble: 8 × 4</span></span></span>
 <span><span class="co">##       a field.x                                                      b field.y  </span></span>
 <span><span class="co">##   <span style="color: #949494; font-style: italic;">&lt;dbl&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>                                                    <span style="color: #949494; font-style: italic;">&lt;dbl&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>    </span></span>
-<span><span class="co">## <span style="color: #BCBCBC;">1</span>    88 scheuer for congress 1980                                  667 scheuer …</span></span>
-<span><span class="co">## <span style="color: #BCBCBC;">2</span>   292 bill bradley for u s senate '84                            913 bill bra…</span></span>
-<span><span class="co">## <span style="color: #BCBCBC;">3</span>   378 guarini for congress 1982                                  883 guarini …</span></span>
-<span><span class="co">## <span style="color: #BCBCBC;">4</span>   238 4th congressional district democratic party                518 16th con…</span></span>
-<span><span class="co">## <span style="color: #BCBCBC;">5</span>   302 americans for good government inc                          910 american…</span></span>
-<span><span class="co">## <span style="color: #BCBCBC;">6</span>   230 pipefitters local union 524                                998 pipefitt…</span></span>
-<span><span class="co">## <span style="color: #BCBCBC;">7</span>   319 7th congressional district democratic party of wisconsin   792 8th cong…</span></span>
-<span><span class="co">## <span style="color: #BCBCBC;">8</span>   378 guarini for congress 1982                                  606 guarini …</span></span></code></pre>
+<span><span class="co">## <span style="color: #BCBCBC;">1</span>   378 guarini for congress 1982                                  606 guarini …</span></span>
+<span><span class="co">## <span style="color: #BCBCBC;">2</span>   378 guarini for congress 1982                                  883 guarini …</span></span>
+<span><span class="co">## <span style="color: #BCBCBC;">3</span>   238 4th congressional district democratic party                518 16th con…</span></span>
+<span><span class="co">## <span style="color: #BCBCBC;">4</span>    88 scheuer for congress 1980                                  667 scheuer …</span></span>
+<span><span class="co">## <span style="color: #BCBCBC;">5</span>   230 pipefitters local union 524                                998 pipefitt…</span></span>
+<span><span class="co">## <span style="color: #BCBCBC;">6</span>   302 americans for good government inc                          910 american…</span></span>
+<span><span class="co">## <span style="color: #BCBCBC;">7</span>   292 bill bradley for u s senate '84                            913 bill bra…</span></span>
+<span><span class="co">## <span style="color: #BCBCBC;">8</span>   319 7th congressional district democratic party of wisconsin   792 8th cong…</span></span></code></pre>
 <p>The first two arguments, <code>a</code>, and <code>b</code>, are
 direct analogues of the <code>dplyr</code> arguments, and are the two
 data frames you want to join. The <code>by</code> field also acts the
diff --git a/articles/matching_vectors.html b/articles/matching_vectors.html
index af0d7a0..f92e658 100644
--- a/articles/matching_vectors.html
+++ b/articles/matching_vectors.html
@@ -169,7 +169,7 @@ <h2 id="demonstration">Demonstration<a class="anchor" aria-label="anchor" href="
 <span><span class="va">n_matches</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/base/nrow.html" class="external-link">nrow</a></span><span class="op">(</span><span class="va">joined_out</span><span class="op">)</span></span>
 <span><span class="va">time_taken</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/base/Sys.time.html" class="external-link">Sys.time</a></span><span class="op">(</span><span class="op">)</span> <span class="op">-</span> <span class="va">start</span></span>
 <span><span class="fu"><a href="https://rdrr.io/r/base/print.html" class="external-link">print</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/paste.html" class="external-link">paste</a></span><span class="op">(</span><span class="st">"found"</span>, <span class="va">n_matches</span>, <span class="st">"matches in"</span>, <span class="fu"><a href="https://rdrr.io/r/base/Round.html" class="external-link">round</a></span><span class="op">(</span><span class="va">time_taken</span><span class="op">)</span>, <span class="st">"seconds"</span><span class="op">)</span><span class="op">)</span></span>
-<span><span class="co">#&gt; [1] "found 100000 matches in 20 seconds"</span></span></code></pre></div>
+<span><span class="co">#&gt; [1] "found 100000 matches in 16 seconds"</span></span></code></pre></div>
 <p>Zoomerjoin is able to easily find all pairs in just under 30s
 (perhaps longer on the runner that renders the website), even though the
 points lie in high-dimensional (d=100) space. This makes zoomerjoin a
diff --git a/pkgdown.yml b/pkgdown.yml
index a2517a7..0d8a58c 100644
--- a/pkgdown.yml
+++ b/pkgdown.yml
@@ -5,7 +5,7 @@ articles:
   benchmarks: benchmarks.html
   guided_tour: guided_tour.html
   matching_vectors: matching_vectors.html
-last_built: 2024-06-03T14:59Z
+last_built: 2024-07-01T01:36Z
 urls:
   reference: https://beniaminogreen.github.io/zoomerjoin/reference
   article: https://beniaminogreen.github.io/zoomerjoin/articles
diff --git a/reference/euclidean-joins.html b/reference/euclidean-joins.html
index 5bb3029..782406d 100644
--- a/reference/euclidean-joins.html
+++ b/reference/euclidean-joins.html
@@ -202,29 +202,29 @@ <h2 id="ref-examples">Examples<a class="anchor" aria-label="anchor" href="#ref-e
 <span class="r-in"><span><span class="fu">euclidean_inner_join</span><span class="op">(</span><span class="va">X_1</span>, <span class="va">X_2</span>, by <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"V1"</span>, <span class="st">"V2"</span><span class="op">)</span>, threshold <span class="op">=</span> <span class="fl">.00005</span><span class="op">)</span></span></span>
 <span class="r-out co"><span class="r-pr">#&gt;</span>         V1.x      V2.x id_1      V1.y      V2.y id_2</span>
 <span class="r-out co"><span class="r-pr">#&gt;</span> 1  0.7777778 0.7777778    8 0.7777779 0.7777779    8</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 2  0.3333333 0.3333333    4 0.3333334 0.3333334    4</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 3  0.6666667 0.6666667    7 0.6666668 0.6666668    7</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 4  0.1111111 0.1111111    2 0.1111112 0.1111112    2</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 5  1.0000000 1.0000000   10 1.0000001 1.0000001   10</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 6  0.2222222 0.2222222    3 0.2222223 0.2222223    3</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 7  0.4444444 0.4444444    5 0.4444445 0.4444445    5</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 8  0.8888889 0.8888889    9 0.8888890 0.8888890    9</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 9  0.5555556 0.5555556    6 0.5555557 0.5555557    6</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 10 0.0000000 0.0000000    1 0.0000001 0.0000001    1</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 2  0.2222222 0.2222222    3 0.2222223 0.2222223    3</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 3  0.4444444 0.4444444    5 0.4444445 0.4444445    5</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 4  0.5555556 0.5555556    6 0.5555557 0.5555557    6</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 5  0.3333333 0.3333333    4 0.3333334 0.3333334    4</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 6  0.8888889 0.8888889    9 0.8888890 0.8888890    9</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 7  1.0000000 1.0000000   10 1.0000001 1.0000001   10</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 8  0.0000000 0.0000000    1 0.0000001 0.0000001    1</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 9  0.1111111 0.1111111    2 0.1111112 0.1111112    2</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 10 0.6666667 0.6666667    7 0.6666668 0.6666668    7</span>
 <span class="r-in"><span></span></span>
 <span class="r-in"><span><span class="co"># keep all observations from X_1, regardless of whether they have a match</span></span></span>
 <span class="r-in"><span><span class="fu">euclidean_inner_join</span><span class="op">(</span><span class="va">X_1</span>, <span class="va">X_2</span>, by <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html" class="external-link">c</a></span><span class="op">(</span><span class="st">"V1"</span>, <span class="st">"V2"</span><span class="op">)</span>, threshold <span class="op">=</span> <span class="fl">.00005</span><span class="op">)</span></span></span>
 <span class="r-out co"><span class="r-pr">#&gt;</span>         V1.x      V2.x id_1      V1.y      V2.y id_2</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 1  0.3333333 0.3333333    4 0.3333334 0.3333334    4</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 2  1.0000000 1.0000000   10 1.0000001 1.0000001   10</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 3  0.6666667 0.6666667    7 0.6666668 0.6666668    7</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 4  0.4444444 0.4444444    5 0.4444445 0.4444445    5</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 1  1.0000000 1.0000000   10 1.0000001 1.0000001   10</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 2  0.1111111 0.1111111    2 0.1111112 0.1111112    2</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 3  0.5555556 0.5555556    6 0.5555557 0.5555557    6</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 4  0.7777778 0.7777778    8 0.7777779 0.7777779    8</span>
 <span class="r-out co"><span class="r-pr">#&gt;</span> 5  0.0000000 0.0000000    1 0.0000001 0.0000001    1</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 6  0.8888889 0.8888889    9 0.8888890 0.8888890    9</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 7  0.5555556 0.5555556    6 0.5555557 0.5555557    6</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 6  0.3333333 0.3333333    4 0.3333334 0.3333334    4</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 7  0.4444444 0.4444444    5 0.4444445 0.4444445    5</span>
 <span class="r-out co"><span class="r-pr">#&gt;</span> 8  0.2222222 0.2222222    3 0.2222223 0.2222223    3</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 9  0.7777778 0.7777778    8 0.7777779 0.7777779    8</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> 10 0.1111111 0.1111111    2 0.1111112 0.1111112    2</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 9  0.6666667 0.6666667    7 0.6666668 0.6666668    7</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> 10 0.8888889 0.8888889    9 0.8888890 0.8888890    9</span>
 </code></pre></div>
     </div>
   </main><aside class="col-md-3"><nav id="toc"><h2>On this page</h2>
diff --git a/reference/hamming-joins.html b/reference/hamming-joins.html
index 929e441..548318a 100644
--- a/reference/hamming-joins.html
+++ b/reference/hamming-joins.html
@@ -209,18 +209,18 @@ <h2 id="ref-examples">Examples<a class="anchor" aria-label="anchor" href="#ref-e
 <span class="r-in"><span>  clean <span class="op">=</span> <span class="cn">FALSE</span> <span class="co"># default</span></span></span>
 <span class="r-in"><span><span class="op">)</span></span></span>
 <span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #949494;"># A tibble: 2,664 × 2</span></span>
-<span class="r-out co"><span class="r-pr">#&gt;</span>    name  name_mispelled</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span>    <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span> <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 1</span> alva  xlgx          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 2</span> elma  clxx          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 3</span> eda   lxx           </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 4</span> lila  lxzx          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 5</span> luna  lxnx          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 6</span> eula  lxlx          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 7</span> metta mxntx         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 8</span> ida   xvx           </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 9</span> sue   xnx           </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">10</span> sue   sxx           </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span>    name   name_mispelled</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span>    <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>  <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>         </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 1</span> oma    sxx           </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 2</span> anna   lxnx          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 3</span> may    mxx           </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 4</span> liza   lxdx          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 5</span> iola   xllx          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 6</span> helena hxlxnx        </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 7</span> rosa   rxtx          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 8</span> olga   clxx          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 9</span> elsa   xlsx          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">10</span> cecil  cxcxl         </span>
 <span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #949494;"># ℹ 2,654 more rows</span></span>
 <span class="r-in"><span></span></span>
 <span class="r-in"><span><span class="co"># Run the join and keep all rows from the first dataset, regardless of whether</span></span></span>
@@ -234,18 +234,18 @@ <h2 id="ref-examples">Examples<a class="anchor" aria-label="anchor" href="#ref-e
 <span class="r-in"><span>  band_width <span class="op">=</span> <span class="fl">10</span>,</span></span>
 <span class="r-in"><span><span class="op">)</span></span></span>
 <span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #949494;"># A tibble: 2,746 × 2</span></span>
-<span class="r-out co"><span class="r-pr">#&gt;</span>    name    name_mispelled</span>
-<span class="r-out co"><span class="r-pr">#&gt;</span>    <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>   <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 1</span> bertha  mxrthx        </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 2</span> ellie   xllxx         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 3</span> vesta   vxstx         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 4</span> lulu    nxll          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 5</span> lue     xrx           </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 6</span> cleo    xltx          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 7</span> rosetta rxsxttx       </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 8</span> evie    xvxx          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 9</span> lou     xnx           </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">10</span> ann     fxx           </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span>    name      name_mispelled</span>
+<span class="r-out co"><span class="r-pr">#&gt;</span>    <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>     <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>         </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 1</span> vina      nxnx          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 2</span> lenna     lxndx         </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 3</span> belle     mxllx         </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 4</span> lois      lxxh          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 5</span> betty     mxttx         </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 6</span> christina chrxstxnx     </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 7</span> deborah   dxbxrxh       </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 8</span> lona      lxdx          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 9</span> mabelle   mxbxllx       </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">10</span> rena      jxnx          </span>
 <span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #949494;"># ℹ 2,736 more rows</span></span>
 </code></pre></div>
     </div>
diff --git a/reference/jaccard-joins.html b/reference/jaccard-joins.html
index eb4a56f..0e240b7 100644
--- a/reference/jaccard-joins.html
+++ b/reference/jaccard-joins.html
@@ -237,19 +237,19 @@ <h2 id="ref-examples">Examples<a class="anchor" aria-label="anchor" href="#ref-e
 <span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #949494;"># A tibble: 13 × 2</span></span>
 <span class="r-out co"><span class="r-pr">#&gt;</span>    name     name_wo_vowels</span>
 <span class="r-out co"><span class="r-pr">#&gt;</span>    <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>    <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 1</span> savannah svnnh         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 2</span> esther   sthr          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 3</span> martha   mrth          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 1</span> hester   hstr          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 2</span> blanch   blnch         </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 3</span> frank    frnk          </span>
 <span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 4</span> esther   thrs          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 5</span> frank    frnk          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 6</span> samantha smnth         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 7</span> esther   hstr          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 8</span> hester   hstr          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 9</span> blanch   blnch         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">10</span> frank    frnk          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">11</span> hester   thrs          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">12</span> hester   sthr          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">13</span> blanch   blnch         </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 5</span> hester   sthr          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 6</span> frank    frnk          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 7</span> blanch   blnch         </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 8</span> esther   sthr          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 9</span> martha   mrth          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">10</span> samantha smnth         </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">11</span> esther   hstr          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">12</span> hester   thrs          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">13</span> savannah svnnh         </span>
 <span class="r-in"><span></span></span>
 <span class="r-in"><span><span class="co"># Run the join and keep all rows from the first dataset, regardless of whether</span></span></span>
 <span class="r-in"><span><span class="co"># they have a match:</span></span></span>
@@ -265,16 +265,16 @@ <h2 id="ref-examples">Examples<a class="anchor" aria-label="anchor" href="#ref-e
 <span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #949494;"># A tibble: 506 × 2</span></span>
 <span class="r-out co"><span class="r-pr">#&gt;</span>    name     name_wo_vowels</span>
 <span class="r-out co"><span class="r-pr">#&gt;</span>    <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>    <span style="color: #949494; font-style: italic;">&lt;chr&gt;</span>         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 1</span> savannah svnnh         </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 1</span> esther   hstr          </span>
 <span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 2</span> samantha smnth         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 3</span> esther   hstr          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 4</span> hester   sthr          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 5</span> hester   hstr          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 6</span> martha   mrth          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 7</span> hester   thrs          </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 8</span> blanch   blnch         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 9</span> blanch   blnch         </span>
-<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">10</span> esther   thrs          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 3</span> hester   hstr          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 4</span> frank    frnk          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 5</span> blanch   blnch         </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 6</span> frank    frnk          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 7</span> martha   mrth          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 8</span> hester   thrs          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;"> 9</span> esther   thrs          </span>
+<span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #BCBCBC;">10</span> hester   sthr          </span>
 <span class="r-out co"><span class="r-pr">#&gt;</span> <span style="color: #949494;"># ℹ 496 more rows</span></span>
 </code></pre></div>
     </div>
diff --git a/search.json b/search.json
index 8a6e6d5..3f6a405 100644
--- a/search.json
+++ b/search.json
@@ -1 +1 @@
-[{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"our-pledge","dir":"","previous_headings":"","what":"Our Pledge","title":"Contributor Covenant Code of Conduct","text":"members, contributors, leaders pledge make participation community harassment-free experience everyone, regardless age, body size, visible invisible disability, ethnicity, sex characteristics, gender identity expression, level experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, sexual identity orientation. pledge act interact ways contribute open, welcoming, diverse, inclusive, healthy community.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"our-standards","dir":"","previous_headings":"","what":"Our Standards","title":"Contributor Covenant Code of Conduct","text":"Examples behavior contributes positive environment community include: Demonstrating empathy kindness toward people respectful differing opinions, viewpoints, experiences Giving gracefully accepting constructive feedback Accepting responsibility apologizing affected mistakes, learning experience Focusing best just us individuals, overall community Examples unacceptable behavior include: use sexualized language imagery, sexual attention advances kind Trolling, insulting derogatory comments, personal political attacks Public private harassment Publishing others’ private information, physical email address, without explicit permission conduct reasonably considered inappropriate professional setting","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"enforcement-responsibilities","dir":"","previous_headings":"","what":"Enforcement Responsibilities","title":"Contributor Covenant Code of Conduct","text":"Community leaders responsible clarifying enforcing standards acceptable behavior take appropriate fair corrective action response behavior deem inappropriate, threatening, offensive, harmful. Community leaders right responsibility remove, edit, reject comments, commits, code, wiki edits, issues, contributions aligned Code Conduct, communicate reasons moderation decisions appropriate.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"scope","dir":"","previous_headings":"","what":"Scope","title":"Contributor Covenant Code of Conduct","text":"Code Conduct applies within community spaces, also applies individual officially representing community public spaces. Examples representing community include using official e-mail address, posting via official social media account, acting appointed representative online offline event.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"enforcement","dir":"","previous_headings":"","what":"Enforcement","title":"Contributor Covenant Code of Conduct","text":"Instances abusive, harassing, otherwise unacceptable behavior may reported community leaders responsible enforcement beniamino.green@tutanota.com. complaints reviewed investigated promptly fairly. community leaders obligated respect privacy security reporter incident.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"enforcement-guidelines","dir":"","previous_headings":"","what":"Enforcement Guidelines","title":"Contributor Covenant Code of Conduct","text":"Community leaders follow Community Impact Guidelines determining consequences action deem violation Code Conduct:","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"id_1-correction","dir":"","previous_headings":"Enforcement Guidelines","what":"1. Correction","title":"Contributor Covenant Code of Conduct","text":"Community Impact: Use inappropriate language behavior deemed unprofessional unwelcome community. Consequence: private, written warning community leaders, providing clarity around nature violation explanation behavior inappropriate. public apology may requested.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"id_2-warning","dir":"","previous_headings":"Enforcement Guidelines","what":"2. Warning","title":"Contributor Covenant Code of Conduct","text":"Community Impact: violation single incident series actions. Consequence: warning consequences continued behavior. interaction people involved, including unsolicited interaction enforcing Code Conduct, specified period time. includes avoiding interactions community spaces well external channels like social media. Violating terms may lead temporary permanent ban.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"id_3-temporary-ban","dir":"","previous_headings":"Enforcement Guidelines","what":"3. Temporary Ban","title":"Contributor Covenant Code of Conduct","text":"Community Impact: serious violation community standards, including sustained inappropriate behavior. Consequence: temporary ban sort interaction public communication community specified period time. public private interaction people involved, including unsolicited interaction enforcing Code Conduct, allowed period. Violating terms may lead permanent ban.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"id_4-permanent-ban","dir":"","previous_headings":"Enforcement Guidelines","what":"4. Permanent Ban","title":"Contributor Covenant Code of Conduct","text":"Community Impact: Demonstrating pattern violation community standards, including sustained inappropriate behavior, harassment individual, aggression toward disparagement classes individuals. Consequence: permanent ban sort public interaction within community.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"attribution","dir":"","previous_headings":"","what":"Attribution","title":"Contributor Covenant Code of Conduct","text":"Code Conduct adapted Contributor Covenant, version 2.1, available https://www.contributor-covenant.org/version/2/1/code_of_conduct.html. Community Impact Guidelines inspired Mozilla’s code conduct enforcement ladder. answers common questions code conduct, see FAQ https://www.contributor-covenant.org/faq. Translations available https://www.contributor-covenant.org/translations.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":null,"dir":"","previous_headings":"","what":"GNU General Public License","title":"GNU General Public License","text":"Version 3, 29 June 2007Copyright © 2007 Free Software Foundation, Inc. <http://fsf.org/> Everyone permitted copy distribute verbatim copies license document, changing allowed.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"preamble","dir":"","previous_headings":"","what":"Preamble","title":"GNU General Public License","text":"GNU General Public License free, copyleft license software kinds works. licenses software practical works designed take away freedom share change works. contrast, GNU General Public License intended guarantee freedom share change versions program–make sure remains free software users. , Free Software Foundation, use GNU General Public License software; applies also work released way authors. can apply programs, . speak free software, referring freedom, price. General Public Licenses designed make sure freedom distribute copies free software (charge wish), receive source code can get want , can change software use pieces new free programs, know can things. protect rights, need prevent others denying rights asking surrender rights. Therefore, certain responsibilities distribute copies software, modify : responsibilities respect freedom others. example, distribute copies program, whether gratis fee, must pass recipients freedoms received. must make sure , , receive can get source code. must show terms know rights. Developers use GNU GPL protect rights two steps: (1) assert copyright software, (2) offer License giving legal permission copy, distribute /modify . developers’ authors’ protection, GPL clearly explains warranty free software. users’ authors’ sake, GPL requires modified versions marked changed, problems attributed erroneously authors previous versions. devices designed deny users access install run modified versions software inside , although manufacturer can . fundamentally incompatible aim protecting users’ freedom change software. systematic pattern abuse occurs area products individuals use, precisely unacceptable. Therefore, designed version GPL prohibit practice products. problems arise substantially domains, stand ready extend provision domains future versions GPL, needed protect freedom users. Finally, every program threatened constantly software patents. States allow patents restrict development use software general-purpose computers, , wish avoid special danger patents applied free program make effectively proprietary. prevent , GPL assures patents used render program non-free. precise terms conditions copying, distribution modification follow.","code":""},{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_0-definitions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"0. Definitions","title":"GNU General Public License","text":"“License” refers version 3 GNU General Public License. “Copyright” also means copyright-like laws apply kinds works, semiconductor masks. “Program” refers copyrightable work licensed License. licensee addressed “”. “Licensees” “recipients” may individuals organizations. “modify” work means copy adapt part work fashion requiring copyright permission, making exact copy. resulting work called “modified version” earlier work work “based ” earlier work. “covered work” means either unmodified Program work based Program. “propagate” work means anything , without permission, make directly secondarily liable infringement applicable copyright law, except executing computer modifying private copy. Propagation includes copying, distribution (without modification), making available public, countries activities well. “convey” work means kind propagation enables parties make receive copies. Mere interaction user computer network, transfer copy, conveying. interactive user interface displays “Appropriate Legal Notices” extent includes convenient prominently visible feature (1) displays appropriate copyright notice, (2) tells user warranty work (except extent warranties provided), licensees may convey work License, view copy License. interface presents list user commands options, menu, prominent item list meets criterion.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_1-source-code","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"1. Source Code","title":"GNU General Public License","text":"“source code” work means preferred form work making modifications . “Object code” means non-source form work. “Standard Interface” means interface either official standard defined recognized standards body, , case interfaces specified particular programming language, one widely used among developers working language. “System Libraries” executable work include anything, work whole, () included normal form packaging Major Component, part Major Component, (b) serves enable use work Major Component, implement Standard Interface implementation available public source code form. “Major Component”, context, means major essential component (kernel, window system, ) specific operating system () executable work runs, compiler used produce work, object code interpreter used run . “Corresponding Source” work object code form means source code needed generate, install, (executable work) run object code modify work, including scripts control activities. However, include work’s System Libraries, general-purpose tools generally available free programs used unmodified performing activities part work. example, Corresponding Source includes interface definition files associated source files work, source code shared libraries dynamically linked subprograms work specifically designed require, intimate data communication control flow subprograms parts work. Corresponding Source need include anything users can regenerate automatically parts Corresponding Source. Corresponding Source work source code form work.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_2-basic-permissions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"2. Basic Permissions","title":"GNU General Public License","text":"rights granted License granted term copyright Program, irrevocable provided stated conditions met. License explicitly affirms unlimited permission run unmodified Program. output running covered work covered License output, given content, constitutes covered work. License acknowledges rights fair use equivalent, provided copyright law. may make, run propagate covered works convey, without conditions long license otherwise remains force. may convey covered works others sole purpose make modifications exclusively , provide facilities running works, provided comply terms License conveying material control copyright. thus making running covered works must exclusively behalf, direction control, terms prohibit making copies copyrighted material outside relationship . Conveying circumstances permitted solely conditions stated . Sublicensing allowed; section 10 makes unnecessary.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_3-protecting-users-legal-rights-from-anti-circumvention-law","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"3. Protecting Users’ Legal Rights From Anti-Circumvention Law","title":"GNU General Public License","text":"covered work shall deemed part effective technological measure applicable law fulfilling obligations article 11 WIPO copyright treaty adopted 20 December 1996, similar laws prohibiting restricting circumvention measures. convey covered work, waive legal power forbid circumvention technological measures extent circumvention effected exercising rights License respect covered work, disclaim intention limit operation modification work means enforcing, work’s users, third parties’ legal rights forbid circumvention technological measures.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_4-conveying-verbatim-copies","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"4. Conveying Verbatim Copies","title":"GNU General Public License","text":"may convey verbatim copies Program’s source code receive , medium, provided conspicuously appropriately publish copy appropriate copyright notice; keep intact notices stating License non-permissive terms added accord section 7 apply code; keep intact notices absence warranty; give recipients copy License along Program. may charge price price copy convey, may offer support warranty protection fee.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_5-conveying-modified-source-versions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"5. Conveying Modified Source Versions","title":"GNU General Public License","text":"may convey work based Program, modifications produce Program, form source code terms section 4, provided also meet conditions: ) work must carry prominent notices stating modified , giving relevant date. b) work must carry prominent notices stating released License conditions added section 7. requirement modifies requirement section 4 “keep intact notices”. c) must license entire work, whole, License anyone comes possession copy. License therefore apply, along applicable section 7 additional terms, whole work, parts, regardless packaged. License gives permission license work way, invalidate permission separately received . d) work interactive user interfaces, must display Appropriate Legal Notices; however, Program interactive interfaces display Appropriate Legal Notices, work need make . compilation covered work separate independent works, nature extensions covered work, combined form larger program, volume storage distribution medium, called “aggregate” compilation resulting copyright used limit access legal rights compilation’s users beyond individual works permit. Inclusion covered work aggregate cause License apply parts aggregate.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_6-conveying-non-source-forms","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"6. Conveying Non-Source Forms","title":"GNU General Public License","text":"may convey covered work object code form terms sections 4 5, provided also convey machine-readable Corresponding Source terms License, one ways: ) Convey object code , embodied , physical product (including physical distribution medium), accompanied Corresponding Source fixed durable physical medium customarily used software interchange. b) Convey object code , embodied , physical product (including physical distribution medium), accompanied written offer, valid least three years valid long offer spare parts customer support product model, give anyone possesses object code either (1) copy Corresponding Source software product covered License, durable physical medium customarily used software interchange, price reasonable cost physically performing conveying source, (2) access copy Corresponding Source network server charge. c) Convey individual copies object code copy written offer provide Corresponding Source. alternative allowed occasionally noncommercially, received object code offer, accord subsection 6b. d) Convey object code offering access designated place (gratis charge), offer equivalent access Corresponding Source way place charge. need require recipients copy Corresponding Source along object code. place copy object code network server, Corresponding Source may different server (operated third party) supports equivalent copying facilities, provided maintain clear directions next object code saying find Corresponding Source. Regardless server hosts Corresponding Source, remain obligated ensure available long needed satisfy requirements. e) Convey object code using peer--peer transmission, provided inform peers object code Corresponding Source work offered general public charge subsection 6d. separable portion object code, whose source code excluded Corresponding Source System Library, need included conveying object code work. “User Product” either (1) “consumer product”, means tangible personal property normally used personal, family, household purposes, (2) anything designed sold incorporation dwelling. determining whether product consumer product, doubtful cases shall resolved favor coverage. particular product received particular user, “normally used” refers typical common use class product, regardless status particular user way particular user actually uses, expects expected use, product. product consumer product regardless whether product substantial commercial, industrial non-consumer uses, unless uses represent significant mode use product. “Installation Information” User Product means methods, procedures, authorization keys, information required install execute modified versions covered work User Product modified version Corresponding Source. information must suffice ensure continued functioning modified object code case prevented interfered solely modification made. convey object code work section , , specifically use , User Product, conveying occurs part transaction right possession use User Product transferred recipient perpetuity fixed term (regardless transaction characterized), Corresponding Source conveyed section must accompanied Installation Information. requirement apply neither third party retains ability install modified object code User Product (example, work installed ROM). requirement provide Installation Information include requirement continue provide support service, warranty, updates work modified installed recipient, User Product modified installed. Access network may denied modification materially adversely affects operation network violates rules protocols communication across network. Corresponding Source conveyed, Installation Information provided, accord section must format publicly documented (implementation available public source code form), must require special password key unpacking, reading copying.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_7-additional-terms","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"7. Additional Terms","title":"GNU General Public License","text":"“Additional permissions” terms supplement terms License making exceptions one conditions. Additional permissions applicable entire Program shall treated though included License, extent valid applicable law. additional permissions apply part Program, part may used separately permissions, entire Program remains governed License without regard additional permissions. convey copy covered work, may option remove additional permissions copy, part . (Additional permissions may written require removal certain cases modify work.) may place additional permissions material, added covered work, can give appropriate copyright permission. Notwithstanding provision License, material add covered work, may (authorized copyright holders material) supplement terms License terms: ) Disclaiming warranty limiting liability differently terms sections 15 16 License; b) Requiring preservation specified reasonable legal notices author attributions material Appropriate Legal Notices displayed works containing ; c) Prohibiting misrepresentation origin material, requiring modified versions material marked reasonable ways different original version; d) Limiting use publicity purposes names licensors authors material; e) Declining grant rights trademark law use trade names, trademarks, service marks; f) Requiring indemnification licensors authors material anyone conveys material (modified versions ) contractual assumptions liability recipient, liability contractual assumptions directly impose licensors authors. non-permissive additional terms considered “restrictions” within meaning section 10. Program received , part , contains notice stating governed License along term restriction, may remove term. license document contains restriction permits relicensing conveying License, may add covered work material governed terms license document, provided restriction survive relicensing conveying. add terms covered work accord section, must place, relevant source files, statement additional terms apply files, notice indicating find applicable terms. Additional terms, permissive non-permissive, may stated form separately written license, stated exceptions; requirements apply either way.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_8-termination","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"8. Termination","title":"GNU General Public License","text":"may propagate modify covered work except expressly provided License. attempt otherwise propagate modify void, automatically terminate rights License (including patent licenses granted third paragraph section 11). However, cease violation License, license particular copyright holder reinstated () provisionally, unless copyright holder explicitly finally terminates license, (b) permanently, copyright holder fails notify violation reasonable means prior 60 days cessation. Moreover, license particular copyright holder reinstated permanently copyright holder notifies violation reasonable means, first time received notice violation License (work) copyright holder, cure violation prior 30 days receipt notice. Termination rights section terminate licenses parties received copies rights License. rights terminated permanently reinstated, qualify receive new licenses material section 10.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_9-acceptance-not-required-for-having-copies","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"9. Acceptance Not Required for Having Copies","title":"GNU General Public License","text":"required accept License order receive run copy Program. Ancillary propagation covered work occurring solely consequence using peer--peer transmission receive copy likewise require acceptance. However, nothing License grants permission propagate modify covered work. actions infringe copyright accept License. Therefore, modifying propagating covered work, indicate acceptance License .","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_10-automatic-licensing-of-downstream-recipients","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"10. Automatic Licensing of Downstream Recipients","title":"GNU General Public License","text":"time convey covered work, recipient automatically receives license original licensors, run, modify propagate work, subject License. responsible enforcing compliance third parties License. “entity transaction” transaction transferring control organization, substantially assets one, subdividing organization, merging organizations. propagation covered work results entity transaction, party transaction receives copy work also receives whatever licenses work party’s predecessor interest give previous paragraph, plus right possession Corresponding Source work predecessor interest, predecessor can get reasonable efforts. may impose restrictions exercise rights granted affirmed License. example, may impose license fee, royalty, charge exercise rights granted License, may initiate litigation (including cross-claim counterclaim lawsuit) alleging patent claim infringed making, using, selling, offering sale, importing Program portion .","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_11-patents","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"11. Patents","title":"GNU General Public License","text":"“contributor” copyright holder authorizes use License Program work Program based. work thus licensed called contributor’s “contributor version”. contributor’s “essential patent claims” patent claims owned controlled contributor, whether already acquired hereafter acquired, infringed manner, permitted License, making, using, selling contributor version, include claims infringed consequence modification contributor version. purposes definition, “control” includes right grant patent sublicenses manner consistent requirements License. contributor grants non-exclusive, worldwide, royalty-free patent license contributor’s essential patent claims, make, use, sell, offer sale, import otherwise run, modify propagate contents contributor version. following three paragraphs, “patent license” express agreement commitment, however denominated, enforce patent (express permission practice patent covenant sue patent infringement). “grant” patent license party means make agreement commitment enforce patent party. convey covered work, knowingly relying patent license, Corresponding Source work available anyone copy, free charge terms License, publicly available network server readily accessible means, must either (1) cause Corresponding Source available, (2) arrange deprive benefit patent license particular work, (3) arrange, manner consistent requirements License, extend patent license downstream recipients. “Knowingly relying” means actual knowledge , patent license, conveying covered work country, recipient’s use covered work country, infringe one identifiable patents country reason believe valid. , pursuant connection single transaction arrangement, convey, propagate procuring conveyance , covered work, grant patent license parties receiving covered work authorizing use, propagate, modify convey specific copy covered work, patent license grant automatically extended recipients covered work works based . patent license “discriminatory” include within scope coverage, prohibits exercise , conditioned non-exercise one rights specifically granted License. may convey covered work party arrangement third party business distributing software, make payment third party based extent activity conveying work, third party grants, parties receive covered work , discriminatory patent license () connection copies covered work conveyed (copies made copies), (b) primarily connection specific products compilations contain covered work, unless entered arrangement, patent license granted, prior 28 March 2007. Nothing License shall construed excluding limiting implied license defenses infringement may otherwise available applicable patent law.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_12-no-surrender-of-others-freedom","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"12. No Surrender of Others’ Freedom","title":"GNU General Public License","text":"conditions imposed (whether court order, agreement otherwise) contradict conditions License, excuse conditions License. convey covered work satisfy simultaneously obligations License pertinent obligations, consequence may convey . example, agree terms obligate collect royalty conveying convey Program, way satisfy terms License refrain entirely conveying Program.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_13-use-with-the-gnu-affero-general-public-license","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"13. Use with the GNU Affero General Public License","title":"GNU General Public License","text":"Notwithstanding provision License, permission link combine covered work work licensed version 3 GNU Affero General Public License single combined work, convey resulting work. terms License continue apply part covered work, special requirements GNU Affero General Public License, section 13, concerning interaction network apply combination .","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_14-revised-versions-of-this-license","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"14. Revised Versions of this License","title":"GNU General Public License","text":"Free Software Foundation may publish revised /new versions GNU General Public License time time. new versions similar spirit present version, may differ detail address new problems concerns. version given distinguishing version number. Program specifies certain numbered version GNU General Public License “later version” applies , option following terms conditions either numbered version later version published Free Software Foundation. Program specify version number GNU General Public License, may choose version ever published Free Software Foundation. Program specifies proxy can decide future versions GNU General Public License can used, proxy’s public statement acceptance version permanently authorizes choose version Program. Later license versions may give additional different permissions. However, additional obligations imposed author copyright holder result choosing follow later version.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_15-disclaimer-of-warranty","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"15. Disclaimer of Warranty","title":"GNU General Public License","text":"WARRANTY PROGRAM, EXTENT PERMITTED APPLICABLE LAW. EXCEPT OTHERWISE STATED WRITING COPYRIGHT HOLDERS /PARTIES PROVIDE PROGRAM “” WITHOUT WARRANTY KIND, EITHER EXPRESSED IMPLIED, INCLUDING, LIMITED , IMPLIED WARRANTIES MERCHANTABILITY FITNESS PARTICULAR PURPOSE. ENTIRE RISK QUALITY PERFORMANCE PROGRAM . PROGRAM PROVE DEFECTIVE, ASSUME COST NECESSARY SERVICING, REPAIR CORRECTION.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_16-limitation-of-liability","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"16. Limitation of Liability","title":"GNU General Public License","text":"EVENT UNLESS REQUIRED APPLICABLE LAW AGREED WRITING COPYRIGHT HOLDER, PARTY MODIFIES /CONVEYS PROGRAM PERMITTED , LIABLE DAMAGES, INCLUDING GENERAL, SPECIAL, INCIDENTAL CONSEQUENTIAL DAMAGES ARISING USE INABILITY USE PROGRAM (INCLUDING LIMITED LOSS DATA DATA RENDERED INACCURATE LOSSES SUSTAINED THIRD PARTIES FAILURE PROGRAM OPERATE PROGRAMS), EVEN HOLDER PARTY ADVISED POSSIBILITY DAMAGES.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_17-interpretation-of-sections-15-and-16","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"17. Interpretation of Sections 15 and 16","title":"GNU General Public License","text":"disclaimer warranty limitation liability provided given local legal effect according terms, reviewing courts shall apply local law closely approximates absolute waiver civil liability connection Program, unless warranty assumption liability accompanies copy Program return fee. END TERMS CONDITIONS","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"how-to-apply-these-terms-to-your-new-programs","dir":"","previous_headings":"","what":"How to Apply These Terms to Your New Programs","title":"GNU General Public License","text":"develop new program, want greatest possible use public, best way achieve make free software everyone can redistribute change terms. , attach following notices program. safest attach start source file effectively state exclusion warranty; file least “copyright” line pointer full notice found. Also add information contact electronic paper mail. program terminal interaction, make output short notice like starts interactive mode: hypothetical commands show w show c show appropriate parts General Public License. course, program’s commands might different; GUI interface, use “box”. also get employer (work programmer) school, , sign “copyright disclaimer” program, necessary. information , apply follow GNU GPL, see <http://www.gnu.org/licenses/>. GNU General Public License permit incorporating program proprietary programs. program subroutine library, may consider useful permit linking proprietary applications library. want , use GNU Lesser General Public License instead License. first, please read <http://www.gnu.org/philosophy/--lgpl.html>.","code":"<one line to give the program's name and a brief idea of what it does.> Copyright (C) <year>  <name of author>  This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.  This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.  You should have received a copy of the GNU General Public License along with this program.  If not, see <http://www.gnu.org/licenses/>. <program>  Copyright (C) <year>  <name of author> This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'. This is free software, and you are welcome to redistribute it under certain conditions; type 'show c' for details."},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/benchmarks.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"Benchmarks","text":"short vignette, show benchmarks zoomerjoin package, comparing excellent fuzzyjoin package. two packages designed different things - fuzzyjoin package fast, provides distance functions (well joining modes) - ’s useful comparison shows time can saved using LSH relative pairwise comparisons, long okay using Jaccard similarity. future, hoping expand package implement LSH method edit distance, add benchmarks / feature completed.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/benchmarks.html","id":"benchmarks","dir":"Articles","previous_headings":"","what":"Benchmarks","title":"Benchmarks","text":", show time takes fuzzyjoin zoomerjoin fuzzily join two datasets size dataset increases. Fuzzyjoin initially quick, runtime scales square input size. Zoomerjoin slower small datasets less memory-intensive, scales sum rows dataset, becomes quicker larger datasets.","code":"#> Rows: 60 Columns: 5 #> ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: \",\" #> chr (3): package, join_type, name #> dbl (2): n, value #>  #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message."},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/benchmarks.html","id":"benchmarking-code","dir":"Articles","previous_headings":"","what":"Benchmarking Code:","title":"Benchmarks","text":", include code used generate benchmarks:","code":"library(zoomerjoin) library(fuzzyjoin) library(tidyverse) library(microbenchmark) library(profmem)   # Sample million rows from DIME dataset data_1 <- as.data.frame(sample_n(dime_data, 10^6)) names(data_1) <- c(\"id_1\", \"name\") data_2 <- as.data.frame(sample_n(dime_data, 10^6)) names(data_2) <- c(\"id_2\", \"name\")  # Generate datasets for euclidean join benchmarking n <- 10^5 p <- 50 X <- matrix(rnorm(n * p), n, p) X_1 <- as.data.frame(X) X_2 <- as.data.frame(X + .000000001)  # Get time and memory use statistics for fuzzyjoin when performing jaccard join fuzzy_jaccard_bench <- function(n) {   time <- microbenchmark(     stringdist_inner_join(data_1[1:n, ],       data_2[1:n, ],       method = \"jaccard\",       max_dist = .6,       q = 4     ),     times = 10   )$time %>%     median()    mem <- profmem(stringdist_inner_join(data_1[1:n, ],     data_2[1:n, ],     method = \"jaccard\",     max_dist = .6,     q = 4   )) %>%     total()    return(c(time = time, memory = mem)) }   # Get time and memory use statistics for zoomerjoin when performing jaccard join zoomer_jaccard_bench <- function(n) {   time <- microbenchmark(     jaccard_inner_join(data_1[1:n, ], data_2[1:n, ],       by = \"name\", band_width = 11,       n_bands = 350, threshold = .7,       n_gram_width = 4     ),     times = 50   )$time %>%     median()    mem <- profmem(     jaccard_inner_join(data_1[1:n, ], data_2[1:n, ],       by = \"name\", band_width = 11,       n_bands = 350, threshold = .7,       n_gram_width = 4     )   ) %>%     total()    return(c(time = time, memory = mem)) }  # Get time and memory use statistics for fuzzyjoin when performing Euclidean join fuzzy_euclid_bench <- function(n) {   time <- microbenchmark(     distance_join(X_1[1:n, ], X_2[1:n, ], max_dist = .1, method = \"euclidean\"),     times = 10   )$time %>%     median()    mem <- total(profmem(     distance_join(X_1[1:n, ], X_2[1:n, ], max_dist = .1, method = \"euclidean\")   ))    return(c(time = time, memory = mem)) }  # Get time and memory use statistics for zoomerjoin when performing Euclidean join zoomer_euclid_bench <- function(n) {   time <- microbenchmark(     euclidean_inner_join(X_1[1:n, ], X_2[1:n, ],       threshold = .1, n_bands = 90,       band_width = 2, r = .1     ),     times = 50   )$time %>%     median()    mem <- profmem(euclidean_inner_join(X_1[1:n, ], X_2[1:n, ],     threshold = .1, n_bands = 90,     band_width = 2, r = .1   )) %>%     total()    return(c(time = time, memory = mem)) }   # Run Grid of Jaccard Benchmarks, Collect results into DF n <- seq(500, 4000, 250) names(n) <- n fuzzy_jacard_benches <- map_df(n, fuzzy_jaccard_bench, .id = \"n\") zoomer_jacard_benches <- map_df(n, zoomer_jaccard_bench, .id = \"n\") fuzzy_jacard_benches$package <- \"fuzzyjoin\" zoomer_jacard_benches$package <- \"zoomerjoin\" jaccard_benches <- bind_rows(fuzzy_jacard_benches, zoomer_jacard_benches) jaccard_benches$join_type <- \"Jaccard Distance\"  # Run Grid of Euclidean Benchmarks, Collect results into DF n <- seq(250, 4000, 250) names(n) <- n fuzzy_euclid_benches <- map_df(n, fuzzy_euclid_bench, .id = \"n\") zoomer_euclid_benches <- map_df(n, zoomer_euclid_bench, .id = \"n\") fuzzy_euclid_benches$package <- \"fuzzyjoin\" zoomer_euclid_benches$package <- \"zoomerjoin\" euclid_benches <- bind_rows(fuzzy_euclid_benches, zoomer_euclid_benches) euclid_benches$join_type <- \"Euclidean Distance\"  sim_data <- bind_rows(euclid_benches, jaccard_benches) %>%   pivot_longer(c(time, memory)) %>%   mutate(value = ifelse(name == \"time\", value / 10^9, value / 10^6)) # convert ns to s and bytes to Gb.  write_csv(sim_data, \"sim_data.csv\")"},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/guided_tour.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction:","title":"A Zoomerjoin Guided Tour","text":"vignette gives basic overview core functionality zoomerjoin package. Zoomerjoin empowers fuzzily-match datasets millions rows seconds, staying light memory usage. makes feasible perform fuzzy-joins datasets hundreds millions observations matter minutes.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/guided_tour.html","id":"how-does-it-work","dir":"Articles","previous_headings":"","what":"How Does it Work?","title":"A Zoomerjoin Guided Tour","text":"Zoomerjoin’s blazingly fast joins string distance made possible optimized, performant implementation MinHash algorithm written Rust. conventional joining packages compare pairs records two datasets wish join, MinHash algorithm manages compare similar records . results matches orders magnitudes faster matching software packages: zoomerjoin takes hours minutes join datasets taken centuries join using matching methods.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/guided_tour.html","id":"basic-syntax","dir":"Articles","previous_headings":"","what":"Basic Syntax:","title":"A Zoomerjoin Guided Tour","text":"’re familiar logical-join syntax dplyr, already know use fuzzy join join two datasets. Zoomerjoin provides jaccard_inner_join() jaccard_full_join() (among others), fuzzy-joining analogues corresponding dplyr functions. demonstrate syntax using package join corpuses, formed entries Database Ideology, Money Politics, Elections (DIME) (Bonica 2016). first corpus looks follows: second looks follows: two Corpuses can’t directly joined misspellings. means must use fuzzy-matching capabilities zoomerjoin: first two arguments, , b, direct analogues dplyr arguments, two data frames want join. field also acts ‘dplyr’ (provides function columns want match ). n_gram_width parameter determines wide n-grams used similarity evaluation , threshold argument determines similar pair strings (Jaccard similarity) considered match. Users stringdist fuzzyjoin package familiar arguments, bear mind packages measure string distance (distance 0 indicates complete similarity), package operates string similarity, threshold .8 keep matches 80% Jaccard similarity. n_bands band_width parameters govern performance LSH. default parameters perform well medium-size (n < 10^7) datasets matches somewhat similar (similarity > .8), may require tuning settings. jaccard_hyper_grid_search(), jaccard_curve() functions can help select parameters given properties LSH desire. example, can use jaccard_curve() function plot probability pair records compared possible Jaccard distance, \\(d\\) zero one:  looking plot produced, can see using hyperparameters, comparisons almost never made pairs records Jaccard similarity less .2 (saving time), pairs records Jaccard similarity greater .8 almost always compared (giving low false-negative rate). details hyperparameters, textreuse package excellent vignette, zoomerjoin provides re-implementation profiling tools, jaccard_probability, jaccard_bandwidth (although implementations differ slightly hyperparameters package different).","code":"library(tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr     1.1.4     ✔ readr     2.1.5 ## ✔ forcats   1.0.0     ✔ stringr   1.5.1 ## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1 ## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1 ## ✔ purrr     1.0.2      ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag()    masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors library(microbenchmark) library(fuzzyjoin) library(zoomerjoin)  corpus_1 <- dime_data %>% # dime data is packaged with zoomerjoin   head(500) names(corpus_1) <- c(\"a\", \"field\") corpus_1 ## # A tibble: 500 × 2 ##        a field                                                                   ##    <dbl> <chr>                                                                   ##  1     1 ufwa cope committee                                                     ##  2     2 committee to re elect charles e. bennett                                ##  3     3 montana democratic party non federal account                            ##  4     4 mississippi power & light company management political action and educ… ##  5     5 napus pac for postmasters                                               ##  6     6 aminoil good government fund                                            ##  7     7 national women's political caucus of california                         ##  8     8 minnesota gun owners' political victory fund                            ##  9     9 metropolitan detroit afl cio cope committee                             ## 10    10 carpenters legislative improvement committee united brotherhood of car… ## # ℹ 490 more rows corpus_2 <- dime_data %>% # dime data is packaged with zoomerjoin   tail(500) names(corpus_2) <- c(\"b\", \"field\") corpus_2 ## # A tibble: 500 × 2 ##        b field                                                                   ##    <dbl> <chr>                                                                   ##  1   501 citizens for derwinski                                                  ##  2   502 progressive victory fund greater washington americans for democratic a… ##  3   503 ingham county democratic party federal campaign fund                    ##  4   504 committee for a stronger future                                         ##  5   505 atoka country supper committee                                          ##  6   506 friends of democracy pac inc                                            ##  7   507 baypac                                                                  ##  8   508 international brotherhood of electrical workers local union 278 cope/p… ##  9   509 louisville & jefferson county republican executive committee            ## 10   510 democratic party of virginia                                            ## # ℹ 490 more rows set.seed(1) start_time <- Sys.time() join_out <- jaccard_inner_join(corpus_1, corpus_2,   by = \"field\", n_gram_width = 6,   n_bands = 20, band_width = 6, threshold = .8 ) print(Sys.time() - start_time) ## Time difference of 0.01206851 secs print(join_out) ## # A tibble: 8 × 4 ##       a field.x                                                      b field.y   ##   <dbl> <chr>                                                    <dbl> <chr>     ## 1    88 scheuer for congress 1980                                  667 scheuer … ## 2   292 bill bradley for u s senate '84                            913 bill bra… ## 3   378 guarini for congress 1982                                  883 guarini … ## 4   238 4th congressional district democratic party                518 16th con… ## 5   302 americans for good government inc                          910 american… ## 6   230 pipefitters local union 524                                998 pipefitt… ## 7   319 7th congressional district democratic party of wisconsin   792 8th cong… ## 8   378 guarini for congress 1982                                  606 guarini … jaccard_curve(20, 6)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/guided_tour.html","id":"standardizing-string-names-after-a-merge","dir":"Articles","previous_headings":"","what":"Standardizing String Names After A Merge","title":"A Zoomerjoin Guided Tour","text":"Often merging, can help standardize names fields joined . way, can assign unique label identifying key observations similar value merging variable. jaccard_string_group() function makes possible. first performs locality sensitive hashing identify similar pairs observations within dataset, runs community detection algorithm identify clusters similar observations, assigned label. community-detection algorithm, fastgreedy.community() igraph package runs log-linear time, entire algorithm completes linearithmic time. ’s short snippet showing can use jaccard_string_group() standardize set organization names.","code":"organization_names <- c(   \"American Civil Liberties Union\",   \"American Civil Liberties Union (ACLU)\",   \"NRA National Rifle Association\",   \"National Rifle Association NRA\",   \"National Rifle Association\",   \"Planned Parenthood\",   \"Blue Cross\" ) standardized_organization_names <- jaccard_string_group(organization_names, threshold = .5, band_width = 3) ## Loading required namespace: igraph print(standardized_organization_names) ## [1] \"American Civil Liberties Union\" \"American Civil Liberties Union\" ## [3] \"NRA National Rifle Association\" \"NRA National Rifle Association\" ## [5] \"NRA National Rifle Association\" \"Planned Parenthood\"             ## [7] \"Blue Cross\""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/guided_tour.html","id":"references","dir":"Articles","previous_headings":"Standardizing String Names After A Merge","what":"References:","title":"A Zoomerjoin Guided Tour","text":"Bonica, Adam. 2016. Database Ideology, Money Politics, Elections: Public version 2.0 [Computer file]. Stanford, CA: Stanford University Libraries.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/matching_vectors.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"Matching Vectors Based on Euclidean Distance","text":"flagship feature zoomerjoin tidy joins strings using Jaccard distance, zoomerjoin also allows join vectors using Euclidean distance. can useful joining addresses coordinates space. Unlike nearest-neighbor methods KD-trees, joins slow dimension coordinates increases, zoomerjoin can used can used find close points high-dimensional space (word embeddings).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/matching_vectors.html","id":"demonstration","dir":"Articles","previous_headings":"","what":"Demonstration","title":"Matching Vectors Based on Euclidean Distance","text":"demonstration, create simulated dataset 10^5 points distributed uniformly within 100-dimensional hypercube. join another dataset copy first point shifted tiny random amount. now want join two datasets together. Euclidean joins take 3 hyperparameters: n_bands, band_width, r. chosen problem domain (although defaults generally sensible). use euclidean_probability function package understand probability two observations distance .01 indentified match variety hyperparameter configurations. Using n_bands=40, band_width=8, r=.15 seems provide good balance identifying true matches (pairs less .01 apart guaranteed found) reducing number un-promising comparisons (pairs greater .1 apart unlikely compared). use euclidean_inner_join find matching pairs across two datasets: Zoomerjoin able easily find pairs just 30s (perhaps longer runner renders website), even though points lie high-dimensional (d=100) space. makes zoomerjoin useful tool trying join find matches datasets word document embeddings.","code":"n <- 10^5 # number of data points d <- 10^2 # dimension  # Create a matrix of 10^6 observations in R^100 X <- matrix(runif(n * d), n, d) # Second Dataset is a copy of the first with points shifted an infinitesimal # amount X_2 <- as.data.frame(X + matrix(rnorm(n * d, 0, .0001), n, d)) X <- as.data.frame(X) euclidean_probability(.01, n_bands = 5, band_width = 8, r = .25) #> [1] 0.9993764 euclidean_probability(.1, n_bands = 5, band_width = 8, r = .25) #> [1] 0.2141322 euclidean_probability(.01, n_bands = 10, band_width = 4, r = .15) #> [1] 0.9999999 euclidean_probability(.1, n_bands = 10, band_width = 4, r = .15) #> [1] 0.4956251 euclidean_probability(.01, n_bands = 40, band_width = 8, r = .15) #> [1] 1 euclidean_probability(.1, n_bands = 40, band_width = 8, r = .15) #> [1] 0.16091 set.seed(1) start <- Sys.time() joined_out <- euclidean_inner_join(   X,   X_2,   threshold = .01,   n_bands = 40,   band_width = 8,   r = .15 ) n_matches <- nrow(joined_out) time_taken <- Sys.time() - start print(paste(\"found\", n_matches, \"matches in\", round(time_taken), \"seconds\")) #> [1] \"found 100000 matches in 20 seconds\""},{"path":"https://beniaminogreen.github.io/zoomerjoin/authors.html","id":null,"dir":"","previous_headings":"","what":"Authors","title":"Authors and Citation","text":"Beniamino Green. Author, maintainer, copyright holder. Etienne Bacher. Contributor. authors dependency Rust crates. Contributor, copyright holder.            see inst/AUTHORS file details","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/authors.html","id":"citation","dir":"","previous_headings":"","what":"Citation","title":"Authors and Citation","text":"Green B (2024). zoomerjoin: Superlatively Fast Fuzzy Joins. R package version 0.1.2.9000, https://beniamino.org/zoomerjoin/.","code":"@Manual{,   title = {zoomerjoin: Superlatively Fast Fuzzy Joins},   author = {Beniamino Green},   year = {2024},   note = {R package version 0.1.2.9000},   url = {https://beniamino.org/zoomerjoin/}, }"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"zoomerjoin-","dir":"","previous_headings":"","what":"Superlatively Fast Fuzzy Joins","title":"Superlatively Fast Fuzzy Joins","text":"zoomerjoin R package empowers fuzzy-join massive datasets rapidly, little memory consumption. powered high-performance implementations Locality Sensitive Hashing, algorithm finds matches records two datasets without compare possible pairs observations. practice, means zoomerjoin can fuzzily-join datasets days, even years faster matching packages. zoomerjoin used -production join datasets hundreds millions names vectors matter hours.","code":""},{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-from-cran","dir":"","previous_headings":"Installation","what":"Installing from CRAN:","title":"Superlatively Fast Fuzzy Joins","text":"can install CRAN package. Please aware Cargo (rust toolchain compiler) installed build package source.","code":"install.packages('zoomerjoin')"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-from-r-universe","dir":"","previous_headings":"Installation","what":"Installing from R-Universe:","title":"Superlatively Fast Fuzzy Joins","text":"package distributed using r-universe, provides pre-compiled binaries common operating systems recent versions R. install r-universe, can use following command R:","code":"install.packages(   'zoomerjoin',   repos = c('https://beniaminogreen.r-universe.dev', getOption(\"repos\")) )"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-rust","dir":"","previous_headings":"Installation","what":"Installing Rust","title":"Superlatively Fast Fuzzy Joins","text":"operating system version R installed, must Rust compiler installed compile package sources. package compiled, Rust longer required, can safely uninstalled.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-rust-on-linux-or-mac","dir":"","previous_headings":"Installation > Installing Rust","what":"Installing Rust on Linux or Mac:","title":"Superlatively Fast Fuzzy Joins","text":"install Rust Linux Mac, can simply run following snippet terminal.","code":"curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-rust-on-windows","dir":"","previous_headings":"Installation > Installing Rust","what":"Installing Rust on Windows:","title":"Superlatively Fast Fuzzy Joins","text":"install Rust windows, can use Rust installation wizard, rustup-init.exe, found site. Depending version Windows, may see error looks something like : case, run rustup install stable-x86_64-pc_windows-gnu install missing toolchain. ’re missing another toolchain, simply type place stable-x86_64-pc_windows-gnu command .","code":"error: toolchain 'stable-x86_64-pc-windows-gnu' is not installed"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-package-from-github","dir":"","previous_headings":"Installation","what":"Installing Package from Github:","title":"Superlatively Fast Fuzzy Joins","text":"rust installed Rust, able install package either install.packages function , using install_github function devtools package pkg_install function pak package.","code":"## Install with devtools # install.packages(\"devtools\") devtools::install_github(\"beniaminogreen/zoomerjoin\")  ## Install with pak # install.packages(\"pak\") pak::pkg_install(\"beniaminogreen/zoomerjoin\")"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"loading-the-package","dir":"","previous_headings":"Installation","what":"Loading The Package","title":"Superlatively Fast Fuzzy Joins","text":"package installed, can load memory usual typing:","code":"library(zoomerjoin)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"usage","dir":"","previous_headings":"","what":"Usage:","title":"Superlatively Fast Fuzzy Joins","text":"flagship feature zoomerjoins jaccard_join euclidean family functions, designed near drop-ins corresponding dplyr/fuzzyjoin commands: jaccard_left_join() jaccard_right_join() jaccard_inner_join() jaccard_full_join() euclidean_left_join() euclidean_right_join() euclidean_inner_join() euclidean_full_join() jaccard_join family functions provide fast fuzzy-joins strings using Jaccard distance euclidean_join family provides fuzzy-joins points vectors using Euclidean distance.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"example-joining-rows-of-the-database-on-ideology-money-in-politics-and-elections","dir":"","previous_headings":"Usage:","what":"Example: Joining rows of the Database on Ideology, Money in Politics, and Elections","title":"Superlatively Fast Fuzzy Joins","text":"(DIME) ’s snippet showing use jaccard_inner_join() merge two lists political donors Database Ideology, Money Politics, Elections (DIME). can see detailed example vignette introductory vignette. start two corpuses like combine, corpus_1: corpus_2: corpuses observation ID column, donor name column. like join two datasets donor names column, two can’t directly joined misspellings. , use jaccard_inner_join function fuzzily join two donor name column. Importantly, Locality Sensitive Hashing probabilistic algorithm, may fail identify matches random chance. adjust hyperparameters n_bands band_width chance true matches dropped negligible. default, package issue warning chance true match discovered less 95%. can use jaccard_probability jaccard_hyper_grid_search help understand probability true matches discarded algorithm. details thorough description tune hyperparameters can can found guided tour vignette. Zoomerjoin able quickly find matching columns without comparing pairs records. saves time size list increases, can scale join datasets millions hundreds millions rows.","code":"corpus_1 <- dime_data %>%     head(500) names(corpus_1) <- c(\"a\", \"field\") corpus_1 ## # A tibble: 500 × 2 ##        a field                                                                   ##    <dbl> <chr>                                                                   ##  1     1 ufwa cope committee                                                     ##  2     2 committee to re elect charles e. bennett                                ##  3     3 montana democratic party non federal account                            ##  4     4 mississippi power & light company management political action and educ… ##  5     5 napus pac for postmasters                                               ##  6     6 aminoil good government fund                                            ##  7     7 national women's political caucus of california                         ##  8     8 minnesota gun owners' political victory fund                            ##  9     9 metropolitan detroit afl cio cope committee                             ## 10    10 carpenters legislative improvement committee united brotherhood of car… ## # ℹ 490 more rows corpus_2 <- dime_data %>%     tail(500) names(corpus_2) <- c(\"b\", \"field\") corpus_2 ## # A tibble: 500 × 2 ##        b field                                                                   ##    <dbl> <chr>                                                                   ##  1   501 citizens for derwinski                                                  ##  2   502 progressive victory fund greater washington americans for democratic a… ##  3   503 ingham county democratic party federal campaign fund                    ##  4   504 committee for a stronger future                                         ##  5   505 atoka country supper committee                                          ##  6   506 friends of democracy pac inc                                            ##  7   507 baypac                                                                  ##  8   508 international brotherhood of electrical workers local union 278 cope/p… ##  9   509 louisville & jefferson county republican executive committee            ## 10   510 democratic party of virginia                                            ## # ℹ 490 more rows set.seed(1) start_time <- Sys.time() join_out <- jaccard_inner_join(corpus_1, corpus_2, n_gram_width=6, n_bands=20, band_width=6) ## Warning in jaccard_join(a, b, mode = \"inner\", by = by, salt_by = block_by, : A pair of records at the threshold (0.7) have only a 92% chance of being compared. ## Please consider changing `n_bands` and `band_width`.  ## Joining by 'field' print(Sys.time() - start_time) ## Time difference of 0.01455116 secs print(join_out) ## # A tibble: 19 × 4 ##        a field.x                                                      b field.y  ##    <dbl> <chr>                                                    <dbl> <chr>    ##  1   216 kent county republican finance committee                   607 lake co… ##  2   238 4th congressional district democratic party                518 16th co… ##  3   292 bill bradley for u s senate '84                            913 bill br… ##  4   378 guarini for congress 1982                                  606 guarini… ##  5   232 republican county committee of chester county              710 republi… ##  6   387 committee to re elect congressman staton                   805 committ… ##  7   122 tarrant county republican victory fund                     761 lake co… ##  8   378 guarini for congress 1982                                  883 guarini… ##  9   238 4th congressional district democratic party                792 8th con… ## 10    88 scheuer for congress 1980                                  667 scheuer… ## 11    45 dole for senate committee                                  623 riegle … ## 12    87 kentucky state democratic central executive committee      639 arizona… ## 13   319 7th congressional district democratic party of wisconsin   792 8th con… ## 14   478 united democrats for better government                     642 democra… ## 15   163 davies county republican executive committee               852 warren … ## 16   230 pipefitters local union 524                                998 pipefit… ## 17   216 kent county republican finance committee                   719 harford… ## 18   302 americans for good government inc                          910 america… ## 19    35 solarz for congress 82                                     671 solarz …"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"contributing","dir":"","previous_headings":"","what":"Contributing","title":"Superlatively Fast Fuzzy Joins","text":"Thanks interest contributing Zoomerjoin! using gitub-centric workflow manage package; can file bug report, request new feature, ask question package filing issue issues page, also find range templates help . ’d like make changes code, can write file pull request page. ’ll try respond timely manner (within week) although occasionally may take longer respond complicated question issue. Please also aware contributor code conduct contributing repository.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"acknowledgments","dir":"","previous_headings":"","what":"Acknowledgments:","title":"Superlatively Fast Fuzzy Joins","text":"Zoomerjoin made using SQL join illustration Germanx speed limit sign Federal Highway Administration - MUTCD.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"references","dir":"","previous_headings":"","what":"References:","title":"Superlatively Fast Fuzzy Joins","text":"Bonica, Adam. 2016. Database Ideology, Money Politics, Elections: Public version 2.0 [Computer file]. Stanford, CA: Stanford University Libraries. Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman. 2014. Mining Massive Datasets (2nd. ed.). Cambridge University Press, USA. Broder, Andrei Z. (1997), “resemblance containment documents”, Compression Complexity Sequences: Proceedings. Positano, Salerno, Italy","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/dime_data.html","id":null,"dir":"Reference","previous_headings":"","what":"Donors from DIME Database — dime_data","title":"Donors from DIME Database — dime_data","text":"set donor names Database Ideology, Money Politics, Elections (DIME).  dataset used benchmark 2021 APSR paper Adaptive Fuzzy String Matching: Merge Datasets One (Messy) Identifying Field Aaron R. Kaufman Aja Klevs, dataset package subset data replication archive paper. full dataset can found paper's replication materials : doi:10.7910/DVN/4031UL .","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/dime_data.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Donors from DIME Database — dime_data","text":"","code":"dime_data"},{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/dime_data.html","id":"dime-data","dir":"Reference","previous_headings":"","what":"dime_data","title":"Donors from DIME Database — dime_data","text":"data frame 10,000 rows 2 columns: id Numeric ID / Row Number x Donor Name #' @source https://www..int/teams/global-tuberculosis-programme/data","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/dime_data.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Donors from DIME Database — dime_data","text":"doi:10.7910/DVN/4031UL","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/dime_data.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Donors from DIME Database — dime_data","text":"Adam Bonica","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/em_link.html","id":null,"dir":"Reference","previous_headings":"","what":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","title":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","text":"Rust implementation Naive Bayes / Fellegi-Sunter model record linkage detailed article \"Using Probabilistic Model Assist Merging Large-Scale Administrative Records\" Enamorado, Fifield Imai (2019). Takes integer matrix describing similarities possible pair observations, vector initial guesses probability pair match (can either set domain knowledge, one can hand-label subset data leave rest p=.5). Iteratively refines guesses using Expectation Maximization algorithm optima reached. details, see doi:10.1017/S0003055418000783 .","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/em_link.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","text":"","code":"em_link(X, g, tol = 10^-6, max_iter = 10^3)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/em_link.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","text":"X integer matrix similarities. Must go 0 (disagreement) maximum without \"gaps\" unused levels. example, column values 0,1,2,3 valid column, 0,1,2,4 three omitted g vector initial guesses iteratively improved using EM algorithm (personal approach guess logistic regression coefficients use create intitial probability guesses). chosen avoid model getting stuck local optimum, avoid problem label-switching, labels matches non-matches reversed. tol tolerance sense infinity norm. .e. close parameters iterations EM algorithm terminates. max_iter iterations algorithm error converged.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/em_link.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","text":"vector probabilities representing posterior probability record pair match.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/em_link.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","text":"","code":"inv_logit <- function(x) {   exp(x) / (1 + exp(x)) } n <- 10^6 d <- 1:n %% 5 == 0 X <- cbind(   as.integer(ifelse(d, runif(n) < .8, runif(n) < .2)),   as.integer(ifelse(d, runif(n) < .9, runif(n) < .2)),   as.integer(ifelse(d, runif(n) < .7, runif(n) < .2)),   as.integer(ifelse(d, runif(n) < .6, runif(n) < .2)),   as.integer(ifelse(d, runif(n) < .5, runif(n) < .2)),   as.integer(ifelse(d, runif(n) < .1, runif(n) < .9)),   as.integer(ifelse(d, runif(n) < .1, runif(n) < .9)),   as.integer(ifelse(d, runif(n) < .8, runif(n) < .01)) )  # inital guess at class assignments based on # a hypothetical logistic # regression. Should be based on domain knowledge, or a handful of hand-coded # observations.  x_sum <- rowSums(X) g <- inv_logit((x_sum - mean(x_sum)) / sd(x_sum))  out <- em_link(X, g, tol = .0001, max_iter = 100)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":null,"dir":"Reference","previous_headings":"","what":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":"Fuzzy joins Euclidean distance using Locality Sensitive Hashing","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":"","code":"euclidean_anti_join(   a,   b,   by = NULL,   threshold = 1,   n_bands = 30,   band_width = 5,   r = 0.5,   progress = FALSE )  euclidean_inner_join(   a,   b,   by = NULL,   threshold = 1,   n_bands = 30,   band_width = 5,   r = 0.5,   progress = FALSE )  euclidean_left_join(   a,   b,   by = NULL,   threshold = 1,   n_bands = 30,   band_width = 5,   r = 0.5,   progress = FALSE )  euclidean_right_join(   a,   b,   by = NULL,   threshold = 1,   n_bands = 30,   band_width = 5,   r = 0.5,   progress = FALSE )  euclidean_full_join(   a,   b,   by = NULL,   threshold = 1,   n_bands = 30,   band_width = 5,   r = 0.5,   progress = FALSE )"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":", b two dataframes join. named vector indicating columns join . Format dplyr: = c(\"column_name_in_df_a\" = \"column_name_in_df_b\"), two columns must specified dataset (x column y column). Specification made dplyr::join_by() also accepted. threshold distance threshold units considered match. Note contrary Jaccard joins, value distance similarity. Therefore, lower value means higher similarity. n_bands number bands used minihash algorithm (default 40). Use conjunction band_width determine performance hashing. default settings (.2, .8, .001, .999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. band_width length band used minihashing algorithm (default 8) Use conjunction n_bands determine performance hashing. default settings (.2, .8, .001, .999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. r Hyperparameter used govern sensitivity locality sensitive hash. Corresponds width hash bucket LSH algorithm. Increasing values r mean hash collisions higher sensitivity (fewer false-negatives) cost lower specificity (false-positives longer run time). information, see description doi:10.1145/997817.997857 . progress Set TRUE print progress.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":"tibble fuzzily-joined basis variables . Tries adhere standards dplyr-joins, uses logical joining patterns (.e. inner-join joins keeps observations datasets).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":"Datar, Mayur, Nicole Immorlica, Pitor Indyk, Vahab Mirrokni. \"Locality-Sensitive Hashing Scheme Based p-Stable Distributions\" SCG '04: Proceedings twentieth annual symposium Computational geometry (2004): 253-262","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":"","code":"n <- 10  # Build two matrices that have close values X_1 <- matrix(c(seq(0, 1, 1 / (n - 1)), seq(0, 1, 1 / (n - 1))), nrow = n) X_2 <- X_1 + .0000001  X_1 <- as.data.frame(X_1) X_2 <- as.data.frame(X_2)  X_1$id_1 <- 1:n X_2$id_2 <- 1:n  # only keep observations that have a match euclidean_inner_join(X_1, X_2, by = c(\"V1\", \"V2\"), threshold = .00005) #>         V1.x      V2.x id_1      V1.y      V2.y id_2 #> 1  0.7777778 0.7777778    8 0.7777779 0.7777779    8 #> 2  0.3333333 0.3333333    4 0.3333334 0.3333334    4 #> 3  0.6666667 0.6666667    7 0.6666668 0.6666668    7 #> 4  0.1111111 0.1111111    2 0.1111112 0.1111112    2 #> 5  1.0000000 1.0000000   10 1.0000001 1.0000001   10 #> 6  0.2222222 0.2222222    3 0.2222223 0.2222223    3 #> 7  0.4444444 0.4444444    5 0.4444445 0.4444445    5 #> 8  0.8888889 0.8888889    9 0.8888890 0.8888890    9 #> 9  0.5555556 0.5555556    6 0.5555557 0.5555557    6 #> 10 0.0000000 0.0000000    1 0.0000001 0.0000001    1  # keep all observations from X_1, regardless of whether they have a match euclidean_inner_join(X_1, X_2, by = c(\"V1\", \"V2\"), threshold = .00005) #>         V1.x      V2.x id_1      V1.y      V2.y id_2 #> 1  0.3333333 0.3333333    4 0.3333334 0.3333334    4 #> 2  1.0000000 1.0000000   10 1.0000001 1.0000001   10 #> 3  0.6666667 0.6666667    7 0.6666668 0.6666668    7 #> 4  0.4444444 0.4444444    5 0.4444445 0.4444445    5 #> 5  0.0000000 0.0000000    1 0.0000001 0.0000001    1 #> 6  0.8888889 0.8888889    9 0.8888890 0.8888890    9 #> 7  0.5555556 0.5555556    6 0.5555557 0.5555557    6 #> 8  0.2222222 0.2222222    3 0.2222223 0.2222223    3 #> 9  0.7777778 0.7777778    8 0.7777779 0.7777779    8 #> 10 0.1111111 0.1111111    2 0.1111112 0.1111112    2"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_curve.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot S-Curve for a LSH with given hyperparameters — euclidean_curve","title":"Plot S-Curve for a LSH with given hyperparameters — euclidean_curve","text":"Plot S-Curve LSH given hyperparameters","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_curve.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot S-Curve for a LSH with given hyperparameters — euclidean_curve","text":"","code":"euclidean_curve(n_bands, band_width, r, up_to = 100)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_curve.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot S-Curve for a LSH with given hyperparameters — euclidean_curve","text":"n_bands number LSH bands calculated band_width number hashes band r \"r\" hyperparameter used govern sensitivity hash. up_to right extent x axis.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_curve.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot S-Curve for a LSH with given hyperparameters — euclidean_curve","text":"plot showing probability pair proposed match, given Jaccard similarity two items.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_probability.html","id":null,"dir":"Reference","previous_headings":"","what":"Find Probability of Match Based on Similarity — euclidean_probability","title":"Find Probability of Match Based on Similarity — euclidean_probability","text":"Find Probability Match Based Similarity","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_probability.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Find Probability of Match Based on Similarity — euclidean_probability","text":"","code":"euclidean_probability(distance, n_bands, band_width, r)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_probability.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Find Probability of Match Based on Similarity — euclidean_probability","text":"distance euclidian distance two vectors want compare. n_bands number LSH bands used hashing. band_width number hashes band. r \"r\" hyperparameter used govern sensitivity hash.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_probability.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Find Probability of Match Based on Similarity — euclidean_probability","text":"decimal number giving proability two items returned candidate pair minihash algorithm.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming-joins.html","id":null,"dir":"Reference","previous_headings":"","what":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","title":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","text":"Find similar rows two tables using hamming distance. hamming distance equal number characters two strings differ , equal infinity two strings different lengths","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming-joins.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","text":"","code":"hamming_inner_join(   a,   b,   by = NULL,   n_bands = 100,   band_width = 8,   threshold = 2,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  hamming_anti_join(   a,   b,   by = NULL,   n_bands = 100,   band_width = 100,   threshold = 2,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  hamming_left_join(   a,   b,   by = NULL,   n_bands = 100,   band_width = 100,   threshold = 2,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  hamming_right_join(   a,   b,   by = NULL,   n_bands = 100,   band_width = 100,   threshold = 2,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  hamming_full_join(   a,   b,   by = NULL,   n_bands = 100,   band_width = 100,   threshold = 2,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming-joins.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","text":", b two dataframes join. named vector indicating columns join . Format dplyr: = c(\"column_name_in_df_a\" = \"column_name_in_df_b\"), two columns must specified dataset (x column y column). Specification made dplyr::join_by() also accepted. n_bands number bands used locality sensitive hashing algorithm (default 100). Use conjunction band_width determine performance hashing. Generally speaking, higher number bands leads greater recall cost higher runtime. band_width length band used minihashing algorithm (default 8). Use conjunction n_bands determine performance hashing. Generally speaking wider number bands decreases number false positives, decreasing runtime cost lower sensitivity (true matches less likely found). threshold Hamming distance threshold two strings considered match. distance zero corresponds complete equality strings, distance 'x' two strings means 'x' substitutions must made transform one string . progress Set TRUE print progress. clean strings fuzzy join cleaned (coerced lower-case, stripped punctuation spaces)? Default FALSE. similarity_column optional character vector. provided, data frame contain column name giving Hamming distance two fields. Extra column present anti-joining.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming-joins.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","text":"tibble fuzzily-joined basis variables . Tries adhere standards dplyr-joins, uses logical joining patterns (.e. inner-join joins keeps observations datasets).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming-joins.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","text":"","code":"# load baby names data # install.packages(\"babynames\") library(babynames)  baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500]) baby_names_mispelled <- data.frame(   name_mispelled = gsub(\"[aeiouy]\", \"x\", baby_names$name) )  # Run the join and only keep rows that have a match: hamming_inner_join(   baby_names,   baby_names_mispelled,   by = c(\"name\" = \"name_mispelled\"),   threshold = 3,   n_bands = 150,   band_width = 10,   clean = FALSE # default ) #> # A tibble: 2,664 × 2 #>    name  name_mispelled #>    <chr> <chr>          #>  1 alva  xlgx           #>  2 elma  clxx           #>  3 eda   lxx            #>  4 lila  lxzx           #>  5 luna  lxnx           #>  6 eula  lxlx           #>  7 metta mxntx          #>  8 ida   xvx            #>  9 sue   xnx            #> 10 sue   sxx            #> # ℹ 2,654 more rows  # Run the join and keep all rows from the first dataset, regardless of whether # they have a match: hamming_left_join(   baby_names,   baby_names_mispelled,   by = c(\"name\" = \"name_mispelled\"),   threshold = 3,   n_bands = 150,   band_width = 10, ) #> # A tibble: 2,746 × 2 #>    name    name_mispelled #>    <chr>   <chr>          #>  1 bertha  mxrthx         #>  2 ellie   xllxx          #>  3 vesta   vxstx          #>  4 lulu    nxll           #>  5 lue     xrx            #>  6 cleo    xltx           #>  7 rosetta rxsxttx        #>  8 evie    xvxx           #>  9 lou     xnx            #> 10 ann     fxx            #> # ℹ 2,736 more rows"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_distance.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate Hamming distance of two character vectors — hamming_distance","title":"Calculate Hamming distance of two character vectors — hamming_distance","text":"Calculate Hamming distance two character vectors","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_distance.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate Hamming distance of two character vectors — hamming_distance","text":"","code":"hamming_distance(a, b)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_distance.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate Hamming distance of two character vectors — hamming_distance","text":"first character vector b first character vector","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_distance.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate Hamming distance of two character vectors — hamming_distance","text":"vector hamming similarities strings","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_distance.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Calculate Hamming distance of two character vectors — hamming_distance","text":"","code":"hamming_distance(   c(\"ACGTCGATGACGTGATGCGTAGCGTA\", \"ACGTCGATGTGCTCTCGTCGATCTAC\"),   c(\"ACGTCGACGACGTGATGCGCAGCGTA\", \"ACGTCGATGGGGTCTCGTCGATCTAC\") ) #> [1] 2 2"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_probability.html","id":null,"dir":"Reference","previous_headings":"","what":"Find Probability of Match Based on Similarity — hamming_probability","title":"Find Probability of Match Based on Similarity — hamming_probability","text":"Find Probability Match Based Similarity","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_probability.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Find Probability of Match Based on Similarity — hamming_probability","text":"","code":"hamming_probability(distance, input_length, n_bands, band_width)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_probability.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Find Probability of Match Based on Similarity — hamming_probability","text":"distance hamming distance two strings want compare input_length length (number characters) input strings want calculate. n_bands number LSH bands used hashing. band_width number hashes band.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_probability.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Find Probability of Match Based on Similarity — hamming_probability","text":"decimal number giving probability two items returned candidate pair lsh algotithm.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard-joins.html","id":null,"dir":"Reference","previous_headings":"","what":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","title":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","text":"Fuzzy joins Jaccard distance using MinHash","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard-joins.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","text":"","code":"jaccard_inner_join(   a,   b,   by = NULL,   block_by = NULL,   n_gram_width = 2,   n_bands = 50,   band_width = 8,   threshold = 0.7,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  jaccard_anti_join(   a,   b,   by = NULL,   block_by = NULL,   n_gram_width = 2,   n_bands = 50,   band_width = 8,   threshold = 0.7,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  jaccard_left_join(   a,   b,   by = NULL,   block_by = NULL,   n_gram_width = 2,   n_bands = 50,   band_width = 8,   threshold = 0.7,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  jaccard_right_join(   a,   b,   by = NULL,   block_by = NULL,   n_gram_width = 2,   n_bands = 50,   band_width = 8,   threshold = 0.7,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  jaccard_full_join(   a,   b,   by = NULL,   block_by = NULL,   n_gram_width = 2,   n_bands = 50,   band_width = 8,   threshold = 0.7,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard-joins.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","text":", b two dataframes join. named vector indicating columns join . Format dplyr: = c(\"column_name_in_df_a\" = \"column_name_in_df_b\"), two columns must specified dataset (x column y column). Specification made dplyr::join_by() also accepted. block_by named vector indicating column block , rows disagree field considered match. Format dplyr: = c(\"column_name_in_df_a\" = \"column_name_in_df_b\") n_gram_width length n_grams used calculating Jaccard similarity. best performance, set large enough chance string specific n_gram low (.e. n_gram_width = 2 3 matching first names, 5 6 matching entire sentences). n_bands number bands used minihash algorithm (default 40). Use conjunction band_width determine performance hashing. default settings (.2, .8, .001, .999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. band_width length band used minihashing algorithm (default 8) Use conjunction n_bands determine performance hashing. default settings (.2, .8, .001, .999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. threshold Jaccard similarity threshold two strings considered match (default .95). similarity equal 1 - Jaccard distance two strings, 1 implies strings identical, similarity zero implies strings completely dissimilar. progress Set TRUE print progress. clean strings fuzzy join cleaned (coerced lower-case, stripped punctuation spaces)? Default FALSE. similarity_column optional character vector. provided, data frame contain column name giving Jaccard similarity two fields. Extra column present anti-joining.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard-joins.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","text":"tibble fuzzily-joined basis variables . Tries adhere standards dplyr-joins, uses logical joining patterns (.e. inner-join joins keeps observations datasets).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard-joins.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","text":"","code":"# load baby names data # install.packages(\"babynames\") library(babynames)  baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500]) baby_names_sans_vowels <- data.frame(   name_wo_vowels = gsub(\"[aeiouy]\", \"\", baby_names$name) ) # Check the probability two pairs of strings with similarity .8 will be # matched with a band width of 8 and 30 bands using the `jaccard_probability()` # function: jaccard_probability(.8, 30, 8) #> [1] 0.9959518  # Run the join and only keep rows that have a match: jaccard_inner_join(   baby_names,   baby_names_sans_vowels,   by = c(\"name\" = \"name_wo_vowels\"),   threshold = .8,   n_bands = 20,   band_width = 6,   n_gram_width = 1,   clean = FALSE # default ) #> # A tibble: 13 × 2 #>    name     name_wo_vowels #>    <chr>    <chr>          #>  1 savannah svnnh          #>  2 esther   sthr           #>  3 martha   mrth           #>  4 esther   thrs           #>  5 frank    frnk           #>  6 samantha smnth          #>  7 esther   hstr           #>  8 hester   hstr           #>  9 blanch   blnch          #> 10 frank    frnk           #> 11 hester   thrs           #> 12 hester   sthr           #> 13 blanch   blnch           # Run the join and keep all rows from the first dataset, regardless of whether # they have a match: jaccard_left_join(   baby_names,   baby_names_sans_vowels,   by = c(\"name\" = \"name_wo_vowels\"),   threshold = .8,   n_bands = 20,   band_width = 6,   n_gram_width = 1 ) #> # A tibble: 506 × 2 #>    name     name_wo_vowels #>    <chr>    <chr>          #>  1 savannah svnnh          #>  2 samantha smnth          #>  3 esther   hstr           #>  4 hester   sthr           #>  5 hester   hstr           #>  6 martha   mrth           #>  7 hester   thrs           #>  8 blanch   blnch          #>  9 blanch   blnch          #> 10 esther   thrs           #> # ℹ 496 more rows"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_curve.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","title":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","text":"Plot S-Curve LSH given hyperparameters","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_curve.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","text":"","code":"jaccard_curve(n_bands, band_width)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_curve.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","text":"n_bands number LSH bands calculated band_width number hashes band","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_curve.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","text":"plot showing probability pair proposed match, given Jaccard similarity two items.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_curve.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","text":"","code":"# Plot the probability two pairs will be matched as a function of their # jaccard similarity, given the hyperparameters n_bands and band_width. jaccard_curve(40, 6)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_hyper_grid_search.html","id":null,"dir":"Reference","previous_headings":"","what":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","title":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","text":"Runs grid search find hyperparameters achieve (s1,s2,p1,p2)-sensitive locality sensitive hash. locality sensitive hash can called (s1,s2,p1,p2)-sensitive strings similarity less s1 less p1 chance compared, two strings similarity s2 greater p2 chance compared. example, (.1,.7,.001,.999)-sensitive LSH means strings similarity less .1 .1% chance compared, strings .7 similarity 99.9% chance compared.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_hyper_grid_search.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","text":"","code":"jaccard_hyper_grid_search(s1 = 0.1, s2 = 0.7, p1 = 0.001, p2 = 0.999)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_hyper_grid_search.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","text":"s1 s1 parameter (first similaity). s2 s2 parameter (second similarity, must greater s1). p1 p1 parameter (first probability). p2 p2 parameter (second probability, must greater p1).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_hyper_grid_search.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","text":"named vector hyperparameters meet LSH criteria, reducing runitme.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_hyper_grid_search.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","text":"","code":"# Help me find the parameters that will minimize runtime while ensuring that # two strings with similarity .1 will be compared less than .1% of the time, # strings with .8 similaity will have a 99.95% chance of being compared: jaccard_hyper_grid_search(.1, .9, .001, .995) #> band_width    n_bands  #>          4          5"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_probability.html","id":null,"dir":"Reference","previous_headings":"","what":"Find Probability of Match Based on Similarity — jaccard_probability","title":"Find Probability of Match Based on Similarity — jaccard_probability","text":"port lsh_probability function textreuse package, arguments changed reflect hyperparameters package. gives probability two strings jaccard similarity similarity matched, given chosen bandwidth number bands.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_probability.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Find Probability of Match Based on Similarity — jaccard_probability","text":"","code":"jaccard_probability(similarity, n_bands, band_width)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_probability.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Find Probability of Match Based on Similarity — jaccard_probability","text":"similarity similarity two strings want compare n_bands number LSH bands used hashing. band_width number hashes band.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_probability.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Find Probability of Match Based on Similarity — jaccard_probability","text":"decimal number giving probability two items returned candidate pair minhash algorithm.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_probability.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Find Probability of Match Based on Similarity — jaccard_probability","text":"","code":"# Find the probability two pairs will be matched given they have a # jaccard_similarity of .8, band width of 5, and 50 bands: jaccard_probability(.8, n_bands = 50, band_width = 5) #> [1] 1"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_similarity.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","title":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","text":"Calculate Jaccard Similarity two character vectors","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_similarity.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","text":"","code":"jaccard_similarity(a, b, ngram_width = 2)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_similarity.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","text":"first character vector b first character vector ngram_width length shingles / ngrams used similarity calculation","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_similarity.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","text":"vector jaccard similarities strings","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_similarity.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","text":"","code":"jaccard_similarity(   c(\"the quick brown fox\", \"jumped over the lazy dog\"),   c(\"the quck bron fx\", \"jumped over hte lazy dog\") ) #> [1] 0.5714286 0.7692308"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_string_group.html","id":null,"dir":"Reference","previous_headings":"","what":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","title":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","text":"Performs fuzzy string grouping similar strings assigned group. Uses cluster_fast_greedy() community detection algorithm igraph package create groups. Must igraph installed order use function.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_string_group.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","text":"","code":"jaccard_string_group(   string,   n_gram_width = 2,   n_bands = 45,   band_width = 8,   threshold = 0.7,   progress = FALSE )"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_string_group.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","text":"string character wish perform entity resolution . n_gram_width length n_grams used calculating jaccard similarity. best performance, set large enough chance string specific n_gram low (.e. n_gram_width = 2 3 matching first names, 5 6 matching entire sentences). n_bands number bands used minihash algorithm (default 40). Use conjunction band_width determine performance hashing. default settings (.2,.8,.001,.999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. band_width length band used minihashing algorithm (default 8) Use conjunction n_bands determine performance hashing. default settings (.2,.8,.001,.999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. threshold jaccard similarity threshold two strings considered match (default .95). similarity euqal 1 jaccard distance two strings, 1 implies strings identical, similarity zero implies strings completely dissimilar. progress set true report progress algorithm","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_string_group.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","text":"string vector storing group element original input strings. input vector grouped similar strings belong group, given standardized name.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_string_group.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","text":"","code":"string <- c(   \"beniamino\", \"jack\", \"benjamin\", \"beniamin\",   \"jacky\", \"giacomo\", \"gaicomo\" ) jaccard_string_group(string, threshold = .2, n_bands = 90, n_gram_width = 1) #> Loading required namespace: igraph #> [1] \"beniamino\" \"jack\"      \"beniamino\" \"beniamino\" \"jack\"      \"giacomo\"   #> [7] \"giacomo\""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/zoomerjoin-package.html","id":null,"dir":"Reference","previous_headings":"","what":"zoomerjoin: Superlatively Fast Fuzzy Joins — zoomerjoin-package","title":"zoomerjoin: Superlatively Fast Fuzzy Joins — zoomerjoin-package","text":"Empowers users fuzzily-merge data frames millions tens millions rows minutes low memory usage. package uses locality sensitive hashing algorithms developed Datar, Immorlica, Indyk Mirrokni (2004) doi:10.1145/997817.997857 , Broder (1998) doi:10.1109/SEQUEN.1997.666900  avoid compare every pair records dataset, resulting fuzzy-merges finish linear time.","code":""},{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/zoomerjoin-package.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"zoomerjoin: Superlatively Fast Fuzzy Joins — zoomerjoin-package","text":"Maintainer: Beniamino Green beniamino.green@yale.edu [copyright holder] contributors: Etienne Bacher etienne.bacher@protonmail.com (ORCID) [contributor] authors dependency Rust crates (see inst/AUTHORS file details) [contributor, copyright holder]","code":""},{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/news/index.html","id":"new-features-development-version","dir":"Changelog","previous_headings":"","what":"New features","title":"zoomerjoin (development version)","text":"Several performance improvements (#101, #104). Added support joining based hamming distance (#100).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/news/index.html","id":"bug-fixes-development-version","dir":"Changelog","previous_headings":"","what":"Bug fixes","title":"zoomerjoin (development version)","text":"clean = TRUE, strings coerced lower case. now case (#105). Fix argument progress, didn’t print anything TRUE (#107).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/news/index.html","id":"zoomerjoin-012","dir":"Changelog","previous_headings":"","what":"zoomerjoin 0.1.2","title":"zoomerjoin 0.1.2","text":"Submitted Package CRAN Add support new join_by() syntax Added NEWS.md file track changes package.","code":""}]
+[{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"our-pledge","dir":"","previous_headings":"","what":"Our Pledge","title":"Contributor Covenant Code of Conduct","text":"members, contributors, leaders pledge make participation community harassment-free experience everyone, regardless age, body size, visible invisible disability, ethnicity, sex characteristics, gender identity expression, level experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, sexual identity orientation. pledge act interact ways contribute open, welcoming, diverse, inclusive, healthy community.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"our-standards","dir":"","previous_headings":"","what":"Our Standards","title":"Contributor Covenant Code of Conduct","text":"Examples behavior contributes positive environment community include: Demonstrating empathy kindness toward people respectful differing opinions, viewpoints, experiences Giving gracefully accepting constructive feedback Accepting responsibility apologizing affected mistakes, learning experience Focusing best just us individuals, overall community Examples unacceptable behavior include: use sexualized language imagery, sexual attention advances kind Trolling, insulting derogatory comments, personal political attacks Public private harassment Publishing others’ private information, physical email address, without explicit permission conduct reasonably considered inappropriate professional setting","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"enforcement-responsibilities","dir":"","previous_headings":"","what":"Enforcement Responsibilities","title":"Contributor Covenant Code of Conduct","text":"Community leaders responsible clarifying enforcing standards acceptable behavior take appropriate fair corrective action response behavior deem inappropriate, threatening, offensive, harmful. Community leaders right responsibility remove, edit, reject comments, commits, code, wiki edits, issues, contributions aligned Code Conduct, communicate reasons moderation decisions appropriate.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"scope","dir":"","previous_headings":"","what":"Scope","title":"Contributor Covenant Code of Conduct","text":"Code Conduct applies within community spaces, also applies individual officially representing community public spaces. Examples representing community include using official e-mail address, posting via official social media account, acting appointed representative online offline event.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"enforcement","dir":"","previous_headings":"","what":"Enforcement","title":"Contributor Covenant Code of Conduct","text":"Instances abusive, harassing, otherwise unacceptable behavior may reported community leaders responsible enforcement beniamino.green@tutanota.com. complaints reviewed investigated promptly fairly. community leaders obligated respect privacy security reporter incident.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"enforcement-guidelines","dir":"","previous_headings":"","what":"Enforcement Guidelines","title":"Contributor Covenant Code of Conduct","text":"Community leaders follow Community Impact Guidelines determining consequences action deem violation Code Conduct:","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"id_1-correction","dir":"","previous_headings":"Enforcement Guidelines","what":"1. Correction","title":"Contributor Covenant Code of Conduct","text":"Community Impact: Use inappropriate language behavior deemed unprofessional unwelcome community. Consequence: private, written warning community leaders, providing clarity around nature violation explanation behavior inappropriate. public apology may requested.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"id_2-warning","dir":"","previous_headings":"Enforcement Guidelines","what":"2. Warning","title":"Contributor Covenant Code of Conduct","text":"Community Impact: violation single incident series actions. Consequence: warning consequences continued behavior. interaction people involved, including unsolicited interaction enforcing Code Conduct, specified period time. includes avoiding interactions community spaces well external channels like social media. Violating terms may lead temporary permanent ban.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"id_3-temporary-ban","dir":"","previous_headings":"Enforcement Guidelines","what":"3. Temporary Ban","title":"Contributor Covenant Code of Conduct","text":"Community Impact: serious violation community standards, including sustained inappropriate behavior. Consequence: temporary ban sort interaction public communication community specified period time. public private interaction people involved, including unsolicited interaction enforcing Code Conduct, allowed period. Violating terms may lead permanent ban.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"id_4-permanent-ban","dir":"","previous_headings":"Enforcement Guidelines","what":"4. Permanent Ban","title":"Contributor Covenant Code of Conduct","text":"Community Impact: Demonstrating pattern violation community standards, including sustained inappropriate behavior, harassment individual, aggression toward disparagement classes individuals. Consequence: permanent ban sort public interaction within community.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/CONTRIBUTING.html","id":"attribution","dir":"","previous_headings":"","what":"Attribution","title":"Contributor Covenant Code of Conduct","text":"Code Conduct adapted Contributor Covenant, version 2.1, available https://www.contributor-covenant.org/version/2/1/code_of_conduct.html. Community Impact Guidelines inspired Mozilla’s code conduct enforcement ladder. answers common questions code conduct, see FAQ https://www.contributor-covenant.org/faq. Translations available https://www.contributor-covenant.org/translations.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":null,"dir":"","previous_headings":"","what":"GNU General Public License","title":"GNU General Public License","text":"Version 3, 29 June 2007Copyright © 2007 Free Software Foundation, Inc. <http://fsf.org/> Everyone permitted copy distribute verbatim copies license document, changing allowed.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"preamble","dir":"","previous_headings":"","what":"Preamble","title":"GNU General Public License","text":"GNU General Public License free, copyleft license software kinds works. licenses software practical works designed take away freedom share change works. contrast, GNU General Public License intended guarantee freedom share change versions program–make sure remains free software users. , Free Software Foundation, use GNU General Public License software; applies also work released way authors. can apply programs, . speak free software, referring freedom, price. General Public Licenses designed make sure freedom distribute copies free software (charge wish), receive source code can get want , can change software use pieces new free programs, know can things. protect rights, need prevent others denying rights asking surrender rights. Therefore, certain responsibilities distribute copies software, modify : responsibilities respect freedom others. example, distribute copies program, whether gratis fee, must pass recipients freedoms received. must make sure , , receive can get source code. must show terms know rights. Developers use GNU GPL protect rights two steps: (1) assert copyright software, (2) offer License giving legal permission copy, distribute /modify . developers’ authors’ protection, GPL clearly explains warranty free software. users’ authors’ sake, GPL requires modified versions marked changed, problems attributed erroneously authors previous versions. devices designed deny users access install run modified versions software inside , although manufacturer can . fundamentally incompatible aim protecting users’ freedom change software. systematic pattern abuse occurs area products individuals use, precisely unacceptable. Therefore, designed version GPL prohibit practice products. problems arise substantially domains, stand ready extend provision domains future versions GPL, needed protect freedom users. Finally, every program threatened constantly software patents. States allow patents restrict development use software general-purpose computers, , wish avoid special danger patents applied free program make effectively proprietary. prevent , GPL assures patents used render program non-free. precise terms conditions copying, distribution modification follow.","code":""},{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_0-definitions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"0. Definitions","title":"GNU General Public License","text":"“License” refers version 3 GNU General Public License. “Copyright” also means copyright-like laws apply kinds works, semiconductor masks. “Program” refers copyrightable work licensed License. licensee addressed “”. “Licensees” “recipients” may individuals organizations. “modify” work means copy adapt part work fashion requiring copyright permission, making exact copy. resulting work called “modified version” earlier work work “based ” earlier work. “covered work” means either unmodified Program work based Program. “propagate” work means anything , without permission, make directly secondarily liable infringement applicable copyright law, except executing computer modifying private copy. Propagation includes copying, distribution (without modification), making available public, countries activities well. “convey” work means kind propagation enables parties make receive copies. Mere interaction user computer network, transfer copy, conveying. interactive user interface displays “Appropriate Legal Notices” extent includes convenient prominently visible feature (1) displays appropriate copyright notice, (2) tells user warranty work (except extent warranties provided), licensees may convey work License, view copy License. interface presents list user commands options, menu, prominent item list meets criterion.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_1-source-code","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"1. Source Code","title":"GNU General Public License","text":"“source code” work means preferred form work making modifications . “Object code” means non-source form work. “Standard Interface” means interface either official standard defined recognized standards body, , case interfaces specified particular programming language, one widely used among developers working language. “System Libraries” executable work include anything, work whole, () included normal form packaging Major Component, part Major Component, (b) serves enable use work Major Component, implement Standard Interface implementation available public source code form. “Major Component”, context, means major essential component (kernel, window system, ) specific operating system () executable work runs, compiler used produce work, object code interpreter used run . “Corresponding Source” work object code form means source code needed generate, install, (executable work) run object code modify work, including scripts control activities. However, include work’s System Libraries, general-purpose tools generally available free programs used unmodified performing activities part work. example, Corresponding Source includes interface definition files associated source files work, source code shared libraries dynamically linked subprograms work specifically designed require, intimate data communication control flow subprograms parts work. Corresponding Source need include anything users can regenerate automatically parts Corresponding Source. Corresponding Source work source code form work.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_2-basic-permissions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"2. Basic Permissions","title":"GNU General Public License","text":"rights granted License granted term copyright Program, irrevocable provided stated conditions met. License explicitly affirms unlimited permission run unmodified Program. output running covered work covered License output, given content, constitutes covered work. License acknowledges rights fair use equivalent, provided copyright law. may make, run propagate covered works convey, without conditions long license otherwise remains force. may convey covered works others sole purpose make modifications exclusively , provide facilities running works, provided comply terms License conveying material control copyright. thus making running covered works must exclusively behalf, direction control, terms prohibit making copies copyrighted material outside relationship . Conveying circumstances permitted solely conditions stated . Sublicensing allowed; section 10 makes unnecessary.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_3-protecting-users-legal-rights-from-anti-circumvention-law","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"3. Protecting Users’ Legal Rights From Anti-Circumvention Law","title":"GNU General Public License","text":"covered work shall deemed part effective technological measure applicable law fulfilling obligations article 11 WIPO copyright treaty adopted 20 December 1996, similar laws prohibiting restricting circumvention measures. convey covered work, waive legal power forbid circumvention technological measures extent circumvention effected exercising rights License respect covered work, disclaim intention limit operation modification work means enforcing, work’s users, third parties’ legal rights forbid circumvention technological measures.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_4-conveying-verbatim-copies","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"4. Conveying Verbatim Copies","title":"GNU General Public License","text":"may convey verbatim copies Program’s source code receive , medium, provided conspicuously appropriately publish copy appropriate copyright notice; keep intact notices stating License non-permissive terms added accord section 7 apply code; keep intact notices absence warranty; give recipients copy License along Program. may charge price price copy convey, may offer support warranty protection fee.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_5-conveying-modified-source-versions","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"5. Conveying Modified Source Versions","title":"GNU General Public License","text":"may convey work based Program, modifications produce Program, form source code terms section 4, provided also meet conditions: ) work must carry prominent notices stating modified , giving relevant date. b) work must carry prominent notices stating released License conditions added section 7. requirement modifies requirement section 4 “keep intact notices”. c) must license entire work, whole, License anyone comes possession copy. License therefore apply, along applicable section 7 additional terms, whole work, parts, regardless packaged. License gives permission license work way, invalidate permission separately received . d) work interactive user interfaces, must display Appropriate Legal Notices; however, Program interactive interfaces display Appropriate Legal Notices, work need make . compilation covered work separate independent works, nature extensions covered work, combined form larger program, volume storage distribution medium, called “aggregate” compilation resulting copyright used limit access legal rights compilation’s users beyond individual works permit. Inclusion covered work aggregate cause License apply parts aggregate.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_6-conveying-non-source-forms","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"6. Conveying Non-Source Forms","title":"GNU General Public License","text":"may convey covered work object code form terms sections 4 5, provided also convey machine-readable Corresponding Source terms License, one ways: ) Convey object code , embodied , physical product (including physical distribution medium), accompanied Corresponding Source fixed durable physical medium customarily used software interchange. b) Convey object code , embodied , physical product (including physical distribution medium), accompanied written offer, valid least three years valid long offer spare parts customer support product model, give anyone possesses object code either (1) copy Corresponding Source software product covered License, durable physical medium customarily used software interchange, price reasonable cost physically performing conveying source, (2) access copy Corresponding Source network server charge. c) Convey individual copies object code copy written offer provide Corresponding Source. alternative allowed occasionally noncommercially, received object code offer, accord subsection 6b. d) Convey object code offering access designated place (gratis charge), offer equivalent access Corresponding Source way place charge. need require recipients copy Corresponding Source along object code. place copy object code network server, Corresponding Source may different server (operated third party) supports equivalent copying facilities, provided maintain clear directions next object code saying find Corresponding Source. Regardless server hosts Corresponding Source, remain obligated ensure available long needed satisfy requirements. e) Convey object code using peer--peer transmission, provided inform peers object code Corresponding Source work offered general public charge subsection 6d. separable portion object code, whose source code excluded Corresponding Source System Library, need included conveying object code work. “User Product” either (1) “consumer product”, means tangible personal property normally used personal, family, household purposes, (2) anything designed sold incorporation dwelling. determining whether product consumer product, doubtful cases shall resolved favor coverage. particular product received particular user, “normally used” refers typical common use class product, regardless status particular user way particular user actually uses, expects expected use, product. product consumer product regardless whether product substantial commercial, industrial non-consumer uses, unless uses represent significant mode use product. “Installation Information” User Product means methods, procedures, authorization keys, information required install execute modified versions covered work User Product modified version Corresponding Source. information must suffice ensure continued functioning modified object code case prevented interfered solely modification made. convey object code work section , , specifically use , User Product, conveying occurs part transaction right possession use User Product transferred recipient perpetuity fixed term (regardless transaction characterized), Corresponding Source conveyed section must accompanied Installation Information. requirement apply neither third party retains ability install modified object code User Product (example, work installed ROM). requirement provide Installation Information include requirement continue provide support service, warranty, updates work modified installed recipient, User Product modified installed. Access network may denied modification materially adversely affects operation network violates rules protocols communication across network. Corresponding Source conveyed, Installation Information provided, accord section must format publicly documented (implementation available public source code form), must require special password key unpacking, reading copying.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_7-additional-terms","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"7. Additional Terms","title":"GNU General Public License","text":"“Additional permissions” terms supplement terms License making exceptions one conditions. Additional permissions applicable entire Program shall treated though included License, extent valid applicable law. additional permissions apply part Program, part may used separately permissions, entire Program remains governed License without regard additional permissions. convey copy covered work, may option remove additional permissions copy, part . (Additional permissions may written require removal certain cases modify work.) may place additional permissions material, added covered work, can give appropriate copyright permission. Notwithstanding provision License, material add covered work, may (authorized copyright holders material) supplement terms License terms: ) Disclaiming warranty limiting liability differently terms sections 15 16 License; b) Requiring preservation specified reasonable legal notices author attributions material Appropriate Legal Notices displayed works containing ; c) Prohibiting misrepresentation origin material, requiring modified versions material marked reasonable ways different original version; d) Limiting use publicity purposes names licensors authors material; e) Declining grant rights trademark law use trade names, trademarks, service marks; f) Requiring indemnification licensors authors material anyone conveys material (modified versions ) contractual assumptions liability recipient, liability contractual assumptions directly impose licensors authors. non-permissive additional terms considered “restrictions” within meaning section 10. Program received , part , contains notice stating governed License along term restriction, may remove term. license document contains restriction permits relicensing conveying License, may add covered work material governed terms license document, provided restriction survive relicensing conveying. add terms covered work accord section, must place, relevant source files, statement additional terms apply files, notice indicating find applicable terms. Additional terms, permissive non-permissive, may stated form separately written license, stated exceptions; requirements apply either way.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_8-termination","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"8. Termination","title":"GNU General Public License","text":"may propagate modify covered work except expressly provided License. attempt otherwise propagate modify void, automatically terminate rights License (including patent licenses granted third paragraph section 11). However, cease violation License, license particular copyright holder reinstated () provisionally, unless copyright holder explicitly finally terminates license, (b) permanently, copyright holder fails notify violation reasonable means prior 60 days cessation. Moreover, license particular copyright holder reinstated permanently copyright holder notifies violation reasonable means, first time received notice violation License (work) copyright holder, cure violation prior 30 days receipt notice. Termination rights section terminate licenses parties received copies rights License. rights terminated permanently reinstated, qualify receive new licenses material section 10.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_9-acceptance-not-required-for-having-copies","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"9. Acceptance Not Required for Having Copies","title":"GNU General Public License","text":"required accept License order receive run copy Program. Ancillary propagation covered work occurring solely consequence using peer--peer transmission receive copy likewise require acceptance. However, nothing License grants permission propagate modify covered work. actions infringe copyright accept License. Therefore, modifying propagating covered work, indicate acceptance License .","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_10-automatic-licensing-of-downstream-recipients","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"10. Automatic Licensing of Downstream Recipients","title":"GNU General Public License","text":"time convey covered work, recipient automatically receives license original licensors, run, modify propagate work, subject License. responsible enforcing compliance third parties License. “entity transaction” transaction transferring control organization, substantially assets one, subdividing organization, merging organizations. propagation covered work results entity transaction, party transaction receives copy work also receives whatever licenses work party’s predecessor interest give previous paragraph, plus right possession Corresponding Source work predecessor interest, predecessor can get reasonable efforts. may impose restrictions exercise rights granted affirmed License. example, may impose license fee, royalty, charge exercise rights granted License, may initiate litigation (including cross-claim counterclaim lawsuit) alleging patent claim infringed making, using, selling, offering sale, importing Program portion .","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_11-patents","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"11. Patents","title":"GNU General Public License","text":"“contributor” copyright holder authorizes use License Program work Program based. work thus licensed called contributor’s “contributor version”. contributor’s “essential patent claims” patent claims owned controlled contributor, whether already acquired hereafter acquired, infringed manner, permitted License, making, using, selling contributor version, include claims infringed consequence modification contributor version. purposes definition, “control” includes right grant patent sublicenses manner consistent requirements License. contributor grants non-exclusive, worldwide, royalty-free patent license contributor’s essential patent claims, make, use, sell, offer sale, import otherwise run, modify propagate contents contributor version. following three paragraphs, “patent license” express agreement commitment, however denominated, enforce patent (express permission practice patent covenant sue patent infringement). “grant” patent license party means make agreement commitment enforce patent party. convey covered work, knowingly relying patent license, Corresponding Source work available anyone copy, free charge terms License, publicly available network server readily accessible means, must either (1) cause Corresponding Source available, (2) arrange deprive benefit patent license particular work, (3) arrange, manner consistent requirements License, extend patent license downstream recipients. “Knowingly relying” means actual knowledge , patent license, conveying covered work country, recipient’s use covered work country, infringe one identifiable patents country reason believe valid. , pursuant connection single transaction arrangement, convey, propagate procuring conveyance , covered work, grant patent license parties receiving covered work authorizing use, propagate, modify convey specific copy covered work, patent license grant automatically extended recipients covered work works based . patent license “discriminatory” include within scope coverage, prohibits exercise , conditioned non-exercise one rights specifically granted License. may convey covered work party arrangement third party business distributing software, make payment third party based extent activity conveying work, third party grants, parties receive covered work , discriminatory patent license () connection copies covered work conveyed (copies made copies), (b) primarily connection specific products compilations contain covered work, unless entered arrangement, patent license granted, prior 28 March 2007. Nothing License shall construed excluding limiting implied license defenses infringement may otherwise available applicable patent law.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_12-no-surrender-of-others-freedom","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"12. No Surrender of Others’ Freedom","title":"GNU General Public License","text":"conditions imposed (whether court order, agreement otherwise) contradict conditions License, excuse conditions License. convey covered work satisfy simultaneously obligations License pertinent obligations, consequence may convey . example, agree terms obligate collect royalty conveying convey Program, way satisfy terms License refrain entirely conveying Program.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_13-use-with-the-gnu-affero-general-public-license","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"13. Use with the GNU Affero General Public License","title":"GNU General Public License","text":"Notwithstanding provision License, permission link combine covered work work licensed version 3 GNU Affero General Public License single combined work, convey resulting work. terms License continue apply part covered work, special requirements GNU Affero General Public License, section 13, concerning interaction network apply combination .","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_14-revised-versions-of-this-license","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"14. Revised Versions of this License","title":"GNU General Public License","text":"Free Software Foundation may publish revised /new versions GNU General Public License time time. new versions similar spirit present version, may differ detail address new problems concerns. version given distinguishing version number. Program specifies certain numbered version GNU General Public License “later version” applies , option following terms conditions either numbered version later version published Free Software Foundation. Program specify version number GNU General Public License, may choose version ever published Free Software Foundation. Program specifies proxy can decide future versions GNU General Public License can used, proxy’s public statement acceptance version permanently authorizes choose version Program. Later license versions may give additional different permissions. However, additional obligations imposed author copyright holder result choosing follow later version.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_15-disclaimer-of-warranty","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"15. Disclaimer of Warranty","title":"GNU General Public License","text":"WARRANTY PROGRAM, EXTENT PERMITTED APPLICABLE LAW. EXCEPT OTHERWISE STATED WRITING COPYRIGHT HOLDERS /PARTIES PROVIDE PROGRAM “” WITHOUT WARRANTY KIND, EITHER EXPRESSED IMPLIED, INCLUDING, LIMITED , IMPLIED WARRANTIES MERCHANTABILITY FITNESS PARTICULAR PURPOSE. ENTIRE RISK QUALITY PERFORMANCE PROGRAM . PROGRAM PROVE DEFECTIVE, ASSUME COST NECESSARY SERVICING, REPAIR CORRECTION.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_16-limitation-of-liability","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"16. Limitation of Liability","title":"GNU General Public License","text":"EVENT UNLESS REQUIRED APPLICABLE LAW AGREED WRITING COPYRIGHT HOLDER, PARTY MODIFIES /CONVEYS PROGRAM PERMITTED , LIABLE DAMAGES, INCLUDING GENERAL, SPECIAL, INCIDENTAL CONSEQUENTIAL DAMAGES ARISING USE INABILITY USE PROGRAM (INCLUDING LIMITED LOSS DATA DATA RENDERED INACCURATE LOSSES SUSTAINED THIRD PARTIES FAILURE PROGRAM OPERATE PROGRAMS), EVEN HOLDER PARTY ADVISED POSSIBILITY DAMAGES.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"id_17-interpretation-of-sections-15-and-16","dir":"","previous_headings":"TERMS AND CONDITIONS","what":"17. Interpretation of Sections 15 and 16","title":"GNU General Public License","text":"disclaimer warranty limitation liability provided given local legal effect according terms, reviewing courts shall apply local law closely approximates absolute waiver civil liability connection Program, unless warranty assumption liability accompanies copy Program return fee. END TERMS CONDITIONS","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/LICENSE.html","id":"how-to-apply-these-terms-to-your-new-programs","dir":"","previous_headings":"","what":"How to Apply These Terms to Your New Programs","title":"GNU General Public License","text":"develop new program, want greatest possible use public, best way achieve make free software everyone can redistribute change terms. , attach following notices program. safest attach start source file effectively state exclusion warranty; file least “copyright” line pointer full notice found. Also add information contact electronic paper mail. program terminal interaction, make output short notice like starts interactive mode: hypothetical commands show w show c show appropriate parts General Public License. course, program’s commands might different; GUI interface, use “box”. also get employer (work programmer) school, , sign “copyright disclaimer” program, necessary. information , apply follow GNU GPL, see <http://www.gnu.org/licenses/>. GNU General Public License permit incorporating program proprietary programs. program subroutine library, may consider useful permit linking proprietary applications library. want , use GNU Lesser General Public License instead License. first, please read <http://www.gnu.org/philosophy/--lgpl.html>.","code":"<one line to give the program's name and a brief idea of what it does.> Copyright (C) <year>  <name of author>  This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.  This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.  You should have received a copy of the GNU General Public License along with this program.  If not, see <http://www.gnu.org/licenses/>. <program>  Copyright (C) <year>  <name of author> This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'. This is free software, and you are welcome to redistribute it under certain conditions; type 'show c' for details."},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/benchmarks.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"Benchmarks","text":"short vignette, show benchmarks zoomerjoin package, comparing excellent fuzzyjoin package. two packages designed different things - fuzzyjoin package fast, provides distance functions (well joining modes) - ’s useful comparison shows time can saved using LSH relative pairwise comparisons, long okay using Jaccard similarity. future, hoping expand package implement LSH method edit distance, add benchmarks / feature completed.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/benchmarks.html","id":"benchmarks","dir":"Articles","previous_headings":"","what":"Benchmarks","title":"Benchmarks","text":", show time takes fuzzyjoin zoomerjoin fuzzily join two datasets size dataset increases. Fuzzyjoin initially quick, runtime scales square input size. Zoomerjoin slower small datasets less memory-intensive, scales sum rows dataset, becomes quicker larger datasets.","code":"#> Rows: 60 Columns: 5 #> ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: \",\" #> chr (3): package, join_type, name #> dbl (2): n, value #>  #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message."},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/benchmarks.html","id":"benchmarking-code","dir":"Articles","previous_headings":"","what":"Benchmarking Code:","title":"Benchmarks","text":", include code used generate benchmarks:","code":"library(zoomerjoin) library(fuzzyjoin) library(tidyverse) library(microbenchmark) library(profmem)   # Sample million rows from DIME dataset data_1 <- as.data.frame(sample_n(dime_data, 10^6)) names(data_1) <- c(\"id_1\", \"name\") data_2 <- as.data.frame(sample_n(dime_data, 10^6)) names(data_2) <- c(\"id_2\", \"name\")  # Generate datasets for euclidean join benchmarking n <- 10^5 p <- 50 X <- matrix(rnorm(n * p), n, p) X_1 <- as.data.frame(X) X_2 <- as.data.frame(X + .000000001)  # Get time and memory use statistics for fuzzyjoin when performing jaccard join fuzzy_jaccard_bench <- function(n) {   time <- microbenchmark(     stringdist_inner_join(data_1[1:n, ],       data_2[1:n, ],       method = \"jaccard\",       max_dist = .6,       q = 4     ),     times = 10   )$time %>%     median()    mem <- profmem(stringdist_inner_join(data_1[1:n, ],     data_2[1:n, ],     method = \"jaccard\",     max_dist = .6,     q = 4   )) %>%     total()    return(c(time = time, memory = mem)) }   # Get time and memory use statistics for zoomerjoin when performing jaccard join zoomer_jaccard_bench <- function(n) {   time <- microbenchmark(     jaccard_inner_join(data_1[1:n, ], data_2[1:n, ],       by = \"name\", band_width = 11,       n_bands = 350, threshold = .7,       n_gram_width = 4     ),     times = 50   )$time %>%     median()    mem <- profmem(     jaccard_inner_join(data_1[1:n, ], data_2[1:n, ],       by = \"name\", band_width = 11,       n_bands = 350, threshold = .7,       n_gram_width = 4     )   ) %>%     total()    return(c(time = time, memory = mem)) }  # Get time and memory use statistics for fuzzyjoin when performing Euclidean join fuzzy_euclid_bench <- function(n) {   time <- microbenchmark(     distance_join(X_1[1:n, ], X_2[1:n, ], max_dist = .1, method = \"euclidean\"),     times = 10   )$time %>%     median()    mem <- total(profmem(     distance_join(X_1[1:n, ], X_2[1:n, ], max_dist = .1, method = \"euclidean\")   ))    return(c(time = time, memory = mem)) }  # Get time and memory use statistics for zoomerjoin when performing Euclidean join zoomer_euclid_bench <- function(n) {   time <- microbenchmark(     euclidean_inner_join(X_1[1:n, ], X_2[1:n, ],       threshold = .1, n_bands = 90,       band_width = 2, r = .1     ),     times = 50   )$time %>%     median()    mem <- profmem(euclidean_inner_join(X_1[1:n, ], X_2[1:n, ],     threshold = .1, n_bands = 90,     band_width = 2, r = .1   )) %>%     total()    return(c(time = time, memory = mem)) }   # Run Grid of Jaccard Benchmarks, Collect results into DF n <- seq(500, 4000, 250) names(n) <- n fuzzy_jacard_benches <- map_df(n, fuzzy_jaccard_bench, .id = \"n\") zoomer_jacard_benches <- map_df(n, zoomer_jaccard_bench, .id = \"n\") fuzzy_jacard_benches$package <- \"fuzzyjoin\" zoomer_jacard_benches$package <- \"zoomerjoin\" jaccard_benches <- bind_rows(fuzzy_jacard_benches, zoomer_jacard_benches) jaccard_benches$join_type <- \"Jaccard Distance\"  # Run Grid of Euclidean Benchmarks, Collect results into DF n <- seq(250, 4000, 250) names(n) <- n fuzzy_euclid_benches <- map_df(n, fuzzy_euclid_bench, .id = \"n\") zoomer_euclid_benches <- map_df(n, zoomer_euclid_bench, .id = \"n\") fuzzy_euclid_benches$package <- \"fuzzyjoin\" zoomer_euclid_benches$package <- \"zoomerjoin\" euclid_benches <- bind_rows(fuzzy_euclid_benches, zoomer_euclid_benches) euclid_benches$join_type <- \"Euclidean Distance\"  sim_data <- bind_rows(euclid_benches, jaccard_benches) %>%   pivot_longer(c(time, memory)) %>%   mutate(value = ifelse(name == \"time\", value / 10^9, value / 10^6)) # convert ns to s and bytes to Gb.  write_csv(sim_data, \"sim_data.csv\")"},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/guided_tour.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction:","title":"A Zoomerjoin Guided Tour","text":"vignette gives basic overview core functionality zoomerjoin package. Zoomerjoin empowers fuzzily-match datasets millions rows seconds, staying light memory usage. makes feasible perform fuzzy-joins datasets hundreds millions observations matter minutes.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/guided_tour.html","id":"how-does-it-work","dir":"Articles","previous_headings":"","what":"How Does it Work?","title":"A Zoomerjoin Guided Tour","text":"Zoomerjoin’s blazingly fast joins string distance made possible optimized, performant implementation MinHash algorithm written Rust. conventional joining packages compare pairs records two datasets wish join, MinHash algorithm manages compare similar records . results matches orders magnitudes faster matching software packages: zoomerjoin takes hours minutes join datasets taken centuries join using matching methods.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/guided_tour.html","id":"basic-syntax","dir":"Articles","previous_headings":"","what":"Basic Syntax:","title":"A Zoomerjoin Guided Tour","text":"’re familiar logical-join syntax dplyr, already know use fuzzy join join two datasets. Zoomerjoin provides jaccard_inner_join() jaccard_full_join() (among others), fuzzy-joining analogues corresponding dplyr functions. demonstrate syntax using package join corpuses, formed entries Database Ideology, Money Politics, Elections (DIME) (Bonica 2016). first corpus looks follows: second looks follows: two Corpuses can’t directly joined misspellings. means must use fuzzy-matching capabilities zoomerjoin: first two arguments, , b, direct analogues dplyr arguments, two data frames want join. field also acts ‘dplyr’ (provides function columns want match ). n_gram_width parameter determines wide n-grams used similarity evaluation , threshold argument determines similar pair strings (Jaccard similarity) considered match. Users stringdist fuzzyjoin package familiar arguments, bear mind packages measure string distance (distance 0 indicates complete similarity), package operates string similarity, threshold .8 keep matches 80% Jaccard similarity. n_bands band_width parameters govern performance LSH. default parameters perform well medium-size (n < 10^7) datasets matches somewhat similar (similarity > .8), may require tuning settings. jaccard_hyper_grid_search(), jaccard_curve() functions can help select parameters given properties LSH desire. example, can use jaccard_curve() function plot probability pair records compared possible Jaccard distance, \\(d\\) zero one:  looking plot produced, can see using hyperparameters, comparisons almost never made pairs records Jaccard similarity less .2 (saving time), pairs records Jaccard similarity greater .8 almost always compared (giving low false-negative rate). details hyperparameters, textreuse package excellent vignette, zoomerjoin provides re-implementation profiling tools, jaccard_probability, jaccard_bandwidth (although implementations differ slightly hyperparameters package different).","code":"library(tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr     1.1.4     ✔ readr     2.1.5 ## ✔ forcats   1.0.0     ✔ stringr   1.5.1 ## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1 ## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1 ## ✔ purrr     1.0.2      ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag()    masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors library(microbenchmark) library(fuzzyjoin) library(zoomerjoin)  corpus_1 <- dime_data %>% # dime data is packaged with zoomerjoin   head(500) names(corpus_1) <- c(\"a\", \"field\") corpus_1 ## # A tibble: 500 × 2 ##        a field                                                                   ##    <dbl> <chr>                                                                   ##  1     1 ufwa cope committee                                                     ##  2     2 committee to re elect charles e. bennett                                ##  3     3 montana democratic party non federal account                            ##  4     4 mississippi power & light company management political action and educ… ##  5     5 napus pac for postmasters                                               ##  6     6 aminoil good government fund                                            ##  7     7 national women's political caucus of california                         ##  8     8 minnesota gun owners' political victory fund                            ##  9     9 metropolitan detroit afl cio cope committee                             ## 10    10 carpenters legislative improvement committee united brotherhood of car… ## # ℹ 490 more rows corpus_2 <- dime_data %>% # dime data is packaged with zoomerjoin   tail(500) names(corpus_2) <- c(\"b\", \"field\") corpus_2 ## # A tibble: 500 × 2 ##        b field                                                                   ##    <dbl> <chr>                                                                   ##  1   501 citizens for derwinski                                                  ##  2   502 progressive victory fund greater washington americans for democratic a… ##  3   503 ingham county democratic party federal campaign fund                    ##  4   504 committee for a stronger future                                         ##  5   505 atoka country supper committee                                          ##  6   506 friends of democracy pac inc                                            ##  7   507 baypac                                                                  ##  8   508 international brotherhood of electrical workers local union 278 cope/p… ##  9   509 louisville & jefferson county republican executive committee            ## 10   510 democratic party of virginia                                            ## # ℹ 490 more rows set.seed(1) start_time <- Sys.time() join_out <- jaccard_inner_join(corpus_1, corpus_2,   by = \"field\", n_gram_width = 6,   n_bands = 20, band_width = 6, threshold = .8 ) print(Sys.time() - start_time) ## Time difference of 0.01161695 secs print(join_out) ## # A tibble: 8 × 4 ##       a field.x                                                      b field.y   ##   <dbl> <chr>                                                    <dbl> <chr>     ## 1   378 guarini for congress 1982                                  606 guarini … ## 2   378 guarini for congress 1982                                  883 guarini … ## 3   238 4th congressional district democratic party                518 16th con… ## 4    88 scheuer for congress 1980                                  667 scheuer … ## 5   230 pipefitters local union 524                                998 pipefitt… ## 6   302 americans for good government inc                          910 american… ## 7   292 bill bradley for u s senate '84                            913 bill bra… ## 8   319 7th congressional district democratic party of wisconsin   792 8th cong… jaccard_curve(20, 6)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/guided_tour.html","id":"standardizing-string-names-after-a-merge","dir":"Articles","previous_headings":"","what":"Standardizing String Names After A Merge","title":"A Zoomerjoin Guided Tour","text":"Often merging, can help standardize names fields joined . way, can assign unique label identifying key observations similar value merging variable. jaccard_string_group() function makes possible. first performs locality sensitive hashing identify similar pairs observations within dataset, runs community detection algorithm identify clusters similar observations, assigned label. community-detection algorithm, fastgreedy.community() igraph package runs log-linear time, entire algorithm completes linearithmic time. ’s short snippet showing can use jaccard_string_group() standardize set organization names.","code":"organization_names <- c(   \"American Civil Liberties Union\",   \"American Civil Liberties Union (ACLU)\",   \"NRA National Rifle Association\",   \"National Rifle Association NRA\",   \"National Rifle Association\",   \"Planned Parenthood\",   \"Blue Cross\" ) standardized_organization_names <- jaccard_string_group(organization_names, threshold = .5, band_width = 3) ## Loading required namespace: igraph print(standardized_organization_names) ## [1] \"American Civil Liberties Union\" \"American Civil Liberties Union\" ## [3] \"NRA National Rifle Association\" \"NRA National Rifle Association\" ## [5] \"NRA National Rifle Association\" \"Planned Parenthood\"             ## [7] \"Blue Cross\""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/guided_tour.html","id":"references","dir":"Articles","previous_headings":"Standardizing String Names After A Merge","what":"References:","title":"A Zoomerjoin Guided Tour","text":"Bonica, Adam. 2016. Database Ideology, Money Politics, Elections: Public version 2.0 [Computer file]. Stanford, CA: Stanford University Libraries.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/matching_vectors.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"Matching Vectors Based on Euclidean Distance","text":"flagship feature zoomerjoin tidy joins strings using Jaccard distance, zoomerjoin also allows join vectors using Euclidean distance. can useful joining addresses coordinates space. Unlike nearest-neighbor methods KD-trees, joins slow dimension coordinates increases, zoomerjoin can used can used find close points high-dimensional space (word embeddings).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/articles/matching_vectors.html","id":"demonstration","dir":"Articles","previous_headings":"","what":"Demonstration","title":"Matching Vectors Based on Euclidean Distance","text":"demonstration, create simulated dataset 10^5 points distributed uniformly within 100-dimensional hypercube. join another dataset copy first point shifted tiny random amount. now want join two datasets together. Euclidean joins take 3 hyperparameters: n_bands, band_width, r. chosen problem domain (although defaults generally sensible). use euclidean_probability function package understand probability two observations distance .01 indentified match variety hyperparameter configurations. Using n_bands=40, band_width=8, r=.15 seems provide good balance identifying true matches (pairs less .01 apart guaranteed found) reducing number un-promising comparisons (pairs greater .1 apart unlikely compared). use euclidean_inner_join find matching pairs across two datasets: Zoomerjoin able easily find pairs just 30s (perhaps longer runner renders website), even though points lie high-dimensional (d=100) space. makes zoomerjoin useful tool trying join find matches datasets word document embeddings.","code":"n <- 10^5 # number of data points d <- 10^2 # dimension  # Create a matrix of 10^6 observations in R^100 X <- matrix(runif(n * d), n, d) # Second Dataset is a copy of the first with points shifted an infinitesimal # amount X_2 <- as.data.frame(X + matrix(rnorm(n * d, 0, .0001), n, d)) X <- as.data.frame(X) euclidean_probability(.01, n_bands = 5, band_width = 8, r = .25) #> [1] 0.9993764 euclidean_probability(.1, n_bands = 5, band_width = 8, r = .25) #> [1] 0.2141322 euclidean_probability(.01, n_bands = 10, band_width = 4, r = .15) #> [1] 0.9999999 euclidean_probability(.1, n_bands = 10, band_width = 4, r = .15) #> [1] 0.4956251 euclidean_probability(.01, n_bands = 40, band_width = 8, r = .15) #> [1] 1 euclidean_probability(.1, n_bands = 40, band_width = 8, r = .15) #> [1] 0.16091 set.seed(1) start <- Sys.time() joined_out <- euclidean_inner_join(   X,   X_2,   threshold = .01,   n_bands = 40,   band_width = 8,   r = .15 ) n_matches <- nrow(joined_out) time_taken <- Sys.time() - start print(paste(\"found\", n_matches, \"matches in\", round(time_taken), \"seconds\")) #> [1] \"found 100000 matches in 16 seconds\""},{"path":"https://beniaminogreen.github.io/zoomerjoin/authors.html","id":null,"dir":"","previous_headings":"","what":"Authors","title":"Authors and Citation","text":"Beniamino Green. Author, maintainer, copyright holder. Etienne Bacher. Contributor. authors dependency Rust crates. Contributor, copyright holder.            see inst/AUTHORS file details","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/authors.html","id":"citation","dir":"","previous_headings":"","what":"Citation","title":"Authors and Citation","text":"Green B (2024). zoomerjoin: Superlatively Fast Fuzzy Joins. R package version 0.1.2.9000, https://beniamino.org/zoomerjoin/.","code":"@Manual{,   title = {zoomerjoin: Superlatively Fast Fuzzy Joins},   author = {Beniamino Green},   year = {2024},   note = {R package version 0.1.2.9000},   url = {https://beniamino.org/zoomerjoin/}, }"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"zoomerjoin-","dir":"","previous_headings":"","what":"Superlatively Fast Fuzzy Joins","title":"Superlatively Fast Fuzzy Joins","text":"zoomerjoin R package empowers fuzzy-join massive datasets rapidly, little memory consumption. powered high-performance implementations Locality Sensitive Hashing, algorithm finds matches records two datasets without compare possible pairs observations. practice, means zoomerjoin can fuzzily-join datasets days, even years faster matching packages. zoomerjoin used -production join datasets hundreds millions names vectors matter hours.","code":""},{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-from-cran","dir":"","previous_headings":"Installation","what":"Installing from CRAN:","title":"Superlatively Fast Fuzzy Joins","text":"can install CRAN package. Please aware Cargo (rust toolchain compiler) installed build package source.","code":"install.packages('zoomerjoin')"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-from-r-universe","dir":"","previous_headings":"Installation","what":"Installing from R-Universe:","title":"Superlatively Fast Fuzzy Joins","text":"package distributed using r-universe, provides pre-compiled binaries common operating systems recent versions R. install r-universe, can use following command R:","code":"install.packages(   'zoomerjoin',   repos = c('https://beniaminogreen.r-universe.dev', getOption(\"repos\")) )"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-rust","dir":"","previous_headings":"Installation","what":"Installing Rust","title":"Superlatively Fast Fuzzy Joins","text":"operating system version R installed, must Rust compiler installed compile package sources. package compiled, Rust longer required, can safely uninstalled.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-rust-on-linux-or-mac","dir":"","previous_headings":"Installation > Installing Rust","what":"Installing Rust on Linux or Mac:","title":"Superlatively Fast Fuzzy Joins","text":"install Rust Linux Mac, can simply run following snippet terminal.","code":"curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-rust-on-windows","dir":"","previous_headings":"Installation > Installing Rust","what":"Installing Rust on Windows:","title":"Superlatively Fast Fuzzy Joins","text":"install Rust windows, can use Rust installation wizard, rustup-init.exe, found site. Depending version Windows, may see error looks something like : case, run rustup install stable-x86_64-pc_windows-gnu install missing toolchain. ’re missing another toolchain, simply type place stable-x86_64-pc_windows-gnu command .","code":"error: toolchain 'stable-x86_64-pc-windows-gnu' is not installed"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"installing-package-from-github","dir":"","previous_headings":"Installation","what":"Installing Package from Github:","title":"Superlatively Fast Fuzzy Joins","text":"rust installed Rust, able install package either install.packages function , using install_github function devtools package pkg_install function pak package.","code":"## Install with devtools # install.packages(\"devtools\") devtools::install_github(\"beniaminogreen/zoomerjoin\")  ## Install with pak # install.packages(\"pak\") pak::pkg_install(\"beniaminogreen/zoomerjoin\")"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"loading-the-package","dir":"","previous_headings":"Installation","what":"Loading The Package","title":"Superlatively Fast Fuzzy Joins","text":"package installed, can load memory usual typing:","code":"library(zoomerjoin)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"usage","dir":"","previous_headings":"","what":"Usage:","title":"Superlatively Fast Fuzzy Joins","text":"flagship feature zoomerjoins jaccard_join euclidean family functions, designed near drop-ins corresponding dplyr/fuzzyjoin commands: jaccard_left_join() jaccard_right_join() jaccard_inner_join() jaccard_full_join() euclidean_left_join() euclidean_right_join() euclidean_inner_join() euclidean_full_join() jaccard_join family functions provide fast fuzzy-joins strings using Jaccard distance euclidean_join family provides fuzzy-joins points vectors using Euclidean distance.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"example-joining-rows-of-the-database-on-ideology-money-in-politics-and-elections","dir":"","previous_headings":"Usage:","what":"Example: Joining rows of the Database on Ideology, Money in Politics, and Elections","title":"Superlatively Fast Fuzzy Joins","text":"(DIME) ’s snippet showing use jaccard_inner_join() merge two lists political donors Database Ideology, Money Politics, Elections (DIME). can see detailed example vignette introductory vignette. start two corpuses like combine, corpus_1: corpus_2: corpuses observation ID column, donor name column. like join two datasets donor names column, two can’t directly joined misspellings. , use jaccard_inner_join function fuzzily join two donor name column. Importantly, Locality Sensitive Hashing probabilistic algorithm, may fail identify matches random chance. adjust hyperparameters n_bands band_width chance true matches dropped negligible. default, package issue warning chance true match discovered less 95%. can use jaccard_probability jaccard_hyper_grid_search help understand probability true matches discarded algorithm. details thorough description tune hyperparameters can can found guided tour vignette. Zoomerjoin able quickly find matching columns without comparing pairs records. saves time size list increases, can scale join datasets millions hundreds millions rows.","code":"corpus_1 <- dime_data %>%     head(500) names(corpus_1) <- c(\"a\", \"field\") corpus_1 ## # A tibble: 500 × 2 ##        a field                                                                   ##    <dbl> <chr>                                                                   ##  1     1 ufwa cope committee                                                     ##  2     2 committee to re elect charles e. bennett                                ##  3     3 montana democratic party non federal account                            ##  4     4 mississippi power & light company management political action and educ… ##  5     5 napus pac for postmasters                                               ##  6     6 aminoil good government fund                                            ##  7     7 national women's political caucus of california                         ##  8     8 minnesota gun owners' political victory fund                            ##  9     9 metropolitan detroit afl cio cope committee                             ## 10    10 carpenters legislative improvement committee united brotherhood of car… ## # ℹ 490 more rows corpus_2 <- dime_data %>%     tail(500) names(corpus_2) <- c(\"b\", \"field\") corpus_2 ## # A tibble: 500 × 2 ##        b field                                                                   ##    <dbl> <chr>                                                                   ##  1   501 citizens for derwinski                                                  ##  2   502 progressive victory fund greater washington americans for democratic a… ##  3   503 ingham county democratic party federal campaign fund                    ##  4   504 committee for a stronger future                                         ##  5   505 atoka country supper committee                                          ##  6   506 friends of democracy pac inc                                            ##  7   507 baypac                                                                  ##  8   508 international brotherhood of electrical workers local union 278 cope/p… ##  9   509 louisville & jefferson county republican executive committee            ## 10   510 democratic party of virginia                                            ## # ℹ 490 more rows set.seed(1) start_time <- Sys.time() join_out <- jaccard_inner_join(corpus_1, corpus_2, n_gram_width=6, n_bands=20, band_width=6) ## Warning in jaccard_join(a, b, mode = \"inner\", by = by, salt_by = block_by, : A pair of records at the threshold (0.7) have only a 92% chance of being compared. ## Please consider changing `n_bands` and `band_width`.  ## Joining by 'field' print(Sys.time() - start_time) ## Time difference of 0.01455116 secs print(join_out) ## # A tibble: 19 × 4 ##        a field.x                                                      b field.y  ##    <dbl> <chr>                                                    <dbl> <chr>    ##  1   216 kent county republican finance committee                   607 lake co… ##  2   238 4th congressional district democratic party                518 16th co… ##  3   292 bill bradley for u s senate '84                            913 bill br… ##  4   378 guarini for congress 1982                                  606 guarini… ##  5   232 republican county committee of chester county              710 republi… ##  6   387 committee to re elect congressman staton                   805 committ… ##  7   122 tarrant county republican victory fund                     761 lake co… ##  8   378 guarini for congress 1982                                  883 guarini… ##  9   238 4th congressional district democratic party                792 8th con… ## 10    88 scheuer for congress 1980                                  667 scheuer… ## 11    45 dole for senate committee                                  623 riegle … ## 12    87 kentucky state democratic central executive committee      639 arizona… ## 13   319 7th congressional district democratic party of wisconsin   792 8th con… ## 14   478 united democrats for better government                     642 democra… ## 15   163 davies county republican executive committee               852 warren … ## 16   230 pipefitters local union 524                                998 pipefit… ## 17   216 kent county republican finance committee                   719 harford… ## 18   302 americans for good government inc                          910 america… ## 19    35 solarz for congress 82                                     671 solarz …"},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"contributing","dir":"","previous_headings":"","what":"Contributing","title":"Superlatively Fast Fuzzy Joins","text":"Thanks interest contributing Zoomerjoin! using gitub-centric workflow manage package; can file bug report, request new feature, ask question package filing issue issues page, also find range templates help . ’d like make changes code, can write file pull request page. ’ll try respond timely manner (within week) although occasionally may take longer respond complicated question issue. Please also aware contributor code conduct contributing repository.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"acknowledgments","dir":"","previous_headings":"","what":"Acknowledgments:","title":"Superlatively Fast Fuzzy Joins","text":"Zoomerjoin made using SQL join illustration Germanx speed limit sign Federal Highway Administration - MUTCD.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/index.html","id":"references","dir":"","previous_headings":"","what":"References:","title":"Superlatively Fast Fuzzy Joins","text":"Bonica, Adam. 2016. Database Ideology, Money Politics, Elections: Public version 2.0 [Computer file]. Stanford, CA: Stanford University Libraries. Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman. 2014. Mining Massive Datasets (2nd. ed.). Cambridge University Press, USA. Broder, Andrei Z. (1997), “resemblance containment documents”, Compression Complexity Sequences: Proceedings. Positano, Salerno, Italy","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/dime_data.html","id":null,"dir":"Reference","previous_headings":"","what":"Donors from DIME Database — dime_data","title":"Donors from DIME Database — dime_data","text":"set donor names Database Ideology, Money Politics, Elections (DIME).  dataset used benchmark 2021 APSR paper Adaptive Fuzzy String Matching: Merge Datasets One (Messy) Identifying Field Aaron R. Kaufman Aja Klevs, dataset package subset data replication archive paper. full dataset can found paper's replication materials : doi:10.7910/DVN/4031UL .","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/dime_data.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Donors from DIME Database — dime_data","text":"","code":"dime_data"},{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/dime_data.html","id":"dime-data","dir":"Reference","previous_headings":"","what":"dime_data","title":"Donors from DIME Database — dime_data","text":"data frame 10,000 rows 2 columns: id Numeric ID / Row Number x Donor Name #' @source https://www..int/teams/global-tuberculosis-programme/data","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/dime_data.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Donors from DIME Database — dime_data","text":"doi:10.7910/DVN/4031UL","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/dime_data.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Donors from DIME Database — dime_data","text":"Adam Bonica","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/em_link.html","id":null,"dir":"Reference","previous_headings":"","what":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","title":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","text":"Rust implementation Naive Bayes / Fellegi-Sunter model record linkage detailed article \"Using Probabilistic Model Assist Merging Large-Scale Administrative Records\" Enamorado, Fifield Imai (2019). Takes integer matrix describing similarities possible pair observations, vector initial guesses probability pair match (can either set domain knowledge, one can hand-label subset data leave rest p=.5). Iteratively refines guesses using Expectation Maximization algorithm optima reached. details, see doi:10.1017/S0003055418000783 .","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/em_link.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","text":"","code":"em_link(X, g, tol = 10^-6, max_iter = 10^3)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/em_link.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","text":"X integer matrix similarities. Must go 0 (disagreement) maximum without \"gaps\" unused levels. example, column values 0,1,2,3 valid column, 0,1,2,4 three omitted g vector initial guesses iteratively improved using EM algorithm (personal approach guess logistic regression coefficients use create intitial probability guesses). chosen avoid model getting stuck local optimum, avoid problem label-switching, labels matches non-matches reversed. tol tolerance sense infinity norm. .e. close parameters iterations EM algorithm terminates. max_iter iterations algorithm error converged.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/em_link.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","text":"vector probabilities representing posterior probability record pair match.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/em_link.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fit a Probabilistic Matching Model using Naive Bayes + E.M. — em_link","text":"","code":"inv_logit <- function(x) {   exp(x) / (1 + exp(x)) } n <- 10^6 d <- 1:n %% 5 == 0 X <- cbind(   as.integer(ifelse(d, runif(n) < .8, runif(n) < .2)),   as.integer(ifelse(d, runif(n) < .9, runif(n) < .2)),   as.integer(ifelse(d, runif(n) < .7, runif(n) < .2)),   as.integer(ifelse(d, runif(n) < .6, runif(n) < .2)),   as.integer(ifelse(d, runif(n) < .5, runif(n) < .2)),   as.integer(ifelse(d, runif(n) < .1, runif(n) < .9)),   as.integer(ifelse(d, runif(n) < .1, runif(n) < .9)),   as.integer(ifelse(d, runif(n) < .8, runif(n) < .01)) )  # inital guess at class assignments based on # a hypothetical logistic # regression. Should be based on domain knowledge, or a handful of hand-coded # observations.  x_sum <- rowSums(X) g <- inv_logit((x_sum - mean(x_sum)) / sd(x_sum))  out <- em_link(X, g, tol = .0001, max_iter = 100)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":null,"dir":"Reference","previous_headings":"","what":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":"Fuzzy joins Euclidean distance using Locality Sensitive Hashing","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":"","code":"euclidean_anti_join(   a,   b,   by = NULL,   threshold = 1,   n_bands = 30,   band_width = 5,   r = 0.5,   progress = FALSE )  euclidean_inner_join(   a,   b,   by = NULL,   threshold = 1,   n_bands = 30,   band_width = 5,   r = 0.5,   progress = FALSE )  euclidean_left_join(   a,   b,   by = NULL,   threshold = 1,   n_bands = 30,   band_width = 5,   r = 0.5,   progress = FALSE )  euclidean_right_join(   a,   b,   by = NULL,   threshold = 1,   n_bands = 30,   band_width = 5,   r = 0.5,   progress = FALSE )  euclidean_full_join(   a,   b,   by = NULL,   threshold = 1,   n_bands = 30,   band_width = 5,   r = 0.5,   progress = FALSE )"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":", b two dataframes join. named vector indicating columns join . Format dplyr: = c(\"column_name_in_df_a\" = \"column_name_in_df_b\"), two columns must specified dataset (x column y column). Specification made dplyr::join_by() also accepted. threshold distance threshold units considered match. Note contrary Jaccard joins, value distance similarity. Therefore, lower value means higher similarity. n_bands number bands used minihash algorithm (default 40). Use conjunction band_width determine performance hashing. default settings (.2, .8, .001, .999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. band_width length band used minihashing algorithm (default 8) Use conjunction n_bands determine performance hashing. default settings (.2, .8, .001, .999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. r Hyperparameter used govern sensitivity locality sensitive hash. Corresponds width hash bucket LSH algorithm. Increasing values r mean hash collisions higher sensitivity (fewer false-negatives) cost lower specificity (false-positives longer run time). information, see description doi:10.1145/997817.997857 . progress Set TRUE print progress.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":"tibble fuzzily-joined basis variables . Tries adhere standards dplyr-joins, uses logical joining patterns (.e. inner-join joins keeps observations datasets).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":"Datar, Mayur, Nicole Immorlica, Pitor Indyk, Vahab Mirrokni. \"Locality-Sensitive Hashing Scheme Based p-Stable Distributions\" SCG '04: Proceedings twentieth annual symposium Computational geometry (2004): 253-262","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean-joins.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fuzzy joins for Euclidean distance using Locality Sensitive Hashing — euclidean_anti_join","text":"","code":"n <- 10  # Build two matrices that have close values X_1 <- matrix(c(seq(0, 1, 1 / (n - 1)), seq(0, 1, 1 / (n - 1))), nrow = n) X_2 <- X_1 + .0000001  X_1 <- as.data.frame(X_1) X_2 <- as.data.frame(X_2)  X_1$id_1 <- 1:n X_2$id_2 <- 1:n  # only keep observations that have a match euclidean_inner_join(X_1, X_2, by = c(\"V1\", \"V2\"), threshold = .00005) #>         V1.x      V2.x id_1      V1.y      V2.y id_2 #> 1  0.7777778 0.7777778    8 0.7777779 0.7777779    8 #> 2  0.2222222 0.2222222    3 0.2222223 0.2222223    3 #> 3  0.4444444 0.4444444    5 0.4444445 0.4444445    5 #> 4  0.5555556 0.5555556    6 0.5555557 0.5555557    6 #> 5  0.3333333 0.3333333    4 0.3333334 0.3333334    4 #> 6  0.8888889 0.8888889    9 0.8888890 0.8888890    9 #> 7  1.0000000 1.0000000   10 1.0000001 1.0000001   10 #> 8  0.0000000 0.0000000    1 0.0000001 0.0000001    1 #> 9  0.1111111 0.1111111    2 0.1111112 0.1111112    2 #> 10 0.6666667 0.6666667    7 0.6666668 0.6666668    7  # keep all observations from X_1, regardless of whether they have a match euclidean_inner_join(X_1, X_2, by = c(\"V1\", \"V2\"), threshold = .00005) #>         V1.x      V2.x id_1      V1.y      V2.y id_2 #> 1  1.0000000 1.0000000   10 1.0000001 1.0000001   10 #> 2  0.1111111 0.1111111    2 0.1111112 0.1111112    2 #> 3  0.5555556 0.5555556    6 0.5555557 0.5555557    6 #> 4  0.7777778 0.7777778    8 0.7777779 0.7777779    8 #> 5  0.0000000 0.0000000    1 0.0000001 0.0000001    1 #> 6  0.3333333 0.3333333    4 0.3333334 0.3333334    4 #> 7  0.4444444 0.4444444    5 0.4444445 0.4444445    5 #> 8  0.2222222 0.2222222    3 0.2222223 0.2222223    3 #> 9  0.6666667 0.6666667    7 0.6666668 0.6666668    7 #> 10 0.8888889 0.8888889    9 0.8888890 0.8888890    9"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_curve.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot S-Curve for a LSH with given hyperparameters — euclidean_curve","title":"Plot S-Curve for a LSH with given hyperparameters — euclidean_curve","text":"Plot S-Curve LSH given hyperparameters","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_curve.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot S-Curve for a LSH with given hyperparameters — euclidean_curve","text":"","code":"euclidean_curve(n_bands, band_width, r, up_to = 100)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_curve.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot S-Curve for a LSH with given hyperparameters — euclidean_curve","text":"n_bands number LSH bands calculated band_width number hashes band r \"r\" hyperparameter used govern sensitivity hash. up_to right extent x axis.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_curve.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot S-Curve for a LSH with given hyperparameters — euclidean_curve","text":"plot showing probability pair proposed match, given Jaccard similarity two items.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_probability.html","id":null,"dir":"Reference","previous_headings":"","what":"Find Probability of Match Based on Similarity — euclidean_probability","title":"Find Probability of Match Based on Similarity — euclidean_probability","text":"Find Probability Match Based Similarity","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_probability.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Find Probability of Match Based on Similarity — euclidean_probability","text":"","code":"euclidean_probability(distance, n_bands, band_width, r)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_probability.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Find Probability of Match Based on Similarity — euclidean_probability","text":"distance euclidian distance two vectors want compare. n_bands number LSH bands used hashing. band_width number hashes band. r \"r\" hyperparameter used govern sensitivity hash.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/euclidean_probability.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Find Probability of Match Based on Similarity — euclidean_probability","text":"decimal number giving proability two items returned candidate pair minihash algorithm.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming-joins.html","id":null,"dir":"Reference","previous_headings":"","what":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","title":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","text":"Find similar rows two tables using hamming distance. hamming distance equal number characters two strings differ , equal infinity two strings different lengths","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming-joins.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","text":"","code":"hamming_inner_join(   a,   b,   by = NULL,   n_bands = 100,   band_width = 8,   threshold = 2,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  hamming_anti_join(   a,   b,   by = NULL,   n_bands = 100,   band_width = 100,   threshold = 2,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  hamming_left_join(   a,   b,   by = NULL,   n_bands = 100,   band_width = 100,   threshold = 2,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  hamming_right_join(   a,   b,   by = NULL,   n_bands = 100,   band_width = 100,   threshold = 2,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  hamming_full_join(   a,   b,   by = NULL,   n_bands = 100,   band_width = 100,   threshold = 2,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming-joins.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","text":", b two dataframes join. named vector indicating columns join . Format dplyr: = c(\"column_name_in_df_a\" = \"column_name_in_df_b\"), two columns must specified dataset (x column y column). Specification made dplyr::join_by() also accepted. n_bands number bands used locality sensitive hashing algorithm (default 100). Use conjunction band_width determine performance hashing. Generally speaking, higher number bands leads greater recall cost higher runtime. band_width length band used minihashing algorithm (default 8). Use conjunction n_bands determine performance hashing. Generally speaking wider number bands decreases number false positives, decreasing runtime cost lower sensitivity (true matches less likely found). threshold Hamming distance threshold two strings considered match. distance zero corresponds complete equality strings, distance 'x' two strings means 'x' substitutions must made transform one string . progress Set TRUE print progress. clean strings fuzzy join cleaned (coerced lower-case, stripped punctuation spaces)? Default FALSE. similarity_column optional character vector. provided, data frame contain column name giving Hamming distance two fields. Extra column present anti-joining.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming-joins.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","text":"tibble fuzzily-joined basis variables . Tries adhere standards dplyr-joins, uses logical joining patterns (.e. inner-join joins keeps observations datasets).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming-joins.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fuzzy joins for Hamming distance using Locality Sensitive Hashing — hamming_inner_join","text":"","code":"# load baby names data # install.packages(\"babynames\") library(babynames)  baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500]) baby_names_mispelled <- data.frame(   name_mispelled = gsub(\"[aeiouy]\", \"x\", baby_names$name) )  # Run the join and only keep rows that have a match: hamming_inner_join(   baby_names,   baby_names_mispelled,   by = c(\"name\" = \"name_mispelled\"),   threshold = 3,   n_bands = 150,   band_width = 10,   clean = FALSE # default ) #> # A tibble: 2,664 × 2 #>    name   name_mispelled #>    <chr>  <chr>          #>  1 oma    sxx            #>  2 anna   lxnx           #>  3 may    mxx            #>  4 liza   lxdx           #>  5 iola   xllx           #>  6 helena hxlxnx         #>  7 rosa   rxtx           #>  8 olga   clxx           #>  9 elsa   xlsx           #> 10 cecil  cxcxl          #> # ℹ 2,654 more rows  # Run the join and keep all rows from the first dataset, regardless of whether # they have a match: hamming_left_join(   baby_names,   baby_names_mispelled,   by = c(\"name\" = \"name_mispelled\"),   threshold = 3,   n_bands = 150,   band_width = 10, ) #> # A tibble: 2,746 × 2 #>    name      name_mispelled #>    <chr>     <chr>          #>  1 vina      nxnx           #>  2 lenna     lxndx          #>  3 belle     mxllx          #>  4 lois      lxxh           #>  5 betty     mxttx          #>  6 christina chrxstxnx      #>  7 deborah   dxbxrxh        #>  8 lona      lxdx           #>  9 mabelle   mxbxllx        #> 10 rena      jxnx           #> # ℹ 2,736 more rows"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_distance.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate Hamming distance of two character vectors — hamming_distance","title":"Calculate Hamming distance of two character vectors — hamming_distance","text":"Calculate Hamming distance two character vectors","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_distance.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate Hamming distance of two character vectors — hamming_distance","text":"","code":"hamming_distance(a, b)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_distance.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate Hamming distance of two character vectors — hamming_distance","text":"first character vector b first character vector","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_distance.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate Hamming distance of two character vectors — hamming_distance","text":"vector hamming similarities strings","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_distance.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Calculate Hamming distance of two character vectors — hamming_distance","text":"","code":"hamming_distance(   c(\"ACGTCGATGACGTGATGCGTAGCGTA\", \"ACGTCGATGTGCTCTCGTCGATCTAC\"),   c(\"ACGTCGACGACGTGATGCGCAGCGTA\", \"ACGTCGATGGGGTCTCGTCGATCTAC\") ) #> [1] 2 2"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_probability.html","id":null,"dir":"Reference","previous_headings":"","what":"Find Probability of Match Based on Similarity — hamming_probability","title":"Find Probability of Match Based on Similarity — hamming_probability","text":"Find Probability Match Based Similarity","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_probability.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Find Probability of Match Based on Similarity — hamming_probability","text":"","code":"hamming_probability(distance, input_length, n_bands, band_width)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_probability.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Find Probability of Match Based on Similarity — hamming_probability","text":"distance hamming distance two strings want compare input_length length (number characters) input strings want calculate. n_bands number LSH bands used hashing. band_width number hashes band.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/hamming_probability.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Find Probability of Match Based on Similarity — hamming_probability","text":"decimal number giving probability two items returned candidate pair lsh algotithm.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard-joins.html","id":null,"dir":"Reference","previous_headings":"","what":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","title":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","text":"Fuzzy joins Jaccard distance using MinHash","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard-joins.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","text":"","code":"jaccard_inner_join(   a,   b,   by = NULL,   block_by = NULL,   n_gram_width = 2,   n_bands = 50,   band_width = 8,   threshold = 0.7,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  jaccard_anti_join(   a,   b,   by = NULL,   block_by = NULL,   n_gram_width = 2,   n_bands = 50,   band_width = 8,   threshold = 0.7,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  jaccard_left_join(   a,   b,   by = NULL,   block_by = NULL,   n_gram_width = 2,   n_bands = 50,   band_width = 8,   threshold = 0.7,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  jaccard_right_join(   a,   b,   by = NULL,   block_by = NULL,   n_gram_width = 2,   n_bands = 50,   band_width = 8,   threshold = 0.7,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )  jaccard_full_join(   a,   b,   by = NULL,   block_by = NULL,   n_gram_width = 2,   n_bands = 50,   band_width = 8,   threshold = 0.7,   progress = FALSE,   clean = FALSE,   similarity_column = NULL )"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard-joins.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","text":", b two dataframes join. named vector indicating columns join . Format dplyr: = c(\"column_name_in_df_a\" = \"column_name_in_df_b\"), two columns must specified dataset (x column y column). Specification made dplyr::join_by() also accepted. block_by named vector indicating column block , rows disagree field considered match. Format dplyr: = c(\"column_name_in_df_a\" = \"column_name_in_df_b\") n_gram_width length n_grams used calculating Jaccard similarity. best performance, set large enough chance string specific n_gram low (.e. n_gram_width = 2 3 matching first names, 5 6 matching entire sentences). n_bands number bands used minihash algorithm (default 40). Use conjunction band_width determine performance hashing. default settings (.2, .8, .001, .999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. band_width length band used minihashing algorithm (default 8) Use conjunction n_bands determine performance hashing. default settings (.2, .8, .001, .999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. threshold Jaccard similarity threshold two strings considered match (default .95). similarity equal 1 - Jaccard distance two strings, 1 implies strings identical, similarity zero implies strings completely dissimilar. progress Set TRUE print progress. clean strings fuzzy join cleaned (coerced lower-case, stripped punctuation spaces)? Default FALSE. similarity_column optional character vector. provided, data frame contain column name giving Jaccard similarity two fields. Extra column present anti-joining.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard-joins.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","text":"tibble fuzzily-joined basis variables . Tries adhere standards dplyr-joins, uses logical joining patterns (.e. inner-join joins keeps observations datasets).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard-joins.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fuzzy joins for Jaccard distance using MinHash — jaccard_inner_join","text":"","code":"# load baby names data # install.packages(\"babynames\") library(babynames)  baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500]) baby_names_sans_vowels <- data.frame(   name_wo_vowels = gsub(\"[aeiouy]\", \"\", baby_names$name) ) # Check the probability two pairs of strings with similarity .8 will be # matched with a band width of 8 and 30 bands using the `jaccard_probability()` # function: jaccard_probability(.8, 30, 8) #> [1] 0.9959518  # Run the join and only keep rows that have a match: jaccard_inner_join(   baby_names,   baby_names_sans_vowels,   by = c(\"name\" = \"name_wo_vowels\"),   threshold = .8,   n_bands = 20,   band_width = 6,   n_gram_width = 1,   clean = FALSE # default ) #> # A tibble: 13 × 2 #>    name     name_wo_vowels #>    <chr>    <chr>          #>  1 hester   hstr           #>  2 blanch   blnch          #>  3 frank    frnk           #>  4 esther   thrs           #>  5 hester   sthr           #>  6 frank    frnk           #>  7 blanch   blnch          #>  8 esther   sthr           #>  9 martha   mrth           #> 10 samantha smnth          #> 11 esther   hstr           #> 12 hester   thrs           #> 13 savannah svnnh           # Run the join and keep all rows from the first dataset, regardless of whether # they have a match: jaccard_left_join(   baby_names,   baby_names_sans_vowels,   by = c(\"name\" = \"name_wo_vowels\"),   threshold = .8,   n_bands = 20,   band_width = 6,   n_gram_width = 1 ) #> # A tibble: 506 × 2 #>    name     name_wo_vowels #>    <chr>    <chr>          #>  1 esther   hstr           #>  2 samantha smnth          #>  3 hester   hstr           #>  4 frank    frnk           #>  5 blanch   blnch          #>  6 frank    frnk           #>  7 martha   mrth           #>  8 hester   thrs           #>  9 esther   thrs           #> 10 hester   sthr           #> # ℹ 496 more rows"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_curve.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","title":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","text":"Plot S-Curve LSH given hyperparameters","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_curve.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","text":"","code":"jaccard_curve(n_bands, band_width)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_curve.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","text":"n_bands number LSH bands calculated band_width number hashes band","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_curve.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","text":"plot showing probability pair proposed match, given Jaccard similarity two items.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_curve.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot S-Curve for a LSH with given hyperparameters — jaccard_curve","text":"","code":"# Plot the probability two pairs will be matched as a function of their # jaccard similarity, given the hyperparameters n_bands and band_width. jaccard_curve(40, 6)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_hyper_grid_search.html","id":null,"dir":"Reference","previous_headings":"","what":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","title":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","text":"Runs grid search find hyperparameters achieve (s1,s2,p1,p2)-sensitive locality sensitive hash. locality sensitive hash can called (s1,s2,p1,p2)-sensitive strings similarity less s1 less p1 chance compared, two strings similarity s2 greater p2 chance compared. example, (.1,.7,.001,.999)-sensitive LSH means strings similarity less .1 .1% chance compared, strings .7 similarity 99.9% chance compared.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_hyper_grid_search.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","text":"","code":"jaccard_hyper_grid_search(s1 = 0.1, s2 = 0.7, p1 = 0.001, p2 = 0.999)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_hyper_grid_search.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","text":"s1 s1 parameter (first similaity). s2 s2 parameter (second similarity, must greater s1). p1 p1 parameter (first probability). p2 p2 parameter (second probability, must greater p1).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_hyper_grid_search.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","text":"named vector hyperparameters meet LSH criteria, reducing runitme.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_hyper_grid_search.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Help Choose the Appropriate LSH Hyperparameters — jaccard_hyper_grid_search","text":"","code":"# Help me find the parameters that will minimize runtime while ensuring that # two strings with similarity .1 will be compared less than .1% of the time, # strings with .8 similaity will have a 99.95% chance of being compared: jaccard_hyper_grid_search(.1, .9, .001, .995) #> band_width    n_bands  #>          4          5"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_probability.html","id":null,"dir":"Reference","previous_headings":"","what":"Find Probability of Match Based on Similarity — jaccard_probability","title":"Find Probability of Match Based on Similarity — jaccard_probability","text":"port lsh_probability function textreuse package, arguments changed reflect hyperparameters package. gives probability two strings jaccard similarity similarity matched, given chosen bandwidth number bands.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_probability.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Find Probability of Match Based on Similarity — jaccard_probability","text":"","code":"jaccard_probability(similarity, n_bands, band_width)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_probability.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Find Probability of Match Based on Similarity — jaccard_probability","text":"similarity similarity two strings want compare n_bands number LSH bands used hashing. band_width number hashes band.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_probability.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Find Probability of Match Based on Similarity — jaccard_probability","text":"decimal number giving probability two items returned candidate pair minhash algorithm.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_probability.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Find Probability of Match Based on Similarity — jaccard_probability","text":"","code":"# Find the probability two pairs will be matched given they have a # jaccard_similarity of .8, band width of 5, and 50 bands: jaccard_probability(.8, n_bands = 50, band_width = 5) #> [1] 1"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_similarity.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","title":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","text":"Calculate Jaccard Similarity two character vectors","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_similarity.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","text":"","code":"jaccard_similarity(a, b, ngram_width = 2)"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_similarity.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","text":"first character vector b first character vector ngram_width length shingles / ngrams used similarity calculation","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_similarity.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","text":"vector jaccard similarities strings","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_similarity.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Calculate Jaccard Similarity of two character vectors — jaccard_similarity","text":"","code":"jaccard_similarity(   c(\"the quick brown fox\", \"jumped over the lazy dog\"),   c(\"the quck bron fx\", \"jumped over hte lazy dog\") ) #> [1] 0.5714286 0.7692308"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_string_group.html","id":null,"dir":"Reference","previous_headings":"","what":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","title":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","text":"Performs fuzzy string grouping similar strings assigned group. Uses cluster_fast_greedy() community detection algorithm igraph package create groups. Must igraph installed order use function.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_string_group.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","text":"","code":"jaccard_string_group(   string,   n_gram_width = 2,   n_bands = 45,   band_width = 8,   threshold = 0.7,   progress = FALSE )"},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_string_group.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","text":"string character wish perform entity resolution . n_gram_width length n_grams used calculating jaccard similarity. best performance, set large enough chance string specific n_gram low (.e. n_gram_width = 2 3 matching first names, 5 6 matching entire sentences). n_bands number bands used minihash algorithm (default 40). Use conjunction band_width determine performance hashing. default settings (.2,.8,.001,.999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. band_width length band used minihashing algorithm (default 8) Use conjunction n_bands determine performance hashing. default settings (.2,.8,.001,.999)-sensitive hash .e. pairs similarity less .2 >.1% chance compared, pairs similarity greater .8 >99.9% chance compared. threshold jaccard similarity threshold two strings considered match (default .95). similarity euqal 1 jaccard distance two strings, 1 implies strings identical, similarity zero implies strings completely dissimilar. progress set true report progress algorithm","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_string_group.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","text":"string vector storing group element original input strings. input vector grouped similar strings belong group, given standardized name.","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/jaccard_string_group.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Fuzzy String Grouping Using Minhashing — jaccard_string_group","text":"","code":"string <- c(   \"beniamino\", \"jack\", \"benjamin\", \"beniamin\",   \"jacky\", \"giacomo\", \"gaicomo\" ) jaccard_string_group(string, threshold = .2, n_bands = 90, n_gram_width = 1) #> Loading required namespace: igraph #> [1] \"beniamino\" \"jack\"      \"beniamino\" \"beniamino\" \"jack\"      \"giacomo\"   #> [7] \"giacomo\""},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/zoomerjoin-package.html","id":null,"dir":"Reference","previous_headings":"","what":"zoomerjoin: Superlatively Fast Fuzzy Joins — zoomerjoin-package","title":"zoomerjoin: Superlatively Fast Fuzzy Joins — zoomerjoin-package","text":"Empowers users fuzzily-merge data frames millions tens millions rows minutes low memory usage. package uses locality sensitive hashing algorithms developed Datar, Immorlica, Indyk Mirrokni (2004) doi:10.1145/997817.997857 , Broder (1998) doi:10.1109/SEQUEN.1997.666900  avoid compare every pair records dataset, resulting fuzzy-merges finish linear time.","code":""},{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/reference/zoomerjoin-package.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"zoomerjoin: Superlatively Fast Fuzzy Joins — zoomerjoin-package","text":"Maintainer: Beniamino Green beniamino.green@yale.edu [copyright holder] contributors: Etienne Bacher etienne.bacher@protonmail.com (ORCID) [contributor] authors dependency Rust crates (see inst/AUTHORS file details) [contributor, copyright holder]","code":""},{"path":[]},{"path":"https://beniaminogreen.github.io/zoomerjoin/news/index.html","id":"new-features-development-version","dir":"Changelog","previous_headings":"","what":"New features","title":"zoomerjoin (development version)","text":"Several performance improvements (#101, #104). Added support joining based hamming distance (#100).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/news/index.html","id":"bug-fixes-development-version","dir":"Changelog","previous_headings":"","what":"Bug fixes","title":"zoomerjoin (development version)","text":"clean = TRUE, strings coerced lower case. now case (#105). Fix argument progress, didn’t print anything TRUE (#107).","code":""},{"path":"https://beniaminogreen.github.io/zoomerjoin/news/index.html","id":"zoomerjoin-012","dir":"Changelog","previous_headings":"","what":"zoomerjoin 0.1.2","title":"zoomerjoin 0.1.2","text":"Submitted Package CRAN Add support new join_by() syntax Added NEWS.md file track changes package.","code":""}]