Skip to content

Commit

Permalink
updated readme and codemeta
Browse files Browse the repository at this point in the history
  • Loading branch information
beniaminogreen committed Jan 26, 2024
1 parent 23afc9d commit 8d7e4d4
Show file tree
Hide file tree
Showing 3 changed files with 67 additions and 57 deletions.
80 changes: 45 additions & 35 deletions README.html
Original file line number Diff line number Diff line change
Expand Up @@ -616,7 +616,7 @@ <h1 id="zoomerjoin-">zoomerjoin
observations. In practice, this means zoomerjoin can fuzzily-join
datasets days, or even years faster than other matching packages.
zoomerjoin has been used in-production to join datasets of hundreds of
millions of names in a few hours.</p>
millions of names or vectors in a matter of hours.</p>
<h2 id="installation">Installation</h2>
<h3 id="installing-from-cran">Installing from CRAN:</h3>
<p>You can install from the CRAN as you would with any other package.
Expand Down Expand Up @@ -695,18 +695,18 @@ <h3 id="example-joining-rows-of-the-database-on-ideology-money-in-politics-and-e
Elections</h3>
<p>(DIME)</p>
<p>Here’s a snippet showing off how to use the
<code>lhs_inner_join()</code> merge two datasets of political donors in
<code>jaccard_inner_join()</code> merge two lists of political donors in
the <a href="https://data.stanford.edu/dime">Database on Ideology, Money
in Politics, and Elections (DIME)</a>. You can see a more detailed
example of this vignette in the <a href="https://beniamino.org/zoomerjoin/articles/guided_tour.html">introductory
vignette</a>.</p>
<p>I start with two corpuses I would like to combine,
<code>corpus_1</code>:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" tabindex="-1"></a>corpus_1 <span class="ot">&lt;-</span> dime_data <span class="sc">%&gt;%</span></span>
<span id="cb7-2"><a href="#cb7-2" tabindex="-1"></a> <span class="fu">head</span>(<span class="dv">500000</span>)</span>
<span id="cb7-2"><a href="#cb7-2" tabindex="-1"></a> <span class="fu">head</span>(<span class="dv">500</span>)</span>
<span id="cb7-3"><a href="#cb7-3" tabindex="-1"></a><span class="fu">names</span>(corpus_1) <span class="ot">&lt;-</span> <span class="fu">c</span>(<span class="st">&quot;a&quot;</span>, <span class="st">&quot;field&quot;</span>)</span>
<span id="cb7-4"><a href="#cb7-4" tabindex="-1"></a>corpus_1</span></code></pre></div>
<pre><code>## # A tibble: 100,000 × 2
<pre><code>## # A tibble: 500 × 2
## a field
## &lt;dbl&gt; &lt;chr&gt;
## 1 1 ufwa cope committee
Expand All @@ -719,26 +719,26 @@ <h3 id="example-joining-rows-of-the-database-on-ideology-money-in-politics-and-e
## 8 8 minnesota gun owners&#39; political victory fund
## 9 9 metropolitan detroit afl cio cope committee
## 10 10 carpenters legislative improvement committee united brotherhood of car…
## # ℹ 99,990 more rows</code></pre>
## # ℹ 490 more rows</code></pre>
<p>And <code>corpus_2</code>:</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" tabindex="-1"></a>corpus_2 <span class="ot">&lt;-</span> dime_data <span class="sc">%&gt;%</span></span>
<span id="cb9-2"><a href="#cb9-2" tabindex="-1"></a> <span class="fu">tail</span>(<span class="dv">500000</span>)</span>
<span id="cb9-2"><a href="#cb9-2" tabindex="-1"></a> <span class="fu">tail</span>(<span class="dv">500</span>)</span>
<span id="cb9-3"><a href="#cb9-3" tabindex="-1"></a><span class="fu">names</span>(corpus_2) <span class="ot">&lt;-</span> <span class="fu">c</span>(<span class="st">&quot;b&quot;</span>, <span class="st">&quot;field&quot;</span>)</span>
<span id="cb9-4"><a href="#cb9-4" tabindex="-1"></a>corpus_2</span></code></pre></div>
<pre><code>## # A tibble: 100,000 × 2
<pre><code>## # A tibble: 500 × 2
## b field
## &lt;dbl&gt; &lt;chr&gt;
## 1 1 ufwa cope committee
## 2 2 committee to re elect charles e. bennett
## 3 3 montana democratic party non federal account
## 4 4 mississippi power &amp; light company management political action and educ…
## 5 5 napus pac for postmasters
## 6 6 aminoil good government fund
## 7 7 national women&#39;s political caucus of california
## 8 8 minnesota gun owners&#39; political victory fund
## 9 9 metropolitan detroit afl cio cope committee
## 10 10 carpenters legislative improvement committee united brotherhood of car…
## # ℹ 99,990 more rows</code></pre>
## 1 501 citizens for derwinski
## 2 502 progressive victory fund greater washington americans for democratic a…
## 3 503 ingham county democratic party federal campaign fund
## 4 504 committee for a stronger future
## 5 505 atoka country supper committee
## 6 506 friends of democracy pac inc
## 7 507 baypac
## 8 508 international brotherhood of electrical workers local union 278 cope/p…
## 9 509 louisville &amp; jefferson county republican executive committee
## 10 510 democratic party of virginia
## # ℹ 490 more rows</code></pre>
<p>Both corpuses have an observation ID column, and a donor name column.
We would like to join the two datasets on the donor names column, but
the two can’t be directly joined because of misspellings. Because of
Expand All @@ -764,24 +764,34 @@ <h3 id="example-joining-rows-of-the-database-on-ideology-money-in-politics-and-e

## Joining by &#39;field&#39;</code></pre>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1" tabindex="-1"></a><span class="fu">print</span>(<span class="fu">Sys.time</span>() <span class="sc">-</span> start_time)</span></code></pre></div>
<pre><code>## Time difference of 0.9200411 secs</code></pre>
<pre><code>## Time difference of 0.01455116 secs</code></pre>
<div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1" tabindex="-1"></a><span class="fu">print</span>(join_out)</span></code></pre></div>
<pre><code>## # A tibble: 213,218 × 4
## a field.x b field.y
## &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
## 1 1733 nussle for congress 1733 nussle…
## 2 35539 big apple wrecking &amp; construct 35539 big ap…
## 3 30729 fit development lp 30729 fit de…
## 4 53451 tom ammiano for assembly 2010 52053 tom am…
## 5 84615 electrical workers local 716 94432 electr…
## 6 39228 electrical workers local 363 96173 electr…
## 7 99572 casey for treasurer cmte 99572 casey …
## 8 50990 afscme local 3634 18086 afscme…
## 9 74858 bolt farrell, kevin anthony 74858 bolt f…
## 10 71895 international brotherhood of electrical workers local un… 54279 intern…
## # ℹ 213,208 more rows</code></pre>
<p>ZoomerJoin finds and joins on the matching rows in just a few
seconds.</p>
<pre><code>## # A tibble: 19 × 4
## a field.x b field.y
## &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
## 1 216 kent county republican finance committee 607 lake co…
## 2 238 4th congressional district democratic party 518 16th co…
## 3 292 bill bradley for u s senate &#39;84 913 bill br…
## 4 378 guarini for congress 1982 606 guarini…
## 5 232 republican county committee of chester county 710 republi…
## 6 387 committee to re elect congressman staton 805 committ…
## 7 122 tarrant county republican victory fund 761 lake co…
## 8 378 guarini for congress 1982 883 guarini…
## 9 238 4th congressional district democratic party 792 8th con…
## 10 88 scheuer for congress 1980 667 scheuer…
## 11 45 dole for senate committee 623 riegle …
## 12 87 kentucky state democratic central executive committee 639 arizona…
## 13 319 7th congressional district democratic party of wisconsin 792 8th con…
## 14 478 united democrats for better government 642 democra…
## 15 163 davies county republican executive committee 852 warren …
## 16 230 pipefitters local union 524 998 pipefit…
## 17 216 kent county republican finance committee 719 harford…
## 18 302 americans for good government inc 910 america…
## 19 35 solarz for congress 82 671 solarz …</code></pre>
<p>Zoomerjoin is able to quickly find the matching columns without
comparing all pairs of records. This saves more and more time as the
size of each list increases, so it can scale to join datasets with
millions or hundreds of millions of rows.</p>
<h1 id="contributing">Contributing</h1>
<p>Thanks for your interest in contributing to Zoomerjoin!</p>
<p>I am using a gitub-centric workflow to manage the package; You can
Expand Down
40 changes: 20 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ join_out <- jaccard_inner_join(corpus_1, corpus_2, n_gram_width=6, n_bands=20, b
print(Sys.time() - start_time)
```

## Time difference of 0.01413631 secs
## Time difference of 0.01455116 secs

``` r
print(join_out)
Expand All @@ -226,25 +226,25 @@ print(join_out)
## # A tibble: 19 × 4
## a field.x b field.y
## <dbl> <chr> <dbl> <chr>
## 1 302 americans for good government inc 910 america
## 2 87 kentucky state democratic central executive committee 639 arizona
## 3 232 republican county committee of chester county 710 republi
## 4 378 guarini for congress 1982 883 guarini…
## 5 35 solarz for congress 82 671 solarz
## 6 230 pipefitters local union 524 998 pipefit
## 7 378 guarini for congress 1982 606 guarini
## 8 238 4th congressional district democratic party 518 16th co
## 9 122 tarrant county republican victory fund 761 lake co
## 10 163 davies county republican executive committee 852 warren
## 11 88 scheuer for congress 1980 667 scheuer
## 12 387 committee to re elect congressman staton 805 committ
## 13 45 dole for senate committee 623 riegle
## 14 319 7th congressional district democratic party of wisconsin 792 8th con
## 15 216 kent county republican finance committee 719 harford
## 16 292 bill bradley for u s senate '84 913 bill br
## 17 478 united democrats for better government 642 democra
## 18 238 4th congressional district democratic party 792 8th con
## 19 216 kent county republican finance committee 607 lake co
## 1 216 kent county republican finance committee 607 lake co
## 2 238 4th congressional district democratic party 518 16th co
## 3 292 bill bradley for u s senate '84 913 bill br
## 4 378 guarini for congress 1982 606 guarini…
## 5 232 republican county committee of chester county 710 republi
## 6 387 committee to re elect congressman staton 805 committ
## 7 122 tarrant county republican victory fund 761 lake co
## 8 378 guarini for congress 1982 883 guarini
## 9 238 4th congressional district democratic party 792 8th con
## 10 88 scheuer for congress 1980 667 scheuer
## 11 45 dole for senate committee 623 riegle
## 12 87 kentucky state democratic central executive committee 639 arizona
## 13 319 7th congressional district democratic party of wisconsin 792 8th con
## 14 478 united democrats for better government 642 democra
## 15 163 davies county republican executive committee 852 warren
## 16 230 pipefitters local union 524 998 pipefit
## 17 216 kent county republican finance committee 719 harford
## 18 302 americans for good government inc 910 america
## 19 35 solarz for congress 82 671 solarz

Zoomerjoin is able to quickly find the matching columns without
comparing all pairs of records. This saves more and more time as the
Expand Down
4 changes: 2 additions & 2 deletions codemeta.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"codeRepository": "https://beniamino.org/zoomerjoin/",
"issueTracker": "https://github.com/beniaminogreen/zoomerjoin/issues/",
"license": "https://spdx.org/licenses/GPL-3.0",
"version": "0.1.2",
"version": "0.1.3",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
Expand Down Expand Up @@ -242,7 +242,7 @@
},
"SystemRequirements": "Cargo (>= 1.56) (Rust's package manager), rustc"
},
"fileSize": "225421.347KB",
"fileSize": "350841.431KB",
"contIntegration": "https://app.codecov.io/gh/beniaminogreen/zoomerjoin?branch=main",
"developmentStatus": "https://lifecycle.r-lib.org/articles/stages.html#experimental"
}

0 comments on commit 8d7e4d4

Please sign in to comment.