Improved performance for the Raw container (#162)

* change intersection algorithm in Raw computations * fix * update version * update changelog * update cpp bench * changelog * changelog again * update docs * comments * update comments * update docs * license formatting * fix test rounding * fix cpp rounding * changelog * golint
OpenMined · Feb 17, 2023 · 20952d9 · 20952d9
1 parent 3228d01
commit 20952d9
Show file tree

Hide file tree

Showing 19 changed files with 412 additions and 121 deletions.
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,3 +1,24 @@
+# Version 2.0.1
+
+Feat:
+
+- The complexity of the underlying `Raw` intersection computation has improved
+  from `O(nmlog(m))` -> `O(nlog(n) + max(n, m))`; however, internal protobuf
+  deserialization remains as the dominant performance inhibitor for the
+  `client->GetIntersection*` methods.
+
+Fix:
+
+- The `go` integration tests were not using the datastructure param properly.
+  The fix did not result in any regression.
+
+Chore:
+
+- Update `C++` benchmarks to include the new `Raw` enum variant
+- Misc fixes to tests which were not rounding correctly and causing CI to fail
+  randomly
+- Update the main README to include a description of the protocol
+
 # Version 2.0.0
 
 Breaking:

diff --git a/README.md b/README.md
@@ -6,8 +6,72 @@
 
 # PSI
 
-Private Set Intersection protocol based on ECDH, Bloom Filters, and Golomb
-Compressed Sets.
+Private Set Intersection protocol based on ECDH and Golomb Compressed Sets or
+Bloom Filters.
+
+## Protocol
+
+The Private Set Intersection (PSI) protocol involves two parties, a client and a
+server, each holding a dataset. The goal of the protocol is for the client to
+determine the intersection between their dataset and the server's dataset,
+without revealing any information about their respective datasets to each other.
+
+The protocol proceeds as follows:
+
+1. Setup (server)
+
+The server encrypts all its elements `x` under a commutative encryption scheme,
+computing `H(x)^s` where `s` is its secret key. The encrypted elements are then
+inserted into a container and sent to the client in the form of a serialized
+protobuf and resembles the following:
+
+```
+[ H(x_1)^(s), H(x_2)^(s), ... , H(x_n)^(s) ]
+```
+
+2. Request (client)
+
+The client encrypts all their elements `x` using the commutative encryption
+scheme, computing `H(x)^c`, where `c` is its secret key. The client sends its
+encrypted elements to the server along with a boolean flag,
+`reveal_intersection`, indicating whether the client wants to learn the elements
+in the intersection or only its size (cardinality). The payload is sent as a
+serialized protobuf and resembles the following:
+
+```
+[ H(x_1)^(c), H(x_2)^(c), ... , H(x_n)^(c) ]
+```
+
+3. Response (server)
+
+For each encrypted element `H(x)^c` received from the client, the server
+encrypts it again under the commutative encryption scheme with its secret key
+`s`, computing `(H(x)^c)^s = H(x)^(cs)`. The result is sent back to the client
+in a serialized protobuf and resembles the following:
+
+```
+[ H(x_1)^(cs), H(x_2)^(cs), ... , H(x_n)^(cs) ]
+```
+
+4. Compute intersection (client)
+
+The client decrypts each element received from the server's response using its
+secret key `c`, computing `(H(x)^(cs))^(1/c) = H(x)^s`. It then checks whether
+each decrypted element is present in the container received from the server, and
+reports the number of matches as the intersection size.
+
+It's worth noting that the protocol has several variants, some of which
+introduce a small false-positive rate, while others do not generate false
+positives. This behavior is selective, and the false-positive rate can be tuned.
+
+The protocol has configurable **containers**. Golomb Compressed Sets (`Gcs`) is
+the default container but it can be overridden to be `BloomFilter` or `Raw`
+encrypted strings. `Gcs` and `BloomFilter` will have false positives whereas
+`Raw` will not.
+
+## Security
+
+See [SECURITY.md](SECURITY.md).
 
 ## Requirements
 

diff --git a/SECURITY.md b/SECURITY.md
@@ -5,41 +5,47 @@ Several caveats should be carefully considered before using PSI.
 ### Information assumed public
 
 1. Server set size
-2. Client set size
-   (Note that Each of these can be turned into upper bounds by adding dummy elements.)
+2. Client set size (Note that Each of these can be turned into upper bounds by
+   adding dummy elements.)
 
 ### Security Limitations for the PSI protocol
 
-There are two configurations for instantiating a new client/server pair by passing in a boolean switch into their respective constructors.
+There are two configurations for instantiating a new client/server pair by
+passing in a boolean switch into their respective constructors.
 
-1. One that reveals only the **size** (cardinality) of the intersection to the client.
+1. One that reveals only the **size** (cardinality) of the intersection to the
+   client.
 2. One that reveals the actual **intersecion** to the client.
 
-In the case of #1, coordinated clients could get the actual intersection. However, server set items not
-in any of the client sets will never be uncovered.
-Situations where it’s feasible for clients to send one request per element in the domain -
-there is a possbility that coordinated clients could uncover server set.
-
-Presence of new client set members or absence of former client set members can be
-detected by server/eavesdroppers if client secret is reused.
-
-In the absence of any rate limiting and assuming the client and server have enough
-computing power and bandwidth, small domains may be brute-forceable. However, a query
-needs to be performed for each brute-force attempt.
-An example for this situation would be suppose you were trying to limit sending antibody
-tests to people based on whether they’d been in an infected location, so that people would
-have to share their location history to prove they’d been somewhere infected, and you were
-using PSI so people wouldn’t have to share their location history without good reason. If
-your health authority only covers 10 possible geohashes, people could sidestep the PSI step
-entirely and submit location histories which unlock tests by brute force.
+In the case of #1, coordinated clients could get the actual intersection.
+However, server set items not in any of the client sets will never be uncovered.
+Situations where it’s feasible for clients to send one request per element in
+the domain - there is a possbility that coordinated clients could uncover server
+set.
+
+Presence of new client set members or absence of former client set members can
+be detected by server/eavesdroppers if client secret is reused.
+
+In the absence of any rate limiting and assuming the client and server have
+enough computing power and bandwidth, small domains may be brute-forceable.
+However, a query needs to be performed for each brute-force attempt. An example
+for this situation would be suppose you were trying to limit sending antibody
+tests to people based on whether they’d been in an infected location, so that
+people would have to share their location history to prove they’d been somewhere
+infected, and you were using PSI so people wouldn’t have to share their location
+history without good reason. If your health authority only covers 10 possible
+geohashes, people could sidestep the PSI step entirely and submit location
+histories which unlock tests by brute force.
 
 A potential limitation with the PSI approach is the communication complexity,
-which scales linearly with the size of the larger set. This is of particular concern
-when performing PSI between a constrained device (cellphone) holding a small set, and a
-large service provider (e.g. WhatsApp), such as in the Private Contact Discovery application.
-Assuming a bloom filter is used, the Client set size affects the algorithmic complexity in
-linear time O(n), with a constant number of lookups. The bloom filter has linear size
-in the server's set, hence the algorithmic complexity of our protocol is O(n). However,
-a bloom filter requires a large number of lookups on each query, if the false positive rate
-is low. An alternative is the Golomb Compressed Set, which requires O(n log n) time due to sorting
-operations, but in practice takes around 25-30% less space than a bloom filter.
+which scales linearly with the size of the larger set. This is of particular
+concern when performing PSI between a constrained device (cellphone) holding a
+small set, and a large service provider (e.g. WhatsApp), such as in the Private
+Contact Discovery application. Assuming a bloom filter is used, the Client set
+size affects the algorithmic complexity in linear time O(n), with a constant
+number of lookups. The bloom filter has linear size in the server's set, hence
+the algorithmic complexity of our protocol is O(n). However, a bloom filter
+requires a large number of lookups on each query, if the false positive rate is
+low. An alternative is the Golomb Compressed Set, which requires O(n log n) time
+due to sorting operations, but in practice takes around 25-30% less space than a
+bloom filter.
diff --git a/package-lock.json b/package-lock.json
diff --git a/package.json b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "@openmined/psi.js",
-  "version": "2.0.0",
+  "version": "2.0.1",
   "description": "Private Set Intersection for JavaScript",
   "repository": {
     "type": "git",

diff --git a/private_set_intersection/c/integration_test.cpp b/private_set_intersection/c/integration_test.cpp
@@ -14,6 +14,8 @@
 // limitations under the License.
 //
 
+#include <math.h>
+
 #include "absl/container/flat_hash_set.h"
 #include "absl/strings/str_cat.h"
 #include "gtest/gtest.h"
@@ -223,7 +225,8 @@ TEST_P(Correctness, intersection) {
 
     // Test if size is approximately as expected (up to 10%).
     EXPECT_GE(intersection_size, num_client_inputs / 2);
-    EXPECT_LT(intersection_size, (num_client_inputs / 2) * 1.1);
+    EXPECT_LT((double)intersection_size,
+              ceil((double(num_client_inputs) / 2.0) * 1.1));
   }
   free(server_setup);
   free(client_request);

diff --git a/private_set_intersection/cpp/datastructure/raw.cpp b/private_set_intersection/cpp/datastructure/raw.cpp
@@ -25,16 +25,34 @@
 
 namespace private_set_intersection {
 
+// Computes the intersection of two collections. The first collection must be a
+// `pair<T, int64_t>`. The `T` must be the same in the second collection.
+//
+// Requires both collections to be sorted.
+//
+// Complexity:
+// - O(max(n, m))
+template <class InputIt1, class InputIt2, class OutputIt>
+void custom_set_intersection(InputIt1 first1, InputIt1 last1, InputIt2 first2,
+                             InputIt2 last2, OutputIt d_first) {
+  while (first1 != last1 && first2 != last2) {
+    if ((*first1).first < *first2)
+      ++first1;
+    else {
+      // *first1 and *first2 are equivalent.
+      if (!(*first2 < (*first1).first)) {
+        *d_first++ = (*first1++).second;
+      }
+      ++first2;
+    }
+  }
+}
+
 Raw::Raw(std::vector<std::string> elements) : encrypted_(std::move(elements)) {}
 
 StatusOr<std::unique_ptr<Raw>> Raw::Create(int64_t num_client_inputs,
                                            std::vector<std::string> elements) {
-  auto num_server_inputs = static_cast<int64_t>(elements.size());
-
-  // If server inputs < client inputs, add random encrypted values
-  // ...
-
-  // Then we perform a sort to make intersections easier to find
+  // We sort to make intersections easier to find later
   std::sort(elements.begin(), elements.end());
 
   return absl::WrapUnique(new Raw(elements));
@@ -55,14 +73,24 @@ StatusOr<std::unique_ptr<Raw>> Raw::CreateFromProtobuf(
 
 std::vector<int64_t> Raw::Intersect(
     absl::Span<const std::string> elements) const {
-  std::vector<int64_t> res;
-
-  for (size_t i = 0; i < elements.size(); i++) {
-    if (std::binary_search(encrypted_.begin(), encrypted_.end(), elements[i])) {
-      res.push_back(i);
-    }
+  // This implementation creates a copy of `elements`, but the tradeoff is that
+  // we can compute the intersection in O(nlog(n) + max(n, m)) where `n` and `m`
+  // correspond to the number of client and server elements respectively.
+  std::vector<std::pair<std::string, int64_t>> vp(elements.size());
+
+  // Collect a pair with the index to track the original index after sorting.
+  for (size_t i = 0; i < elements.size(); ++i) {
+    vp[i] = make_pair(elements[i], (int64_t)i);
   }
 
+  // Next, we sort the collection. O(nlog(n))
+  std::sort(vp.begin(), vp.end());
+
+  std::vector<int64_t> res;
+  // Compute intersection. O(max(m, n))
+  custom_set_intersection(vp.begin(), vp.end(), encrypted_.begin(),
+                          encrypted_.end(), std::back_inserter(res));
+
   return res;
 }
 

diff --git a/private_set_intersection/cpp/datastructure/raw.h b/private_set_intersection/cpp/datastructure/raw.h
@@ -41,6 +41,7 @@ class Raw {
   static StatusOr<std::unique_ptr<Raw>> CreateFromProtobuf(
       const psi_proto::ServerSetup& encoded_filter);
 
+  // Calculates the intersection
   std::vector<int64_t> Intersect(absl::Span<const std::string> elements) const;
 
   // Returns the size of the encrypted elements

diff --git a/private_set_intersection/cpp/datastructure/raw_test.cpp b/private_set_intersection/cpp/datastructure/raw_test.cpp
@@ -36,20 +36,31 @@ class RawTest : public ::testing::Test {
   std::unique_ptr<Raw> container_;
 };
 
-TEST_F(RawTest, TestAdd) {
+TEST_F(RawTest, TestIntersection) {
   std::vector<std::string> server = {"a", "b", "c", "d", "e"};
-  std::vector<std::string> client = {"z", "b", "c", "d"};
+  std::vector<std::string> client = {"b", "c", "d", "z"};
 
   SetUp(static_cast<int64_t>(client.size()), server);
-  std::vector<int64_t> results = container_->Intersect(client);
+  std::vector<int64_t> results = container_->Intersect(absl::MakeSpan(client));
+  std::vector<int64_t> expected{0, 1, 2};
+  EXPECT_EQ(results.size(), 3);
+  EXPECT_EQ(results, expected);
+}
+
+TEST_F(RawTest, TestIntersectionLargerClient) {
+  std::vector<std::string> server = {"b", "c", "d", "z"};
+  std::vector<std::string> client = {"a", "b", "c", "d", "e"};
+
+  SetUp(static_cast<int64_t>(client.size()), server);
+  std::vector<int64_t> results = container_->Intersect(absl::MakeSpan(client));
   std::vector<int64_t> expected{1, 2, 3};
   EXPECT_EQ(results.size(), 3);
   EXPECT_EQ(results, expected);
 }
 
 TEST_F(RawTest, TestToProtobuf) {
   std::vector<std::string> server = {"b", "a", "c", "d", "e"};
-  std::vector<std::string> client = {"z", "b", "c", "d"};
+  std::vector<std::string> client = {"b", "c", "d", "z"};
 
   SetUp(static_cast<int64_t>(client.size()), server);
 
@@ -63,17 +74,17 @@ TEST_F(RawTest, TestToProtobuf) {
   EXPECT_EQ(encoded_filter.raw().encrypted_elements()[0], "a");
 }
 
-TEST_F(RawTest, TestCreateFromProtobuf) {
+TEST_F(RawTest, TestIntersectionFromProtobuf) {
   std::vector<std::string> server = {"a", "b", "c", "d", "e"};
-  std::vector<std::string> client = {"z", "b", "c", "d"};
+  std::vector<std::string> client = {"b", "c", "d", "z"};
 
   SetUp(static_cast<int64_t>(client.size()), server);
 
   // Create the protobuf from the Raw container and check if it matches.
   PSI_ASSERT_OK_AND_ASSIGN(auto container2,
                            Raw::CreateFromProtobuf(container_->ToProtobuf()));
-  std::vector<int64_t> results = container2->Intersect(client);
-  std::vector<int64_t> expected{1, 2, 3};
+  std::vector<int64_t> results = container2->Intersect(absl::MakeSpan(client));
+  std::vector<int64_t> expected{0, 1, 2};
   EXPECT_EQ(results.size(), 3);
   EXPECT_EQ(results, expected);
 }