Skip to content

Commit

Permalink
Improved performance for the Raw container (#162)
Browse files Browse the repository at this point in the history
* change intersection algorithm in Raw computations

* fix

* update version

* update changelog

* update cpp bench

* changelog

* changelog again

* update docs

* comments

* update comments

* update docs

* license formatting

* fix test rounding

* fix cpp rounding

* changelog

* golint
  • Loading branch information
s0l0ist authored Feb 17, 2023
1 parent 3228d01 commit 20952d9
Show file tree
Hide file tree
Showing 19 changed files with 412 additions and 121 deletions.
21 changes: 21 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,24 @@
# Version 2.0.1

Feat:

- The complexity of the underlying `Raw` intersection computation has improved
from `O(nmlog(m))` -> `O(nlog(n) + max(n, m))`; however, internal protobuf
deserialization remains as the dominant performance inhibitor for the
`client->GetIntersection*` methods.

Fix:

- The `go` integration tests were not using the datastructure param properly.
The fix did not result in any regression.

Chore:

- Update `C++` benchmarks to include the new `Raw` enum variant
- Misc fixes to tests which were not rounding correctly and causing CI to fail
randomly
- Update the main README to include a description of the protocol

# Version 2.0.0

Breaking:
Expand Down
68 changes: 66 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,72 @@

# PSI

Private Set Intersection protocol based on ECDH, Bloom Filters, and Golomb
Compressed Sets.
Private Set Intersection protocol based on ECDH and Golomb Compressed Sets or
Bloom Filters.

## Protocol

The Private Set Intersection (PSI) protocol involves two parties, a client and a
server, each holding a dataset. The goal of the protocol is for the client to
determine the intersection between their dataset and the server's dataset,
without revealing any information about their respective datasets to each other.

The protocol proceeds as follows:

1. Setup (server)

The server encrypts all its elements `x` under a commutative encryption scheme,
computing `H(x)^s` where `s` is its secret key. The encrypted elements are then
inserted into a container and sent to the client in the form of a serialized
protobuf and resembles the following:

```
[ H(x_1)^(s), H(x_2)^(s), ... , H(x_n)^(s) ]
```

2. Request (client)

The client encrypts all their elements `x` using the commutative encryption
scheme, computing `H(x)^c`, where `c` is its secret key. The client sends its
encrypted elements to the server along with a boolean flag,
`reveal_intersection`, indicating whether the client wants to learn the elements
in the intersection or only its size (cardinality). The payload is sent as a
serialized protobuf and resembles the following:

```
[ H(x_1)^(c), H(x_2)^(c), ... , H(x_n)^(c) ]
```

3. Response (server)

For each encrypted element `H(x)^c` received from the client, the server
encrypts it again under the commutative encryption scheme with its secret key
`s`, computing `(H(x)^c)^s = H(x)^(cs)`. The result is sent back to the client
in a serialized protobuf and resembles the following:

```
[ H(x_1)^(cs), H(x_2)^(cs), ... , H(x_n)^(cs) ]
```

4. Compute intersection (client)

The client decrypts each element received from the server's response using its
secret key `c`, computing `(H(x)^(cs))^(1/c) = H(x)^s`. It then checks whether
each decrypted element is present in the container received from the server, and
reports the number of matches as the intersection size.

It's worth noting that the protocol has several variants, some of which
introduce a small false-positive rate, while others do not generate false
positives. This behavior is selective, and the false-positive rate can be tuned.

The protocol has configurable **containers**. Golomb Compressed Sets (`Gcs`) is
the default container but it can be overridden to be `BloomFilter` or `Raw`
encrypted strings. `Gcs` and `BloomFilter` will have false positives whereas
`Raw` will not.

## Security

See [SECURITY.md](SECURITY.md).

## Requirements

Expand Down
66 changes: 36 additions & 30 deletions SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,41 +5,47 @@ Several caveats should be carefully considered before using PSI.
### Information assumed public

1. Server set size
2. Client set size
(Note that Each of these can be turned into upper bounds by adding dummy elements.)
2. Client set size (Note that Each of these can be turned into upper bounds by
adding dummy elements.)

### Security Limitations for the PSI protocol

There are two configurations for instantiating a new client/server pair by passing in a boolean switch into their respective constructors.
There are two configurations for instantiating a new client/server pair by
passing in a boolean switch into their respective constructors.

1. One that reveals only the **size** (cardinality) of the intersection to the client.
1. One that reveals only the **size** (cardinality) of the intersection to the
client.
2. One that reveals the actual **intersecion** to the client.

In the case of #1, coordinated clients could get the actual intersection. However, server set items not
in any of the client sets will never be uncovered.
Situations where it’s feasible for clients to send one request per element in the domain -
there is a possbility that coordinated clients could uncover server set.

Presence of new client set members or absence of former client set members can be
detected by server/eavesdroppers if client secret is reused.

In the absence of any rate limiting and assuming the client and server have enough
computing power and bandwidth, small domains may be brute-forceable. However, a query
needs to be performed for each brute-force attempt.
An example for this situation would be suppose you were trying to limit sending antibody
tests to people based on whether they’d been in an infected location, so that people would
have to share their location history to prove they’d been somewhere infected, and you were
using PSI so people wouldn’t have to share their location history without good reason. If
your health authority only covers 10 possible geohashes, people could sidestep the PSI step
entirely and submit location histories which unlock tests by brute force.
In the case of #1, coordinated clients could get the actual intersection.
However, server set items not in any of the client sets will never be uncovered.
Situations where it’s feasible for clients to send one request per element in
the domain - there is a possbility that coordinated clients could uncover server
set.

Presence of new client set members or absence of former client set members can
be detected by server/eavesdroppers if client secret is reused.

In the absence of any rate limiting and assuming the client and server have
enough computing power and bandwidth, small domains may be brute-forceable.
However, a query needs to be performed for each brute-force attempt. An example
for this situation would be suppose you were trying to limit sending antibody
tests to people based on whether they’d been in an infected location, so that
people would have to share their location history to prove they’d been somewhere
infected, and you were using PSI so people wouldn’t have to share their location
history without good reason. If your health authority only covers 10 possible
geohashes, people could sidestep the PSI step entirely and submit location
histories which unlock tests by brute force.

A potential limitation with the PSI approach is the communication complexity,
which scales linearly with the size of the larger set. This is of particular concern
when performing PSI between a constrained device (cellphone) holding a small set, and a
large service provider (e.g. WhatsApp), such as in the Private Contact Discovery application.
Assuming a bloom filter is used, the Client set size affects the algorithmic complexity in
linear time O(n), with a constant number of lookups. The bloom filter has linear size
in the server's set, hence the algorithmic complexity of our protocol is O(n). However,
a bloom filter requires a large number of lookups on each query, if the false positive rate
is low. An alternative is the Golomb Compressed Set, which requires O(n log n) time due to sorting
operations, but in practice takes around 25-30% less space than a bloom filter.
which scales linearly with the size of the larger set. This is of particular
concern when performing PSI between a constrained device (cellphone) holding a
small set, and a large service provider (e.g. WhatsApp), such as in the Private
Contact Discovery application. Assuming a bloom filter is used, the Client set
size affects the algorithmic complexity in linear time O(n), with a constant
number of lookups. The bloom filter has linear size in the server's set, hence
the algorithmic complexity of our protocol is O(n). However, a bloom filter
requires a large number of lookups on each query, if the false positive rate is
low. An alternative is the Golomb Compressed Set, which requires O(n log n) time
due to sorting operations, but in practice takes around 25-30% less space than a
bloom filter.
4 changes: 2 additions & 2 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@openmined/psi.js",
"version": "2.0.0",
"version": "2.0.1",
"description": "Private Set Intersection for JavaScript",
"repository": {
"type": "git",
Expand Down
5 changes: 4 additions & 1 deletion private_set_intersection/c/integration_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
// limitations under the License.
//

#include <math.h>

#include "absl/container/flat_hash_set.h"
#include "absl/strings/str_cat.h"
#include "gtest/gtest.h"
Expand Down Expand Up @@ -223,7 +225,8 @@ TEST_P(Correctness, intersection) {

// Test if size is approximately as expected (up to 10%).
EXPECT_GE(intersection_size, num_client_inputs / 2);
EXPECT_LT(intersection_size, (num_client_inputs / 2) * 1.1);
EXPECT_LT((double)intersection_size,
ceil((double(num_client_inputs) / 2.0) * 1.1));
}
free(server_setup);
free(client_request);
Expand Down
52 changes: 40 additions & 12 deletions private_set_intersection/cpp/datastructure/raw.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,34 @@

namespace private_set_intersection {

// Computes the intersection of two collections. The first collection must be a
// `pair<T, int64_t>`. The `T` must be the same in the second collection.
//
// Requires both collections to be sorted.
//
// Complexity:
// - O(max(n, m))
template <class InputIt1, class InputIt2, class OutputIt>
void custom_set_intersection(InputIt1 first1, InputIt1 last1, InputIt2 first2,
InputIt2 last2, OutputIt d_first) {
while (first1 != last1 && first2 != last2) {
if ((*first1).first < *first2)
++first1;
else {
// *first1 and *first2 are equivalent.
if (!(*first2 < (*first1).first)) {
*d_first++ = (*first1++).second;
}
++first2;
}
}
}

Raw::Raw(std::vector<std::string> elements) : encrypted_(std::move(elements)) {}

StatusOr<std::unique_ptr<Raw>> Raw::Create(int64_t num_client_inputs,
std::vector<std::string> elements) {
auto num_server_inputs = static_cast<int64_t>(elements.size());

// If server inputs < client inputs, add random encrypted values
// ...

// Then we perform a sort to make intersections easier to find
// We sort to make intersections easier to find later
std::sort(elements.begin(), elements.end());

return absl::WrapUnique(new Raw(elements));
Expand All @@ -55,14 +73,24 @@ StatusOr<std::unique_ptr<Raw>> Raw::CreateFromProtobuf(

std::vector<int64_t> Raw::Intersect(
absl::Span<const std::string> elements) const {
std::vector<int64_t> res;

for (size_t i = 0; i < elements.size(); i++) {
if (std::binary_search(encrypted_.begin(), encrypted_.end(), elements[i])) {
res.push_back(i);
}
// This implementation creates a copy of `elements`, but the tradeoff is that
// we can compute the intersection in O(nlog(n) + max(n, m)) where `n` and `m`
// correspond to the number of client and server elements respectively.
std::vector<std::pair<std::string, int64_t>> vp(elements.size());

// Collect a pair with the index to track the original index after sorting.
for (size_t i = 0; i < elements.size(); ++i) {
vp[i] = make_pair(elements[i], (int64_t)i);
}

// Next, we sort the collection. O(nlog(n))
std::sort(vp.begin(), vp.end());

std::vector<int64_t> res;
// Compute intersection. O(max(m, n))
custom_set_intersection(vp.begin(), vp.end(), encrypted_.begin(),
encrypted_.end(), std::back_inserter(res));

return res;
}

Expand Down
1 change: 1 addition & 0 deletions private_set_intersection/cpp/datastructure/raw.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ class Raw {
static StatusOr<std::unique_ptr<Raw>> CreateFromProtobuf(
const psi_proto::ServerSetup& encoded_filter);

// Calculates the intersection
std::vector<int64_t> Intersect(absl::Span<const std::string> elements) const;

// Returns the size of the encrypted elements
Expand Down
27 changes: 19 additions & 8 deletions private_set_intersection/cpp/datastructure/raw_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,20 +36,31 @@ class RawTest : public ::testing::Test {
std::unique_ptr<Raw> container_;
};

TEST_F(RawTest, TestAdd) {
TEST_F(RawTest, TestIntersection) {
std::vector<std::string> server = {"a", "b", "c", "d", "e"};
std::vector<std::string> client = {"z", "b", "c", "d"};
std::vector<std::string> client = {"b", "c", "d", "z"};

SetUp(static_cast<int64_t>(client.size()), server);
std::vector<int64_t> results = container_->Intersect(client);
std::vector<int64_t> results = container_->Intersect(absl::MakeSpan(client));
std::vector<int64_t> expected{0, 1, 2};
EXPECT_EQ(results.size(), 3);
EXPECT_EQ(results, expected);
}

TEST_F(RawTest, TestIntersectionLargerClient) {
std::vector<std::string> server = {"b", "c", "d", "z"};
std::vector<std::string> client = {"a", "b", "c", "d", "e"};

SetUp(static_cast<int64_t>(client.size()), server);
std::vector<int64_t> results = container_->Intersect(absl::MakeSpan(client));
std::vector<int64_t> expected{1, 2, 3};
EXPECT_EQ(results.size(), 3);
EXPECT_EQ(results, expected);
}

TEST_F(RawTest, TestToProtobuf) {
std::vector<std::string> server = {"b", "a", "c", "d", "e"};
std::vector<std::string> client = {"z", "b", "c", "d"};
std::vector<std::string> client = {"b", "c", "d", "z"};

SetUp(static_cast<int64_t>(client.size()), server);

Expand All @@ -63,17 +74,17 @@ TEST_F(RawTest, TestToProtobuf) {
EXPECT_EQ(encoded_filter.raw().encrypted_elements()[0], "a");
}

TEST_F(RawTest, TestCreateFromProtobuf) {
TEST_F(RawTest, TestIntersectionFromProtobuf) {
std::vector<std::string> server = {"a", "b", "c", "d", "e"};
std::vector<std::string> client = {"z", "b", "c", "d"};
std::vector<std::string> client = {"b", "c", "d", "z"};

SetUp(static_cast<int64_t>(client.size()), server);

// Create the protobuf from the Raw container and check if it matches.
PSI_ASSERT_OK_AND_ASSIGN(auto container2,
Raw::CreateFromProtobuf(container_->ToProtobuf()));
std::vector<int64_t> results = container2->Intersect(client);
std::vector<int64_t> expected{1, 2, 3};
std::vector<int64_t> results = container2->Intersect(absl::MakeSpan(client));
std::vector<int64_t> expected{0, 1, 2};
EXPECT_EQ(results.size(), 3);
EXPECT_EQ(results, expected);
}
Expand Down
Loading

0 comments on commit 20952d9

Please sign in to comment.