Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

palomar (search) iteration #263

Merged
merged 41 commits into from
Sep 15, 2023
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
af1829a
palomar: add second README with ES ops stuff
bnewbold May 3, 2023
d01531d
palomar: proposed post and profile schema iteration
bnewbold Aug 2, 2023
ca73914
palomar: tweak proposed schemas
bnewbold Aug 2, 2023
2c782e3
search: search doc transform helpers
bnewbold Aug 14, 2023
42078d8
search: incorporate transforms
bnewbold Aug 14, 2023
4cdbdb0
palomar: update post+profile index schemas
bnewbold Aug 14, 2023
e382b9e
palomar: update README and dev setup
bnewbold Aug 14, 2023
550fc57
palomar: more progress
bnewbold Aug 14, 2023
ae34ff3
search: lint fixes
bnewbold Aug 15, 2023
adc05ba
palomar: more tweaks to schema
bnewbold Sep 1, 2023
a9e1c05
palomar: bit of progress
bnewbold Sep 1, 2023
cafef6e
palomar: basic query parsing, handle 'from:'
bnewbold Sep 1, 2023
8e78c4b
Merge branch 'main' into bnewbold/palomar-iterate
bnewbold Sep 12, 2023
10fd2dd
Makefile: build palomar (search)
bnewbold Sep 12, 2023
89c0ff4
gitignore: add more executables
bnewbold Sep 12, 2023
b5055aa
palomar: construct subscribeRepos URL using struct
bnewbold Sep 13, 2023
2cc193a
palomar: switch to slog; add prometheus and other common middleware
bnewbold Sep 13, 2023
35f09f3
palomar: don't force refresh for most indexing ops
bnewbold Sep 13, 2023
41f3ad3
palomar: progress on removing user+record database tables
bnewbold Sep 13, 2023
4489a89
identity: handle errors when doing LookupDID should not error, just i…
bnewbold Sep 13, 2023
5251558
util: allow non-fractional-second timestamps
bnewbold Sep 13, 2023
8c914b9
palomar: more cleanup
bnewbold Sep 13, 2023
f8ad174
palomar: switch HTTP API to skeleton
bnewbold Sep 14, 2023
d842aa9
palomar: fix bug in query marshal
bnewbold Sep 14, 2023
31a7212
palomar: clarify weird double-marshal
bnewbold Sep 14, 2023
cc587df
palomar: auto-create indices if needed; check existence
bnewbold Sep 14, 2023
1812e1e
palomar: logging, var names
bnewbold Sep 14, 2023
299f640
palomar: update READMEs
bnewbold Sep 14, 2023
66fef0a
palomar: hitsTotal
bnewbold Sep 14, 2023
ee3f286
identity: support skipping DNS resolution for some hosts (like bsky.s…
bnewbold Sep 14, 2023
fff34f1
palomar: skip DNS resolution on bsky.social; do try authoritative DNS
bnewbold Sep 14, 2023
25e1e95
palomar: fix unclosed HTTP connections
bnewbold Sep 14, 2023
4f7f097
palomar: default index shard sizes
bnewbold Sep 14, 2023
2ead460
palomar: tune backfill a bit
bnewbold Sep 14, 2023
c4bcc0e
palomar: clear createdAt on error, not skip record
bnewbold Sep 14, 2023
bcec416
palomar: fix go:embed schemas
bnewbold Sep 14, 2023
cddee8f
make lint
bnewbold Sep 14, 2023
8b8ab88
palomar: fix bad slog invocation
bnewbold Sep 14, 2023
8a7e7fd
palomar: feedback from review
bnewbold Sep 15, 2023
a7c32e3
palomar: better checkParams sanity check
bnewbold Sep 15, 2023
32a9856
palomar: RawQuery typo
bnewbold Sep 15, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ test-coverage.out
/lexgen
/stress
/labelmaker
/palomar
/sonar-cli
/supercollider

# Don't ignore this file itself, or other specific dotfiles
!.gitignore
Expand Down
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ build: ## Build all executables
go build ./cmd/labelmaker
go build ./cmd/supercollider
go build -o ./sonar-cli ./cmd/sonar
go build ./cmd/palomar

.PHONY: all
all: build
Expand Down
9 changes: 6 additions & 3 deletions atproto/identity/base_directory.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ type BaseDirectory struct {
Resolver net.Resolver
// when doing DNS handle resolution, should this resolver attempt re-try against an authoritative nameserver if the first TXT lookup fails?
TryAuthoritativeDNS bool
// set of handle domain suffixes for for which DNS handle resolution will be skipped
SkipDNSDomainSuffixes []string
}

var _ Directory = (*BaseDirectory)(nil)
Expand Down Expand Up @@ -61,10 +63,11 @@ func (d *BaseDirectory) LookupDID(ctx context.Context, did syntax.DID) (*Identit
return nil, err
}
resolvedDID, err := d.ResolveHandle(ctx, declared)
if err != nil {
if err != nil && err != ErrHandleNotFound {
return nil, err
}
if resolvedDID == did {
} else if ErrHandleNotFound == err || resolvedDID != did {
ident.Handle = syntax.Handle("handle.invalid")
} else {
ident.Handle = declared
}

Expand Down
41 changes: 27 additions & 14 deletions atproto/identity/handle.go
Original file line number Diff line number Diff line change
Expand Up @@ -124,24 +124,37 @@ func (d *BaseDirectory) ResolveHandleWellKnown(ctx context.Context, handle synta

func (d *BaseDirectory) ResolveHandle(ctx context.Context, handle syntax.Handle) (syntax.DID, error) {
// TODO: *could* do resolution in parallel, but expecting that sequential is sufficient to start
start := time.Now()
triedAuthoritative := false
did, dnsErr := d.ResolveHandleDNS(ctx, handle)
if dnsErr == ErrHandleNotFound && d.TryAuthoritativeDNS {
slog.Info("attempting authoritative handle DNS resolution", "handle", handle)
triedAuthoritative = true
// try harder with authoritative lookup
did, dnsErr = d.ResolveHandleDNSAuthoritative(ctx, handle)
var dnsErr error
var did syntax.DID

tryDNS := true
for _, suffix := range d.SkipDNSDomainSuffixes {
if strings.HasSuffix(handle.String(), suffix) {
tryDNS = false
break
}
}
elapsed := time.Since(start)
slog.Debug("resolve handle DNS", "handle", handle, "err", dnsErr, "did", did, "authoritative", triedAuthoritative, "duration_ms", elapsed.Milliseconds())
if nil == dnsErr { // if *not* an error
return did, nil

if tryDNS {
start := time.Now()
triedAuthoritative := false
did, dnsErr = d.ResolveHandleDNS(ctx, handle)
if dnsErr == ErrHandleNotFound && d.TryAuthoritativeDNS {
slog.Info("attempting authoritative handle DNS resolution", "handle", handle)
triedAuthoritative = true
// try harder with authoritative lookup
did, dnsErr = d.ResolveHandleDNSAuthoritative(ctx, handle)
}
elapsed := time.Since(start)
slog.Debug("resolve handle DNS", "handle", handle, "err", dnsErr, "did", did, "authoritative", triedAuthoritative, "duration_ms", elapsed.Milliseconds())
if nil == dnsErr { // if *not* an error
return did, nil
}
}

start = time.Now()
start := time.Now()
did, httpErr := d.ResolveHandleWellKnown(ctx, handle)
elapsed = time.Since(start)
elapsed := time.Since(start)
slog.Debug("resolve handle HTTP well-known", "handle", handle, "err", httpErr, "did", did, "duration_ms", elapsed.Milliseconds())
if nil == httpErr { // if *not* an error
return did, nil
Expand Down
2 changes: 2 additions & 0 deletions atproto/identity/identity.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,8 @@ func DefaultDirectory() Directory {
},
},
TryAuthoritativeDNS: true,
// primary Bluesky PDS instance only supports HTTP resolution method
SkipDNSDomainSuffixes: []string{".bsky.social"},
}
cached := NewCacheDirectory(&base, 10000, time.Hour*24, time.Minute*2)
return &cached
Expand Down
2 changes: 2 additions & 0 deletions cmd/palomar/Dockerfile.opensearch
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
FROM opensearchproject/opensearch:2.5.0
RUN /usr/share/opensearch/bin/opensearch-plugin install --batch analysis-icu
98 changes: 71 additions & 27 deletions cmd/palomar/README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,92 @@
# Palomar

Palomar is an Elasticsearch/OpenSearch frontend and ATP (AT Protocol) repository crawler designed to provide search services for the Bluesky network.
Palomar is a backend search service for atproto, specifically the `bsky.app` post and profile record types. It works by consuming a repo event stream ("firehose") and upating an OpenSearch cluster (fork of Elasticsearch) with docs.

## Prerequisites
Almost all the code for this service is actually in the `search/` directory at the top of this repo.

- GoLang (version 1.21)
- Running instance of Elasticsearch or OpenSearch for indexing.
In September 2023, this service was substantially re-written. It no longer stores records in a local database, returns only "skelton" results (list of ATURIs or DIDs) via the HTTP API, and defines index mappings.

## Building

```
go build
```
## Query String Syntax

Currently only a simple query string syntax is supported. Double-quotes can surround phrases, `-` prefix negates a single keyword, and the following initial filters are supported:

- `from:<handle>` will filter to results from that account, based on current (cached) identity resolution
- entire DIDs as an un-quoted keyword will result in filtering to results from that account


## Configuration

Palomar uses environment variables for configuration.

- `ATP_BGS_HOST`: URL of the Bluesky BGS (e.g., `https://bgs.staging.bsky.dev`).
- `ELASTIC_HTTPS_FINGERPRINT`: Required if using a self-signed cert for your Elasticsearch deployment.
- `ELASTIC_USERNAME`: Elasticsearch username (default: `admin`).
- `ELASTIC_PASSWORD`: Password for Elasticsearch authentication.
- `ELASTIC_HOSTS`: Comma-separated list of Elasticsearch endpoints.
- `READONLY`: Set this if the instance should act as a readonly HTTP server (no indexing).
- `ATP_BGS_HOST`: URL of firehose to subscribe to, either global BGS or individual PDS (default: `wss://bsky.social`)
- `ATP_PLC_HOST`: PLC directory for identity lookups (default: `https://plc.directory`)
- `DATABASE_URL`: connection string for database to persist firehose cursor subscription state
- `PALOMAR_BIND`: IP/port to have HTTP API listen on (default: `:3999`)
- `ES_USERNAME`: Elasticsearch username (default: `admin`)
- `ES_PASSWORD`: Password for Elasticsearch authentication
- `ES_CERT_FILE`: Optional, for TLS connections
- `ES_HOSTS`: Comma-separated list of Elasticsearch endpoints
- `ES_POST_INDEX`: name of index for post docs (default: `palomar_post`)
- `ES_PROFILE_INDEX`: name of index for profile docs (default: `palomar_profile`)
- `PALOMAR_READONLY`: Set this if the instance should act as a readonly HTTP server (no indexing)

## HTTP API

### Query Posts: `/xrpc/app.bsky.unspecced.searchPostsSkeleton`

HTTP Query Params:

- `q`: query string, required
- `limit`: integer, default 25
- `cursor`: string, for partial pagination (uses offset, not a scroll)

Response:

- `posts`: array of AT-URI strings
- `hits_total`: integer; optional number of search hits (may not be populated for large result sets, eg over 10k hits)
- `cursor`: string; optionally included if there are more results that can be paginated

### Query Profiles: `/xrpc/app.bsky.unspecced.searchActorsSkeleton`

HTTP Query Params:

- `q`: query string, required
- `limit`: integer, default 25
- `cursor`: string, for partial pagination (uses offset, not a scroll)
- `typeahead`: boolean, for typeahead behavior (vs. full search)

Response:

- `actors`: array of AT-URI strings
- `hits_total`: integer; optional number of search hits (may not be populated for large result sets, eg over 10k hits)
- `cursor`: string; optionally included if there are more results that can be paginated

## Development Quickstart

Run an ephemeral opensearch instance on local port 9200, with SSL disabled, and the `analysis-icu` plugin installed, using docker:

## Running the Application
docker build -f Dockerfile.opensearch . -t opensearch-palomar
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" opensearch-palomar

Once the environment variables are set properly, you can start Palomar by running:
See [README.opensearch.md]() for more Opensearch operational tips.

```
./palomar run
```
From the top level of the repository:

## Indexing
# run combined indexing and search service
make run-dev-search

For now, there isnt an easy way to get updates from the PDS, so to keep the
index up to date you will periodcally need to scrape the data.
# run just the search service
READONLY=true make run-dev-search

## API
You'll need to get some content in to the index. An easy way to do this is to have palomar consume from the public production firehose.

### `/index/:did`
You can run test queries from the top level of the repository:

Indexes the content in the given user's repository. It keeps track of the last repository update and only fetches incremental changes.
go run ./cmd/palomar search-post "hello"
go run ./cmd/palomar search-profile "hello"
go run ./cmd/palomar search-profile -typeahead "h"

### `/search?q=QUERY`
For more commands and args:

Performs a simple, case-insensitive search across the entire application.
go run ./cmd/palomar --help
90 changes: 90 additions & 0 deletions cmd/palomar/README.opensearch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@

# Basic OpenSearch Operations

We use OpenSearch version 2.5+, with the `analysis-icu` plugin. This is included automatically on the AWS hosted version of Opensearch, otherwise you need to install:

sudo /usr/share/opensearch/bin/opensearch-plugin install analysis-icu
sudo service opensearch restart

If you are trying to use Elasticsearch 7.10 instead of OpenSearch, you can install the plugin with:

sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu
sudo service elasticsearch restart

## Local Development

With OpenSearch running locally.

To manually drop and re-build the indices with new schemas (palomar will create these automatically if they don't exist, but this can be helpful when developing the schema itself):

http delete :9200/palomar_post
http delete :9200/palomar_profile
http put :9200/palomar_post < post_schema.json
http put :9200/palomar_profile < profile_schema.json

Put a single object (good for debugging):

head -n1 examples.json | http post :9200/palomar_post/_doc/0
http get :9200/palomar_post/_doc/0

Bulk insert from a file on disk:

# esbulk is a golang CLI tool which must be installed separately
esbulk -verbose -id ident -index palomar_post -type _doc examples.json

## Index Aliases

To make re-indexing and schema changes easier, we can create versioned (or
time-stamped) elasticsearch indexes, and then point to them using index
aliases. The index alias updates are fast and atomic, so we can slowly build up
a new index and then cut over with no downtime.

http put :9200/palomar_post_v04 < post_schema.json

To do an atomic swap from one alias to a new one ("zero downtime"):

http post :9200/_aliases << EOF
{
"actions": [
{ "remove": { "index": "palomar_post_v05", "alias": "palomar_post" }},
{ "add": { "index": "palomar_post_v06", "alias": "palomar_post" }}
]
}
EOF

To replace an existing ("real") index with an alias pointer, do two actions
(not truly zero-downtime, but pretty fast):

http delete :9200/palomar_post
http put :9200/palomar_post_v03/_alias/palomar_post

## Full-Text Querying

A generic full-text "query string" query look like this (replace "blood" with
actual query string, and "size" field with the max results to return):

GET /palomar_post/_search
{
"query": {
"query_string": {
"query": "blood",
"analyzer": "textIcuSearch",
"default_operator": "AND",
"analyze_wildcard": true,
"lenient": true,
"fields": ["handle^5", "text"]
}
},
"size": 3
}

In the results take `.hits.hits[]._source` as the objects; `.hits.total` is the
total number of search hits.


## Index Debugging

Check index size:

http get :9200/palomar_post/_count
http get :9200/palomar_profile/_count
Loading