Skip to content

Commit

Permalink
improve logging and add support for the ingest-attachment plugin in es 5
Browse files Browse the repository at this point in the history
  • Loading branch information
rwynn committed Aug 14, 2016
1 parent c380dd6 commit 4f25a91
Show file tree
Hide file tree
Showing 2 changed files with 238 additions and 50 deletions.
119 changes: 111 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ A sample TOML config file looks like this:
namespace-exclude-regex = "^mydb.ignorecollection$"
gtm-channel-size = 200
index-files = true
file-highlighting = true
file-namespaces = ["users.fs.files"]
verbose = true

All options in the config file above also work if passed explicity by the same name to the monstache command
Expand All @@ -73,6 +75,8 @@ The following defaults are used for missing config values:
namespace-exclude-regex -> nil
gtm-channel-size -> 100
index-files -> false
file-highlighting -> false
file-namespaces -> nil
verbose -> false

When `resume` is true, monstache writes the timestamp of mongodb operations it has successfully synced to elasticsearch
Expand Down Expand Up @@ -111,6 +115,13 @@ For versions of elasticsearch prior to version 5, you should install the [mapper
of elasticsearch the mapper-attachment plugin is deprecated and you should install the [ingest-attachment](https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html) plugin instead.
For further information on how to configure monstache to index content from grids, see the section [Indexing Gridfs Files](#files).

The `file-namespaces` config must be set when `index-files` is enabled. `file-namespaces` must be set to an array of mongodb
namespace strings. Files uploaded through gridfs to any of the namespaces in `file-namespaces` will be retrieved and their
raw content indexed into elasticsearch via either the mapper-attachments or ingest-attachment plugin.

When `file-highlighting` is true monstache will enable the ability to return highlighted keywords in the extracted text of files
for queries on files which were indexed in elasticsearch from gridfs.

When `verbose` is true monstache with enable debug logging including a trace of requests to elasticsearch

When `elasticseach-retry-seconds` is greater than 0 a failed request to elasticsearch with retry the request after the given number of seconds
Expand Down Expand Up @@ -217,51 +228,143 @@ namespace of all collections which will hold GridFS files. For example in your T

file-namespaces = ["users.fs.files", "posts.fs.files"]

file-highlighting = true

The above configuration tells monstache that you wish to index the raw content of GridFS files in the `users` and `posts`
mongodb databases. By default, mongodb uses a bucket named `fs`, so if you just use the defaults your collection name will
be `fs.files`. However, if you have customized the bucket name, then your file collection would be something like `mybucket.files`
and the entire namespace would be `users.mybucket.files`.

When you configure monstache this way it will perform an additional operation at startup to ensure the destination indexes in
elasticsearch have a field named `filecontent` with a type mapping of `attachment`.
elasticsearch have a field named `file` with a type mapping of `attachment`.

For the example TOML configuration above, monstache would initialize 2 indices in preparation for indexing into
elasticsearch by issuing the following REST commands:

For elasticsearch versions prior to version 5...

POST /users
{
"mappings": {
"fs.files": {
"properties": {
"filecontent": { "type": "attachment" }
"file": { "type": "attachment" }
}}}}

POST /posts
{
"mappings": {
"fs.files": {
"properties": {
"filecontent": { "type": "attachment" }
"file": { "type": "attachment" }
}}}}

For elasticsearch version 5 and above...

PUT /_ingest/pipeline/attachment
{
"description" : "Extract file information",
"processors" : [
{
"attachment" : {
"field" : "file"
}
}
]
}

When a file is inserted into mongodb via GridFS, monstache will detect the new file, use the mongodb api to retrieve the raw
content, and index a document into elasticsearch with the raw content stored in a `filecontent` field as a base64
content, and index a document into elasticsearch with the raw content stored in a `file` field as a base64
encoded string. The elasticsearch plugin will then extract text content from the raw content using
[Apache Tika](https://tika.apache.org/)
, tokenize the text content, and allow you to query on the content of the file.
[Apache Tika](https://tika.apache.org/), tokenize the text content, and allow you to query on the content of the file.

To test this feature of monstache you can simply use the [mongofiles](https://docs.mongodb.com/manual/reference/program/mongofiles/)
command to quickly add a file to mongodb via GridFS. Continuing the example above one could issue the following command to put a
file named `resume.docx` into GridFS and after a short time this file should be searchable in elasticsearch in the index `users`
under the type `fs.files`.


mongofiles -d users put resume.docx


After a short time you should be able to query the contents of resume.docx in the users index in elasticsearch

curl -XGET 'http://localhost:9200/users/fs.files/_search?q=golang'

If you would like to see the text extracted by Apache Tika you can project the appropriate sub-field

For elasticsearch versions prior to version 5...

curl localhost:9200/users/fs.files/_search?pretty -d '{
"fields": [ "file.content" ],
"query": {
"match": {
"_all": "golang"
}
}
}'

For elasticsearch version 5 and above...

curl localhost:9200/users/fs.files/_search?pretty -d '{
"_source": [ "attachment.content" ],
"query": {
"match": {
"_all": "golang"
}
}
}'

When `file-highlighting` is enabled you can add a highlight clause to your query

For elasticsearch versions prior to version 5...

curl localhost:9200/users/fs.files/_search?pretty -d '{
"fields": ["file.content"],
"query": {
"match": {
"file.content": "golang"
}
},
"highlight": {
"fields": {
"file.content": {
}
}
}
}'

For elasticsearch version 5 and above...

curl localhost:9200/users/fs.files/_search?pretty -d '{
"_source": ["attachment.content"],
"query": {
"match": {
"attachment.content": "golang"
}
},
"highlight": {
"fields": {
"attachment.content": {
}
}
}
}'


The highlight response will contain emphasis on the matching terms

For elasticsearch versions prior to version 5...

"hits" : [ {
"highlight" : {
"file.content" : [ "I like to program in <em>golang</em>.\n\n" ]
}
} ]

For elasticsearch version 5 and above...

"hits" : [{
"highlight" : {
"attachment.content" : [ "I like to program in <em>golang</em>." ]
}
}]

Loading

0 comments on commit 4f25a91

Please sign in to comment.