copyright | lastupdated | subcollection | ||
---|---|---|---|---|
|
2021-09-28 |
discovery-data |
{{site.data.keyword.attribute-definition-list}}
{: #crawler-plugin-build}
{{site.data.keyword.discoveryshort}} features the option to build your own crawler plug-in with a Java SDK. By using crawler plug-ins, you can now quickly develop relevant solutions for your use cases. You can download the SDK from your installed {{site.data.keyword.discoveryshort}} cluster. For more information, see Obtaining the crawler plug-in SDK package. {: shortdesc}
[IBM Cloud Pak for Data]{: tag-cp4d} {{site.data.keyword.icp4dfull_notm}} only
This information applies only to installed deployments. {: note}
Any custom code that you use with {{site.data.keyword.discoveryfull}} is the responsibility of the developer; IBM Support does not cover any custom code that the developer creates. {: note}
The crawler plug-ins support the following functions:
- Update the metadata list of a crawled document
- Update the content of a crawled document
- Exclude a crawled document
- Reference crawler configurations, masking password values
- Show notice messages in the {{site.data.keyword.discoveryshort}} user interface
- Output log messages to the
crawler
pod console
However, the crawler
plug-ins cannot support the following functions:
- Split a crawled document into multiple documents
- Combine content from multiple documents into a single document
- Modify access control lists
{: #plugin-reqs}
Make sure that the following items are installed on the development server that you plan to use to develop a crawler
plug-in by using this SDK:
- Java SE Development Kit (JDK) 1.8 or higher
- Gradle{: external}
- cURL
- sed (stream editor)
{: #obtain-sdk}
-
Log in to your {{site.data.keyword.discoveryshort}} cluster.
-
Enter the following command to obtain your
crawler
pod name:oc get pods | grep crawler
{: pre}
The following example shows sample output.
wd-discovery-crawler-57985fc5cf-rxk89 1/1 Running 0 85m
{: codeblock}
-
Enter the following command to obtain the SDK package name, replacing
{crawler-pod-name}
with thecrawler
pod name that you obtained in step 2:oc exec {crawler-pod-name} -- ls -l /opt/ibm/wex/zing/resources/ | grep wd-crawler-plugin-sdk
{: pre}
The following example shows sample output.
-rw-r--r--. 1 dadmin dadmin 35575 Oct 1 16:51 wd-crawler-plugin-sdk-${build-version}.zip
{: codeblock}
-
Enter the following command to copy the SDK package to the host server, replacing
{build-version}
with the build version number from the previous step:oc cp {crawler-pod-name}:/opt/ibm/wex/zing/resources/wd-crawler-plugin-sdk-${build-version}.zip wd-crawler-plugin-sdk.zip
{: pre}
-
If necessary, copy the SDK package to the development server.
{: #build-plugin-pkg}
- Extract the SDK compressed file.
- Implement the plug-in logic in
src/
. Ensure that the dependency is written inbuild.gradle
. - Enter
gradle packageCrawlerPlugin
to create the plug-in package. The package is generated asbuild/distributed/wd-crawler-plugin-sample.zip
.