copyright

lastupdated

subcollection

years
2020, 2024

2021-09-28

discovery-data

Building a Cloud Pak for Data custom crawler plug-in

{: #crawler-plugin-build}

{{site.data.keyword.discoveryshort}} features the option to build your own crawler plug-in with a Java SDK. By using crawler plug-ins, you can now quickly develop relevant solutions for your use cases. You can download the SDK from your installed {{site.data.keyword.discoveryshort}} cluster. For more information, see Obtaining the crawler plug-in SDK package. {: shortdesc}

[IBM Cloud Pak for Data]{: tag-cp4d} {{site.data.keyword.icp4dfull_notm}} only

This information applies only to installed deployments. {: note}

Any custom code that you use with {{site.data.keyword.discoveryfull}} is the responsibility of the developer; IBM Support does not cover any custom code that the developer creates. {: note}

The crawler plug-ins support the following functions:

Update the metadata list of a crawled document
Update the content of a crawled document
Exclude a crawled document
Reference crawler configurations, masking password values
Show notice messages in the {{site.data.keyword.discoveryshort}} user interface
Output log messages to the crawler pod console

However, the crawler plug-ins cannot support the following functions:

Split a crawled document into multiple documents
Combine content from multiple documents into a single document
Modify access control lists

Crawler plug-in requirements

{: #plugin-reqs}

Make sure that the following items are installed on the development server that you plan to use to develop a crawler plug-in by using this SDK:

Java SE Development Kit (JDK) 1.8 or higher
Gradle{: external}
cURL
sed (stream editor)

Obtaining the crawler plug-in SDK package

{: #obtain-sdk}

Log in to your {{site.data.keyword.discoveryshort}} cluster.
Enter the following command to obtain your crawler pod name:
```
oc get pods | grep crawler
```
{: pre}

The following example shows sample output.
```
wd-discovery-crawler-57985fc5cf-rxk89     1/1     Running     0          85m
```
{: codeblock}
Enter the following command to obtain the SDK package name, replacing {crawler-pod-name} with the crawler pod name that you obtained in step 2:
```
oc exec {crawler-pod-name} -- ls -l /opt/ibm/wex/zing/resources/ | grep wd-crawler-plugin-sdk
```
{: pre}

The following example shows sample output.
```
-rw-r--r--. 1 dadmin dadmin 35575 Oct  1 16:51 wd-crawler-plugin-sdk-${build-version}.zip
```
{: codeblock}
Enter the following command to copy the SDK package to the host server, replacing {build-version} with the build version number from the previous step:
```
oc cp {crawler-pod-name}:/opt/ibm/wex/zing/resources/wd-crawler-plugin-sdk-${build-version}.zip wd-crawler-plugin-sdk.zip
```
{: pre}
If necessary, copy the SDK package to the development server.

Building a crawler plug-in package

{: #build-plugin-pkg}

Extract the SDK compressed file.
Implement the plug-in logic in src/. Ensure that the dependency is written in build.gradle.
Enter gradle packageCrawlerPlugin to create the plug-in package. The package is generated as build/distributed/wd-crawler-plugin-sample.zip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawler-plugin-build.md

crawler-plugin-build.md

Building a Cloud Pak for Data custom crawler plug-in

Crawler plug-in requirements

Obtaining the crawler plug-in SDK package

Building a crawler plug-in package

Files

crawler-plugin-build.md

Latest commit

History

crawler-plugin-build.md

File metadata and controls

Building a Cloud Pak for Data custom crawler plug-in

Crawler plug-in requirements

Obtaining the crawler plug-in SDK package

Building a crawler plug-in package