copyright | lastupdated | subcollection | ||
---|---|---|---|---|
|
2022-09-28 |
discovery-data |
{{site.data.keyword.attribute-definition-list}}
{: #connector-dev}
The custom connector example includes a Java package named com.ibm.es.ama.custom.crawler
. The package includes the following Java interfaces that you can use when you write your own custom connector.
{: shortdesc}
[IBM Cloud Pak for Data]{: tag-cp4d} {{site.data.keyword.icp4dfull_notm}} only
This information applies only to installed deployments. {: note}
{: #ccs-interfaces-jdoc}
The interfaces that are listed in this document are available in the JAR package file that is included with the custom connector compressed file. After you download and expand the custom-crawler-docs.zip
file as described in Downloading the custom-crawler-docs.zip
file in Discovery 2.2.1 and later and Downloading the custom-crawler-docs.zip
file in Discovery 2.2.0 and earlier, the interface JAR file is available as wexlib/ama-zing-custom-crawler-{version_numbers}.jar
from the root level of the expanded compressed file. Javadoc for the JAR file is available as wexlib/ama-zing-custom-crawler-{version_numbers}-javadoc.jar
at the same level.
{: #ccs-init-interface}
{: #customcrawler}
Use the com.ibm.es.ama.custom.crawler.CustomCrawler
interface to start or stop a custom crawler or to crawl documents from a path. The interface has the following methods.
Method | Description |
---|---|
init |
Start a custom crawler |
term |
Stop a custom crawler |
crawl |
Crawl documents from a specified path |
{: caption="CustomCrawler methods" caption-side="top"} |
{: #ccs-config-interfaces}
{: #customcrawlerconfiguration}
Use the com.ibm.es.ama.custom.crawler.CustomCrawlerConfiguration
interface to validate the configuration and to discover available crawl spaces on the data source. The interface has the following methods.
Method | Description |
---|---|
validate |
Validate configuration |
getFieldsFor |
List known fields and their types |
discoverySubspaces |
Discover crawl spaces on the data source |
{: caption="CustomCrawlerConfiguration methods" caption-side="top"} |
{: #configprovider}
Use the com.ibm.es.ama.custom.crawler.CustomCrawlerConfiguration.ConfigProvider
interface to map the settings of the data source and to list the crawl-space settings on the data source. The interface has the following methods:
Method | Description |
---|---|
get |
Get a map of the settings in a section |
getCrawlspaceSettings |
Get a list of crawl-space settings |
{: caption="ConfigProvider methods" caption-side="top"} |
{: #subspaceconsumer}
Use the com.ibm.es.ama.custom.crawler.CustomCrawlerConfiguration.SubspaceConsumer
interface to add a path to a crawl space. The interface has the following method:
Method | Description |
---|---|
add |
Add a path to the crawl space |
{: caption="SubspaceConsumer methods" caption-side="top"} |
{: #ccs-crawler-interface}
{: #recordkeeper}
Use the com.ibm.es.ama.custom.crawler.CustomCrawler.RecordKeeper
interface to keep records of crawls and to publish crawled documents. The interface has the following methods:
Method | Description |
---|---|
canContinue |
Boolean that lists whether the crawler can continue. The custom crawler must poll this value periodically and terminate if it returns false . |
check |
Get metadata fields from the last crawled document |
upsert |
Publish a document for further processing |
delete |
Delete a document |
{: caption="RecordKeeper methods" caption-side="top"} |
{: #ccs-security-interface}
{: #customcrawlersecurityhandler}
Use the com.ibm.es.ama.custom.crawler.CustomCrawlerSecurityHandler
interface to implement security for your custom crawler. The interface has the following methods:
Method | Description |
---|---|
term |
Terminate a security handler |
getUserAndGroups |
Get the ACLs of a given user |
{: caption="CustomCrawlerSecurityHandler methods" caption-side="top"} |
When the getUserAndGroups
logic of a connector is updated, it can take up to 10 minutes after the connector is redeployed for the change to take effect.
{: important}
{: #example-connector}
The example connector is a Secure File Transfer Protocol (SFTP) connector that crawls files that are located on an SFTP server. {: shortdesc}
The example connector includes three components:
- Java source code for the connector
- An XML definition file that defines the parameters that the connector uses to connect to and crawl the data source
- A properties file that defines optional behaviors for the connector
{: #example-connection-requirements}
The Java source code for the example connector has the following dependencies:
- Java SE Development Kit (JDK) 1.8 or higher.
- The
custom-crawler-docs.zip
file from an installed {{site.data.keyword.discoveryshort}} instance as described at Downloading thecustom-crawler-docs.zip
file in Discovery 2.2.1 and later and Downloading thecustom-crawler-docs.zip
file in Discovery 2.2.0 and earlier. - The JSch{: external} Java package, as described Downloaded JSch. You can download the package in ZIP format{: external} or JAR format{: external}.
{: #download-ccs-zip}
In {{site.data.keyword.discoveryshort}} version 2.2.1 and later, perform the following steps to download the custom-crawler-docs.zip
file to your local machine. You need root access to an installed {{site.data.keyword.discoveryshort}} instance:
-
Log in to your {{site.data.keyword.discoveryshort}} cluster.
-
Enter the following command to obtain your
crawler
pod name:oc get pods | grep crawler
{: pre}
You might see output that looks like this:
wd-discovery-crawler-57985fc5cf-rxk89 1/1 Running 0 85m
{: codeblock}
-
Enter the following command to obtain the
custom-crawler-docs.zip
file, replacing{crawler-pod-name}
with thecrawler
pod name that you obtained in step 2:oc exec {crawler-pod-name} -- ls -l /opt/ibm/wex/zing/resources/ \ | grep custom-crawler-docs
{: pre}
You might see output that is similar to the following:
-rw-r--r--. 1 dadmin dadmin 59451 Jan 19 03:50 custom-crawler-docs-${build-version}.zip
{: codeblock}
-
Enter the following command to copy the
custom-crawler-docs.zip
file to the host server, replacing{build-version}
with the build version number in step 3:oc cp {crawler-pod-name}:/opt/ibm/wex/zing/resources/custom-crawler-docs-${build-version}.zip custom-crawler-docs.zip
{: pre}
-
Enter the following command to expand the
custom-crawler-docs.zip
file:unzip custom-crawler-docs.zip -d custom-crawler-docs-primary
{: pre}
If necessary, copy the
custom-crawler-docs.zip
file to the development server. {: note}If your local machine does not have the
unzip
utility, try using thegunzip
command instead, or see the documentation of the operating system of your local machine for other alternatives to expand compressed files. {: note}If you are using a version of {{site.data.keyword.discoveryshort}} that is earlier than 2.1.2 and you want to access the
custom-crawler-docs.zip
file, enter the following command:scp root@{instance_name}:/root/bob/sdk/custom-crawler-docs.zip {local_directory}
. {: tip}
For information about downloading the custom-crawler-docs.zip
file to {{site.data.keyword.discoveryshort}} 2.2.0 and earlier, see Downloading the custom-crawler-docs.zip file to Discovery 2.2.0 and earlier.
{: #download-ccs-zip-old}
In {{site.data.keyword.discoveryshort}} version 2.2.0 and earlier, perform the following steps to download the custom-crawler-docs.zip
file to your local machine. You need root access to an installed {{site.data.keyword.discoveryshort}} instance:
-
Obtain the entitlement key by navigating to your container software library{: external}.
-
Enter the following command to log in to the Docker registry where your {{site.data.keyword.discoveryshort}} images are available. Include your entitlement key in the following command:
docker login cp.icr.io -u cp -p {entitlement_key}
{: pre}
-
Enter the following command to pull the
custom-crawler-sdk
image:docker pull cp.icr.io/cp/watson-discovery/custom-crawler-sdk:2.1.3
{: pre}
-
Enter the following command to run the
custom-crawler-sdk
image:docker run cp.icr.io/cp/watson-discovery/custom-crawler-sdk:2.1.3
{: pre}
-
Enter the following command to copy
custom-crawler-docs.zip
from the container where the image is running:docker cp {container_name}:/crawler/custom-crawler-docs.zip .
{: pre}
To find the image, enter
docker ps -a | grep custom-crawler-sdk
. -
Expand the
custom-crawler-docs.zip
file:cd {local_directory}
{: pre}
where
{local_directory}
is the directory on your local machine to which you downloaded thecustom-crawler-docs.zip
file.unzip custom-crawler-docs.zip
{: pre}
If your local machine does not have the
unzip
utility, try using thegunzip
command instead, or see the documentation of the operating system of your local machine for other alternatives to expand compressed files. {: note}If you are using a version of {{site.data.keyword.discoveryshort}} that is earlier than 2.1.2 and you want to access the
custom-crawler-docs.zip
file, enter the following command:scp root@{instance_name}:/root/bob/sdk/custom-crawler-docs.zip {local_directory}
. {: tip}
For information about downloading the custom-crawler-docs.zip
file on {{site.data.keyword.discoveryshort}} version 2.2.1 and later, see Downloading the custom-crawler-docs.zip
file in Discovery 2.2.1 and later.
{: #ccs-grok-crawler-zip-file}
The custom-crawler-docs.zip
file expands into a directory named custom-crawler-docs-primary
that includes the following contents:
custom-crawler-docs-primary
├── README.md
├── build.gradle
├── config
│ ├── README.md
│ ├── messages.properties
│ └── template.xml
├── scripts
│ └── manage_custom_crawler.sh
├── settings.gradle
├── src
│ └── main
│ └── java
│ └── com
│ └── ibm
│ └── es
│ └── ama
│ └── custom
│ └── crawler
│ └── sample
│ └── sftp
│ └── SftpCrawler.java
| └── SftpSecurityHandler.java
└── wexlib
├── META-INF
│ └── MANIFEST.MF
├── README.md
├── ama-zing-custom-crawler-{version_numbers}-javadoc.jar
└── ama-zing-custom-crawler-{version_numbers}.jar
15 directories, 12 files
{: codeblock}
{: #download-jsch}
JSch is a Java implementation of the Secure Shell protocol version 2 (SSH2) protocol and, by extension, sftp
. It is derived from the Java Cryptography Extension (JCE){: external}. You can find specifications for SSH2 at www.openssh.com/specs.html{: external}.
The current version of JSch is 0.1.55 and is supported by the example connector.
Download JSch to your development directory ({local_directory}
). You can download the package in ZIP format{: external} or JAR format{: external}. If you download the package in .zip format, extract it as described in the previous section.
{: #example-files}
The example custom connector includes three files that get built together:
- Java source files that are named
SftpCrawler.java
andSftpSecurityHandler.java
- An XML definitions file named
template.xml
- A properties file named
message.properties
You can locate and examine these files by referencing the directory tree listing in Understanding the custom-crawler-docs.zip
file.
{: #see-jdoc}
For detailed documentation of all of the interfaces and methods that are available in the com.ibm.es.ama.custom.crawler
package, see the Javadoc, which is available as indicated in Interfaces and Javadoc.