-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connectors failes to complete sync #2925
Comments
Congrats on having over 100 connectors at once! I'm wondering if this is related to elastic/kibana#195127, and if Kibana is marking syncs as "error". |
Great, thanks! Each connector we configure gets a dedicated container, we don't run multiple connectors in the same container. So i don't think it should not be related to elastic/kibana#195127. Its quite likely that its related to the scale-up, but a lot of developments happen in parallel, in the mean time we also moved from ES-stack 8.14 > 8.15. We have a script to configure multiple connectors at once, which uses the connector API's (https://www.elastic.co/guide/en/elasticsearch/reference/current/connector-apis.html). As we speak we have 109 connectors configured, i could try to delete 10 and see if the issue still exists. |
Hi @sjors101, Is there any chance you can collect the logs from all of your connector hosts in one place and grep by the failed job id there (in your log file that'd be Connectors should not affect each other, but they seem to do it somehow: as if another service is marking the connector sync job as failed. Could it be that you have services running with identical config, so that they attempt to serve the same connector? |
This got me thinking, what if you had one service, configured to be responsible for more than 100 connectors all at once. Do we correctly fetch all connectors from Elasticsearch to compare against what's configured in YAML? I don't think we do.
@sjors101 you might be able to test this faster than we can set up an env with 100 connectors. Can you you change that hardcoded page size to something like 1000 and see if that fixes things? (obviously not a good long-term fix, just as an investigation step). |
I think total is independent - it's just number of documents matching the query, so it's okay to overwrite it. Although we don't really use PIT here so modification of collection can cause weird bugs. On the other hand, if indices are not added/removed, it should not be a problem and this inconsistency will be very eventual. |
🤦 you're right, I still think the hardcoded |
I checked the logs of all our elastic nodes but no log records with the job-id or connector-id. The only log messages i just saw during a crash are the following, but dont think they are relevant:
|
Hi @sjors101, it seems to be a log from Elasticsearch. We're looking for logs from connector containers :) |
Bug Description
We are using the connector framework for a while now with over 100 connectors configured. Since a few weeks we experiencing connector jobs failing with the following error:
connectors.sync_job_runner.ConnectorJobNotRunningError: Connector job (ID: Wsl5npIBp9FxXy_8Hx2C) is not running but in status of JobStatus.ERROR.
We can't really pinpoint the issue, some runs fail after a few second, others after 90 minutes, and others finish successfully. It seems related to the > 100 connectors.We did found a workaround, we noticed when we run ~ less than 10 active connector containers on our kubernetes platform, the issue won't occur. This make me wonder if there is some queue on the Elastic side that is full. We also tried increasing the DEFAULT_PAGE_SIZE in connectors/es/index.py, but this did not solve the issue.
To Reproduce
Steps to reproduce the behavior:
Environment
Elasticsearch 8.15
ConnectorFW: 8.15.3.0
Logs / config
Attached logs of one connector container, and connector config (i replaced the sensitive records). We dont see any logs at Elasticsearch or Enterprisesearch. We notice the same behaviour on different connector_types.
container-config.txt
container-logs.txt
The text was updated successfully, but these errors were encountered: