Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Pre-processing options work for PreTrainedVectorizer #307

Open
dafajon opened this issue Dec 30, 2021 · 0 comments · May be fixed by #308
Open

Make Pre-processing options work for PreTrainedVectorizer #307

dafajon opened this issue Dec 30, 2021 · 0 comments · May be fixed by #308
Assignees

Comments

@dafajon
Copy link
Contributor

dafajon commented Dec 30, 2021

Currently get_pretrained_embeddings, get_bert_embeddings work on the raw form of the document. As a result preprocessing settings do not apply to the text that goes into the transformer based vectorizers.

  • Add ignore_preprocess option to vectorizer to use raw text.
  • Build input str sequence from filtered Token objects before passing it to the SentenceTransformer.encode method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants