Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docker ml #35081

Merged
merged 57 commits into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
d9a5ea3
updated docker
jlevypaloalto Jun 26, 2024
ba9d109
added the rest
jlevypaloalto Jun 27, 2024
985f2e9
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jun 28, 2024
2c16da9
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 1, 2024
cdec46a
devdemisto/ml:1.0.0.100486
jlevypaloalto Jul 2, 2024
50eecda
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 2, 2024
0f5fe63
fix tpb
jlevypaloalto Jul 2, 2024
a7b8708
Merge branch 'jl-update-docker-ml' of github.com:demisto/content into…
jlevypaloalto Jul 2, 2024
6a40af5
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 3, 2024
103ce3d
return on no incidents
jlevypaloalto Jul 3, 2024
6cc1c62
remove runonce
jlevypaloalto Jul 3, 2024
2ac96b4
remove space
jlevypaloalto Jul 3, 2024
a4ab8b7
fixed
jlevypaloalto Jul 3, 2024
bee2995
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 3, 2024
8b03c97
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 4, 2024
6d0354c
fix create incidents script
jlevypaloalto Jul 4, 2024
f12dc89
new docker
jlevypaloalto Jul 4, 2024
687cd8b
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 4, 2024
4ea1b9a
revert: fix create incidents script
jlevypaloalto Jul 4, 2024
be0dba3
Merge branch 'jl-update-docker-ml' of github.com:demisto/content into…
jlevypaloalto Jul 5, 2024
ea383a5
add outputs to DBotFindSimilarIncidents
jlevypaloalto Jul 5, 2024
036d9f6
new tpb DBotFindSimilarIncidents-test
jlevypaloalto Jul 5, 2024
df38b60
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 5, 2024
3d0c6c4
new docker
jlevypaloalto Jul 7, 2024
ec58fe6
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 7, 2024
1fb8d9e
bump transformers
jlevypaloalto Jul 7, 2024
5db5ab0
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 7, 2024
bde59b7
Empty-Commit
jlevypaloalto Jul 7, 2024
67207b9
Merge branch 'jl-update-docker-ml' of github.com:demisto/content into…
jlevypaloalto Jul 7, 2024
9f04953
fix conf.json
jlevypaloalto Jul 7, 2024
5fe4534
more fixes
jlevypaloalto Jul 7, 2024
9132ff0
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 8, 2024
c0aae9e
more fixes
jlevypaloalto Jul 8, 2024
1af391c
Merge branch 'jl-update-docker-ml' of github.com:demisto/content into…
jlevypaloalto Jul 8, 2024
31ee063
new docker
jlevypaloalto Jul 8, 2024
9bec145
RN
jlevypaloalto Jul 8, 2024
acf5aa7
new docker
jlevypaloalto Jul 8, 2024
1220212
revert dockers
jlevypaloalto Jul 8, 2024
a5584e1
more stuff
jlevypaloalto Jul 8, 2024
60c90d6
redirect stderr
jlevypaloalto Jul 9, 2024
5564338
docker
jlevypaloalto Jul 9, 2024
a347300
format
jlevypaloalto Jul 9, 2024
ba1d862
format
jlevypaloalto Jul 9, 2024
dff9f4f
merge master
jlevypaloalto Jul 9, 2024
556c5b3
RN
jlevypaloalto Jul 9, 2024
a0abaa8
more stuff
jlevypaloalto Jul 9, 2024
c7498f7
build fixes
jlevypaloalto Jul 9, 2024
df9038b
build fixes
jlevypaloalto Jul 9, 2024
6731a4e
fix unit-tests
jlevypaloalto Jul 9, 2024
5defa87
more docker changes
jlevypaloalto Jul 9, 2024
b5cc9dd
more docker changes
jlevypaloalto Jul 9, 2024
4ea2ccc
build fixes
jlevypaloalto Jul 9, 2024
5a44061
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 10, 2024
c0e14c3
suppress logger
jlevypaloalto Jul 10, 2024
bce8abe
build fixes
jlevypaloalto Jul 10, 2024
3d2238a
build fixes
jlevypaloalto Jul 10, 2024
0d1330f
Merge branch 'master' into jl-update-docker-ml
jlevypaloalto Jul 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions Packs/Base/ReleaseNotes/1_34_27.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@

#### Scripts

##### DBotFindSimilarIncidents

- Updated the Docker image to: *demisto/ml:1.0.0.101889*.

##### DBotPredictPhishingWords

- Updated the Docker image to: *demisto/ml:1.0.0.101889*.

##### DBotFindSimilarIncidentsByIndicators

- Updated the Docker image to: *demisto/ml:1.0.0.101889*.

##### DBotBuildPhishingClassifier

- Updated the Docker image to: *demisto/ml:1.0.0.101889*.

##### DBotPreProcessTextData

- Updated the Docker image to: *demisto/ml:1.0.0.101889*.

##### DBotTrainTextClassifierV2

- Updated the Docker image to: *demisto/ml:1.0.0.101889*.

##### GetMLModelEvaluation

- Updated the Docker image to: *demisto/ml:1.0.0.101889*.
Original file line number Diff line number Diff line change
@@ -1,19 +1,12 @@
from CommonServerPython import *
import base64
import copy
import gc

from CommonServerPython import *

PREFIXES_TO_REMOVE = ['incident.']
ALL_LABELS = "*"


def preprocess_incidents_field(incidents_field):
incidents_field = incidents_field.strip()
for prefix in PREFIXES_TO_REMOVE:
if incidents_field.startswith(prefix):
incidents_field = incidents_field[len(prefix):]
return incidents_field
return incidents_field.strip().removeprefix('incident.')


def get_phishing_map_labels(comma_values):
Expand All @@ -28,7 +21,7 @@ def get_phishing_map_labels(comma_values):
labels_dict[splited[0].strip()] = splited[1].strip()
else:
labels_dict[v] = v
return {k: v for k, v in labels_dict.items()}
return dict(labels_dict.items())


def build_query_in_reepect_to_phishing_labels(args):
Expand All @@ -38,17 +31,17 @@ def build_query_in_reepect_to_phishing_labels(args):
return args
mapping_dict = get_phishing_map_labels(mapping)
tag_field = args['tagField']
tags_union = ' '.join(['"{}"'.format(label) for label in mapping_dict])
mapping_query = '{}:({})'.format(tag_field, tags_union)
tags_union = ' '.join([f'"{label}"' for label in mapping_dict])
mapping_query = f'{tag_field}:({tags_union})'
if 'query' not in args or args['query'].strip() == '':
args['query'] = mapping_query
else:
args['query'] = '({}) and ({})'.format(query, mapping_query)
args['query'] = f'({query}) and ({mapping_query})'
return args


def get_incidents(d_args):
get_incidents_by_query_args = copy.deepcopy(d_args)
get_incidents_by_query_args = d_args.copy()
get_incidents_by_query_args['NonEmptyFields'] = d_args['tagField']
fields_names_to_populate = ['tagField', 'emailsubject', 'emailbody', "emailbodyhtml"]
fields_to_populate = [get_incidents_by_query_args.get(x, None) for x in fields_names_to_populate]
Expand All @@ -63,15 +56,15 @@ def get_incidents(d_args):


def preprocess_incidents(incidents, d_args):
text_pre_process_args = copy.deepcopy(d_args)
text_pre_process_args = d_args.copy()
text_pre_process_args['inputType'] = 'json_b64_string'
text_pre_process_args['input'] = base64.b64encode(incidents.encode('utf-8')).decode('ascii')
text_pre_process_args['preProcessType'] = 'nlp'
email_body_fields = [text_pre_process_args.get("emailbody"), text_pre_process_args.get("emailbodyhtml")]
email_body = "|".join([x for x in email_body_fields if x])
text_pre_process_args['textFields'] = "%s,%s" % (text_pre_process_args['emailsubject'], email_body)
text_pre_process_args['whitelistFields'] = "{0},{1}".format('dbot_processed_text',
text_pre_process_args['tagField'])
text_pre_process_args['textFields'] = "{},{}".format(text_pre_process_args['emailsubject'], email_body)
text_pre_process_args['whitelistFields'] = "{},{}".format('dbot_processed_text',
text_pre_process_args['tagField'])
res = demisto.executeCommand("DBotPreProcessTextData", text_pre_process_args)
if is_error(res):
return_error(get_error(res))
Expand All @@ -81,7 +74,7 @@ def preprocess_incidents(incidents, d_args):


def train_model(processed_text_data, d_args):
train_model_args = copy.deepcopy(d_args)
train_model_args = d_args.copy()
train_model_args['inputType'] = 'json_b64_string'
train_model_args['input'] = base64.b64encode(processed_text_data.encode('utf-8')).decode('ascii')
train_model_args['overrideExistingModel'] = 'true'
Expand All @@ -90,7 +83,7 @@ def train_model(processed_text_data, d_args):


def main():
d_args = dict(demisto.args())
d_args = demisto.args()
for arg in ['tagField', 'emailbody', 'emailbodyhtml', 'emailsubject', 'timeField']:
d_args[arg] = preprocess_incidents_field(d_args.get(arg, ''))

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ args:
- defaultValue: Phishing
description: A comma-separated list of incident types by which to filter.
name: incidentTypes
- description: 'The start date by which to filter incidents. Date format will be the same as in the incidents query page (valid strings example: "3 days ago", ""2019-01-01T00:00:00 +0200")'
- description: 'The start date by which to filter incidents. Date format will be the same as in the incidents query page (valid strings example: "3 days ago", ""2019-01-01T00:00:00 +0200").'
name: fromDate
- description: 'The end date by which to filter incidents. Date format will be the same as in the incidents query page (valid strings example: "3 days ago", ""2019-01-01T00:00:00 +0200")'
- description: 'The end date by which to filter incidents. Date format will be the same as in the incidents query page (valid strings example: "3 days ago", ""2019-01-01T00:00:00 +0200").'
name: toDate
- defaultValue: '3000'
description: The maximum number of incidents to fetch.
Expand Down Expand Up @@ -39,7 +39,7 @@ args:
- description: The model name to store in the system.
name: modelName
- defaultValue: '*'
description: 'A comma-separated list of email tags values and mapping. The script considers only the tags specified in this field. You can map a label to another value by using this format: LABEL:MAPPED_LABEL. For example, for 4 values in email tag: malicious, credentials harvesting, inner communitcation, external legit email, unclassified. While training, we want to ignore "unclassified" tag, and refer to "credentials harvesting" as "malicious" too. Also, we want to merge "inner communitcation" and "external legit email" to one tag called "non-malicious". The input will be: malicious, credentials harvesting:malicious, inner communitcation:non-malicious, external legit email:non-malicious'
description: 'A comma-separated list of email tags values and mapping. The script considers only the tags specified in this field. You can map a label to another value by using this format: LABEL:MAPPED_LABEL. For example, for 4 values in email tag: malicious, credentials harvesting, inner communitcation, external legit email, unclassified. While training, we want to ignore "unclassified" tag, and refer to "credentials harvesting" as "malicious" too. Also, we want to merge "inner communitcation" and "external legit email" to one tag called "non-malicious". The input will be: malicious, credentials harvesting:malicious, inner communitcation:non-malicious, external legit email:non-malicious.'
name: phishingLabels
- defaultValue: emailsubject
description: Incident field name with the email subject.
Expand Down Expand Up @@ -83,8 +83,7 @@ tags:
- ml
timeout: 12µs
type: python
dockerimage: demisto/ml:1.0.0.45981
runonce: true
dockerimage: demisto/python3:3.10.14.101217
tests:
- Create Phishing Classifier V2 ML Test
- DBotCreatePhishingClassifierV2FromFile-Test
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ def test_no_mapping_no_query():
def test_no_mapping_with_query():
args = {'phishingLabels': '*', 'query': QUERY}
args = build_query_in_reepect_to_phishing_labels(args)
assert 'query' in args and args['query'] == QUERY
assert 'query' in args
assert args['query'] == QUERY


def test_mapping_no_query():
Expand All @@ -27,6 +28,6 @@ def test_mapping_with_query():
args = {'phishingLabels': MAPPING, 'tagField': 'closeReason', 'query': QUERY}
args = build_query_in_reepect_to_phishing_labels(args)
assert 'query' in args
opt1 = args['query'] == '({}) and (closeReason:("spam" "legit"))'.format(QUERY)
opt2 = args['query'] == '({}) and (closeReason:("legit" "spam"))'.format(QUERY)
opt1 = args['query'] == f'({QUERY}) and (closeReason:("spam" "legit"))'
opt2 = args['query'] == f'({QUERY}) and (closeReason:("legit" "spam"))'
assert opt1 or opt2
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,27 @@ script: '-'
subtype: python3
timeout: '0'
type: python
dockerimage: demisto/ml:1.0.0.94241
dockerimage: demisto/ml:1.0.0.101889
runas: DBotWeakRole
runonce: true
tests:
- No tests (auto formatted)
- DBotFindSimilarIncidents-test
fromversion: 5.0.0
outputs:
- contextPath: DBotFindSimilarIncidents.isSimilarIncidentFound
description: Indicates whether similar incidents have been found.
type: boolean
- contextPath: DBotFindSimilarIncidents.similarIncident.created
description: The creation date of the linked incident.
type: date
- contextPath: DBotFindSimilarIncidents.similarIncident.id
description: The ID of the linked incident.
type: string
- contextPath: DBotFindSimilarIncidents.similarIncident.name
description: The name of the linked incident.
type: string
- contextPath: DBotFindSimilarIncidents.similarIncident.similarity incident
description: The similarity of the linked incident represented as a float in the range 0-1.
type: number
- contextPath: DBotFindSimilarIncidents.similarIncident.details
description: The details of the linked incident.
type: string
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ script: '-'
subtype: python3
timeout: '0'
type: python
dockerimage: demisto/ml:1.0.0.88591
dockerimage: demisto/ml:1.0.0.101889
runas: DBotWeakRole
tests:
- DBotFindSimilarIncidentsByIndicators - Test
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,62 @@
from string import punctuation
import demisto_ml
import numpy as np
import tempfile

FASTTEXT_MODEL_TYPE = 'FASTTEXT_MODEL_TYPE'
TORCH_TYPE = 'torch'
UNKNOWN_MODEL_TYPE = 'UNKNOWN_MODEL_TYPE'
BERT_TOKENIZER_ERROR = "The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. \nThe tokenizer class you load from this checkpoint is 'BertTokenizer'. \nThe class this function is called from is 'DistilBertTokenizer'.\n"


class StderrRedirect:
'''Context manager to redirect stderr.'''
temp_stderr: Any
old_stderr: int
error: str

def __enter__(self):
demisto.debug('entering StderrRedirect')
self.temp_stderr = tempfile.TemporaryFile()
self.old_stderr = os.dup(sys.stderr.fileno()) # make a copy of stderr
os.dup2(self.temp_stderr.fileno(), sys.stderr.fileno()) # redirect stderr to the temporary file
return self

def __exit__(self, exc_type, exc_value, exc_traceback):
demisto.debug(f'exiting StderrRedirect: {exc_type=}, {exc_value=}, {exc_traceback=}')
self.temp_stderr.seek(0)
self.error = self.temp_stderr.read().decode()
demisto.debug(f'stderr: {self.error}')
os.dup2(self.old_stderr, sys.stderr.fileno()) # restore stderr
os.close(self.old_stderr)
self.temp_stderr.close()


def OrderedSet(iterable):
return list(dict.fromkeys(iterable))

def new_get_model_data(model_name, store_type):
if store_type == "mlModel":
res_model = demisto.executeCommand("getMLModel", {"modelName": model_name})
if is_error(res_model):
return_error(get_error(res_model))
model_data = res_model[0]['Contents']['modelData']
model_type = res_model[0]['Contents']['model']["type"]["type"]
return model_data, model_type
if store_type == "list":
res_model_list = demisto.executeCommand("getList", {"listName": model_name})
if is_error(res_model_list):
return_error(get_error(res_model_list))
return res_model_list[0]["Contents"], UNKNOWN_MODEL_TYPE
return None



def get_model_data(model_name, store_type, is_return_error):
try:
return new_get_model_data(model_name, store_type)
except Exception as e:
demisto.debug(f'new_get_model_data() failed: {e}, {e.args}')
res_model_list = demisto.executeCommand("getList", {"listName": model_name})[0]
res_model = demisto.executeCommand("getMLModel", {"modelName": model_name})[0]
if is_error(res_model_list) and not is_error(res_model):
Expand All @@ -35,6 +80,7 @@ def get_model_data(model_name, store_type, is_return_error):
return model_data, model_type
else:
handle_error("error reading model %s from Demisto" % model_name, is_return_error)
return None


def handle_error(message, is_return_error):
Expand Down Expand Up @@ -88,6 +134,7 @@ def preprocess_text(text, model_type, is_return_error):
else:
words_to_token_maps = tokenized_text_result['originalWordsToTokens']
return input_text, words_to_token_maps
return None


def predict_phishing_words(model_name, model_store_type, email_subject, email_body, min_text_length, label_threshold,
Expand All @@ -97,7 +144,12 @@ def predict_phishing_words(model_name, model_store_type, email_subject, email_bo
model_type = FASTTEXT_MODEL_TYPE
if model_type not in [FASTTEXT_MODEL_TYPE, TORCH_TYPE, UNKNOWN_MODEL_TYPE]:
model_type = UNKNOWN_MODEL_TYPE
phishing_model = demisto_ml.phishing_model_loads_handler(model_data, model_type)

with StderrRedirect() as s:
phishing_model = demisto_ml.phishing_model_loads_handler(model_data, model_type)
if s.error != BERT_TOKENIZER_ERROR:
raise DemistoException(s.error)

is_model_applied_on_a_single_incidents = isinstance(email_subject, str) and isinstance(email_body, str)
if is_model_applied_on_a_single_incidents:
return predict_single_incident_full_output(email_subject, email_body, is_return_error, label_threshold,
Expand All @@ -110,7 +162,7 @@ def predict_phishing_words(model_name, model_store_type, email_subject, email_bo


def predict_batch_incidents_light_output(email_subject, email_body, phishing_model, model_type, min_text_length):
text_list = [{'text': "%s \n%s" % (subject, body)} for subject, body in zip(email_subject, email_body)]
text_list = [{'text': f"{subject} \n{body}"} for subject, body in zip(email_subject, email_body)]
preprocessed_text_list = preprocess_text(text_list, model_type, is_return_error=False)
batch_predictions = []
for input_text in preprocessed_text_list:
Expand All @@ -132,14 +184,14 @@ def predict_batch_incidents_light_output(email_subject, email_body, phishing_mod
'Type': entryTypes['note'],
'Contents': batch_predictions,
'ContentsFormat': formats['json'],
'HumanReadable': 'Applied predictions on {} incidents.'.format(len(batch_predictions)),
'HumanReadable': f'Applied predictions on {len(batch_predictions)} incidents.',
}


def predict_single_incident_full_output(email_subject, email_body, is_return_error, label_threshold, min_text_length,
model_type, phishing_model, set_incidents_fields, top_word_limit,
word_threshold):
text = "%s \n%s" % (email_subject, email_body)
text = f"{email_subject} \n{email_body}"
input_text, words_to_token_maps = preprocess_text(text, model_type, is_return_error)
filtered_text, filtered_text_number_of_words = phishing_model.filter_model_words(input_text)
if filtered_text_number_of_words == 0:
Expand All @@ -163,22 +215,22 @@ def predict_single_incident_full_output(email_subject, email_body, is_return_err
negative_tokens = OrderedSet(explain_result['NegativeWords'])
positive_words = find_words_contain_tokens(positive_tokens, words_to_token_maps)
negative_words = find_words_contain_tokens(negative_tokens, words_to_token_maps)
positive_words = list(OrderedSet([s.strip(punctuation) for s in positive_words]))
negative_words = list(OrderedSet([s.strip(punctuation) for s in negative_words]))
positive_words = OrderedSet([s.strip(punctuation) for s in positive_words])
negative_words = OrderedSet([s.strip(punctuation) for s in negative_words])
positive_words = [w for w in positive_words if w.isalnum()]
negative_words = [w for w in negative_words if w.isalnum()]
highlighted_text_markdown = text.strip()
for word in positive_words:
for cased_word in [word.lower(), word.title(), word.upper()]:
highlighted_text_markdown = re.sub(r'(?<!\w)({})(?!\w)'.format(cased_word), '**{}**'.format(cased_word),
highlighted_text_markdown = re.sub(fr'(?<!\w)({cased_word})(?!\w)', f'**{cased_word}**',
highlighted_text_markdown)
highlighted_text_markdown = re.sub(r'\n+', '\n', highlighted_text_markdown)
explain_result['PositiveWords'] = [w.lower() for w in positive_words]
explain_result['NegativeWords'] = [w.lower() for w in negative_words]
explain_result['OriginalText'] = text.strip()
explain_result['TextTokensHighlighted'] = highlighted_text_markdown
predicted_label = explain_result["Label"]
explain_result_hr = dict()
explain_result_hr = {}
explain_result_hr['TextTokensHighlighted'] = highlighted_text_markdown
explain_result_hr['Label'] = predicted_label
explain_result_hr['Probability'] = "%.2f" % predicted_prob
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -98,8 +98,7 @@ tags:
- phishing
timeout: 60µs
type: python
dockerimage: demisto/ml:1.0.0.32340
runonce: true
dockerimage: demisto/ml:1.0.0.101889
tests:
- Create Phishing Classifier V2 ML Test
fromversion: 5.0.0
Expand Down
Loading