You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed this bit of code is probably not performing as intended:
def fix_deid_tokens(text, processed_text):
deid_regex = r"\[\*\*.{0,15}.*?\*\*\]"
indexes = [m.span() for m in re.finditer(deid_regex,text,flags=re.IGNORECASE)]
Take for example this string with two de-ID'd portions:
[**9-17**] constant pressure/pain over L chest. MD [**First Name (Titles) 14**]
The regex matches the whole string, rather than matching the two de-ID'd tokens separately--and therefore merges the whole string as one token in the spaCy Document.
The problem is that the .{0,15} is greedy matching (no ?) and therefore consumes the first **]if the first tag is short enough and if there is a later **] that the pattern can match. One fix for this would be to add a ?, like so:
deid_regex = r"\[\*\*.{0,15}?.*?\*\*\]"
But it seems that dropping the .{0,15} altogether makes more sense, since the .*? follows it...
deid_regex = r"\[\*\*.*?\*\*\]"
though there may be a reason for that block that I'm unaware of. Nevertheless, it's not working entirely as intended either.
Working on getting the pipeline to work with modern Pandas and spaCy... so far pretty good... will submit a PR if I can get it to work, but this abuts a block that needs changes and affects the original code. Can submit a separate PR just for this, once I'm confident the fix is the right one.
The text was updated successfully, but these errors were encountered:
I noticed this bit of code is probably not performing as intended:
Take for example this string with two de-ID'd portions:
The regex matches the whole string, rather than matching the two de-ID'd tokens separately--and therefore merges the whole string as one token in the spaCy Document.
The problem is that the
.{0,15}
is greedy matching (no?
) and therefore consumes the first**]
if the first tag is short enough and if there is a later**]
that the pattern can match. One fix for this would be to add a?
, like so:But it seems that dropping the
.{0,15}
altogether makes more sense, since the.*?
follows it...though there may be a reason for that block that I'm unaware of. Nevertheless, it's not working entirely as intended either.
Working on getting the pipeline to work with modern Pandas and spaCy... so far pretty good... will submit a PR if I can get it to work, but this abuts a block that needs changes and affects the original code. Can submit a separate PR just for this, once I'm confident the fix is the right one.
The text was updated successfully, but these errors were encountered: