Adding list of actual "registered" domains in result data #75
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, when analyzing an email with eml_parser, the domains appearing in the body of the email are given in the output, like this:
However, a possible issue here is that the full domains are listed, including the subdomains part. This can make the identification of entities and actors complicated if a lot of subdomains are present in the
domain
table.This commit takes the opportunity to use
publicsuffixlist
(already used in eml_parser) to add a table nameddomain_registered
in the data returned by an eml_parser analysis.The domains in
domain_registered
are the true registered domains, i.e. the "closest" domains to the TLD. Thanks topublicsuffixlist
, public suffixes likeco.uk
orco.jp
can be taken into consideration.Now, the output looks like this:
Do not hesitate to suggest any improvements (especially regarding the name of the table)