Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datajson v1.0 #12102

Closed
wants to merge 8 commits into from
Closed

Datajson v1.0 #12102

wants to merge 8 commits into from

Conversation

regit
Copy link
Contributor

@regit regit commented Nov 7, 2024

Indicator of Compromises (IOCs) are a key element in Security Operating Center. Dataset
have been a huge step in getting alert on IOCs from Suricata. But produced alerts are
lacking contextualization. For example, if a IOC management software has a list of host
names, they will be linked to different threat actors but at ingestion in Suricata they will be
a simple list of strings. This list will be used via a rule like

alert tls any any -> any any (msg:"IOC hostname on TLS"; tls.sni; dataset:isset,hostname.lst,...; sid:1664;)

With this an alert will have a subject without information and a mapping will have to be done
at posteriori to see which IOC has hit. The pseudo algorithm to run is:

  • Intercept the signature 1664
  • Extract the tls.sni
  • Check the tls.sni value in the IOC management software

This works but it is not optimal as correlation and external processing has to be done for all the matches.

To fix this issue, we need to be able to ingest the IOCs without loosing the contextual
information contained in the IOC management software.

Datajson is a proposed implementation that addresses this issue. Instead of injecting into
Suricata the value list we can attach to each value a JSON object that will end up into
the alert output.

The following example is alert on source and destination IP in dataset:

alert tls $HOME_NET any -> any any (msg:"Test dataset";
                              ip.src; datajson:isset,ip4list,type ipv6,load ip4-json.lst,key inventory; \
                              ip.dst; datajson:isset,actors_ip,type ipv6, load bad-json.lst, key bad_actors; sid:1;)

In ip4-json.lst, we have data from inventory:

10.7.5.5,{"user":"vjulien","rank": 1}

In bad-json.lst, we have data from the IOC management software:

185.117.73.76,["Bad Panda"]
144.217.50.240,["killer bear","LSD kitten"]

The result is an alert section that looks like:

  "alert": {
    "action": "allowed",
    "gid": 1,
    "signature_id": 1,
    "rev": 0,
    "signature": "Test dataset",
    "category": "",
    "severity": 3,
    "extra": {
      "inventory": {
        "user": "vjulien",
        "rank": 1
      },
      "bad_actors": [
        "Killer Bear",
        "LSD kitten"
      ]
    }
  },

Contribution style:

Our Contribution agreements:

Changes (if applicable):

Link to ticket: https://redmine.openinfosecfoundation.org/issues/7372

Describe changes:

  • Implement datajson keywords for all dataset types
  • Add documentation
  • Add unix socket commands

SV_REPO=https://github.com/regit/suricata-verify/
SV_BRANCH=datajson-v1.0

This patch introduces a new keyword datajson that is similar
to dataset with a twist. Where dataset allows match from sets,
datajson allows the same but also adds JSON data to the alert
event. This data is comint from the set definition it self.
For example, an ipv4 set will look like:

  10.16.1.11,{"test": "success","context":3}

The syntax is value and json data separated by a comma.

The syntax of the keyword is the following:

  datajson:isset,src_ip,type ip,load src.lst,key src_ip;

Compare to dataset, it just have a supplementary option key
that is used to indicate in which subobject the JSON value
should be added.

The information is added in the even under the alert.extra
subobject:

  "alert": {
    "extra": {
      "src_ip": {
        "test": "success",
        "context": 3
      },

The main interest of the feature is to be able to contextualize
a match. For example, if you have an IOC source, you can do

 value1,{"actor":"APT28","Country":"FR"}
 value2,{"actor":"APT32","Country":"NL"}

This way, a single dataset is able to produce context to the
event where it was not possible before and multiple signatures
had to be used.

Ticket: OISF#7372
Previous code was using an array and introducing a limit in the
number of datajson keywords that can be used in a signature.

This patch uses a linked list instead to overcome the limit. By
using a first element of the list that is part of the structure
we limit the cost of the feature to a structure member added to
PacketAlert structure. Only the PacketAlertFree function is
impacted as we need to iterate to find potential allocation.

Ticket: OISF#7372
It was not handling correctly the json values with space as they
were seen as multiple arguments.

Ticket: OISF#7372
Copy link

codecov bot commented Nov 7, 2024

Codecov Report

Attention: Patch coverage is 3.28685% with 971 lines in your changes missing coverage. Please review.

Project coverage is 67.43%. Comparing base (278dc24) to head (6061d01).

❗ There is a different number of reports uploaded between BASE (278dc24) and HEAD (6061d01). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (278dc24) HEAD (6061d01)
suricata-verify 1 0
unittests 1 0
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #12102       +/-   ##
===========================================
- Coverage   83.23%   67.43%   -15.81%     
===========================================
  Files         906      846       -60     
  Lines      257647   156207   -101440     
===========================================
- Hits       214458   105332   -109126     
- Misses      43189    50875     +7686     
Flag Coverage Δ
fuzzcorpus 60.88% <2.98%> (-0.33%) ⬇️
livemode 19.32% <2.88%> (-0.11%) ⬇️
pcap 44.16% <2.68%> (-0.27%) ⬇️
suricata-verify ?
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

@suricata-qa
Copy link

Information:

ERROR: QA failed on SURI_TLPW2_autofp_suri_time.

field baseline test %
SURI_TLPR1_stats_chk
.uptime 636 676 106.29%

Pipeline 23289

Copy link
Member

@victorjulien victorjulien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.

Bigger issue: I like this work, but I would like to understand why this needs a new set type and keyword type? Can we not overload the existing dataset/datarep facilities?

src/datasets-json.h Show resolved Hide resolved
src/datasets.c Show resolved Hide resolved
src/datasets.c Show resolved Hide resolved
src/datasets.c Show resolved Hide resolved
if (set == NULL)
return rrep;

if (data_len != 16 && data_len != 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we get here with data_len == 4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetDataDst and GetDataSrc in detect-ipaddr.c are setting a buffer length of 4 or 16 depending of IP type:

static InspectionBuffer *GetDataDst(DetectEngineThreadCtx *det_ctx,
        const DetectEngineTransforms *transforms, Packet *p, const int list_id)
{
    InspectionBuffer *buffer = InspectionBufferGet(det_ctx, list_id);
    if (buffer->inspect == NULL) {
        if (PacketIsIPv4(p)) {
            /* Suricata stores the IPv4 at the beginning of the field */
            InspectionBufferSetup(det_ctx, list_id, buffer, p->dst.address.address_un_data8, 4);
        } else if (PacketIsIPv6(p)) {
            InspectionBufferSetup(det_ctx, list_id, buffer, p->dst.address.address_un_data8, 16);

so we can have the 2 lengths. As Suricata stores the IPv4 at the beginning of the field, this is working.

src/detect.h Show resolved Hide resolved
@suricata-qa
Copy link

Information:

ERROR: QA failed on SURI_TLPW2_autofp_suri_time.

field baseline test %
SURI_TLPW2_autofp_stats_chk
.uptime 139 144 103.6%
SURI_TLPR1_stats_chk
.uptime 636 670 105.35%

Pipeline 23316

@catenacyber
Copy link
Contributor

Did you forget to use OISF/suricata-verify#2123 here ?

@@ -212,6 +212,10 @@
"xff": {
"type": "string"
},
"extra": {
"type": "object",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to get a description on everything new for documentation purposes, can you add one?

@jasonish
Copy link
Member

jasonish commented Nov 19, 2024

I'm a little hung up on the "extra" object? Is this data much different than metadata? For example, if we were to spin a rule for each entry in dataset, we'd add extra context in the message, or metadata, here we're just adding extra metadata from the dataset.

Also, with "extra" being a very generic variable name, should it be available to possible future items that might want to add extra metadata? For example, a Lua rule that wants to add extra metadata? Or should this be restricted to datajson and such be given a more specific name?

@jasonish
Copy link
Member

Is there a real need for a new dataset type of datajson? Could there just be dataset, and then dataset that has metadata?

@regit
Copy link
Contributor Author

regit commented Nov 23, 2024

I'm a little hung up on the "extra" object? Is this data much different than metadata? For example, if we were to spin a rule for each entry in dataset, we'd add extra context in the message, or metadata, here we're just adding extra metadata from the dataset.

I will admit I was kinda expecting to have a discussion on the used keyword. I'm not super happy with the choice but I did not find a better one.

I think we need to avoid conflict, people parsing metadata of signatures are kind of expecting to just have that in metadata. So it should be on its own.

Also, with "extra" being a very generic variable name, should it be available to possible future items that might want to add extra metadata? For example, a Lua rule that wants to add extra metadata? Or should this be restricted to datajson and such be given a more specific name?

The concept of extracting information at detection time seems pretty good. Flowbits is currently used for that (lua, pcre extraction) but it is then propagated to all events in the flowbits object. It may be better to offer a per alert way of doing this.

Maybe we can stay with extra (or something generic) in this case and extend the capabilities to other systems.

@regit
Copy link
Contributor Author

regit commented Nov 23, 2024

Is there a real need for a new dataset type of datajson? Could there just be dataset, and then dataset that has metadata?

I think it is more on educational side. Potentially also a bit keeping the code simple. On the educational/usage side, it allows signature readers to understand really fast what is really done. datajson keyword is rather explicit and different from dataset so user can understand that there will be consequences.

@jasonish
Copy link
Member

Is there a real need for a new dataset type of datajson? Could there just be dataset, and then dataset that has metadata?

I think it is more on educational side. Potentially also a bit keeping the code simple. On the educational/usage side, it allows signature readers to understand really fast what is really done. datajson keyword is rather explicit and different from dataset so user can understand that there will be consequences.

Can you explain how its different than a dataset that contains additional metadata to attach to the alert? Its almost like datadata would be the correct keyword then. JSON is just the encoding of the data :)

@regit
Copy link
Contributor Author

regit commented Nov 25, 2024

Is there a real need for a new dataset type of datajson? Could there just be dataset, and then dataset that has metadata?

I think it is more on educational side. Potentially also a bit keeping the code simple. On the educational/usage side, it allows signature readers to understand really fast what is really done. datajson keyword is rather explicit and different from dataset so user can understand that there will be consequences.

Can you explain how its different than a dataset that contains additional metadata to attach to the alert? Its almost like datadata would be the correct keyword then. JSON is just the encoding of the data :)

Currently, what the code is doing is:

  • parse value,json by separating at ,
  • check if json is valid data
    • if yes, add entry in hash with value -> string(json) (no transformation here)
  • when alert fire, add the string(json) directly under the extra subobject without using json dump

So it is really a datajson as we don't encode and reencode at output.

@regit regit mentioned this pull request Nov 28, 2024
5 tasks
@regit
Copy link
Contributor Author

regit commented Nov 28, 2024

Closing in favor of #12175

@regit regit closed this Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants