Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change data models, to fix existing issues #206

Closed
sbs2001 opened this issue Jun 20, 2020 · 12 comments
Closed

Change data models, to fix existing issues #206

sbs2001 opened this issue Jun 20, 2020 · 12 comments

Comments

@sbs2001
Copy link
Collaborator

sbs2001 commented Jun 20, 2020

Vulnerability model diff

Old Vulnerability model

class Vulnerability(models.Model):
   cve_id = models.CharField(max_length=50, help_text='CVE ID', unique=True, null=True)
   summary = models.TextField(help_text='Summary of the vulnerability', blank=True)
   cvss = models.FloatField(max_length=100, help_text='CVSS Score', null=True)

New Vulnerability model

In the new Vulnerability model, cvss which is a severity indicator, is removed, the reason being, every data source rates the same vulnerability differently, hence we need some additional context when talking about severity. Hence all severity related fields are moved to VulnerabilityReference.

I also felt moving reference_ids here makes more sense. We would want a way to avoid entries duplicate entries in reference_id which I guess is possible, check https://stackoverflow.com/questions/49252135/how-to-save-arrayfield-as-set-in-django

class Vulnerability(models.Model):
   vuln_id = models.CharField(max_length=50, help_text='CVE ID', unique=True, null=True)
   reference_ids = pgfields.ArrayField(
       models.CharField(max_length=50,help_text='Reference ID, eg:DSA-4465-1'),default=list)

ImpactedPackage

Note: We would delete ResolvedPackage and rename ImpactedPackage to something else. Since their exact funtionality can be mimicked by a more efficient boolean field. This will also reduce the number of db queries

Old ImpactedPackage

class ImpactedPackage(models.Model):
   vulnerability = models.ForeignKey(Vulnerability, on_delete=models.CASCADE)
   package = models.ForeignKey(Package, on_delete=models.CASCADE)

   class Meta:
       unique_together = ('vulnerability', 'package')

New ImpactedPackage

Note: This won't be called ImpactedPackage, please suggest a good name.

The version_range will contain mathematical comparator symbols like '<,>,=' , (This will also support complex ranges like >1.0.0, <9.8.10 basically anything that dephell_range_specifier.RangeSpecifier supports) . This is very useful in cases where the data sources don't provide concrete versions. In the cases where concrete versions are provided we can always have version_range= '=' .

I haven't figured out how to solve #141, need help . Maybe adding Importer FK might help.

class ImpactedPackage(models.Model):
   vulnerability = models.ForeignKey(Vulnerability, on_delete=models.CASCADE)
   package = models.ForeignKey(Package, on_delete=models.CASCADE)
   version_range = models.CharField(max_length=30,default='=')
   is_vulnerable = models.BooleanField()

   class Meta:
       unique_together = ('vulnerability', 'package', 'version_range')

VulnerabilityReference

Old VulnerabilityReference

class VulnerabilityReference(models.Model):
   vulnerability = models.ForeignKey(
       Vulnerability, on_delete=models.CASCADE)
   source = models.CharField(
       max_length=50, help_text='Source(s) name eg:NVD', blank=True)
   reference_id = models.CharField(
       max_length=50, help_text='Reference ID, eg:DSA-4465-1', blank=True)
   url = models.URLField(
       max_length=1024, help_text='URL of Vulnerability data', blank=True)

   class Meta:
       unique_together = ('vulnerability', 'source', 'reference_id', 'url')
      

New VulnerabilityReference

Lots of changes here. We have here a FK of Importer, which makes logical sense since different data sources will say different things about the same vulnerability(hence the new unique_together)
Other changes include addtion of more severity indicators and using pgfields.ArrayField to store urls because we don't want to create a VulnerabilityReference for each url with other fields having same data.

Updated

class VulnerabilityReference(models.Model):
   vulnerability = models.ForeignKey(
       Vulnerability, on_delete=models.CASCADE,required=True)
   source = models.ForeignKey(
       Importer, on_delete=models.CASCADE, required=True)
   urls = pgfields.ArrayField(models.URLField(
       max_length=1024, help_text='URL of Vulnerability data'),default=list)
   summary = models.TextField(help_text='Summary of the vulnerability', blank=True)
   class Meta:
       unique_together = ('vulnerability', 'source')

class VulnerabilityScore(models.Model):
 vulnerability_reference = models.ForeignKey(VulnerabilityReference, on_delete=models.CASCADE, required=True)
 type = models.CharField(max_length=50, help_text='Vulnerability score type', blank=True))
 score = models.FloatField()
 severity_text = models.CharField()

Gets us closer to fix #18, #157, #140, #209 and allows us to collect data from any advisories which don't provide concrete versions(but provide version ranges)

@sbs2001
Copy link
Collaborator Author

sbs2001 commented Jun 20, 2020

I just watched few talks about database indexing and django, looking back at these models, they are not perfect considering the indexing(order in unique_together matters ;) which I didnt think of earlier ) .

@pombredanne
Copy link
Member

pombredanne commented Jun 20, 2020

For VulnerabilityReference, IMHO each URL is a different URL... I do not see the value to use an arrayfield there. That's OK to have the same data with different URLs.

And what about that :

class VulnerabilityScore(models.Model):
   vulnerability_reference = models.ForeignKey(VulnerabilityReference, on_delete=models.CASCADE, required=True)
   type = models.CharField(max_length=50, help_text='Vulnerability score type', blank=True))
   score = models.CharField(max_length=50, help_text='Score value', blank=True))

and type could be any of the various cvss, a severity, or "other"

@sbs2001
Copy link
Collaborator Author

sbs2001 commented Jun 20, 2020

Last time I checked https://github.com/intel/cve-bin-tool , they also used such a model with minor differences. I like the type field which future proofs the db for all other standards.

The only issues I have with this is score = models.CharField(max_length=50, help_text='Score value', blank=True)) , which kind of prevents the use case where the user wants to view vulnerabilities with cvssX > Y . Adding to your model IMHO I like this better

class VulnerabilityScore(models.Model):
  vulnerability_reference = models.ForeignKey(VulnerabilityReference, on_delete=models.CASCADE, required=True)
  type = models.CharField(max_length=50, help_text='Vulnerability score type', blank=True))
  score = models.FloatField()
  severity_text = models.CharField()

We could also get more sophisticated by populating the severity_text by computing the score, but let's get the big things right :)

Last time we discussed about this issue, you also mentioned that we could get sophisticated by storing timestamps of when the scores were given, which is now possible thanks to @haikoschol 's Importer model. If we want timestamps using a separate VulnerabilityScore ,will add up one more sql JOIN IMHO, but that's alright.

@sbs2001
Copy link
Collaborator Author

sbs2001 commented Jun 20, 2020

@pombredanne re

For VulnerabilityReference, IMHO each URL is a different URL... I do not see the value to use an arrayfield there. That's OK to have the same data with different URLs.

The rationale behind using arrayfield is I want to have unique_together = ('vulnerability', 'source') . One reference to look at to know what XYZ data_source thinks about it ;) . Also isn't preventing duplicate data considered a win , I wouldn't say arrayfield being useless here

@haikoschol
Copy link
Collaborator

Regarding the proposed Vulnerability model:

  • I would prefer not using blank=True and null=True for text columns, unless there really is a semantic difference between empty string and NULL.
  • In theory it is possible that two data sources use the same reference ID for two different vulnerabilities that both don't have a CVE ID. I don't see how we could detect that and avoid updating the vulnerability that was imported first, instead of adding a new one. I have no idea whether that is likely to happen though.

Regarding the proposed ImpactedPackage model:

IIRC, one issue that came up when implementing the new import mechanism was related to cascading deletes in ImpactedPackage and ResolvedPackage; When an advisory gets updated and a version that was previously considered fixed is vulnerable. This should be possible with the proposed changes by simply updating the is_vulnerable flag. (BTW, I don't have a good name for the new ImpactedPackage model.)

Regarding #141:

I'm not sure I understand the issue. What would be a concrete scenario that triggers it? Is it something like this:

  1. new vulnerability V is imported, references package P as vulnerable in versions >1.0,<2.0
  2. before the next import run for this data source, the version range for V is updated to >1.1,<2.0

If the data source supports detecting updates to existing advisories, it should work fine. If the data source does not support it, there is nothing we can do automatically, so have to rely on curations.

@sbs2001
Copy link
Collaborator Author

sbs2001 commented Jun 24, 2020

I would prefer not using blank=True and null=True for text columns, unless there really is a semantic difference between empty string and NULL.

Sure I didn't give it much thought , I want to have the big things right first.

I'm not sure I understand the issue.

The scenario described is perfect. Rather it actually happened a while ago when you discovered that wrong version range in safety-db

@sbs2001
Copy link
Collaborator Author

sbs2001 commented Jun 25, 2020

@haikoschol

I would prefer not using blank=True and null=True for text columns, unless there really is a semantic difference between empty string and NULL.

I remember now, a while back I had posted something like this in chat

This doesn't error out , but it should

In [2]: p1 = Package(name='foo',version='1.0.0',type='deb')                                                                                                                         

In [3]: p2 = Package(name='foo',version='1.0.0',type='deb')                                                                                                                         

In [4]: p1.save()                                                                                                                                                                   

In [5]: p2.save()

This errors out fine,

In [9]: p2 = Package(name='foo',version='1.0.0',type='deb',qualifiers={'os':'windows'},namespace='',subpath='')                                                                     

In [10]: p3 = Package(name='foo',version='1.0.0',type='deb',qualifiers={'os':'windows'},namespace='',subpath='')                                                                    

In [11]: p3.save()                                                                                                                                                                  

In [13]: p2.save()

So in the latter case, we were storing empty string in db, while the first case stored NULL in db. The first case creates duplicates. IMHO empty string seems better in this Package context.

@haikoschol
Copy link
Collaborator

This doesn't error out , but it should

In [2]: p1 = Package(name='foo',version='1.0.0',type='deb')                                                                                                                         

In [3]: p2 = Package(name='foo',version='1.0.0',type='deb')                                                                                                                         

In [4]: p1.save()                                                                                                                                                                   

In [5]: p2.save()

This errors out fine,

In [9]: p2 = Package(name='foo',version='1.0.0',type='deb',qualifiers={'os':'windows'},namespace='',subpath='')                                                                     

In [10]: p3 = Package(name='foo',version='1.0.0',type='deb',qualifiers={'os':'windows'},namespace='',subpath='')                                                                    

In [11]: p3.save()                                                                                                                                                                  

In [13]: p2.save()

So in the latter case, we were storing empty string in db, while the first case stored NULL in db. The first case creates duplicates. IMHO empty string seems better in this Package context.

Ah, that's the reason for this behaviour. Makes sense and yet another reason to avoid blank=True, null=True. :)

@pombredanne
Copy link
Member

So in the latter case, we were storing empty string in db, while the first case stored NULL in db. The first case creates duplicates. IMHO empty string seems better in this Package context.

we have to be consistent in all cases... @tdruez what's your take on the blank/null thing on Django fields?

@pombredanne
Copy link
Member

Here are my notes what our live discussion today:

  1. It would be best to have all models changes in a PR (it is OK that nothing will work and test fails, etc.) this is just to better view the changes and comment on them more efficiently
  2. the proposed changes for ImpactedPackage seem fine (minor adjustments may be needed and feedback in the PR will help)
  3. for the Vulnerability/VulnerabiltyReference changes, that's rather major and there are possibly multiple ways to hand these. For now, the PR is going to the best way.
  4. It may be better to use a JSONField than an ArrayField as it is more portable to other DBs: an array is really just a JSON list?
  5. a small diagram of the proposed new models and their relationship would help a lot, but that's secondary
  6. storing the commit as references is fine IMHO, we should agree on a convention on how this fits. I think using the SPDX VCS URL as the URL in a reference may be enough?

@tdruez
Copy link
Contributor

tdruez commented Jun 25, 2020

@pombredanne https://docs.djangoproject.com/en/3.0/ref/models/fields/#null

Avoid using null on string-based fields such as CharField and TextField. If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string. In most cases, it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL.

@sbs2001
Copy link
Collaborator Author

sbs2001 commented Feb 23, 2021

Fixed finally via #259

@sbs2001 sbs2001 closed this as completed Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants