Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applied Schema Caching Fix | Canidate V0.7.1 #121

Closed
wants to merge 2 commits into from

Conversation

SoundsSerious
Copy link
Contributor

@SoundsSerious SoundsSerious commented Feb 8, 2023

  • use python __{class.__name__} prefix to store schema caches similar to the python double underscore, name mangling approach.
  • added recache option in schema to allow changing schema [ schema(...,recache:bool=False) ]

This PR should address these issues, while preserving the memory overflow fix of 0.7.0
#120
#115
#114

@kashifpk I did run these through the testing system and found that all tests passed. I think it might be worthwhile to check the relation's schema since I don't really use those in my application.

@overbost you may find this useful for your issues.

SoundsSerious added 2 commits February 7, 2023 20:52
commit 2086167
Merge: d13f7eb 91e1a19
Author: Kashif Iftikhar <[email protected]>
Date:   Sat Nov 19 12:26:40 2022 +0500

    Merge pull request threatify#117 from anthony-poddubny/feature/add_full_count_support

    Add full count support to the Query

commit 91e1a19
Author: Anthony Poddubny <[email protected]>
Date:   Sat Nov 19 00:08:11 2022 +0200

    add full count support

commit d13f7eb
Merge: 8de2c46 84acdd3
Author: Kashif Iftikhar <[email protected]>
Date:   Fri Jun 10 14:39:36 2022 +0500

    Merge pull request threatify#110 from valentingregoire/patch-1

    fix: Typo in README fixed.

commit 84acdd3
Author: Valentin Grégoire <[email protected]>
Date:   Wed Jun 8 09:23:43 2022 +0200

    fix: Typo in README fixed.

commit 8de2c46
Merge: 641c2cf 202d4e7
Author: Kashif Iftikhar <[email protected]>
Date:   Thu Apr 21 10:45:50 2022 +0500

    Merge pull request threatify#108 from SoundsSerious/fix_mem_schema

    Added Conditional Checking Of Marshmallow Fields
- use python `__{class}` double underscore behavior to prevent subclassing issues.
- added recache option in schema to allow changing schema [ schema(...,recache:bool=False) ]
@overbost
Copy link
Contributor

overbost commented Feb 8, 2023

Thank you for the contribution.

I belive you tested the solution but i have some doubts about double underscore properties.
Check this samples i run in online-python.com

class A():
  @classmethod
  def schema(cls):
    if not hasattr(cls, "__cls_schema"):
        print("generating schema...")
        cls.__cls_schema = cls.__name__ + "-Schema"
        
print(A.schema())
---
output:

generating schema...
None
Traceback (most recent call last):
  File "main.py", line 13, in <module>
    print(A.__cls_schema)
AttributeError: type object 'A' has no attribute '__cls_schema'

Double score properties is not defined in class.
And there are another issue with class inheritation see here

This is the cause of the issue i reported about class inheritation

class A():
  @classmethod
  def schema(cls):
    if not hasattr(cls, "_cls_schema"):
        print("generating schema...")
        cls._cls_schema = cls.__name__ + "-Schema"
        
print(A.schema())

class B(A):
  pass
    
print(A._cls_schema)

print(B.schema())
print(B._cls_schema)

---
output:
generating schema...
ASchema
ASchema

Thank you for reporting me the solution where i can use CollectionBase, but doesn't fix the issue because _load method from collection is still used and is buggy.
I suggest to recycle the same properties names and simple check if schema name is referable to current class:

class A():
  @classmethod
  def schema(cls):
    if not hasattr(cls, "_cls_schema") or cls._cls_schema != cls.__name__ + "Schema":
        print("generating schema...")
        cls._cls_schema = cls.__name__ + "Schema"
        
A.schema()
A.schema()
class B(A):
  pass
    
print(A._cls_schema)

B.schema()
print(B._cls_schema)

--- 
output:
generating schema...
ASchema
generating schema...
BSchema

This solution allow schema cache, A.schema() is called twice but not generated twice.

simple and clean

@SoundsSerious
Copy link
Contributor Author

SoundsSerious commented Feb 8, 2023

@overbost You will see that I am using the same hasattr/setattr approach with cls.__name__, this is done by g() and f() lambda funtions.

The name mangling approach is purely to apply to the conventions of python.

It would be great if you could checkout this code and evaluate the code in this commit since there are additional factors for partial schema instantiation with the only keyword.

I have edited the text of the PR to reflect this

@overbost
Copy link
Contributor

overbost commented Feb 8, 2023

setattr is a nice workaroud, i have tested your PR and it works.
This solve my issue about class inheritance but as expected there are unused data inherited in derived class, check the code below

from arango_orm import Collection

class A(Collection):
    __collection__ = "A"
    
class B(A):
    __collection__ = "B"

a = A()

b = B._load(a._dump())

print(a.__collection__)
print(a)
print(a.__A__cls_schema)

print(b.__collection__)   
print(b)
print(b.__B__cls_schema)
print(b.schema())
print(b.__A__cls_schema)

---
output

 A
<A(_key=None)>
<class 'marshmallow.schema.ASchema'>
B
<B(_key=None)>
<class 'marshmallow.schema.BSchema'>
<BSchema(many=False)>
<class 'marshmallow.schema.ASchema'>   <-- __A__cls_schema shouldn't exist in B class

of course is not causing issues and the waste memory is not too much, but is technically incorrect.
Fetch a list of collections can be waste more memory.
My comment is only to provide contribution, i have no power here.

@SoundsSerious
Copy link
Contributor Author

SoundsSerious commented Feb 9, 2023

@overbost This is an interesting example. I have not observed this behavior, checking the __dict__ of each class.

I suspect this has to do with accessing any subclass property? Ie. any subclass can access its parent class variables?

How would the solution you're proposing handle this differently?

I don't think this affects memory at all since the variable from the parent class is defined in the parent class dict. Also good to remember that the point of this schema caching is reduce memory consumption. Every time schema() is called a new set of marshmallow objects is created creating a pretty significant bloat per object which affects CPU as well

@overbost
Copy link
Contributor

overbost commented Feb 9, 2023

If you check my first comment here there are the code example of the issue and my proposal of solution.
Is not a problem with _load/_dump process but caching logic in schema() method.
A.schema() is called before B class definition/creation, so B class inherit _cls_schema and _cls_schema_cache properties from A.
This properties are already setted in B, and B schema is not generated because the properties exists and contain A schema.
My solution is simply to reuse the same name properties, and simply check if name of marshmallow schema is cls.__name__ + "Schema" as reported in my first comment.
Last but not least i don't understand why you put a flag to force schema regeneration, since the schema should be constant at runtime, and not modified manually by the user. It must be modified at runtime only in this edge case of class inheritance, but is just a workaround.

I tryed the solution of settattr you use, but there aren't benefit in class inheritance

class A():
  @classmethod
  def schema(cls):
    if not hasattr(cls, "__cls_schema"):
        print("generating schema...")
        setattr(cls, "__cls_schema", cls.__name__ + "-Schema")
        
print(A.schema())

class B(A):
  pass
    
print(A.__cls_schema)

print(B.schema())
print(B.__cls_schema)

---
output

generating schema...
None
A-Schema
None <-- missed schema generation
A-Schema <-- It should be B-Schema

@SoundsSerious
Copy link
Contributor Author

SoundsSerious commented Feb 9, 2023

I think these are essentially the same solution. Yours has a dynamic flag, whereas mine has a dynamic variable. The authoritative schema comes from the schema() call so in the end, it's all the same.

I agree schema should be static at run time, but it's not a stated requirement in the library. The reason I added the regeneration ability was to address one potential issue causing #114, which caused a revert of 0.7.0

There were some issues with the entire concept of schema caching, so I am more concerned about those. I am also concerned about not having it, ie, memory leaks, as per issue #105.

I think it might take some time to figure this out but with @kashifpk's input and yours, I'm sure we can determine a solution :)

@overbost
Copy link
Contributor

overbost commented Feb 9, 2023

I appreciate your work, but i made a PR for my solution, included other changes you made in your PR.
#114 have the same problem with class inheritation.
I also added a simple bulk_delete.
Tests passed.

I apologize if i made some mistake, it's my first PR.

@SoundsSerious
Copy link
Contributor Author

@volodymyrko We are discussing the schema caching issue as it relates to a number of issues including one opened by you #114

Its unclear the steps needed to reproduce this error, would you be able to test this PR or let us know how to setup to test for your issue?

@volodymyrko
Copy link

@SoundsSerious

this is the code from README and two asserts to reproduce the issue

from arango import ArangoClient
from arango_orm import Database

from arango_orm import Collection, Relation, Graph, GraphConnection
from arango_orm.fields import String, Date, Integer, Boolean


client = ArangoClient(hosts='http://localhost:8529')
test_db = client.db('main', username='test_user', password='qwerty')

db = Database(test_db)


class Student(Collection):

    __collection__ = 'students'

    _key = String(required=True)  # registration number
    name = String(required=True, allow_none=False)
    age = Integer()

    def __str__(self):
        return "<Student({})>".format(self.name)


class Teacher(Collection):

    __collection__ = 'teachers'

    _key = String(required=True)  # employee id
    name = String(required=True)

    def __str__(self):
        return "<Teacher({})>".format(self.name)


class Subject(Collection):

    __collection__ = 'subjects'

    _key = String(required=True)  # subject code
    name = String(required=True)
    credit_hours = Integer()
    has_labs = Boolean(missing=True)

    def __str__(self):
        return "<Subject({})>".format(self.name)




class SpecializesIn(Relation):

    __collection__ = 'specializes_in'

    expertise_level = String(required=True, options=["expert", "medium", "basic"])

    def __str__(self):
        return "<SpecializesIn(_key={}, expertise_level={}, _from={}, _to={})>".format(
            self._key, self.expertise_level, self._from, self._to)


class UniversityGraph(Graph):

    __graph__ = 'university_graph'

    graph_connections = [
        # Using general Relation class for relationship
        GraphConnection(Student, Relation("studies"), Subject),
        GraphConnection(Teacher, Relation("teaches"), Subject),

        # Using specific classes for vertex and edges
        GraphConnection(Teacher, SpecializesIn, Subject),
    ]

uni_graph = UniversityGraph(connection=db)
 db.create_graph(uni_graph)

# THESE ARE 2 asserts that reproduce the issue 
assert list(uni_graph.edges['studies'].schema().fields.keys()) == []
# the next line fails, because `[]` is already cached (this is actually the issue)
assert list(uni_graph.edges['specializes_in'].schema().fields.keys()) == ['expertise_level'] 

the order of assert is important ! if you swap them than the error will disappear (but still it is not correct behavior ).

@volodymyrko
Copy link

I've tested PR in terms of the issue I faced - everything seems to be fine.

@SoundsSerious
Copy link
Contributor Author

@volodymyrko Awesome this is great! I think this shows that the class name-based lazy caching can handle edge cases like relationship instantiation.

@kashifpk any other concerns you think this PR might present?

@kashifpk
Copy link
Contributor

@SoundsSerious @overbost @volodymyrko - Thank you for all the work on this. Let me go through this one and #122. I'll report back here. Just want to test out a few scenarios with both approaches.

@SoundsSerious
Copy link
Contributor Author

@kashifpk would you like me to close out this PR since the other PR effectively supports the same features?

@kashifpk
Copy link
Contributor

kashifpk commented Mar 2, 2023

@SoundsSerious - sure, I couldn't find time to look into other caching PRs, I leave it to your best judgement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants