-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛(backend) adding /stats endpoint for user and document statistics #449
base: main
Are you sure you want to change the base?
Conversation
dcd6b28
to
920d764
Compare
Hey @lindenb1
|
Maybe write tests then, once your tests pass, give a try to this code optimizing queries and proposing compute a distribution of the number of documents: def list(self, request):
"""Returns various statistics in JSON format about documents usage."""
# Time ranges
today = now().date()
one_week_ago = now() - timedelta(days=7)
one_month_ago = now() - timedelta(days=30)
# Total number of users
user_queryset = models.User.objects.all()
user_count = user_queryset.count()
# Active users in different time frames
active_user_counts = user_queryset.filter(
last_login__isnull=False
).aggregate(
active_today=Count("id", filter=Q(last_login__date=today)),
active_7_days=Count("id", filter=Q(last_login__gte=one_week_ago)),
active_30_days=Count("id", filter=Q(last_login__gte=one_month_ago)),
)
# Calculate percentages in Python
percentage_active_users_today = (
round((active_user_counts["active_today"] / user_count) * 100, 1)
if user_count > 0
else 0
)
percentage_active_users_7_days = (
round((active_user_counts["active_7_days"] / user_count) * 100, 1)
if user_count > 0
else 0
)
percentage_active_users_30_days = (
round((active_user_counts["active_30_days"] / user_count) * 100, 1)
if user_count > 0
else 0
)
# Document statistics
doc_queryset = models.Document.objects.all()
doc_stats = doc_queryset.aggregate(
total_documents=Count("id"),
shared_documents=Count("id", filter=Q(accesses__gt=1)),
oldest_doc_date=Min("created_at"),
newest_doc_date=Max("created_at"),
active_today=Count("id", filter=Q(updated_at__date=today)),
active_7_days=Count("id", filter=Q(updated_at__gte=one_week_ago)),
active_30_days=Count("id", filter=Q(updated_at__gte=one_month_ago)),
)
# Calculate average documents per user
user_doc_counts = doc_queryset.values("creator").annotate(
doc_count=Count("id")
)
avg_docs_per_user = (
round(sum(u["doc_count"] for u in user_doc_counts) / user_count, 1)
if user_count > 0
else 0
)
# JSON structure with document statistics
user_doc_distribution = (
doc_queryset.values("creator")
.annotate(num_docs=Count("id"))
.values_list("num_docs", flat=True)
)
doc_distribution = {
count: sum(1 for _ in filter(lambda x: x == count, user_doc_distribution))
for count in set(user_doc_distribution)
}
stats = {
"total_users": user_count,
"active_users_today": active_user_counts["active_today"],
"active_users_7_days": active_user_counts["active_7_days"],
"active_users_30_days": active_user_counts["active_30_days"],
"percentage_active_users_today": percentage_active_users_today,
"percentage_active_users_7_days": percentage_active_users_7_days,
"percentage_active_users_30_days": percentage_active_users_30_days,
"total_documents": doc_stats["total_documents"],
"shared_documents_count": doc_stats["shared_documents"],
"active_docs_today": doc_stats["active_today"],
"active_docs_last_7_days": doc_stats["active_7_days"],
"active_docs_last_30_days": doc_stats["active_30_days"],
"oldest_document_date": doc_stats["oldest_doc_date"],
"newest_document_date": doc_stats["newest_doc_date"],
"average_documents_per_user": avg_docs_per_user,
"user_document_distribution": doc_distribution,
}
return Response(stats) This is untested ChatGPT output but the idea is there 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First light review, I guess you must add some tests on this :)
src/backend/core/api/viewsets.py
Outdated
today = now().date() | ||
one_week_ago = now() - timedelta(days=7) | ||
one_month_ago = now() - timedelta(days=30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit false, as we are not all using UTC timezone. I guess you should at least use the user timezone
, and maybe (but not mandatory, for simplicity's sake) allow the timezone to be passed as a query argument?
You need to use datetime instead of date:
today = timezone.localtime(timezone=user.timezone).replace(hour=0, minute=0, second=0, microsecond=0)
because, if you filter datetime
using date
in database it uses the UTC datetime.
src/backend/core/api/viewsets.py
Outdated
# Active users in the last 7 days | ||
active_users_today = models.User.objects.filter( | ||
last_login__gte=today | ||
).count() | ||
|
||
# Active users in the last 7 days | ||
active_users_7_days = models.User.objects.filter( | ||
last_login__gte=one_week_ago | ||
).count() | ||
|
||
# Active users in the last 30 days | ||
active_users_30_days = models.User.objects.filter( | ||
last_login__gte=one_month_ago | ||
).count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Your implementation is ok, one improvement might be to group database queries:
Aggregate active users in a single query
active_users = models.User.objects.aggregate(
active_today=Count(Case(
When(last_login__gte=today, then=1),
output_field=IntegerField()
)),
active_7_days=Count(Case(
When(last_login__gte=one_week_ago, then=1),
output_field=IntegerField()
)),
active_30_days=Count(Case(
When(last_login__gte=one_month_ago, then=1),
output_field=IntegerField()
))
)
# Extract counts
active_users_today = active_users['active_today']
active_users_7_days = active_users['active_7_days']
active_users_30_days = active_users['active_30_days']
From my understanding, yes there is. If we look at the model DocumentAccess, we could use the updated_at field the DocumentAccess model inherits from BaseAccess, which itself inherits from BaseModel. The BaseModel includes the updated_at field, which should automatically update whenever the DocumentAccess instance is saved. To accomplish this, we could use a combined query that includes multiple fields e.g.:
Which to me appears to be working fine so far. |
33f595c
to
03f4a71
Compare
@virgile-dev @sampaccoud @qbey I have implemented all the requirements and suggestions so far and kindly ask you to review them again. Additionally, I would like to highlight the comment at the |
implements various metrics for users and documents like total_users, total_documents and more Signed-off-by: lindenb1 <[email protected]>
03f4a71
to
bfa95d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not finish the review there is already a lot to rework before.
|
||
def list(self, request): | ||
user_timezone = "UTC" | ||
if request.user.is_authenticated: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should always be true, thanks to permission_classes = [IsAdminUser]
.
if "tz" in request.query_params: | ||
try: | ||
timezone_to_use = ZoneInfo(request.query_params["tz"]) | ||
except ZoneInfoNotFoundError: | ||
return drf_response.Response( | ||
{"detail": "Invalid timezone provided."}, | ||
status=status.HTTP_400_BAD_REQUEST, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this for the moment, I suggest we keep things simple and don't manage the timezone change from URL now.
class StatsViewSet(viewsets.GenericViewSet): | ||
""" | ||
API endpoint that provides various statistics in JSON format. | ||
""" | ||
|
||
permission_classes = [IsAdminUser] | ||
|
||
def list(self, request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a viewset which is needed here
class StatsViewSet(viewsets.GenericViewSet): | |
""" | |
API endpoint that provides various statistics in JSON format. | |
""" | |
permission_classes = [IsAdminUser] | |
def list(self, request): | |
class StatsAPIView(views.APIView): | |
""" | |
API endpoint that provides various statistics in JSON format. | |
""" | |
permission_classes = [IsAdminUser] | |
def get(self, request, *args, **kwargs): |
user_timezone = "UTC" | ||
if request.user.is_authenticated: | ||
user_timezone = str(getattr(request.user, "timezone", user_timezone)) | ||
|
||
try: | ||
timezone_to_use = ZoneInfo(user_timezone) | ||
except ZoneInfoNotFoundError: | ||
return drf_response.Response( | ||
{"detail": "Invalid user timezone provided."}, | ||
status=status.HTTP_400_BAD_REQUEST, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep things simple:
user_timezone = "UTC" | |
if request.user.is_authenticated: | |
user_timezone = str(getattr(request.user, "timezone", user_timezone)) | |
try: | |
timezone_to_use = ZoneInfo(user_timezone) | |
except ZoneInfoNotFoundError: | |
return drf_response.Response( | |
{"detail": "Invalid user timezone provided."}, | |
status=status.HTTP_400_BAD_REQUEST, | |
) | |
user_timezone = request.user.timezone |
now_time = now() | ||
now_in_user_tz = now_time.astimezone(timezone_to_use) | ||
today = now_in_user_tz.replace(hour=0, minute=0, second=0, microsecond=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local_now = timezone.localtime(timezone=user_timezone)
today_datetime = local_now.replace(hour=0, minute=0, second=0, microsecond=0)
today = today_datetime.date()
# OR
today = timezone.localdate(timezone=user_timezone)
today_datetime = datetime.datetime.combine(today, datetime.time.min)
now_time = now() | ||
now_in_user_tz = now_time.astimezone(timezone_to_use) | ||
today = now_in_user_tz.replace(hour=0, minute=0, second=0, microsecond=0) | ||
one_week_ago = today - timedelta(days=7) | ||
one_month_ago = today - timedelta(days=30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment + this is quite false because of daylight saving time.
one_week_ago = datetime.datetime.combine(today - timedelta(days=7), datetime.time.min)
thirty_days_ago = datetime.datetime.combine(today - timedelta(days=30), datetime.time.min)
# Convert to UTC for database queries | ||
today_start_utc = today_in_user_tz.astimezone(ZoneInfo("UTC")) | ||
tomorrow_start_utc = tomorrow_in_user_tz.astimezone(ZoneInfo("UTC")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to do that, Django takes care of it
""" | ||
Non-admin authenticated users should not have access to the stats endpoint. | ||
""" | ||
regular_user = factories.UserFactory(is_staff=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_staff
is different from is_admin
Implements various metrics for users and documents like total_users, total_documents and more as requested here:
#415
Purpose
Implements various metrics to analyze user and document activity:
User Metrics:
Document Metrics: