Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛(backend) adding /stats endpoint for user and document statistics #449

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lindenb1
Copy link
Collaborator

@lindenb1 lindenb1 commented Nov 25, 2024

Implements various metrics for users and documents like total_users, total_documents and more as requested here:

#415

Purpose

Implements various metrics to analyze user and document activity:

User Metrics:

Total Users: The total number of users in the system.
Active Users Today: Users who logged in on the current day.
Active Users in the Last 7 Days: Users who logged in during the past week.
Active Users in the Last 30 Days: Users who logged in during the past month.
Percentage of Active Users Today: Percentage of users who logged in on the current day.
Percentage of Active Users in the Last 7 Days: Percentage of users who logged in during the past week.
Percentage of Active Users in the Last 30 Days: Percentage of users who logged in during the past month.

Document Metrics:

Total Documents: The total number of documents in the system.
Documents Edited/Created Today: Documents updated or created on the current day.
Documents Edited/Created in the Last 7 Days: Documents updated or created during the past week.
Documents Edited/Created in the Last 30 Days: Documents updated or created during the past month.
Shared Documents Count: The number of documents shared (based on access count greater than 1).
Oldest Document Creation Date: The creation date of the oldest document in the system.
Newest Document Creation Date: The creation date of the most recently created document in the system.
Average Documents Per User: The average number of documents per user, calculated based on access data.

@lindenb1 lindenb1 mentioned this pull request Nov 25, 2024
@lindenb1 lindenb1 marked this pull request as ready for review November 25, 2024 15:09
@lindenb1 lindenb1 requested review from AntoLC and qbey November 25, 2024 15:11
@virgile-dev
Copy link
Collaborator

Hey @lindenb1
Thanks for going at it.
Here a couple thoughts.

  • Is there another way we can say that user has been active other than login ? Because if the session remains active then we can have an active user who is not detected. Could we have something based on actions the user perform like opening a doc ?
  • I don't have much use for oldest doc and newest doc, so it's a nice to have for me
  • Could we have separate metrics for documents created and documents updated ?
    Thanks in advance for your answers.

@AntoLC AntoLC requested a review from sampaccoud November 26, 2024 09:18
@sampaccoud
Copy link
Member

sampaccoud commented Nov 26, 2024

Maybe write tests then, once your tests pass, give a try to this code optimizing queries and proposing compute a distribution of the number of documents:

def list(self, request):
    """Returns various statistics in JSON format about documents usage."""

    # Time ranges
    today = now().date()
    one_week_ago = now() - timedelta(days=7)
    one_month_ago = now() - timedelta(days=30)

    # Total number of users
    user_queryset = models.User.objects.all()
    user_count = user_queryset.count()

    # Active users in different time frames
    active_user_counts = user_queryset.filter(
        last_login__isnull=False
    ).aggregate(
        active_today=Count("id", filter=Q(last_login__date=today)),
        active_7_days=Count("id", filter=Q(last_login__gte=one_week_ago)),
        active_30_days=Count("id", filter=Q(last_login__gte=one_month_ago)),
    )

    # Calculate percentages in Python
    percentage_active_users_today = (
        round((active_user_counts["active_today"] / user_count) * 100, 1)
        if user_count > 0
        else 0
    )
    percentage_active_users_7_days = (
        round((active_user_counts["active_7_days"] / user_count) * 100, 1)
        if user_count > 0
        else 0
    )
    percentage_active_users_30_days = (
        round((active_user_counts["active_30_days"] / user_count) * 100, 1)
        if user_count > 0
        else 0
    )

    # Document statistics
    doc_queryset = models.Document.objects.all()
    doc_stats = doc_queryset.aggregate(
        total_documents=Count("id"),
        shared_documents=Count("id", filter=Q(accesses__gt=1)),
        oldest_doc_date=Min("created_at"),
        newest_doc_date=Max("created_at"),
        active_today=Count("id", filter=Q(updated_at__date=today)),
        active_7_days=Count("id", filter=Q(updated_at__gte=one_week_ago)),
        active_30_days=Count("id", filter=Q(updated_at__gte=one_month_ago)),
    )

    # Calculate average documents per user
    user_doc_counts = doc_queryset.values("creator").annotate(
        doc_count=Count("id")
    )
    avg_docs_per_user = (
        round(sum(u["doc_count"] for u in user_doc_counts) / user_count, 1)
        if user_count > 0
        else 0
    )

    # JSON structure with document statistics
    user_doc_distribution = (
        doc_queryset.values("creator")
        .annotate(num_docs=Count("id"))
        .values_list("num_docs", flat=True)
    )
    doc_distribution = {
        count: sum(1 for _ in filter(lambda x: x == count, user_doc_distribution))
        for count in set(user_doc_distribution)
    }

    stats = {
        "total_users": user_count,
        "active_users_today": active_user_counts["active_today"],
        "active_users_7_days": active_user_counts["active_7_days"],
        "active_users_30_days": active_user_counts["active_30_days"],
        "percentage_active_users_today": percentage_active_users_today,
        "percentage_active_users_7_days": percentage_active_users_7_days,
        "percentage_active_users_30_days": percentage_active_users_30_days,
        "total_documents": doc_stats["total_documents"],
        "shared_documents_count": doc_stats["shared_documents"],
        "active_docs_today": doc_stats["active_today"],
        "active_docs_last_7_days": doc_stats["active_7_days"],
        "active_docs_last_30_days": doc_stats["active_30_days"],
        "oldest_document_date": doc_stats["oldest_doc_date"],
        "newest_document_date": doc_stats["newest_doc_date"],
        "average_documents_per_user": avg_docs_per_user,
        "user_document_distribution": doc_distribution,
    }
    return Response(stats)

This is untested ChatGPT output but the idea is there 😉

Copy link
Collaborator

@qbey qbey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First light review, I guess you must add some tests on this :)

Comment on lines 206 to 204
today = now().date()
one_week_ago = now() - timedelta(days=7)
one_month_ago = now() - timedelta(days=30)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit false, as we are not all using UTC timezone. I guess you should at least use the user timezone, and maybe (but not mandatory, for simplicity's sake) allow the timezone to be passed as a query argument?

You need to use datetime instead of date:

today = timezone.localtime(timezone=user.timezone).replace(hour=0, minute=0, second=0, microsecond=0)

because, if you filter datetime using date in database it uses the UTC datetime.

Comment on lines 213 to 226
# Active users in the last 7 days
active_users_today = models.User.objects.filter(
last_login__gte=today
).count()

# Active users in the last 7 days
active_users_7_days = models.User.objects.filter(
last_login__gte=one_week_ago
).count()

# Active users in the last 30 days
active_users_30_days = models.User.objects.filter(
last_login__gte=one_month_ago
).count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Your implementation is ok, one improvement might be to group database queries:

 Aggregate active users in a single query
active_users = models.User.objects.aggregate(
    active_today=Count(Case(
        When(last_login__gte=today, then=1),
        output_field=IntegerField()
    )),
    active_7_days=Count(Case(
        When(last_login__gte=one_week_ago, then=1),
        output_field=IntegerField()
    )),
    active_30_days=Count(Case(
        When(last_login__gte=one_month_ago, then=1),
        output_field=IntegerField()
    ))
)

# Extract counts
active_users_today = active_users['active_today']
active_users_7_days = active_users['active_7_days']
active_users_30_days = active_users['active_30_days']

@lindenb1
Copy link
Collaborator Author

@virgile-dev

  • Is there another way we can say that user has been active other than login ? Because if the session remains active then we can have an active user who is not detected. Could we have something based on actions the user perform like opening a doc ?

From my understanding, yes there is. If we look at the model DocumentAccess, we could use the updated_at field the DocumentAccess model inherits from BaseAccess, which itself inherits from BaseModel. The BaseModel includes the updated_at field, which should automatically update whenever the DocumentAccess instance is saved. To accomplish this, we could use a combined query that includes multiple fields e.g.:

        # Active users based on DocumentAccess, LinkTrace and Last Login
        active_users_today = models.User.objects.filter(
            Q(documentaccess__updated_at__date=today) |
            Q(link_traces__created_at__date=today) |
            Q(last_login__date=today)
        ).distinct().count()

Which to me appears to be working fine so far.

@lindenb1 lindenb1 force-pushed the stats-endpoint branch 3 times, most recently from 33f595c to 03f4a71 Compare November 28, 2024 16:20
@lindenb1
Copy link
Collaborator Author

lindenb1 commented Nov 28, 2024

@virgile-dev @sampaccoud @qbey

I have implemented all the requirements and suggestions so far and kindly ask you to review them again. Additionally, I would like to highlight the comment at the test_stats_view_timezone_impact test function, where I had to apply some workaround adjustment to make it function correctly. But everything is explained in the comment of the test and why it is like that.

implements various metrics for users and documents like total_users,
total_documents and more

Signed-off-by: lindenb1 <[email protected]>
Copy link
Collaborator

@qbey qbey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not finish the review there is already a lot to rework before.


def list(self, request):
user_timezone = "UTC"
if request.user.is_authenticated:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should always be true, thanks to permission_classes = [IsAdminUser].

Comment on lines +221 to +228
if "tz" in request.query_params:
try:
timezone_to_use = ZoneInfo(request.query_params["tz"])
except ZoneInfoNotFoundError:
return drf_response.Response(
{"detail": "Invalid timezone provided."},
status=status.HTTP_400_BAD_REQUEST,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this for the moment, I suggest we keep things simple and don't manage the timezone change from URL now.

Comment on lines +201 to +208
class StatsViewSet(viewsets.GenericViewSet):
"""
API endpoint that provides various statistics in JSON format.
"""

permission_classes = [IsAdminUser]

def list(self, request):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a viewset which is needed here

Suggested change
class StatsViewSet(viewsets.GenericViewSet):
"""
API endpoint that provides various statistics in JSON format.
"""
permission_classes = [IsAdminUser]
def list(self, request):
class StatsAPIView(views.APIView):
"""
API endpoint that provides various statistics in JSON format.
"""
permission_classes = [IsAdminUser]
def get(self, request, *args, **kwargs):

Comment on lines +209 to +219
user_timezone = "UTC"
if request.user.is_authenticated:
user_timezone = str(getattr(request.user, "timezone", user_timezone))

try:
timezone_to_use = ZoneInfo(user_timezone)
except ZoneInfoNotFoundError:
return drf_response.Response(
{"detail": "Invalid user timezone provided."},
status=status.HTTP_400_BAD_REQUEST,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep things simple:

Suggested change
user_timezone = "UTC"
if request.user.is_authenticated:
user_timezone = str(getattr(request.user, "timezone", user_timezone))
try:
timezone_to_use = ZoneInfo(user_timezone)
except ZoneInfoNotFoundError:
return drf_response.Response(
{"detail": "Invalid user timezone provided."},
status=status.HTTP_400_BAD_REQUEST,
)
user_timezone = request.user.timezone

Comment on lines +230 to +232
now_time = now()
now_in_user_tz = now_time.astimezone(timezone_to_use)
today = now_in_user_tz.replace(hour=0, minute=0, second=0, microsecond=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        local_now = timezone.localtime(timezone=user_timezone)
        today_datetime = local_now.replace(hour=0, minute=0, second=0, microsecond=0)
        today = today_datetime.date()
        
        # OR
        today = timezone.localdate(timezone=user_timezone)
        today_datetime = datetime.datetime.combine(today, datetime.time.min)

Comment on lines +230 to +234
now_time = now()
now_in_user_tz = now_time.astimezone(timezone_to_use)
today = now_in_user_tz.replace(hour=0, minute=0, second=0, microsecond=0)
one_week_ago = today - timedelta(days=7)
one_month_ago = today - timedelta(days=30)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment + this is quite false because of daylight saving time.

one_week_ago = datetime.datetime.combine(today - timedelta(days=7), datetime.time.min)
thirty_days_ago = datetime.datetime.combine(today - timedelta(days=30), datetime.time.min)

Comment on lines +240 to +242
# Convert to UTC for database queries
today_start_utc = today_in_user_tz.astimezone(ZoneInfo("UTC"))
tomorrow_start_utc = tomorrow_in_user_tz.astimezone(ZoneInfo("UTC"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to do that, Django takes care of it

"""
Non-admin authenticated users should not have access to the stats endpoint.
"""
regular_user = factories.UserFactory(is_staff=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_staff is different from is_admin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants