🚸(backend) sort user search results by proximity with the active user#1802
🚸(backend) sort user search results by proximity with the active user#1802
Conversation
fb5d1bf to
d7cc384
Compare
Allows a user to find more easily the other users they search, with the following order of priority: - users they already share documents with (more recent first) - users that share the same full email domain - users that share the same partial email domain (last two parts) - other users
19706b0 to
768516c
Compare
768516c to
0eb0dc1
Compare
qbey
left a comment
There was a problem hiding this comment.
Interesting:) I like the python instead of SQL approach but I wonder if it will not bring too much overhead. I guess it could be ok for a first version but:
- is the "closer" compute not too much? I mean if you put the user you share something with, wouldn't that be lighter to compute and enough? Could be a first step, before bringing the heavy machine gun.
- we need to monitor the function execution time to see if at some point or for some users it starts to be long (either logger.info or prometheus)
Anyway, this is just a review to help (I know I only ask more questions and did not bring any solution ^^)
| return re.findall(enums.MEDIA_STORAGE_URL_EXTRACT, xml_content) | ||
|
|
||
|
|
||
| def users_sharing_documents_with(user): |
There was a problem hiding this comment.
I'm a bit afraid of this, because on each search query, you will load all user accesses, I don't have a better proposition and maybe we should let it like that but we should at least add a "timer" log to be able to see if this takes too much time. Maybe a cache could also help?
| .values("user") | ||
| .annotate(last_shared=db.Max("created_at")) | ||
| ) | ||
| return {item["user"]: item["last_shared"] for item in shared_qs} |
There was a problem hiding this comment.
| return {item["user"]: item["last_shared"] for item in shared_qs} | |
| return {item["user"]: item["last_shared"] for item in shared_qs.iterator()} |
I would suggest to use an iterator here.
| except ValidationError: | ||
| return "", "" | ||
|
|
||
| domain = email.split("@", 1)[1].lower() |
There was a problem hiding this comment.
Such split is not "safe", the safe way is using email.headerregistry.Address but for performances sake, I guess we should keep it like this: maybe add a comment and rename the function unsafe_extract_email_domain_parts to say you did it this way on purpose could help, and might prevent the use of this method for another context :)
Note: django-lasuite already provides a get_domain_from_email
| "document_id", flat=True | ||
| ) | ||
| shared_qs = ( | ||
| models.DocumentAccess.objects.filter(document_id__in=user_docs_qs) |
There was a problem hiding this comment.
Would using a Subquery be better here? (I mean, to prevent data from being passed to Django for nothing)
|
|
||
| ### Changed | ||
|
|
||
| - 🚸(backend) sort user search results by proximity with the active user #1802 |
There was a problem hiding this comment.
Maybe create an issue now in django-lasuite so we don't forget to upstream it as it will have to be shared with all our apps once it has proven its efficiency?
|
Per a talk with @virgile-dev, we will only keep results for the first two categories (shared doc with and same full email domain) and drop the others to avoid exposing email addresses, for GDPR reason. |
Purpose
Allows a user to find more easily the other users they search, with the following order of priority:
Solves #1521
Proposal
core/utils.py:users_sharing_documents_with()andextract_email_domain_parts()