-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DAG metrics] add dataservices metrics #369
base: main
Are you sure you want to change the base?
Conversation
36393c2
to
757a22f
Compare
fecd43f
to
39a607e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, much cleaner than before 🧹
dgv/metrics/task_functions.py
Outdated
"DATAGOUVFR_RGS~" in parsed_line | ||
and '"GET' in parsed_line | ||
and ('302' in parsed_line or '200' in parsed_line) | ||
and ("302" in parsed_line or "200" in parsed_line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if one of these appears for in instance in a date or id? Do we expect it at a specific spot of the parsed_line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
Here is an example of log:
2024-12-17T00:00:19.439613+01:00 slb-04 haproxy[234222]: X.X.X.X[17/Dec/2024:00:00:19.431] DATAGOUVFR_RGS~ DATAGOUVFR/XX 0/0/1/3/+4 302 +415 - - --NN 123/4/5/0/0 0/0 "GET /fr/datasets/r/1f930c9a-5c15-4aea-a961-baa98f176dcb HTTP/1.1"
This is the original logic, I did not change it but I think it should be safe since looking at the log structure it would be very unlikely to find another lone 302
or 200
. parsed_line
is the split of the log line on spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also raises a question. Should we really include the 302
code?
A 302 will redirect to another page until it raises a 200 (or an error) right?
So won't we count twice the traffic ?
Example:
2024-11-13T23:09:16.940506+01:00 slb-04 haproxy[1234]: X.X.X.X [13/Nov/2024:23:09:16.929] DATAGOUVFR_RGS~ DATAGOUVFR/prod 0/0/1/2/+3 302 +432 - - --NN 313/227/5/3/0 0/0 "GET /datasets/57868ea9a3a7295d371adcfe/ HTTP/1.1"
2024-11-13T23:09:17.517371+01:00 slb-04 haproxy[1234]: X.X.X.X [13/Nov/2024:23:09:17.039] DATAGOUVFR_RGS~ DATAGOUVFR/prod 0/0/2/324/+234 200 +1234 - - --NR 315/229/5/2/0 0/0 "GET /fr/datasets/logements-sociaux-subventionnes-dans-les-communes-ayant-moins-de-25-de-logements-sociaux/ HTTP/1.1"
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the first message: I'm confused, isn't parsed_line
the line itself as a string? In which case it's not unlikely to find such substrigs within right?
About the 302
that's a good question, maybe rather for end-users of the metrics. I summon @maudetes 🪄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the first message:
Amazing catch!
Indeed, I removed the .split()
for a more robust regex to avoid applying the pattern logic to each element of the log. But doing so reduced the robustness of:
"DATAGOUVFR_RGS~" in parsed_line
and '"GET' in parsed_line
and ("302" in parsed_line or "200" in parsed_line)
Edit. I enriched the regex to skip the if
:
path = re.search(r" DATAGOUVFR_RGS~ .* (200|302) .* \"GET (/[^\s]+)", parsed_line)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the 302, I don't remember how we've managed in details it but I see two cases
- a 302 leading to an external URL. Example for this resource
$ curl --head https://www.data.gouv.fr/fr/datasets/r/2abb2c99-f953-4ac9-83ce-17963c04dc9f
HTTP/2 302
server: nginx
date: Mon, 06 Jan 2025 15:58:21 GMT
content-type: text/html; charset=utf-8
content-length: 369
location: https://download.data.grandlyon.com/files/rdata/tcl_sytral.tcltarification/Tarification.csv
...
It won't count twice since the redirect points towards download.data.grandlyon.com, and not data.gouv.fr.
- a 302 leading to an internal URL. Example for this resource
$ curl --head https://www.data.gouv.fr/fr/datasets/r/087ec735-74fd-48a7-a82e-0b1cd3ea6fe9
HTTP/2 302
server: nginx
date: Mon, 06 Jan 2025 16:00:53 GMT
content-type: text/html; charset=utf-8
content-length: 387
location: https://static.data.gouv.fr/resources/demandes-de-valeurs-foncieres/20221017-153257/faq-20221017.pdf
In this case, the subsequent request to static.data.gouv.fr will also appear in the logs.
From what I remember, a dedicated logic was supposed to be applied to de-duplicate these logs in particular.
I am not aware how and if it was implemented though :p We should make sure it works accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the additional details @maudetes!
From what I remember, a dedicated logic was supposed to be applied to de-duplicate these logs in particular.
I am not aware how and if it was implemented though :p We should make sure it works accordingly.
Indeed this has not been done yet.
Thanks to your first example I understand including redirects is quite important so I won't change the current logic yet.
But I will create an issue so we take into account the second example in the future.
With the current logs format setup this may be hard to implement though since we don't know where a 302 leads to. We will have to check with devops if we can enrich the logs first. Or we will have a to use a list of resources redirecting to outbound links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just pinging @geoffreyaldebert here in case he has some inputs on this when he comes back :)
c8f813e
to
b06fc83
Compare
b06fc83
to
291e7cd
Compare
dgv/metrics/task_functions.py
Outdated
static_slug_line = ( | ||
f"https://static.data.gouv.fr{url_match}".replace(";", "") | ||
) | ||
static_obj_type = segment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand why not
static_obj_type = obj_config.type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I need to comment this part of the code better.
The static resources are part of the resources
object but are treated separately afterward. They are stored in a separate file found_resources-static.csv
and have their own logic in aggregate_obj_type()
.
As such, to differentiate them from the "standard" resources we have to change their type to resources-static
instead of the usual resources
.
dgv/metrics/task_functions.py
Outdated
f"https://static.data.gouv.fr{url_match}".replace(";", "") | ||
) | ||
static_obj_type = segment | ||
static_segment = segment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we like to return
here like in the other case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resources-static
are a last resort option, that's why the return for the static variables is done only if no other pattern matches the log line.
It is currently implemented like this in the code and I did no challenge.
Do you think this is not useful? Like if we have a match on static.data.gouv.fr
it is very very unlikely any other pattern would match?
I can try to test that assumption on the data I am using (2024-12-17) 🤔
657d9f5
to
09d4856
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge work 👏
utils/utils_test.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to introduce tests 👏 I am wondering what is the best structure if we want to introduce more, should we have a dedicated tests
folder at the root, or rather test files closer to the files that contain the tested features (like you're doing here with dgv/metrics/task_function_test.py
) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the convention where we locate unit tests close to the tested functions or methods but E2E and integration tests in the root tests
folder.
It makes it easier to identify tests on functions we are working on, and unit test are also a good documentation on how a function works.
What do you think?
utils/filesystem.py
Outdated
|
||
|
||
def save_list_of_dict_to_csv( | ||
records_list: list[dict[str, str]], destination_file: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the keys and values need to be str
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, a value can be anything!
I will leave the key as str
though since those are the CSV header. So even if something else than a string would technically be ok, this should be done conscientiously by converting it to a string in my opinion.
csv_writer = csv.DictWriter(csv_file, records_list[0].keys(), delimiter=";") | ||
if not file_exists: | ||
csv_writer.writeheader() | ||
for row in records_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we want to make sure all columns match, or do we say we'll use it with full knowledge of the facts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It curently raises an error if one of the dict keys do not match. I will enrich the docstring to state that all dicts are excepted to follow the same format.
But would you prefer if it behaved another way?
df["id"] = df["id"].apply( | ||
lambda x: catalog_dict[x] if x in catalog_dict else None | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do:
df["id"] = df["id"].apply( | |
lambda x: catalog_dict[x] if x in catalog_dict else None | |
) | |
df["id"] = df["id"].apply( | |
lambda x: catalog_dict[x] if x in catalog_dict.values() else None | |
) |
so that we don't need to store {id: id}
in get_catalog_id_mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to a previous comment.
On top of mapping slugs, we also want to make sure IDs do exists in the catalog. I will add a small comment on top for clarity!
logging.info(f">> {n_logs_processed} log processed.") | ||
|
||
|
||
def aggregate_obj_type(log_date: str, obj_config: DataGouvLog) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused with why we have task_functions.py
and task.py
but task.py
alos contains sub-task functions 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused too haha
I wanted to merge aggregate_obj_type
and aggregate_log
but forgot. I will fix that quickly.
Also update copy_object() method, only used in the metrics DAG so far.
Including save_list_of_dict_to_csv() function
09d4856
to
7a87252
Compare
+ refactor of the code to make it more modular and easier to navigate. + add a split between languages and API segments. BREAKING CHANGE: Migration of a few tables to add new columns.
fcf24f1
to
37919f4
Compare
Ajout des dataservices aux mesures calculées dans le DAG
metrics
à partir des logs HAProxy.De nouvelles tables et vues sont disponibles dans la BDD Postgres pour accueillir ces données. Sur le même modèle que les anciennes.
De mesures avec une plus faible granularité sur l'usage des datasets/resources/etc. sont ajoutées.
nb_visit_api1
nb_visit_api2
nb_visit_apis
(somme de api1 et api2)nb_visit_fr
nb_visit_en
nb_visit_es
nb_visit_total
nb_visit
qui ne concerne que le site reste identique.nb_visit
correspond donc à la somme denb_visit_fr
,nb_visit_es
etnb_visit_en
.Le code a été réusiné pour le rendre plus modulaire et quelques tests ont été ajoutés sur des parties critiques du code.
Impact
Changement sur les metrics sur le 22 janvier 2025 :
Bug fix
https://www.data.gouv.fr/fr/datasets/obsolete-le-calendrier-scolaire-2019-2020-et-2020-2021-format-ical/
correspond maintenant à une visite pour le dataset correspondant. L'impact sur les chiffres peut être important. Par exemple, pour le dataset en exemple on passe de 8 visites à 284 sur la journée du 22 janvier.Breaking Changes
/!\ Avant de merge exécuter la requête suivante sur la DB postgres: