[DAG metrics] add dataservices metrics #369

hacherix · 2024-12-18T16:24:03Z

Ajout des dataservices aux mesures calculées dans le DAG metrics à partir des logs HAProxy.
De nouvelles tables et vues sont disponibles dans la BDD Postgres pour accueillir ces données. Sur le même modèle que les anciennes.

De mesures avec une plus faible granularité sur l'usage des datasets/resources/etc. sont ajoutées.

nb_visit_api1
nb_visit_api2
nb_visit_apis (somme de api1 et api2)
nb_visit_fr
nb_visit_en
nb_visit_es
nb_visit_total
La mesure d'origine nb_visit qui ne concerne que le site reste identique.
nb_visit correspond donc à la somme de nb_visit_fr, nb_visit_es et nb_visit_en.

Le code a été réusiné pour le rendre plus modulaire et quelques tests ont été ajoutés sur des parties critiques du code.

Impact

Changement sur les metrics sur le 22 janvier 2025 :

Objet	Nb IDs (ancien)	Nb IDs (nouveau)	Nb visites (ancien)	Nb visites (nouveau)
datasets	67 929	73 083	191 514	298 346
organisations	5 209	5 209	11 450	24 024
resources	44 943	47 702	7 869 775	7 715 453
reuses	4 206	4 305	10 160	39 479
dataservices	-	-	343	4 061

Bug fix

Les logs d'URL complètes n'étaient pas comptabilisés précédemment.

https://www.data.gouv.fr/fr/datasets/obsolete-le-calendrier-scolaire-2019-2020-et-2020-2021-format-ical/ correspond maintenant à une visite pour le dataset correspondant. L'impact sur les chiffres peut être important. Par exemple, pour le dataset en exemple on passe de 8 visites à 284 sur la journée du 22 janvier.

Empêche la duplication des metrics sur une journée dans le cas où un DAG est exécuté une seconde fois.

Breaking Changes

/!\ Avant de merge exécuter la requête suivante sur la DB postgres:

alter table metric.visits_datasets
    add column nb_visit_apis  integer default 0,
    add column nb_visit_total integer default 0,
    add column nb_visit_api1  integer default 0,
    add column nb_visit_api2  integer default 0,
    add column nb_visit_fr    integer default 0,
    add column nb_visit_en    integer default 0,
    add column nb_visit_es    integer default 0;

alter table metric.visits_organizations
    add column nb_visit_apis  integer default 0,
    add column nb_visit_total integer default 0,
    add column nb_visit_api1  integer default 0,
    add column nb_visit_api2  integer default 0,
    add column nb_visit_fr    integer default 0,
    add column nb_visit_en    integer default 0,
    add column nb_visit_es    integer default 0;

alter table metric.visits_reuses
    add column nb_visit_apis  integer default 0,
    add column nb_visit_total integer default 0,
    add column nb_visit_api1  integer default 0,
    add column nb_visit_api2  integer default 0,
    add column nb_visit_fr    integer default 0,
    add column nb_visit_en    integer default 0,
    add column nb_visit_es    integer default 0;

alter table metric.visits_resources
    add column nb_visit_apis  integer default 0,
    add column nb_visit_total integer default 0,
    add column nb_visit_api1  integer default 0,
    add column nb_visit_api2  integer default 0,
    add column nb_visit_fr    integer default 0,
    add column nb_visit_en    integer default 0,
    add column nb_visit_es    integer default 0;

Pierlou

Great work, much cleaner than before 🧹

dgv/metrics/sql/create_tables.sql

dgv/metrics/task_functions.py

Pierlou · 2024-12-23T14:54:40Z

dgv/metrics/task_functions.py

        "DATAGOUVFR_RGS~" in parsed_line
        and '"GET' in parsed_line
-        and ('302' in parsed_line or '200' in parsed_line)
+        and ("302" in parsed_line or "200" in parsed_line)


What happens if one of these appears for in instance in a date or id? Do we expect it at a specific spot of the parsed_line?

Good question.
Here is an example of log:

2024-12-17T00:00:19.439613+01:00 slb-04 haproxy[234222]: X.X.X.X[17/Dec/2024:00:00:19.431] DATAGOUVFR_RGS~ DATAGOUVFR/XX 0/0/1/3/+4 302 +415 - - --NN 123/4/5/0/0 0/0 "GET /fr/datasets/r/1f930c9a-5c15-4aea-a961-baa98f176dcb HTTP/1.1"

This is the original logic, I did not change it but I think it should be safe since looking at the log structure it would be very unlikely to find another lone 302 or 200. parsed_line is the split of the log line on spaces.

It also raises a question. Should we really include the 302 code?

A 302 will redirect to another page until it raises a 200 (or an error) right?
So won't we count twice the traffic ?
Example:

2024-11-13T23:09:16.940506+01:00 slb-04 haproxy[1234]: X.X.X.X [13/Nov/2024:23:09:16.929] DATAGOUVFR_RGS~ DATAGOUVFR/prod 0/0/1/2/+3 302 +432 - - --NN 313/227/5/3/0 0/0 "GET /datasets/57868ea9a3a7295d371adcfe/ HTTP/1.1" 2024-11-13T23:09:17.517371+01:00 slb-04 haproxy[1234]: X.X.X.X [13/Nov/2024:23:09:17.039] DATAGOUVFR_RGS~ DATAGOUVFR/prod 0/0/2/324/+234 200 +1234 - - --NR 315/229/5/2/0 0/0 "GET /fr/datasets/logements-sociaux-subventionnes-dans-les-communes-ayant-moins-de-25-de-logements-sociaux/ HTTP/1.1"

What do you think?

Regarding the first message: I'm confused, isn't parsed_line the line itself as a string? In which case it's not unlikely to find such substrigs within right?
About the 302 that's a good question, maybe rather for end-users of the metrics. I summon @maudetes 🪄

About the first message:

Amazing catch!
Indeed, I removed the .split() for a more robust regex to avoid applying the pattern logic to each element of the log. But doing so reduced the robustness of:

"DATAGOUVFR_RGS~" in parsed_line and '"GET' in parsed_line and ("302" in parsed_line or "200" in parsed_line)

Edit. I enriched the regex to skip the if:

path = re.search(r" DATAGOUVFR_RGS~ .* (200|302) .* \"GET (/[^\s]+)", parsed_line)

Regarding the 302, I don't remember how we've managed in details it but I see two cases

a 302 leading to an external URL. Example for this resource

$ curl --head https://www.data.gouv.fr/fr/datasets/r/2abb2c99-f953-4ac9-83ce-17963c04dc9f HTTP/2 302 server: nginx date: Mon, 06 Jan 2025 15:58:21 GMT content-type: text/html; charset=utf-8 content-length: 369 location: https://download.data.grandlyon.com/files/rdata/tcl_sytral.tcltarification/Tarification.csv ...

It won't count twice since the redirect points towards download.data.grandlyon.com, and not data.gouv.fr.

a 302 leading to an internal URL. Example for this resource

$ curl --head https://www.data.gouv.fr/fr/datasets/r/087ec735-74fd-48a7-a82e-0b1cd3ea6fe9 HTTP/2 302 server: nginx date: Mon, 06 Jan 2025 16:00:53 GMT content-type: text/html; charset=utf-8 content-length: 387 location: https://static.data.gouv.fr/resources/demandes-de-valeurs-foncieres/20221017-153257/faq-20221017.pdf

In this case, the subsequent request to static.data.gouv.fr will also appear in the logs.
From what I remember, a dedicated logic was supposed to be applied to de-duplicate these logs in particular.
I am not aware how and if it was implemented though :p We should make sure it works accordingly.

Thanks a lot for the additional details @maudetes!

From what I remember, a dedicated logic was supposed to be applied to de-duplicate these logs in particular.
I am not aware how and if it was implemented though :p We should make sure it works accordingly.

Indeed this has not been done yet.

Thanks to your first example I understand including redirects is quite important so I won't change the current logic yet.

But I will create an issue so we take into account the second example in the future.
With the current logs format setup this may be hard to implement though since we don't know where a 302 leads to. We will have to check with devops if we can enrich the logs first. Or we will have a to use a list of resources redirecting to outbound links.

I'm just pinging @geoffreyaldebert here in case he has some inputs on this when he comes back :)

dgv/metrics/task_functions.py

utils/postgres.py

dgv/metrics/task_functions.py

Pierlou · 2025-01-06T16:28:43Z

dgv/metrics/task_functions.py

+                        static_slug_line = (
+                            f"https://static.data.gouv.fr{url_match}".replace(";", "")
+                        )
+                        static_obj_type = segment


I am not sure I understand why not
static_obj_type = obj_config.type

Good question! I need to comment this part of the code better.

The static resources are part of the resources object but are treated separately afterward. They are stored in a separate file found_resources-static.csv and have their own logic in aggregate_obj_type().

As such, to differentiate them from the "standard" resources we have to change their type to resources-static instead of the usual resources.

Pierlou · 2025-01-06T16:29:58Z

dgv/metrics/task_functions.py

+                            f"https://static.data.gouv.fr{url_match}".replace(";", "")
+                        )
+                        static_obj_type = segment
+                        static_segment = segment


Would we like to return here like in the other case?

resources-static are a last resort option, that's why the return for the static variables is done only if no other pattern matches the log line.
It is currently implemented like this in the code and I did no challenge.
Do you think this is not useful? Like if we have a match on static.data.gouv.fr it is very very unlikely any other pattern would match?
I can try to test that assumption on the data I am using (2024-12-17) 🤔

dgv/metrics/task_functions.py

Pierlou

Huge work 👏

Pierlou · 2025-01-30T11:18:17Z

utils/utils_test.py

Nice to introduce tests 👏 I am wondering what is the best structure if we want to introduce more, should we have a dedicated tests folder at the root, or rather test files closer to the files that contain the tested features (like you're doing here with dgv/metrics/task_function_test.py) 🤔

I like the convention where we locate unit tests close to the tested functions or methods but E2E and integration tests in the root tests folder.

It makes it easier to identify tests on functions we are working on, and unit test are also a good documentation on how a function works.

What do you think?

utils/minio.py

Pierlou · 2025-01-30T13:09:54Z

utils/filesystem.py

+
+
+def save_list_of_dict_to_csv(
+    records_list: list[dict[str, str]], destination_file: str


Do the keys and values need to be str?

Good point, a value can be anything!
I will leave the key as str though since those are the CSV header. So even if something else than a string would technically be ok, this should be done conscientiously by converting it to a string in my opinion.

Pierlou · 2025-01-30T13:12:23Z

utils/filesystem.py

+        csv_writer = csv.DictWriter(csv_file, records_list[0].keys(), delimiter=";")
+        if not file_exists:
+            csv_writer.writeheader()
+        for row in records_list:


Would we want to make sure all columns match, or do we say we'll use it with full knowledge of the facts?

It curently raises an error if one of the dict keys do not match. I will enrich the docstring to state that all dicts are excepted to follow the same format.

But would you prefer if it behaved another way?

dgv/metrics/task.py

Pierlou · 2025-01-30T16:44:36Z

dgv/metrics/task.py

+            df["id"] = df["id"].apply(
+                lambda x: catalog_dict[x] if x in catalog_dict else None
+            )


We could do:

Suggested change

df["id"] = df["id"].apply(

lambda x: catalog_dict[x] if x in catalog_dict else None

)

df["id"] = df["id"].apply(

lambda x: catalog_dict[x] if x in catalog_dict.values() else None

)

so that we don't need to store {id: id} in get_catalog_id_mapping

Related to a previous comment.

On top of mapping slugs, we also want to make sure IDs do exists in the catalog. I will add a small comment on top for clarity!

dgv/metrics/task.py

Pierlou · 2025-01-30T16:50:27Z

dgv/metrics/task.py

+                    logging.info(f">> {n_logs_processed} log processed.")
+
+
+def aggregate_obj_type(log_date: str, obj_config: DataGouvLog) -> list[str]:


I am confused with why we have task_functions.py and task.py but task.py alos contains sub-task functions 😅

I am confused too haha
I wanted to merge aggregate_obj_type and aggregate_log but forgot. I will fix that quickly.

Also update copy_object() method, only used in the metrics DAG so far.

Including save_list_of_dict_to_csv() function

+ refactor of the code to make it more modular and easier to navigate. + add a split between languages and API segments. BREAKING CHANGE: Migration of a few tables to add new columns.

hacherix added the enhancement New feature or request label Dec 18, 2024

hacherix self-assigned this Dec 18, 2024

hacherix force-pushed the add-dataservices-to-metrics-dag branch from 36393c2 to 757a22f Compare December 19, 2024 13:29

hacherix requested a review from Pierlou December 19, 2024 13:30

hacherix changed the title ~~feat(metrics): add dataservices metrics~~ [DAG metrics] add dataservices metrics Dec 19, 2024

hacherix force-pushed the add-dataservices-to-metrics-dag branch from fecd43f to 39a607e Compare December 23, 2024 14:29

Pierlou reviewed Dec 23, 2024

View reviewed changes

hacherix force-pushed the add-dataservices-to-metrics-dag branch 2 times, most recently from c8f813e to b06fc83 Compare January 6, 2025 09:57

hacherix requested a review from Pierlou January 6, 2025 09:57

hacherix force-pushed the add-dataservices-to-metrics-dag branch from b06fc83 to 291e7cd Compare January 6, 2025 15:44

Pierlou reviewed Jan 6, 2025

View reviewed changes

hacherix force-pushed the add-dataservices-to-metrics-dag branch 4 times, most recently from 657d9f5 to 09d4856 Compare January 30, 2025 10:55

Pierlou reviewed Jan 30, 2025

View reviewed changes

hacherix added 5 commits January 31, 2025 17:43

chore: update .gitignore

165d0b1

refactor(utils): create get_unique_list

4cac8d9

refactor(minio): add copy_many_objects method

914c9ec

Also update copy_object() method, only used in the metrics DAG so far.

feat(filesystem): create filesystem utils

6caac17

Including save_list_of_dict_to_csv() function

refactor(utils.postgres): typing

9e95bb4

hacherix force-pushed the add-dataservices-to-metrics-dag branch from 09d4856 to 7a87252 Compare January 31, 2025 16:45

hacherix added 3 commits January 31, 2025 18:28

feat(metrics): add dataservices metrics

2784c4d

+ refactor of the code to make it more modular and easier to navigate. + add a split between languages and API segments. BREAKING CHANGE: Migration of a few tables to add new columns.

feat(metrics): add tests

0f8cf29

chore(metrics): add README with DAG picture

37919f4

hacherix force-pushed the add-dataservices-to-metrics-dag branch from fcf24f1 to 37919f4 Compare January 31, 2025 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DAG metrics] add dataservices metrics #369

[DAG metrics] add dataservices metrics #369

hacherix commented Dec 18, 2024 •

edited

Loading

Pierlou left a comment

Pierlou Dec 23, 2024

hacherix Dec 23, 2024

hacherix Dec 23, 2024 •

edited

Loading

Pierlou Dec 23, 2024

hacherix Dec 23, 2024 •

edited

Loading

maudetes Jan 6, 2025

hacherix Jan 6, 2025

maudetes Jan 10, 2025

Pierlou Jan 6, 2025

hacherix Jan 9, 2025

Pierlou Jan 6, 2025

hacherix Jan 9, 2025

Pierlou left a comment

Pierlou Jan 30, 2025

hacherix Jan 31, 2025

Pierlou Jan 30, 2025

hacherix Jan 31, 2025

Pierlou Jan 30, 2025

hacherix Jan 31, 2025 •

edited

Loading

Pierlou Jan 30, 2025

hacherix Jan 31, 2025

Pierlou Jan 30, 2025

hacherix Jan 31, 2025



		def save_list_of_dict_to_csv(
		records_list: list[dict[str, str]], destination_file: str

		logging.info(f">> {n_logs_processed} log processed.")


		def aggregate_obj_type(log_date: str, obj_config: DataGouvLog) -> list[str]:

[DAG metrics] add dataservices metrics #369

Are you sure you want to change the base?

[DAG metrics] add dataservices metrics #369

Conversation

hacherix commented Dec 18, 2024 • edited Loading

Impact

Bug fix

Breaking Changes

Pierlou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hacherix Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hacherix Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pierlou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hacherix Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hacherix commented Dec 18, 2024 •

edited

Loading

hacherix Dec 23, 2024 •

edited

Loading

hacherix Dec 23, 2024 •

edited

Loading

hacherix Jan 31, 2025 •

edited

Loading