Improvements arising from suggestions evt #132

eugenividal · 2025-01-14T21:21:17Z

@e-kotov and @Robinlovelace, in this PR I've:

In the README:
Rephrased the paragraph explaining the different versions of the data based on the improvements on the paper.
In the vignettes:
Renamed the mobility datasets as follows:
+ Version 1, Mobility data:
- Origin-Destination data
- Population by trip count data
+ Version 2, Mobility data:
- Origin-Destination data
- Population by trip count data
- Population by overnight stay data

These were the main tasks from #129. As for the rest, I found some issues:

In the README:
Given that the title is done with one #, add one level (i.e. ##) to the headings 1. I've tried this but there are several qmd files linked to the README, so to make it work we would have to lower one level in all the files, or we can just forget about this very minor issue. It is just aesthetics
plot the desire lines using the flow mapper.
Not sure where this plot comes from
Remove the title of the first figure --- there is already a caption with title for this figure.
Same as above
Beautify Figure 4, checking for typos (e.g. munucipality, houlry), putting the first three boxes at the same level, etc.
Same as above
In the package:

These suggestions regarding function names were also very minor. If you agree, better if you implement them @e-kotov. I am not sure how to rename functions.

Fix inconsistencies in the name of the variables of the different versions: e.g. date and full_date, and residence_province_name and residence_province.
I'd consider renaming some variables: E.g. time slot = hour_slot and distance = interval_distance

Robinlovelace

👍 from me, thoughts @e-kotov ?

e-kotov · 2025-01-14T23:08:39Z

thoughts

@Robinlovelace Wondering why ubuntu devel suddenly fails, though @eugenividal did not touch the code ) Probably has something to do with ubuntu latest moving to v24 in github actions... Have not studied the logs yet.

@eugenividal are you hesitating to touch the code and the docs?

Some of the changes in the vignette are consistent with the package docs and argument options.

'Population by trip count data' - I'm not sure this reads well. In the package docs we refer to it as "number_of_trips" (see help for spod_get, spod_download, spod_convert). So if we change it in vignette, we need to make it consistent with the package options and docs. But the package is already released. Not that many people have downloaded it, but I would be hesitant to suddenly change the arguments and make breaking changes. So at this point I am hesitant to change the vignette in this way, especially if the new heading does not read well (to me at least it does not). One of the problems is this specific data set that has this weird form/grouping of number or people by number of trips as a factor... Could we try alternative headings for this one that would be easy to understand but also consistent with the docs?

eugenividal · 2025-01-15T08:55:08Z

Hi @e-kotov, yeah, not sure what happened with this PR.
The most important change is the definition of the data versions in the README, which is implemented in the files changed. This part is now correct—previously, it was misleading regarding the differences between versions.
As for the name of the current 2.2. 'number of trips' data in the vignettes, I find it a bit confusing. Let me explain this. For instance, in version 1, the first dataset (maestra1) provides the number of trips (and trip kilometers) per origin and destination, we call this origin destination data (which seems ok). In contrast, the second dataset (maestra2) gives the number of people by trip count category (0, 1, 2, 2+), we call this number of trips data. Referring to maestra2 as 'number of trips' can be a little misleading since it actually represents the number of individuals per trip frequency, while maestra1 actually provides the number of trips by origin and destination. Do you see what I mean? That said, it’s probably not incorrect—just a bit ambiguous. If it works for you, we can leave it as is. It may be just my own interpretation. Do you find the title of these headings (2.1. origin-destination data and 2.2. number of trips data) in the vignettes correspond well with the tables they show, @Robinlovelace?
Regarding the rest of suggestions, they’re mostly aesthetic and subjective. I didn’t make any changes yet because I couldn’t find the original code for the figures yesterday. Let me know if you’d like me to try to implement these changes another day if you think the suggestions could be useful.

e-kotov · 2025-01-15T10:32:39Z

@eugenividal I checked the main branch and the tests are failing for the main as well, the reason is some bug that I will attend later, it's not critical.

This 'number of trips' is super confusing and hard to explain, but I think we should try )

I would ask for some patience and allow me some more time (about a week from now, as I'm a bit busy right now) to add to/review your changes as they may require changes in the package docs and I want it to be a cohesive thing before merging, I don't want to push to the main something that will be immediately reflected on the package website and may not represent what the CRAN version of the package does and says in the docs.

eugenividal · 2025-01-15T12:18:41Z

@e-kotov, absolutely, no rush and agree with your comments, better not to change anything unless it is a clear and worth it improvement! The package is already great. These are just minor suggestions that we can implement only if they make it better and do not represent big changes.

I agree, the 'number of trips' is quite confusing! :) But we'll find a way to clarify it. The dataset is really complex, and sometimes complexity can't be easily simplified.

I just did this experiment: I asked ChatGPT to suggest titles for the maestra 1 and 2 tables of version 1, but I think it struggled too. Here are the responses:

maestra 1: Travel Activity Data: Origins, Destinations, and Trip Details by Residence and Age Group
maestra 2: Number of Trips and People by District and Trip Category

I wouldn't say maestra 2 provides the 'Number of Trips and People by District and Trip Category', but only the 'Number People by District and Trip Category'.

e-kotov · 2025-01-28T15:50:22Z

I'd consider renaming some variables: E.g. time slot = hour_slot and distance = interval_distance

I would actually simplify it to just hour.

e.g. date and full_date, and residence_province_name and residence_province.

There are no inconsistencies in the package code and output. What you might have found is inconsistencies in the vignettes because they were created on different dates.

Here's the up to date output for v1 and v2 data:

od1 <- spod_get("od", zones = "distr", dates = "2021-01-01")
od2 <- spod_get("od", zones = "distr", dates = "2022-01-01")


> od1 |> head() |> collect() |> glimpse()
Rows: 6
Columns: 14
$ date                        <date> 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01
$ id_origin                   <fct> 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM
$ id_destination              <fct> 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM
$ activity_origin             <fct> home, home, home, home, home, home
$ activity_destination        <fct> other, other, other, other, other, other
$ residence_province_ine_code <fct> 01, 01, 01, 01, 01, 01
$ residence_province_name     <fct> "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava"
$ time_slot                   <int> 0, 1, 2, 2, 3, 7
$ distance                    <fct> 002-005, 002-005, 002-005, 010-050, 002-005, 010-050
$ n_trips                     <dbl> 24.265, 37.347, 8.926, 5.811, 37.347, 8.926
$ trips_total_length_km       <dbl> 89.484, 147.157, 32.908, 75.248, 151.922, 90.750
$ year                        <int> 2021, 2021, 2021, 2021, 2021, 2021
$ month                       <int> 1, 1, 1, 1, 1, 1
$ day                         <int> 1, 1, 1, 1, 1, 1




> od2 |> head() |> collect() |> glimpse()
Rows: 6
Columns: 19
$ date                        <date> 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01
$ time_slot                   <int> 0, 0, 0, 0, 0, 0
$ id_origin                   <fct> 01001, 01001, 01001, 01001, 01001, 01001
$ id_destination              <fct> 01009_AM, 01009_AM, 01009_AM, 01009_AM, 01009_AM, 01009_AM
$ distance                    <fct> 2-10, 2-10, 2-10, 2-10, 2-10, 2-10
$ activity_origin             <fct> home, infrequent_activity, frequent_activity, frequent_activity, frequent_activity, work…
$ activity_destination        <fct> frequent_activity, home, home, frequent_activity, frequent_activity, home
$ study_possible_origin       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
$ study_possible_destination  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
$ residence_province_ine_code <fct> 01, 01, 01, 01, 01, 01
$ residence_province_name     <fct> "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava"
$ income                      <fct> 10-15, >15, >15, >15, >15, >15
$ age                         <fct> NA, NA, NA, 25-45, 65-100, NA
$ sex                         <fct> NA, NA, NA, male, female, NA
$ n_trips                     <dbl> 2.605, 2.212, 12.146, 2.997, 2.401, 1.041
$ trips_total_length_km       <dbl> 9.457, 5.645, 65.663, 16.093, 16.179, 4.629
$ year                        <int> 2022, 2022, 2022, 2022, 2022, 2022
$ month                       <int> 1, 1, 1, 1, 1, 1
$ day                         <int> 1, 1, 1, 1, 1, 1
>

So what I see is that:

time_slot should be hour
hour should come right after date in both tables
other columns that are the same between tables should also be in the same position relative to other columns

Robinlovelace · 2025-01-28T15:52:00Z

resolving the conflicts now..

…m:rOpenSpain/spanishoddata into improvements-arising-from-suggestions-evt

e-kotov · 2025-01-28T16:08:56Z

Updated as per comments above:

> od1 |> head() |> collect() |> glimpse()
Rows: 6
Columns: 14
$ date                        <date> 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01
$ hour                        <int> 0, 1, 2, 2, 3, 7
$ id_origin                   <fct> 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM
$ id_destination              <fct> 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM
$ distance                    <fct> 002-005, 002-005, 002-005, 010-050, 002-005, 010-050
$ activity_origin             <fct> home, home, home, home, home, home
$ activity_destination        <fct> other, other, other, other, other, other
$ residence_province_ine_code <fct> 01, 01, 01, 01, 01, 01
$ residence_province_name     <fct> "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava"
$ n_trips                     <dbl> 24.265, 37.347, 8.926, 5.811, 37.347, 8.926
$ trips_total_length_km       <dbl> 89.484, 147.157, 32.908, 75.248, 151.922, 90.750
$ year                        <int> 2021, 2021, 2021, 2021, 2021, 2021
$ month                       <int> 1, 1, 1, 1, 1, 1
$ day                         <int> 1, 1, 1, 1, 1, 1

> od2 |> head() |> collect() |> glimpse()
Rows: 6
Columns: 19
$ date                        <date> 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01
$ hour                        <int> 0, 0, 0, 0, 0, 0
$ id_origin                   <fct> 01001, 01001, 01001, 01001, 01001, 01001
$ id_destination              <fct> 01009_AM, 01009_AM, 01009_AM, 01009_AM, 01009_AM, 01009_AM
$ distance                    <fct> 2-10, 2-10, 2-10, 2-10, 2-10, 2-10
$ activity_origin             <fct> home, infrequent_activity, frequent_activity, frequent_activity, frequent_activity, wo…
$ activity_destination        <fct> frequent_activity, home, home, frequent_activity, frequent_activity, home
$ study_possible_origin       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
$ study_possible_destination  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
$ residence_province_ine_code <fct> 01, 01, 01, 01, 01, 01
$ residence_province_name     <fct> "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava"
$ income                      <fct> 10-15, >15, >15, >15, >15, >15
$ age                         <fct> NA, NA, NA, 25-45, 65-100, NA
$ sex                         <fct> NA, NA, NA, male, female, NA
$ n_trips                     <dbl> 2.605, 2.212, 12.146, 2.997, 2.401, 1.041
$ trips_total_length_km       <dbl> 9.457, 5.645, 65.663, 16.093, 16.179, 4.629
$ year                        <int> 2022, 2022, 2022, 2022, 2022, 2022
$ month                       <int> 1, 1, 1, 1, 1, 1
$ day                         <int> 1, 1, 1, 1, 1, 1

e-kotov · 2025-01-28T16:11:06Z

time slot = hour_slot

technically, this may be a breaking change if anyone already has a production code that uses the time_slot column name. So I am a bit hesitant even about the commit i made to change it to hour.

distance = interval_distance

I would not do this because (a) it would also be a breaking change and (b) the type of variable is factor and we have detailed docs on what it does, so I don't see a good reason to make the name longer.

e-kotov · 2025-01-28T16:36:40Z

so to make it work we would have to lower one level in all the files

done

plot the desire lines using the flow mapper.
Not sure where this plot comes from

Comes from the disaggregation vignette. I would suggest this to be solved in a separate issue.

Remove the title of the first figure --- there is already a caption with title for this figure.
Same as above

Honestly, I don't have the code for this, as it was done hastily just to paste something. I might not have kept it. I would create a new issue for this to address this eventually.

Beautify Figure 4, checking for typos (e.g. munucipality, houlry), putting the first three boxes at the same level, etc.

probably also in the disaggregation vignette

@Robinlovelace @eugenividal I would merge it as it is with my latest commits and spin off the remaining minor problems into separate issues. These can be addressed eventually or perhaps even during a hackathon for first time contributors https://ropensci.org/blog/2025/01/24/coworking-hackathons/ (I am taking part as a package author) - for this event in March I was asked to prepare a few easy to solve issues in the package repo.

I'm still a but hesitant to change the column name time_slot... @Robinlovelace what do you think? 300+ people who have already started using the package may be unhappy about this when the next release hits CRAN.

e-kotov

approved with my changes (see commits)

eugenividal · 2025-01-28T20:16:16Z

All changes and decisions taken sound good @e-kotov 👌

There are no inconsistencies in the package code and output. What you might have found is inconsistencies in the vignettes because they were created on different dates.

You are right, I found the inconsistencies in the vignettes, I did not test the package.

eugenividal added 3 commits January 14, 2025 20:55

rephrase paragraph and add level

a96ab55

rename mobility data v1

31b0bb2

rename mobility datasets v2

59a0ddf

eugenividal requested review from e-kotov and Robinlovelace and removed request for e-kotov January 14, 2025 21:21

Robinlovelace approved these changes Jan 14, 2025

View reviewed changes

rename time_slot to hour

320393a

Robinlovelace and others added 4 commits January 28, 2025 15:52

Merge branch 'main' into improvements-arising-from-suggestions-evt

6319d02

move hour and distance in v1 to same positions as in v2

4c1950c

Merge branch 'improvements-arising-from-suggestions-evt' of github.co…

a8fe6d1

…m:rOpenSpain/spanishoddata into improvements-arising-from-suggestions-evt

update hour and distance col positions in v1 codebook

f8ca613

bump heading levels

b2c75bb

e-kotov approved these changes Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements arising from suggestions evt #132

Improvements arising from suggestions evt #132

eugenividal commented Jan 14, 2025 •

edited by e-kotov

Loading

Robinlovelace left a comment

e-kotov commented Jan 14, 2025

eugenividal commented Jan 15, 2025 •

edited

Loading

e-kotov commented Jan 15, 2025

eugenividal commented Jan 15, 2025 •

edited

Loading

e-kotov commented Jan 28, 2025 •

edited

Loading

Robinlovelace commented Jan 28, 2025

e-kotov commented Jan 28, 2025

e-kotov commented Jan 28, 2025

e-kotov commented Jan 28, 2025

e-kotov left a comment

eugenividal commented Jan 28, 2025

Improvements arising from suggestions evt #132

Are you sure you want to change the base?

Improvements arising from suggestions evt #132

Conversation

eugenividal commented Jan 14, 2025 • edited by e-kotov Loading

Robinlovelace left a comment

Choose a reason for hiding this comment

e-kotov commented Jan 14, 2025

eugenividal commented Jan 15, 2025 • edited Loading

e-kotov commented Jan 15, 2025

eugenividal commented Jan 15, 2025 • edited Loading

e-kotov commented Jan 28, 2025 • edited Loading

Robinlovelace commented Jan 28, 2025

e-kotov commented Jan 28, 2025

e-kotov commented Jan 28, 2025

e-kotov commented Jan 28, 2025

e-kotov left a comment

Choose a reason for hiding this comment

eugenividal commented Jan 28, 2025

eugenividal commented Jan 14, 2025 •

edited by e-kotov

Loading

eugenividal commented Jan 15, 2025 •

edited

Loading

eugenividal commented Jan 15, 2025 •

edited

Loading

e-kotov commented Jan 28, 2025 •

edited

Loading