Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements arising from suggestions evt #132

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

eugenividal
Copy link
Collaborator

@eugenividal eugenividal commented Jan 14, 2025

@e-kotov and @Robinlovelace, in this PR I've:

  • In the README:

  • Rephrased the paragraph explaining the different versions of the data based on the improvements on the paper.

  • In the vignettes:

  • Renamed the mobility datasets as follows:
    + Version 1, Mobility data:
    - Origin-Destination data
    - Population by trip count data
    + Version 2, Mobility data:
    - Origin-Destination data
    - Population by trip count data
    - Population by overnight stay data

These were the main tasks from #129. As for the rest, I found some issues:

  • In the README:

  • Given that the title is done with one #, add one level (i.e. ##) to the headings 1. I've tried this but there are several qmd files linked to the README, so to make it work we would have to lower one level in all the files, or we can just forget about this very minor issue. It is just aesthetics

  • plot the desire lines using the flow mapper.
    Not sure where this plot comes from

  • Remove the title of the first figure --- there is already a caption with title for this figure.
    Same as above

  • Beautify Figure 4, checking for typos (e.g. munucipality, houlry), putting the first three boxes at the same level, etc.
    Same as above

  • In the package:

These suggestions regarding function names were also very minor. If you agree, better if you implement them @e-kotov. I am not sure how to rename functions.

  • Fix inconsistencies in the name of the variables of the different versions: e.g. date and full_date, and residence_province_name and residence_province.

  • I'd consider renaming some variables: E.g. time slot = hour_slot and distance = interval_distance

@eugenividal eugenividal requested review from e-kotov and Robinlovelace and removed request for e-kotov January 14, 2025 21:21
Copy link
Collaborator

@Robinlovelace Robinlovelace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 from me, thoughts @e-kotov ?

@e-kotov
Copy link
Member

e-kotov commented Jan 14, 2025

thoughts

@Robinlovelace Wondering why ubuntu devel suddenly fails, though @eugenividal did not touch the code ) Probably has something to do with ubuntu latest moving to v24 in github actions... Have not studied the logs yet.

@eugenividal are you hesitating to touch the code and the docs?

Some of the changes in the vignette are consistent with the package docs and argument options.

'Population by trip count data' - I'm not sure this reads well. In the package docs we refer to it as "number_of_trips" (see help for spod_get, spod_download, spod_convert). So if we change it in vignette, we need to make it consistent with the package options and docs. But the package is already released. Not that many people have downloaded it, but I would be hesitant to suddenly change the arguments and make breaking changes. So at this point I am hesitant to change the vignette in this way, especially if the new heading does not read well (to me at least it does not). One of the problems is this specific data set that has this weird form/grouping of number or people by number of trips as a factor... Could we try alternative headings for this one that would be easy to understand but also consistent with the docs?

@eugenividal
Copy link
Collaborator Author

eugenividal commented Jan 15, 2025

Hi @e-kotov, yeah, not sure what happened with this PR.
The most important change is the definition of the data versions in the README, which is implemented in the files changed. This part is now correct—previously, it was misleading regarding the differences between versions.
As for the name of the current 2.2. 'number of trips' data in the vignettes, I find it a bit confusing. Let me explain this. For instance, in version 1, the first dataset (maestra1) provides the number of trips (and trip kilometers) per origin and destination, we call this origin destination data (which seems ok). In contrast, the second dataset (maestra2) gives the number of people by trip count category (0, 1, 2, 2+), we call this number of trips data. Referring to maestra2 as 'number of trips' can be a little misleading since it actually represents the number of individuals per trip frequency, while maestra1 actually provides the number of trips by origin and destination. Do you see what I mean? That said, it’s probably not incorrect—just a bit ambiguous. If it works for you, we can leave it as is. It may be just my own interpretation. Do you find the title of these headings (2.1. origin-destination data and 2.2. number of trips data) in the vignettes correspond well with the tables they show, @Robinlovelace?
Regarding the rest of suggestions, they’re mostly aesthetic and subjective. I didn’t make any changes yet because I couldn’t find the original code for the figures yesterday. Let me know if you’d like me to try to implement these changes another day if you think the suggestions could be useful.

@e-kotov
Copy link
Member

e-kotov commented Jan 15, 2025

@eugenividal I checked the main branch and the tests are failing for the main as well, the reason is some bug that I will attend later, it's not critical.

This 'number of trips' is super confusing and hard to explain, but I think we should try )

I would ask for some patience and allow me some more time (about a week from now, as I'm a bit busy right now) to add to/review your changes as they may require changes in the package docs and I want it to be a cohesive thing before merging, I don't want to push to the main something that will be immediately reflected on the package website and may not represent what the CRAN version of the package does and says in the docs.

@eugenividal
Copy link
Collaborator Author

eugenividal commented Jan 15, 2025

@e-kotov, absolutely, no rush and agree with your comments, better not to change anything unless it is a clear and worth it improvement! The package is already great. These are just minor suggestions that we can implement only if they make it better and do not represent big changes.

I agree, the 'number of trips' is quite confusing! :) But we'll find a way to clarify it. The dataset is really complex, and sometimes complexity can't be easily simplified.

I just did this experiment: I asked ChatGPT to suggest titles for the maestra 1 and 2 tables of version 1, but I think it struggled too. Here are the responses:

  • maestra 1: Travel Activity Data: Origins, Destinations, and Trip Details by Residence and Age Group
  • maestra 2: Number of Trips and People by District and Trip Category

I wouldn't say maestra 2 provides the 'Number of Trips and People by District and Trip Category', but only the 'Number People by District and Trip Category'.

@e-kotov
Copy link
Member

e-kotov commented Jan 28, 2025

  • I'd consider renaming some variables: E.g. time slot = hour_slot and distance = interval_distance

I would actually simplify it to just hour.

  • e.g. date and full_date, and residence_province_name and residence_province.

There are no inconsistencies in the package code and output. What you might have found is inconsistencies in the vignettes because they were created on different dates.

Here's the up to date output for v1 and v2 data:

od1 <- spod_get("od", zones = "distr", dates = "2021-01-01")
od2 <- spod_get("od", zones = "distr", dates = "2022-01-01")


> od1 |> head() |> collect() |> glimpse()
Rows: 6
Columns: 14
$ date                        <date> 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01
$ id_origin                   <fct> 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM
$ id_destination              <fct> 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM
$ activity_origin             <fct> home, home, home, home, home, home
$ activity_destination        <fct> other, other, other, other, other, other
$ residence_province_ine_code <fct> 01, 01, 01, 01, 01, 01
$ residence_province_name     <fct> "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava"
$ time_slot                   <int> 0, 1, 2, 2, 3, 7
$ distance                    <fct> 002-005, 002-005, 002-005, 010-050, 002-005, 010-050
$ n_trips                     <dbl> 24.265, 37.347, 8.926, 5.811, 37.347, 8.926
$ trips_total_length_km       <dbl> 89.484, 147.157, 32.908, 75.248, 151.922, 90.750
$ year                        <int> 2021, 2021, 2021, 2021, 2021, 2021
$ month                       <int> 1, 1, 1, 1, 1, 1
$ day                         <int> 1, 1, 1, 1, 1, 1




> od2 |> head() |> collect() |> glimpse()
Rows: 6
Columns: 19
$ date                        <date> 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01
$ time_slot                   <int> 0, 0, 0, 0, 0, 0
$ id_origin                   <fct> 01001, 01001, 01001, 01001, 01001, 01001
$ id_destination              <fct> 01009_AM, 01009_AM, 01009_AM, 01009_AM, 01009_AM, 01009_AM
$ distance                    <fct> 2-10, 2-10, 2-10, 2-10, 2-10, 2-10
$ activity_origin             <fct> home, infrequent_activity, frequent_activity, frequent_activity, frequent_activity, work$ activity_destination        <fct> frequent_activity, home, home, frequent_activity, frequent_activity, home
$ study_possible_origin       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
$ study_possible_destination  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
$ residence_province_ine_code <fct> 01, 01, 01, 01, 01, 01
$ residence_province_name     <fct> "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava"
$ income                      <fct> 10-15, >15, >15, >15, >15, >15
$ age                         <fct> NA, NA, NA, 25-45, 65-100, NA
$ sex                         <fct> NA, NA, NA, male, female, NA
$ n_trips                     <dbl> 2.605, 2.212, 12.146, 2.997, 2.401, 1.041
$ trips_total_length_km       <dbl> 9.457, 5.645, 65.663, 16.093, 16.179, 4.629
$ year                        <int> 2022, 2022, 2022, 2022, 2022, 2022
$ month                       <int> 1, 1, 1, 1, 1, 1
$ day                         <int> 1, 1, 1, 1, 1, 1
>

So what I see is that:

  • time_slot should be hour
  • hour should come right after date in both tables
  • other columns that are the same between tables should also be in the same position relative to other columns

@Robinlovelace
Copy link
Collaborator

resolving the conflicts now..

@e-kotov
Copy link
Member

e-kotov commented Jan 28, 2025

Updated as per comments above:

> od1 |> head() |> collect() |> glimpse()
Rows: 6
Columns: 14
$ date                        <date> 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01, 2021-01-01
$ hour                        <int> 0, 1, 2, 2, 3, 7
$ id_origin                   <fct> 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM
$ id_destination              <fct> 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM, 01001_AM
$ distance                    <fct> 002-005, 002-005, 002-005, 010-050, 002-005, 010-050
$ activity_origin             <fct> home, home, home, home, home, home
$ activity_destination        <fct> other, other, other, other, other, other
$ residence_province_ine_code <fct> 01, 01, 01, 01, 01, 01
$ residence_province_name     <fct> "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava"
$ n_trips                     <dbl> 24.265, 37.347, 8.926, 5.811, 37.347, 8.926
$ trips_total_length_km       <dbl> 89.484, 147.157, 32.908, 75.248, 151.922, 90.750
$ year                        <int> 2021, 2021, 2021, 2021, 2021, 2021
$ month                       <int> 1, 1, 1, 1, 1, 1
$ day                         <int> 1, 1, 1, 1, 1, 1
> od2 |> head() |> collect() |> glimpse()
Rows: 6
Columns: 19
$ date                        <date> 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01, 2022-01-01
$ hour                        <int> 0, 0, 0, 0, 0, 0
$ id_origin                   <fct> 01001, 01001, 01001, 01001, 01001, 01001
$ id_destination              <fct> 01009_AM, 01009_AM, 01009_AM, 01009_AM, 01009_AM, 01009_AM
$ distance                    <fct> 2-10, 2-10, 2-10, 2-10, 2-10, 2-10
$ activity_origin             <fct> home, infrequent_activity, frequent_activity, frequent_activity, frequent_activity, wo…
$ activity_destination        <fct> frequent_activity, home, home, frequent_activity, frequent_activity, home
$ study_possible_origin       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
$ study_possible_destination  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
$ residence_province_ine_code <fct> 01, 01, 01, 01, 01, 01
$ residence_province_name     <fct> "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava", "Araba/Álava"
$ income                      <fct> 10-15, >15, >15, >15, >15, >15
$ age                         <fct> NA, NA, NA, 25-45, 65-100, NA
$ sex                         <fct> NA, NA, NA, male, female, NA
$ n_trips                     <dbl> 2.605, 2.212, 12.146, 2.997, 2.401, 1.041
$ trips_total_length_km       <dbl> 9.457, 5.645, 65.663, 16.093, 16.179, 4.629
$ year                        <int> 2022, 2022, 2022, 2022, 2022, 2022
$ month                       <int> 1, 1, 1, 1, 1, 1
$ day                         <int> 1, 1, 1, 1, 1, 1

@e-kotov
Copy link
Member

e-kotov commented Jan 28, 2025

time slot = hour_slot

technically, this may be a breaking change if anyone already has a production code that uses the time_slot column name. So I am a bit hesitant even about the commit i made to change it to hour.

distance = interval_distance

I would not do this because (a) it would also be a breaking change and (b) the type of variable is factor and we have detailed docs on what it does, so I don't see a good reason to make the name longer.

@e-kotov
Copy link
Member

e-kotov commented Jan 28, 2025

  • so to make it work we would have to lower one level in all the files

done

  • plot the desire lines using the flow mapper.
    Not sure where this plot comes from

Comes from the disaggregation vignette. I would suggest this to be solved in a separate issue.

  • Remove the title of the first figure --- there is already a caption with title for this figure.
    Same as above

Honestly, I don't have the code for this, as it was done hastily just to paste something. I might not have kept it. I would create a new issue for this to address this eventually.

  • Beautify Figure 4, checking for typos (e.g. munucipality, houlry), putting the first three boxes at the same level, etc.

probably also in the disaggregation vignette

@Robinlovelace @eugenividal I would merge it as it is with my latest commits and spin off the remaining minor problems into separate issues. These can be addressed eventually or perhaps even during a hackathon for first time contributors https://ropensci.org/blog/2025/01/24/coworking-hackathons/ (I am taking part as a package author) - for this event in March I was asked to prepare a few easy to solve issues in the package repo.

I'm still a but hesitant to change the column name time_slot... @Robinlovelace what do you think? 300+ people who have already started using the package may be unhappy about this when the next release hits CRAN.

Copy link
Member

@e-kotov e-kotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with my changes (see commits)

@eugenividal
Copy link
Collaborator Author

All changes and decisions taken sound good @e-kotov 👌

There are no inconsistencies in the package code and output. What you might have found is inconsistencies in the vignettes because they were created on different dates.

You are right, I found the inconsistencies in the vignettes, I did not test the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants