From a96ab551442dd34432eb2e559a8f9def998086b8 Mon Sep 17 00:00:00 2001 From: eugenividal Date: Tue, 14 Jan 2025 20:55:27 +0100 Subject: [PATCH 1/9] rephrase paragraph and add level --- README.md | 65 ++++++++++++++++++++------------------------- README.qmd | 20 +++++++------- spanishoddata.Rproj | 1 + 3 files changed, 40 insertions(+), 46 deletions(-) diff --git a/README.md b/README.md index 2dff85b..8773804 100644 --- a/README.md +++ b/README.md @@ -31,22 +31,20 @@ alt="CRAN/METACRAN Downloads per month" /> downloading and formatting Spanish open mobility data released by the Ministry of Transport and Sustainable Mobility of Spain (MITMS 2024). -It supports the two versions of the Spanish mobility data that consists -of origin-destination matrices and some additional data sets. [The first -version](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/estudios-de-movilidad-anteriores/covid-19/opendata-movilidad) -covers data from 2020 and 2021, including the period of the COVID-19 -pandemic. [The second -version](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/opendata-movilidad) -contains data from January 2022 onwards and is regularly updated. Both -versions of the data primarily consist of mobile phone positioning data, -and include matrices for overnight stays, individual movements, and -trips of Spanish residents at different geographical levels. See the -[package website](https://rOpenSpain.github.io/spanishoddata/) and -vignettes for -[v1](https://rOpenSpain.github.io/spanishoddata/articles/v1-2020-2021-mitma-data-codebook) -and -[v2](https://rOpenSpain.github.io/spanishoddata/articles/v2-2022-onwards-mitma-data-codebook) -data for more details. +It supports the two versions of the Spanish mobility data. [The first +version (2020 to +2021)](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/estudios-de-movilidad-anteriores/covid-19/opendata-movilidad) +includes data from the COVID-19 pandemic, with tables detailing trip +numbers and distances, broken down by origin, destination, activity, +residence province, time interval, distance interval, and date. It also +provides tables of individual counts by location and trip frequency. +[The second version (2022 +onwards)](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/opendata-movilidad) +improves spatial resolution, adds trips to and from Portugal and France, +and introduces new fields for study-related activities and +sociodemographic factors (income, age, and sex) in the +origin-destination tables, along with additional tables showing +individual counts by overnight stay location, residence, and date. **spanishoddata** is designed to save time by providing the data in analysis-ready formats. Automating the process of downloading, cleaning, @@ -57,12 +55,13 @@ To effectively work with multiple data files, it’s recommended you set up a data directory where the package can search for the data and download only the files that are not already present. -# Examples of available data +## Examples of available data
![](vignettes/media/flows_plot.svg) + Figure 1: Example of the data available through the package: daily flows in Barcelona @@ -77,15 +76,16 @@ To create static maps like that see our vignette ![](https://ropenspain.github.io/spanishoddata/media/spain-folding-flows.gif) + Figure 2: Example of the data available through the package: interactive daily flows in Spain
-
![](https://ropenspain.github.io/spanishoddata/media/barcelona-time.gif) + Figure 3: Example of the data available through the package: interactive daily flows in Barcelona with time filter @@ -103,9 +103,7 @@ install.packages("spanishoddata") ```
- - Alternative installation and developemnt @@ -166,9 +164,7 @@ The function above will also ensure that the directory is created and that you have sufficient permissions to write to it.
- - Setting data directory for advanced users @@ -237,11 +233,12 @@ package. + Figure 4: The overview of package functions to get the data
-# Showcase +## Showcase To run the code in this README we will use the following setup: @@ -275,7 +272,7 @@ metadata # ℹ 9,432 more rows # ℹ 1 more variable: local_path -## Zones +### Zones Zones can be downloaded as follows: @@ -289,7 +286,7 @@ plot(sf::st_geometry(distritos_wgs84)) ![](man/figures/README-distritos-1.png) -## OD data +### OD data ``` r od_db <- spod_get( @@ -341,7 +338,7 @@ n_per_hour |> The figure above summarises 925,874,012 trips over the 7 days associated with 135,866,524 records. -## `spanishoddata` advantage over accessing the data yourself +### `spanishoddata` advantage over accessing the data yourself As we demonstrated above, you can perform very quick analysis using just a few lines of code. @@ -372,7 +369,7 @@ We did all of that for you and present you with a few simple functions that get you straight to the data in one line of code, and you are ready to run any analysis on it. -# Desire lines +## Desire lines We’ll use the same input data to pick-out the most important flows in Spain, with a focus on longer trips for visualisation: @@ -466,7 +463,7 @@ ggplot() + ![](man/figures/README-salamanca-plot-1.png) -# Further information +## Further information For more information on the package, see: @@ -493,7 +490,7 @@ For more information on the package, see: - [Quickly getting daily aggregated 2022+ data at municipality level](https://ropenspain.github.io/spanishoddata/articles/quick-get.html) -## Citation +### Citation To cite spanishoddata R package and data in publications use: @@ -506,7 +503,7 @@ MITMS (2024). “Estudio de movilidad de viajeros de ámbito nacional aplicando la tecnología Big Data. Informe metodológico (Study of National Traveler mobility Using Big Data Technology. Methodological Report).” Secretaría de Estado de Transportes y Movilidad Sostenible; -Ministerio de Transportes y Movilidad Sostenible. +Ministerio de Transportes, Movilidad y Agenda Urbana. . BibTeX: @@ -522,23 +519,19 @@ BibTeX: @TechReport{mitma_mobility_2024_v8, title = {Estudio de movilidad de viajeros de ámbito nacional aplicando la tecnología Big Data. Informe metodológico (Study of National Traveler mobility Using Big Data Technology. Methodological Report)}, author = {{MITMS}}, - institution = {Secretaría de Estado de Transportes y Movilidad Sostenible; Ministerio de Transportes y Movilidad Sostenible}, + institution = {Secretaría de Estado de Transportes y Movilidad Sostenible; Ministerio de Transportes, Movilidad y Agenda Urbana}, year = {2024}, url = {https://www.transportes.gob.es/ministerio/proyectos-singulares/estudio-de-movilidad-con-big-data}, urldate = {2024-12-11}, annotation = {https://www.transportes.gob.es/recursos_mfom/paginabasica/recursos/a3_informe_metodologico_estudio_movilidad_mitms_v8.pdf}, } -# References +## References - - - -
The figure above summarises 925,874,012 trips over the 7 days associated with 135,866,524 records. -## `spanishoddata` advantage over accessing the data yourself +### `spanishoddata` advantage over accessing the data yourself As we demonstrated above, you can perform very quick analysis using just a few lines of code. @@ -198,7 +198,7 @@ To highlight the benefits of the package, here is how you would do this manually We did all of that for you and present you with a few simple functions that get you straight to the data in one line of code, and you are ready to run any analysis on it. -# Desire lines +## Desire lines We'll use the same input data to pick-out the most important flows in Spain, with a focus on longer trips for visualisation: @@ -295,7 +295,7 @@ ggplot() + ![](man/figures/README-salamanca-plot-1.png) -# Further information +## Further information For more information on the package, see: @@ -336,7 +336,7 @@ usethis::use_tidy_description() ``` -## Citation +### Citation ```{r} #| eval: true @@ -358,7 +358,7 @@ toBibtex(citation("spanishoddata")) -# References +## References diff --git a/spanishoddata.Rproj b/spanishoddata.Rproj index 6ff5a50..57e7e2e 100644 --- a/spanishoddata.Rproj +++ b/spanishoddata.Rproj @@ -1,4 +1,5 @@ Version: 1.0 +ProjectId: 0eb7deaa-2778-4211-9274-917281de2007 RestoreWorkspace: No SaveWorkspace: No From 31b0bb206af9322afb1922d4b695bae1c2b7993b Mon Sep 17 00:00:00 2001 From: eugenividal Date: Tue, 14 Jan 2025 21:47:36 +0100 Subject: [PATCH 2/9] rename mobility data v1 --- README.md | 20 ++++++++++++------- README.qmd | 3 ++- .../v1-2020-2021-mitma-data-codebook.qmd | 18 ++++++++--------- 3 files changed, 24 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index 8773804..ee498c2 100644 --- a/README.md +++ b/README.md @@ -33,18 +33,24 @@ Ministry of Transport and Sustainable Mobility of Spain (MITMS 2024). It supports the two versions of the Spanish mobility data. [The first version (2020 to -2021)](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/estudios-de-movilidad-anteriores/covid-19/opendata-movilidad) -includes data from the COVID-19 pandemic, with tables detailing trip -numbers and distances, broken down by origin, destination, activity, -residence province, time interval, distance interval, and date. It also -provides tables of individual counts by location and trip frequency. -[The second version (2022 +2021)](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/estudios-de-movilidad-anteriores/covid-19/opendata-movilidad), +covering the period of the COVID-19 pandemic, contains tables detailing +trip numbers and distances, broken down by origin, destination, +activity, residence province, time interval, distance interval, and +date. It also provides tables of individual counts by location and trip +frequency. [The second version (2022 onwards)](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/opendata-movilidad) improves spatial resolution, adds trips to and from Portugal and France, and introduces new fields for study-related activities and sociodemographic factors (income, age, and sex) in the origin-destination tables, along with additional tables showing -individual counts by overnight stay location, residence, and date. +individual counts by overnight stay location, residence, and date. See +the [package website](https://rOpenSpain.github.io/spanishoddata/) and +vignettes for +[v1](https://rOpenSpain.github.io/spanishoddata/articles/v1-2020-2021-mitma-data-codebook) +and +[v2](https://rOpenSpain.github.io/spanishoddata/articles/v2-2022-onwards-mitma-data-codebook) +data for more details. **spanishoddata** is designed to save time by providing the data in analysis-ready formats. Automating the process of downloading, cleaning, diff --git a/README.qmd b/README.qmd index b1bc66e..1855ed0 100644 --- a/README.qmd +++ b/README.qmd @@ -27,7 +27,8 @@ default-image-extension: "" **spanishoddata** is an R package that provides functions for downloading and formatting Spanish open mobility data released by the Ministry of Transport and Sustainable Mobility of Spain [@mitma_mobility_2024_v8]. -It supports the two versions of the Spanish mobility data. [The first version (2020 to 2021)](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/estudios-de-movilidad-anteriores/covid-19/opendata-movilidad) includes data from the COVID-19 pandemic, with tables detailing trip numbers and distances, broken down by origin, destination, activity, residence province, time interval, distance interval, and date. It also provides tables of individual counts by location and trip frequency. [The second version (2022 onwards)](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/opendata-movilidad) improves spatial resolution, adds trips to and from Portugal and France, and introduces new fields for study-related activities and sociodemographic factors (income, age, and sex) in the origin-destination tables, along with additional tables showing individual counts by overnight stay location, residence, and date. +It supports the two versions of the Spanish mobility data. [The first version (2020 to 2021)](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/estudios-de-movilidad-anteriores/covid-19/opendata-movilidad), covering the period of the COVID-19 pandemic, contains tables detailing trip numbers and distances, broken down by origin, destination, activity, residence province, time interval, distance interval, and date. It also provides tables of individual counts by location and trip frequency. [The second version (2022 onwards)](https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/opendata-movilidad) improves spatial resolution, adds trips to and from Portugal and France, and introduces new fields for study-related activities and sociodemographic factors (income, age, and sex) in the origin-destination tables, along with additional tables showing individual counts by overnight stay location, residence, and date. +See the [package website](https://rOpenSpain.github.io/spanishoddata/) and vignettes for [v1](https://rOpenSpain.github.io/spanishoddata/articles/v1-2020-2021-mitma-data-codebook) and [v2](https://rOpenSpain.github.io/spanishoddata/articles/v2-2022-onwards-mitma-data-codebook) data for more details. **spanishoddata** is designed to save time by providing the data in analysis-ready formats. Automating the process of downloading, cleaning, and importing the data can also reduce the risk of errors in the laborious process of data preparation. It also reduces computational resources by using computationally efficient packages behind the scenes. To effectively work with multiple data files, it’s recommended you set up a data directory where the package can search for the data and download only the files that are not already present. diff --git a/vignettes/v1-2020-2021-mitma-data-codebook.qmd b/vignettes/v1-2020-2021-mitma-data-codebook.qmd index 0a2bf9c..25ed381 100644 --- a/vignettes/v1-2020-2021-mitma-data-codebook.qmd +++ b/vignettes/v1-2020-2021-mitma-data-codebook.qmd @@ -65,7 +65,7 @@ Using the instructions below, set the data folder for the package to download th # 1. Spatial data with zoning boundaries -The boundary data is provided at two geographic levels: [`Distrtics`](#districts) and [`Municipalities`](#municipalities). It's important to note that these do not always align with the official Spanish census districts and municipalities. To comply with data protection regulations, certain aggregations had to be made to districts and municipalities". +The boundary data is provided at two geographic levels: [`Districts`](#districts) and [`Municipalities`](#municipalities). It's important to note that these do not always align with the official Spanish census districts and municipalities. To comply with data protection regulations, certain aggregations had to be made to districts and municipalities". ## 1.1 `Districts` {#districts} @@ -83,7 +83,7 @@ Data structure: | Variable Name | **Description** | |------------------------------------|------------------------------------| -| `id` | District `id` assigned by the data provider. Matches with `id_origin`, `id_destination`, and `id` in district level [origin-destination data](#od-data) and [number of trips data](#nt-data). | +| `id` | District `id` assigned by the data provider. Matches with `id_origin`, `id_destination`, and `id` in district level [origin-destination data](#od-data) and [number of trips data](#pop-tc). | | `census_districts` | A string with semicolon-separated list of census district semicolon-separated identifiers as classified by the Spanish Statistical Office (INE) that are spatially bound within polygons with `id` above. | | `municipalities_mitma` | A string with semicolon-separated list of municipality identifiers as assigned by the data provider in municipality zones spatial dataset that correspond to a given district `id` . | | `municipalities` | A string with semicolon-separated list of municipality identifiers as classified by the Spanish Statistical Office (INE) that correspond to polygons with `id` above. | @@ -106,7 +106,7 @@ Data structure: | Variable Name | **Description** | |------------------------------------|------------------------------------| -| `id` | District `id` assigned by the data provider. Matches with `id_origin`, `id_destination`, and `id` in municipality level [origin-destination data](#od-data) and [number of trips data](#nt-data). | +| `id` | District `id` assigned by the data provider. Matches with `id_origin`, `id_destination`, and `id` in municipality level [origin-destination data](#od-data) and [population by trip count](#pop-tc). | | `municipalities` | A list of municipality identifiers as classified by the Spanish Statistical Office (INE) that correspond to polygons with `id` above. | | `districts_mitma` | A list of district identifiers as assigned by the data provider in districts zones spatial dataset that correspond to a given municipality `id` . | | `census_districts` | A list of census district identifiers as classified by the Spanish Statistical Office (INE) that are spatially bound within polygons with `id` above. | @@ -119,9 +119,9 @@ The spatial data you get via `spanishoddata` package is downloaded directly from All mobility data is referenced via `id_origin`, `id_destination`, or other location identifiers (mostly labelled as `id`) with the two sets of zones described above. -## 2.1. Origin-destination data {#od-data} +## 2.1. Origin-Destination data {#od-data} -The origin-destination data contain the number of trips between `districts` or `municipalities` in Spain for every hour of every day between 2020-02-14 and 2021-05-09. Each flow also has attributes such as the trip purpose (composed of the type of activity (`home`/`work_or_study`/`other`) at both the origin and destination), province of residence of individuals making this trip, distance covered while making the trip. See the detailed attributes below in a table. @fig-flows-barcelona shows an example of total flows in the province of Barcelona on Feb 14th, 2020. +The origin-destination data contain the number of trips and distance travelled between `districts` or `municipalities` in Spain for every hour of every day between 2020-02-14 and 2021-05-09. Each flow also has attributes such as the trip purpose (composed of the type of activity (`home`/`work_or_study`/`other`) at both the origin and destination), province of residence of individuals making this trip, distance covered while making the trip. See the detailed attributes below in a table. @fig-flows-barcelona shows an example of total flows in the province of Barcelona on Feb 14th, 2020. ![Origin destination flows in Barcelona on 2020-02-14](media/flows_plot.svg){#fig-flows-barcelona width="70%"} @@ -203,16 +203,16 @@ The same summary operation as provided in the example above can be done with the {{< include ../inst/vignette-include/csv-date-filter-note.qmd >}} -## 2.2. Number of trips data {#nt-data} +## 2.2. Population by trip count data {#pop-tc} -The "number of trips" data shows the number of individuals in each district or municipality who made trips categorised by the number of trips. +The population by trip count data shows the number of individuals in each district or municipality, categorized by the trips they make: 0, 1, 2, or more than 2. | **English Variable Name** | **Original Variable Name** | **Type** | **Description** | |----------------|----------------|----------------|------------------------| | `date` | `fecha` | `Date` | The date of the recorded data, formatted as `YYYY-MM-DD`. | | `id` | `distrito` | `factor` | The identifier of the `district` or `municipality` zone. | -| `n_trips` | `numero_viajes` | `factor` | The number of individuals who made trips, categorized by `0`, `1`, `2`, or `2+` trips. | -| `n_persons` | `personas` | `factor` | The number of persons making the trips from `district` or `municipality` with zone `id`. | +| `n_trips` | `numero_viajes` | `factor` | The number of trips grouped into four categories `0`, `1`, `2`, or `2+`. | +| `n_persons` | `personas` | `factor` | The number of individuals making the trips from `district` or `municipality` with zone `id`. | | `year` | `year` | `integer` | The year of the recorded data, extracted from the date. | | `month` | `month` | `integer` | The month of the recorded data, extracted from the date. | | `day` | `day` | `integer` | The day of the recorded data, extracted from the date. | From 59a0ddf971ed8c55d697dcc01493eea34045870e Mon Sep 17 00:00:00 2001 From: eugenividal Date: Tue, 14 Jan 2025 22:01:13 +0100 Subject: [PATCH 3/9] rename mobility datasets v2 --- vignettes/v1-2020-2021-mitma-data-codebook.qmd | 6 +++--- vignettes/v2-2022-onwards-mitma-data-codebook.qmd | 10 +++++----- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/vignettes/v1-2020-2021-mitma-data-codebook.qmd b/vignettes/v1-2020-2021-mitma-data-codebook.qmd index 25ed381..2a248dc 100644 --- a/vignettes/v1-2020-2021-mitma-data-codebook.qmd +++ b/vignettes/v1-2020-2021-mitma-data-codebook.qmd @@ -83,7 +83,7 @@ Data structure: | Variable Name | **Description** | |------------------------------------|------------------------------------| -| `id` | District `id` assigned by the data provider. Matches with `id_origin`, `id_destination`, and `id` in district level [origin-destination data](#od-data) and [number of trips data](#pop-tc). | +| `id` | District `id` assigned by the data provider. Matches with `id_origin`, `id_destination`, and `id` in district level [origin-destination data](#od-data) and [number of trips data](#ptc-data). | | `census_districts` | A string with semicolon-separated list of census district semicolon-separated identifiers as classified by the Spanish Statistical Office (INE) that are spatially bound within polygons with `id` above. | | `municipalities_mitma` | A string with semicolon-separated list of municipality identifiers as assigned by the data provider in municipality zones spatial dataset that correspond to a given district `id` . | | `municipalities` | A string with semicolon-separated list of municipality identifiers as classified by the Spanish Statistical Office (INE) that correspond to polygons with `id` above. | @@ -203,7 +203,7 @@ The same summary operation as provided in the example above can be done with the {{< include ../inst/vignette-include/csv-date-filter-note.qmd >}} -## 2.2. Population by trip count data {#pop-tc} +## 2.2. Population by trip count data {#ptc-data} The population by trip count data shows the number of individuals in each district or municipality, categorized by the trips they make: 0, 1, 2, or more than 2. @@ -211,7 +211,7 @@ The population by trip count data shows the number of individuals in each distri |----------------|----------------|----------------|------------------------| | `date` | `fecha` | `Date` | The date of the recorded data, formatted as `YYYY-MM-DD`. | | `id` | `distrito` | `factor` | The identifier of the `district` or `municipality` zone. | -| `n_trips` | `numero_viajes` | `factor` | The number of trips grouped into four categories `0`, `1`, `2`, or `2+`. | +| `n_trips` | `numero_viajes` | `factor` | The number of individuals who made trips, categorized by `0`, `1`, `2`, or `2+` trips. | | `n_persons` | `personas` | `factor` | The number of individuals making the trips from `district` or `municipality` with zone `id`. | | `year` | `year` | `integer` | The year of the recorded data, extracted from the date. | | `month` | `month` | `integer` | The month of the recorded data, extracted from the date. | diff --git a/vignettes/v2-2022-onwards-mitma-data-codebook.qmd b/vignettes/v2-2022-onwards-mitma-data-codebook.qmd index f0e719d..c4af24a 100644 --- a/vignettes/v2-2022-onwards-mitma-data-codebook.qmd +++ b/vignettes/v2-2022-onwards-mitma-data-codebook.qmd @@ -235,15 +235,15 @@ od_mean_trips_by_ses_over_the_4_days # ℹ Use `print(n = ...)` to see more rows ``` -In this example above, becaus the data is with hourly intervals within each day, we first summed the number of trips for each day by age, sex, and income groups. We then grouped the data again dropping the day variable and calculated the mean number of trips per day by age, sex, and income groups. The full data for all 4 days was probably never loaded into memory all at once. Rather the available memory of the computer was used up to its maximum limit to make that calculation happen, without ever exceeding the available memory limit. If you were doing the same opearation on 100 or even more days, it would work in the same way and would be possible even with limited memory. This is done transparantly to the user with the help of [`DuckDB`](https://duckdb.org/){target="_blank"} (specifically, with [{duckdb} R package](https://r.duckdb.org/index.html){target="_blank"} @duckdb-r). +In this example above, because the data is with hourly intervals within each day, we first summed the number of trips for each day by age, sex, and income groups. We then grouped the data again dropping the day variable and calculated the mean number of trips per day by age, sex, and income groups. The full data for all 4 days was probably never loaded into memory all at once. Rather the available memory of the computer was used up to its maximum limit to make that calculation happen, without ever exceeding the available memory limit. If you were doing the same opearation on 100 or even more days, it would work in the same way and would be possible even with limited memory. This is done transparently to the user with the help of [`DuckDB`](https://duckdb.org/){target="_blank"} (specifically, with [{duckdb} R package](https://r.duckdb.org/index.html){target="_blank"} @duckdb-r). -The same summary operation as provided in the example above can be done with the entire dataset for multiple years worth of data on a regular laptop with 8-16 GB memory. It will take a bit of time to complete, but it will be done. To speed things up, please also see the [vignette on converting the data](convert.qmd) into formats that will increase the analsysis performance. +The same summary operation as provided in the example above can be done with the entire dataset for multiple years worth of data on a regular laptop with 8-16 GB memory. It will take a bit of time to complete, but it will be done. To speed things up, please also see the [vignette on converting the data](convert.qmd) into formats that will increase the analysis performance. {{< include ../inst/vignette-include/csv-date-filter-note.qmd >}} -## 2.2. Number of trips data {#nt-data} +## 2.2. Population by trip count data {#ptc-data} -For each location, the "number of trips" data provides the number of individuals who spent the night there, with breakdown by the number of trips made, age, and sex. +The population by trip count data shows the number of individuals in each district or municipality, categorized by the trips they make (0, 1, 2, or more than 2), age, and sex. | **English Variable Name** | **Original Variable Name** | **Type** | **Description** | |-----------------|-----------------|-----------------|----------------------| @@ -272,7 +272,7 @@ Because this data is small, we can actually load it completely into memory: nt_dist_tbl <- nt_dist |> dplyr::collect() ``` -## 2.3. Overnight stays {#os-data} +## 2.3. Population by overnight stay data {#pos-data} This dataset provides the number of people who spend the night in each location, also identifying their place of residence down to the census district level according to the [INE encoding](https://www.ine.es/ss/Satellite?c=Page&p=1259952026632&pagename=ProductosYServicios%2FPYSLayout&cid=1259952026632&L=1){target="_blank"}. From 320393ab8c0c7ec7a7e51de1fa4029c5b060db54 Mon Sep 17 00:00:00 2001 From: Egor Kotov Date: Tue, 28 Jan 2025 16:50:58 +0100 Subject: [PATCH 4/9] rename time_slot to hour --- R/duckdb-helpers.R | 2 +- README.md | 6 +++--- README.qmd | 6 +++--- .../sql-queries/v1-od-distritos-clean-csv-view-en.sql | 2 +- .../sql-queries/v1-od-municipios-clean-csv-view-en.sql | 2 +- .../sql-queries/v2-od-distritos-clean-csv-view-en.sql | 2 +- inst/extdata/sql-queries/v2-od-gau-clean-csv-view-en.sql | 2 +- .../sql-queries/v2-od-municipios-clean-csv-view-en.sql | 2 +- man/spod_duckdb_od.Rd | 2 +- vignettes/convert.qmd | 4 ++-- vignettes/flowmaps-interactive.qmd | 6 +++--- vignettes/flowmaps-static.qmd | 2 +- vignettes/v1-2020-2021-mitma-data-codebook.qmd | 6 +++--- vignettes/v2-2022-onwards-mitma-data-codebook.qmd | 2 +- 14 files changed, 23 insertions(+), 23 deletions(-) diff --git a/R/duckdb-helpers.R b/R/duckdb-helpers.R index bead8f3..33a3080 100644 --- a/R/duckdb-helpers.R +++ b/R/duckdb-helpers.R @@ -21,7 +21,7 @@ #' \item{activity_destination}{\code{factor}. The type of activity at the destination location (e.g., 'home', 'other'). \strong{Note:} Only available for district level data.} #' \item{residence_province_ine_code}{\code{factor}. The province of residence for the group of individual making the trip, encoded according to the INE classification. \strong{Note:} Only available for district level data.} #' \item{residence_province_name}{\code{factor}. The province of residence for the group of individuals making the trip (e.g., 'Cuenca', 'Girona'). \strong{Note:} Only available for district level data.} -#' \item{time_slot}{\code{integer}. The time slot (the hour of the day) during which the trip started, represented as an integer (e.g., 0, 1, 2).} +#' \item{hour}{\code{integer}. The time slot (the hour of the day) during which the trip started, represented as an integer (e.g., 0, 1, 2).} #' \item{distance}{\code{factor}. The distance category of the trip, represented as a code (e.g., '002-005' for 2-5 km).} #' \item{n_trips}{\code{double}. The number of trips taken within the specified time slot and distance.} #' \item{trips_total_length_km}{\code{double}. The total length of all trips in kilometers for the specified time slot and distance.} diff --git a/README.md b/README.md index ee498c2..1b4e55a 100644 --- a/README.md +++ b/README.md @@ -310,7 +310,7 @@ class(od_db) colnames(od_db) ``` - [1] "full_date" "time_slot" + [1] "full_date" "hour" [3] "id_origin" "id_destination" [5] "distance" "activity_origin" [7] "activity_destination" "study_possible_origin" @@ -328,10 +328,10 @@ aggregation to find the total number trips per hour over the 7 days: ``` r n_per_hour <- od_db |> - group_by(date, time_slot) |> + group_by(date, hour) |> summarise(n = n(), Trips = sum(n_trips)) |> collect() |> - mutate(Time = lubridate::ymd_h(paste0(date, time_slot, sep = " "))) |> + mutate(Time = lubridate::ymd_h(paste0(date, hour, sep = " "))) |> mutate(Day = lubridate::wday(Time, label = TRUE)) n_per_hour |> ggplot(aes(x = Time, y = Trips)) + diff --git a/README.qmd b/README.qmd index 1855ed0..661e00d 100644 --- a/README.qmd +++ b/README.qmd @@ -142,7 +142,7 @@ colnames(od_db) ``` ``` - [1] "full_date" "time_slot" + [1] "full_date" "hour" [3] "id_origin" "id_destination" [5] "distance" "activity_origin" [7] "activity_destination" "study_possible_origin" @@ -160,10 +160,10 @@ Let's do an aggregation to find the total number trips per hour over the 7 days: ```{r} #| label: trips-per-hour n_per_hour <- od_db |> - group_by(date, time_slot) |> + group_by(date, hour) |> summarise(n = n(), Trips = sum(n_trips)) |> collect() |> - mutate(Time = lubridate::ymd_h(paste0(date, time_slot, sep = " "))) |> + mutate(Time = lubridate::ymd_h(paste0(date, hour, sep = " "))) |> mutate(Day = lubridate::wday(Time, label = TRUE)) n_per_hour |> ggplot(aes(x = Time, y = Trips)) + diff --git a/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql b/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql index 76980ed..3819139 100644 --- a/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql +++ b/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql @@ -75,7 +75,7 @@ CREATE VIEW od_csv_clean AS SELECT WHEN '51' THEN 'Ceuta' WHEN '52' THEN 'Melilla' END AS INE_PROV_NAME_ENUM) AS residence_province_name, - periodo AS time_slot, + periodo AS hour, CAST(distancia AS DISTANCE_ENUM) AS distance, viajes AS n_trips, viajes_km AS trips_total_length_km, diff --git a/inst/extdata/sql-queries/v1-od-municipios-clean-csv-view-en.sql b/inst/extdata/sql-queries/v1-od-municipios-clean-csv-view-en.sql index 509d8c7..028080c 100644 --- a/inst/extdata/sql-queries/v1-od-municipios-clean-csv-view-en.sql +++ b/inst/extdata/sql-queries/v1-od-municipios-clean-csv-view-en.sql @@ -84,7 +84,7 @@ SELECT WHEN '51' THEN 'Ceuta' WHEN '52' THEN 'Melilla' END AS INE_PROV_NAME_ENUM) AS residence_province_name, - d.periodo AS time_slot, + d.periodo AS hour, CAST(d.distancia AS DISTANCE_ENUM) AS distance, SUM(d.viajes) AS n_trips, SUM(d.viajes_km) AS trips_total_length_km, diff --git a/inst/extdata/sql-queries/v2-od-distritos-clean-csv-view-en.sql b/inst/extdata/sql-queries/v2-od-distritos-clean-csv-view-en.sql index 9201a84..69b195f 100644 --- a/inst/extdata/sql-queries/v2-od-distritos-clean-csv-view-en.sql +++ b/inst/extdata/sql-queries/v2-od-distritos-clean-csv-view-en.sql @@ -1,6 +1,6 @@ CREATE VIEW od_csv_clean AS SELECT fecha AS date, - periodo AS time_slot, + periodo AS hour, CAST (CASE origen WHEN 'externo' THEN 'external' ELSE origen diff --git a/inst/extdata/sql-queries/v2-od-gau-clean-csv-view-en.sql b/inst/extdata/sql-queries/v2-od-gau-clean-csv-view-en.sql index 9201a84..69b195f 100644 --- a/inst/extdata/sql-queries/v2-od-gau-clean-csv-view-en.sql +++ b/inst/extdata/sql-queries/v2-od-gau-clean-csv-view-en.sql @@ -1,6 +1,6 @@ CREATE VIEW od_csv_clean AS SELECT fecha AS date, - periodo AS time_slot, + periodo AS hour, CAST (CASE origen WHEN 'externo' THEN 'external' ELSE origen diff --git a/inst/extdata/sql-queries/v2-od-municipios-clean-csv-view-en.sql b/inst/extdata/sql-queries/v2-od-municipios-clean-csv-view-en.sql index 9201a84..69b195f 100644 --- a/inst/extdata/sql-queries/v2-od-municipios-clean-csv-view-en.sql +++ b/inst/extdata/sql-queries/v2-od-municipios-clean-csv-view-en.sql @@ -1,6 +1,6 @@ CREATE VIEW od_csv_clean AS SELECT fecha AS date, - periodo AS time_slot, + periodo AS hour, CAST (CASE origen WHEN 'externo' THEN 'external' ELSE origen diff --git a/man/spod_duckdb_od.Rd b/man/spod_duckdb_od.Rd index 5534680..195bdcf 100644 --- a/man/spod_duckdb_od.Rd +++ b/man/spod_duckdb_od.Rd @@ -38,7 +38,7 @@ The structure of the cleaned-up views \code{od_csv_clean} is as follows: \item{activity_destination}{\code{factor}. The type of activity at the destination location (e.g., 'home', 'other'). \strong{Note:} Only available for district level data.} \item{residence_province_ine_code}{\code{factor}. The province of residence for the group of individual making the trip, encoded according to the INE classification. \strong{Note:} Only available for district level data.} \item{residence_province_name}{\code{factor}. The province of residence for the group of individuals making the trip (e.g., 'Cuenca', 'Girona'). \strong{Note:} Only available for district level data.} -\item{time_slot}{\code{integer}. The time slot (the hour of the day) during which the trip started, represented as an integer (e.g., 0, 1, 2).} +\item{hour}{\code{integer}. The time slot (the hour of the day) during which the trip started, represented as an integer (e.g., 0, 1, 2).} \item{distance}{\code{factor}. The distance category of the trip, represented as a code (e.g., '002-005' for 2-5 km).} \item{n_trips}{\code{double}. The number of trips taken within the specified time slot and distance.} \item{trips_total_length_km}{\code{double}. The total length of all trips in kilometers for the specified time slot and distance.} diff --git a/vignettes/convert.qmd b/vignettes/convert.qmd index ed1566e..6f746a9 100644 --- a/vignettes/convert.qmd +++ b/vignettes/convert.qmd @@ -60,7 +60,7 @@ For reference, here is a simple query we used for speed comparison in @fig-csv-d ```r # data represents either CSV files acquired from `spod_get()`, a `DuckDB` database or a folder of Parquet files connceted with `spod_connect()` data |> - group_by(id_origin, id_destination, time_slot) |> + group_by(id_origin, id_destination, hour) |> summarise(mean_hourly_trips = mean(n_trips, na.rm = TRUE), .groups = "drop") ``` @@ -109,7 +109,7 @@ The output should look like this: ``` # Source: table [?? x 19] # Database: DuckDB v1.0.0 [... 6.5.0-45-generic:R 4.4.1/:memory:] - date time_slot id_origin id_destination distance activity_origin + date hour id_origin id_destination distance activity_origin 1 2024-03-01 19 01009_AM 01001 0.5-2 frequent_activity 2 2024-03-01 15 01002 01001 10-50 frequent_activity diff --git a/vignettes/flowmaps-interactive.qmd b/vignettes/flowmaps-interactive.qmd index d9c5cfb..b0dafab 100644 --- a/vignettes/flowmaps-interactive.qmd +++ b/vignettes/flowmaps-interactive.qmd @@ -58,7 +58,7 @@ head(od_20210407) ``` # Source: SQL [6 x 14] # Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:] - date id_origin id_destination activity_origin activity_destination residence_province_ine_code residence_province_name time_slot distance n_trips trips_total_length_km year month day + date id_origin id_destination activity_origin activity_destination residence_province_ine_code residence_province_name hour distance n_trips trips_total_length_km year month day 1 2021-04-07 01001_AM 01001_AM home other 01 Araba/Álava 0 005-010 10.5 68.9 2021 4 7 2 2021-04-07 01001_AM 01001_AM home other 01 Araba/Álava 0 010-050 12.6 127. 2021 4 7 @@ -236,11 +236,11 @@ After following the simple example, let us now add a time filter to the flows. W ## Prepare data for visualization -Just like before, we aggregate the data and rename some columns. This time we will keep combine the `date` and `time_slot` (which corresponds to the hour of the day) to procude timestamps, so that the flows can be interactively filtered by time of day. +Just like before, we aggregate the data and rename some columns. This time we will keep combine the `date` and `hour` (which corresponds to the hour of the day) to procude timestamps, so that the flows can be interactively filtered by time of day. ```{r} od_20210407_time <- od_20210407 |> - mutate(time = as.POSIXct(paste0(date, "T", time_slot, ":00:00"))) |> + mutate(time = as.POSIXct(paste0(date, "T", hour, ":00:00"))) |> group_by(origin = id_origin, dest = id_destination, time) |> summarise(count = sum(n_trips, na.rm = TRUE), .groups = "drop") |> collect() diff --git a/vignettes/flowmaps-static.qmd b/vignettes/flowmaps-static.qmd index 9a33e70..10420bb 100644 --- a/vignettes/flowmaps-static.qmd +++ b/vignettes/flowmaps-static.qmd @@ -47,7 +47,7 @@ head(od_20210407) ``` # Source: SQL [6 x 14] # Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1/:memory:] - date id_origin id_destination activity_origin activity_destination residence_province_in…¹ residence_province_n…² time_slot distance n_trips trips_total_length_km year month + date id_origin id_destination activity_origin activity_destination residence_province_in…¹ residence_province_n…² hour distance n_trips trips_total_length_km year month 1 2021-04-07 01001_AM 01001_AM home other 01 Araba/Álava 0 005-010 10.5 68.9 2021 4 2 2021-04-07 01001_AM 01001_AM home other 01 Araba/Álava 0 010-050 12.6 127. 2021 4 diff --git a/vignettes/v1-2020-2021-mitma-data-codebook.qmd b/vignettes/v1-2020-2021-mitma-data-codebook.qmd index 2a248dc..505d7e9 100644 --- a/vignettes/v1-2020-2021-mitma-data-codebook.qmd +++ b/vignettes/v1-2020-2021-mitma-data-codebook.qmd @@ -136,7 +136,7 @@ Here are the variables you can find in both the `district` and `municipality` le | `activity_destination` | `actividad_destino` | `factor` | The type of activity at the destination zone, similarly recoded as for `activity_origin` above. | | `residence_province_ine_code` | `residencia` | `factor` | The province code of residence if individuals who were making the trips in `n_trips`, encoded as province codes as classified by the Spanish Statistical Office (INE). | | `residence_province_name` | Derived from `residencia` | `factor` | The full name of the residence province, derived from the province code above. | -| `time_slot` | `periodo` | `integer` | The time slot during which the trips occurred. | +| `hour` | `periodo` | `integer` | The time slot during which the trips occurred. | | `distance` | `distancia` | `factor` | The distance range of the trip, categorized into specific intervals such as `0005-002` (500 m to 2 km), `002-005` (2-5 km), `005-010` (5-10km), `010-050` (10-50 km), `050-100` (50-100 km), and `100+` (more than 100 km). | | `n_trips` | `viajes` | `numeric` | The number of trips for that specific origin-destination pair and time slot. | | `trips_total_length_km` | `viajes_km` | `numeric` | The total length of trips in kilometers, summing up all trips between the origin and destination zones. | @@ -171,7 +171,7 @@ The resulting objects `od_dist` and `od_muni` are of class `tbl_duckdb_connectio ```{r} library(dplyr) od_mean_hourly_trips_over_the_4_days <- od_dist |> - group_by(time_slot) |> + group_by(hour) |> summarise( mean_hourly_trips = mean(n_trips, na.rm = TRUE), .groups = "drop") |> @@ -181,7 +181,7 @@ od_mean_hourly_trips_over_the_4_days ``` # A tibble: 24 × 2 - time_slot mean_hourly_trips + hour mean_hourly_trips 1 18 21.4 2 10 19.3 diff --git a/vignettes/v2-2022-onwards-mitma-data-codebook.qmd b/vignettes/v2-2022-onwards-mitma-data-codebook.qmd index c4af24a..d46946c 100644 --- a/vignettes/v2-2022-onwards-mitma-data-codebook.qmd +++ b/vignettes/v2-2022-onwards-mitma-data-codebook.qmd @@ -165,7 +165,7 @@ Here are the variables you can find in the `district`, `municipality` and `large | **English Variable Name** | **Original Variable Name** | **Type** | **Description** | |---------|---------|---------|---------------------------------------------| | `date` | `fecha` | `Date` | The date of the recorded data, formatted as `YYYY-MM-DD`. | -| `time_slot` | `periodo` | `integer` | The time slot during which the trips occurred. | +| `hour` | `periodo` | `integer` | The time slot during which the trips occurred. | | `id_origin` | `origen` | `factor` | The origin zone `id` of `district`, `municipality`, or `large urban area`. | | `id_destination` | `destino` | `factor` | The destination zone `id` of `district`, `municipality`, or `large urban area`. | | `distance` | `distancia` | `factor` | The distance range of the trip, categorized into specific intervals such as `0.5-2` (500 m to 2 km), `2-10` (2-10 km), `10-50` (10-50km), and `>50` (50 or more km). | From 4c1950ccd8b26c954b1241b68118ddd30d5ebaf7 Mon Sep 17 00:00:00 2001 From: Egor Kotov Date: Tue, 28 Jan 2025 16:55:10 +0100 Subject: [PATCH 5/9] move hour and distance in v1 to same positions as in v2 --- .../extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql | 4 ++-- .../extdata/sql-queries/v1-od-distritos-clean-csv-view-es.sql | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql b/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql index 3819139..2c9d95e 100644 --- a/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql +++ b/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql @@ -1,5 +1,6 @@ CREATE VIEW od_csv_clean AS SELECT fecha AS date, + periodo AS hour, CAST (CASE origen WHEN 'externo' THEN 'external' ELSE origen @@ -10,6 +11,7 @@ CREATE VIEW od_csv_clean AS SELECT ELSE destino END AS ZONES_ENUM) AS id_destination, + CAST(distancia AS DISTANCE_ENUM) AS distance, CAST(CASE actividad_origen WHEN 'casa' THEN 'home' WHEN 'otros' THEN 'other' @@ -75,8 +77,6 @@ CREATE VIEW od_csv_clean AS SELECT WHEN '51' THEN 'Ceuta' WHEN '52' THEN 'Melilla' END AS INE_PROV_NAME_ENUM) AS residence_province_name, - periodo AS hour, - CAST(distancia AS DISTANCE_ENUM) AS distance, viajes AS n_trips, viajes_km AS trips_total_length_km, CAST(year AS INTEGER) AS year, diff --git a/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-es.sql b/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-es.sql index cc542d9..cfb3fbf 100644 --- a/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-es.sql +++ b/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-es.sql @@ -1,5 +1,6 @@ CREATE VIEW od_csv_clean AS SELECT fecha AS date, + periodo, CAST(origen AS ZONES_ENUM) AS origen, CAST(destino AS ZONES_ENUM) AS destino, CAST(CASE actividad_origen @@ -12,6 +13,7 @@ CREATE VIEW od_csv_clean AS SELECT WHEN 'otros' THEN 'other' WHEN 'trabajo_estudio' THEN 'work_or_study' END AS ACTIV_ENUM) AS actividad_destino, + CAST(distancia AS DISTANCE_ENUM) AS distancia, CAST(residencia AS INE_PROV_CODE_ENUM) AS residencia, CAST (CASE residencia WHEN '01' THEN 'Araba/Álava' @@ -67,8 +69,6 @@ CREATE VIEW od_csv_clean AS SELECT WHEN '51' THEN 'Ceuta' WHEN '52' THEN 'Melilla' END AS INE_PROV_NAME_ENUM) AS residencia_nombre, - periodo, - CAST(distancia AS DISTANCE_ENUM) AS distancia, viajes, viajes_km, CAST(year AS INTEGER) AS ano, From f8ca613ab4f2919c815b0895dcd89a82ea722fe6 Mon Sep 17 00:00:00 2001 From: Egor Kotov Date: Tue, 28 Jan 2025 17:06:00 +0100 Subject: [PATCH 6/9] update hour and distance col positions in v1 codebook --- vignettes/v1-2020-2021-mitma-data-codebook.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vignettes/v1-2020-2021-mitma-data-codebook.qmd b/vignettes/v1-2020-2021-mitma-data-codebook.qmd index b1f7cf3..6be588d 100644 --- a/vignettes/v1-2020-2021-mitma-data-codebook.qmd +++ b/vignettes/v1-2020-2021-mitma-data-codebook.qmd @@ -130,14 +130,14 @@ Here are the variables you can find in both the `district` and `municipality` le | **English Variable Name** | **Original Variable Name** | **Type** | **Description** | |----------------|----------------|----------------|------------------------| | `date` | `fecha` | `Date` | The date of the recorded data, formatted as `YYYY-MM-DD`. | +| `hour` | `periodo` | `integer` | The time slot during which the trips occurred. | | `id_origin` | `origen` | `factor` | The origin zone `id` of `district` or `municipalitity`. | | `id_destination` | `destino` | `factor` | The destination zone `id` of `district` or `municipalitity`. | +| `distance` | `distancia` | `factor` | The distance range of the trip, categorized into specific intervals such as `0005-002` (500 m to 2 km), `002-005` (2-5 km), `005-010` (5-10km), `010-050` (10-50 km), `050-100` (50-100 km), and `100+` (more than 100 km). | | `activity_origin` | `actividad_origen` | `factor` | The type of activity at the origin zone, recoded from `casa`, `otros`, `trabajo_estudio` to `home`, `other`, `work_or_study` respectively. | | `activity_destination` | `actividad_destino` | `factor` | The type of activity at the destination zone, similarly recoded as for `activity_origin` above. | | `residence_province_ine_code` | `residencia` | `factor` | The province code of residence if individuals who were making the trips in `n_trips`, encoded as province codes as classified by the Spanish Statistical Office (INE). | | `residence_province_name` | Derived from `residencia` | `factor` | The full name of the residence province, derived from the province code above. | -| `hour` | `periodo` | `integer` | The time slot during which the trips occurred. | -| `distance` | `distancia` | `factor` | The distance range of the trip, categorized into specific intervals such as `0005-002` (500 m to 2 km), `002-005` (2-5 km), `005-010` (5-10km), `010-050` (10-50 km), `050-100` (50-100 km), and `100+` (more than 100 km). | | `n_trips` | `viajes` | `numeric` | The number of trips for that specific origin-destination pair and time slot. | | `trips_total_length_km` | `viajes_km` | `numeric` | The total length of trips in kilometers, summing up all trips between the origin and destination zones. | | `year` | `year` | `integer` | The year of the recorded data, extracted from the date. | From b2c75bb8ed7c4e8a1a95522395f0d2c16f616072 Mon Sep 17 00:00:00 2001 From: Egor Kotov Date: Tue, 28 Jan 2025 17:29:56 +0100 Subject: [PATCH 7/9] bump heading levels --- inst/vignette-include/overall-approach.qmd | 2 +- vignettes/convert.qmd | 40 ++++++------- vignettes/disaggregation.qmd | 6 +- vignettes/flowmaps-interactive.qmd | 30 +++++----- vignettes/flowmaps-static.qmd | 56 +++++++++---------- vignettes/quick-get.qmd | 16 +++--- .../v1-2020-2021-mitma-data-codebook.qmd | 14 ++--- .../v2-2022-onwards-mitma-data-codebook.qmd | 21 ++++--- 8 files changed, 95 insertions(+), 90 deletions(-) diff --git a/inst/vignette-include/overall-approach.qmd b/inst/vignette-include/overall-approach.qmd index 525d8f2..5c4a17b 100644 --- a/inst/vignette-include/overall-approach.qmd +++ b/inst/vignette-include/overall-approach.qmd @@ -4,7 +4,7 @@ execute: --- -# Overall approach to accessing the data +## Overall approach to accessing the data If you only need flows data aggregated by day at municipal level, you can use the `spod_quick_get_od()` function. This will download the data directly from the web API and let you analyse it in-memory. More on this in the [Quickly get daily data](https://ropenspain.github.io/spanishoddata/articles/quick-get.html) vignette. diff --git a/vignettes/convert.qmd b/vignettes/convert.qmd index 9c3ddca..a6074eb 100644 --- a/vignettes/convert.qmd +++ b/vignettes/convert.qmd @@ -15,14 +15,14 @@ execute: eval: false --- -# Introduction {#intro} +## Introduction {#intro} **TL;DR (too long, didn't read): For analysing more than 1 week of data, use `spod_convert()` to convert the data into `DuckDB` and `spod_connect()` to connect to it for analysis using `{dplyr}`. Skip to the [section about it](#duckdb).** The main focus of this vignette is to show how to get long periods of origin-destination data for analysis. First, we describe and compare the two ways to get the mobility data using origin-destination data as an example. The package functions and overall approaches are the same for working with other types of data available through the package, such as the number of trips, overnight stays and any other data. Then we show how to get a few days of origin-destination data with `spod_get()`. Finally, we show how to download and convert multiple weeks, months or even years of origin-destination data into analysis-ready formats. See description of datasets in the [Codebook and cookbook for v1 (2020-2021) Spanish mobility data](v1-2020-2021-mitma-data-codebook.html) and in the [Codebook and cookbook for v2 (2022 onwards) Spanish mobility data](v2-2022-onwards-mitma-data-codebook.html). -# Two ways to get the data +## Two ways to get the data There are two main ways to import the datasets: @@ -32,7 +32,7 @@ There are two main ways to import the datasets: `spod_get()` returns objects that are only appropriate for small datasets representing a few days of the national origin-destination flows. We recommend converting the data into analysis-ready formats (`DuckDB` or `Parquet`) using `spod_convert()` + `spod_connect()`. This will allow you to work with much longer time periods (months and years) on a consumer laptop (with 8-16 GB of memory). See the section below for more details. -# Analysing large datasets {#analysing-large-datasets} +## Analysing large datasets {#analysing-large-datasets} The mobility datasets available through `{spanishiddata}` are very large. Particularly the origin-destination data, which contains millions of rows. These data sets may not fit into the memory of your computer, especially if you plan to run the analysis over multiple days, weeks, months, or even years. @@ -40,11 +40,11 @@ To work with these datasets, we highly recommend using `DuckDB` and `Parquet`. T Learning to use `DuckDB` and `Parquet` is easy for anyone who have ever worked with `{dplyr}` functions such as `select()`, `filter()`, `mutate()`, `group_by()`, `summarise()`, etc. However, since there is some learning curve to master these new tools, we provide some helper functions for novices to get started and easily open the datasets from `DuckDB` and `Parquet`. Please read the relevant sections below, where we first show how to convert the data, and then how to use it. -## How to choose between DuckDB, Parquet, and CSV {#duckdb-vs-parquet-csv} +### How to choose between DuckDB, Parquet, and CSV {#duckdb-vs-parquet-csv} The main considerations to make when choosing between `DuckDB` and `Parquet` (that you can get with `spod_convert()` + `spod_connect()`), as well as `CSV.gz` (that you can get with `spod_get()`) are analysis speed, convenience of data analysis, and the specific approach you prefer when getting the data. We discuss all three below. -### Analysis Speed {#speed-comparison} +#### Analysis Speed {#speed-comparison} The data format you choose may dramatically impact the speed of analysis (e.g. filtering by dates, calculating number of trips per hour, per week, per month, per origin-destination pair, and any other data aggregation or manipulation). @@ -70,11 +70,11 @@ data |> @fig-csv-duckdb-parquet-speed also shows that `DuckDB` format will give you the best performance even on low-end systems with limited memory and number of processor cores, conditional on a fast SSD storage. Also note, that if you do choose to work with long time periods using CSV.gz files via `spod_get()`, you will need to balance the amount of memory and processor cores via the `max_n_cpu` and `max_mem_gb` arguments, otherwise the analysis may fail (see the grey area in the figure), when there are too many parallel processes running at the same time with limited memory. -### Convenience of data analysis +#### Convenience of data analysis Regardless of the data format (`DuckDB`, `Parquet`, or `CSV.gz`), the functions you will need for data manipulation and analysis are the same. This is because the analysis is actually performed by the `DuckDB` [@duckdb-r] engine, which presents the data as if it were a regular `data.frame`/`tibble` object in R (almost). So from that point of view, there is no difference between the data formats. You can manipulate the data using `{dplyr}` functions such as `select()`, `filter()`, `mutate()`, `group_by()`, `summarise()`, etc. In the end of any sequence of commands you will need to add `collect()` to execute the whole chain of data manipulations and load the results into memory in an R `data.frame`/`tibble`. We provide examples in the following sections. Please refer to the recommended external tutorials and our own vignettes in the [Analysing large datasets](#analysing-large-datasets) section. -### Scenarios of getting the data +#### Scenarios of getting the data The choice between converting to `DuckDB` and `Parquet` could also be made based on how you plan to work with the data. Specifically whether you want to just download long periods or even all available data, or if you want to get the data gradually, as you progress through with the analysis. @@ -84,7 +84,7 @@ The choice between converting to `DuckDB` and `Parquet` could also be made based - If you only work with a few individual days, you may not notice the advantages of the `DuckDB` or `Parquet` formats. In this case, you can keep using the `CSV.gz` format for the analysis using the `spod_get()` function. This is also useful for quick tutorials, where you only need one or two days of data for demonstration purposes. -# Setup {#setup} +## Setup {#setup} Make sure you have loaded the package: @@ -94,7 +94,7 @@ library(spanishoddata) {{< include ../inst/vignette-include/setup-data-directory.qmd >}} -# Getting a single day with `spod_get()` {#spod-get} +## Getting a single day with `spod_get()` {#spod-get} As you might have seen in the codebooks for [v1](v1-2020-2021-mitma-data-codebook.html) and [v2](v2-2022-onwards-mitma-data-codebook.html) data, you can get a single day's worth of data as an in-memory object with `spod_get()`: @@ -120,11 +120,11 @@ Note that this is a lazily-evaluated in-memory object (note the `:memory:` in th -# Analysing the data using `DuckDB` database {#duckdb} +## Analysing the data using `DuckDB` database {#duckdb} Please make sure you did all the steps in the [Setup](#setup) section above. -## Convert to `DuckDB` {#convert-to-duckdb} +### Convert to `DuckDB` {#convert-to-duckdb} You can download and convert the data into `DuckDB` database in two steps. For example, you select a few dates, and download the data manually (note: we use `dates_2` to refer to the fact that we are using v2 data): @@ -195,7 +195,7 @@ db_2 <- spod_convert(type = "od", zones = "distr", dates = dates_1, overwrite = In this case, any missing data that has not yet been downloaded will be automatically downloaded, while 2020-02-17 will not be redownloaded, as we already requsted it when creating `db_1`. Then the requested dates will be converted into `DuckDB`, overwriting the file with `db_1`. Once again, we save the path to the output `DuckDB` database file into `db_2` variable. -## Load the converted `DuckDB` {#load-converted-duckdb} +### Load the converted `DuckDB` {#load-converted-duckdb} You can read the introductory information on how to connect to `DuckDB` files [here](https://duckdb.org/docs/api/r){target="_blank"}, however to simplify things for you we created a helper function. So to connect to the data stored in at path `db_1` and `db_2` you can do the following: @@ -213,11 +213,11 @@ spod_disconnect(my_od_data_2) This is useful to free-up memory and is neccessary if you would like to run `spod_convert()` again and save the data to the same location. Otherwise, it is also helpful to avoid unnecessary possible warnings in terminal for garbage collected connections. -# Analysing the data using `Parquet` {#parquet} +## Analysing the data using `Parquet` {#parquet} Please make sure you did all the steps in the [Setup](#setup) section above. -## Convert to `Parquet` {#convert-to-parquet} +### Convert to `Parquet` {#convert-to-parquet} The process is exactly the same as for `DuckDB` above. The only difference is that the data is converted to `parquet` format and stored in `SPANISH_OD_DATA_DIR` under `v1/clean_data/tabular/parquet/` directory for v1 data (change this with the `save_path` argument), and the subfolders are in hive-style format like `year=2020/month=2/day=14` and inside each of these folders a single `parquet` file will be placed containing the data for that day. @@ -248,7 +248,7 @@ dates <- c(start = "2023-02-14", end = "2023-02-17") od_parquet <- spod_convert(type = type, zones = zones, dates = dates, save_format = "parquet", save_path = file.path(tempdir(), "od_parquet")) ``` -## Load the converted `Parquet` {#load-converted-parquet} +### Load the converted `Parquet` {#load-converted-parquet} Working with these `parquet` files is exactly the same as with `DuckDB` and `Arrow` files. Just like before, you can use the same helper function `spod_connect()` to connect to the `parquet` files: @@ -269,7 +269,7 @@ my_od_data_3 |> For analysis, please refer to the recommended external tutorials and our own vignettes in the [Analysing large datasets](#analysing-large-datasets) section. -# Download all available data {#all-dates} +## Download all available data {#all-dates} To prepare origin-destination data v1 (2020-2021) for analysis over the whole period of data availability, please follow the steps below: @@ -281,7 +281,7 @@ dates_v2 <- spod_get_valid_dates(ver = 2) {{< include ../inst/vignette-include/missing-dates-outages.qmd >}} -## Download all data +### Download all data Here the example is for origin-destination on district level for v1 data. You can change the `type` to "number_of_trips" and the `zones` to "municipalities" for v1 data. For v2 data, just use `dates` starting with 2022-01-01 or the `dates_v2` from above. Use all other function arguments for v2 in the same way as shown for v1, but also consult the [v2 data codebook](v2-2022-onwards-mitma-data-codebook.html), as it has many more datasets in addition to "origin-destination" and "number_of_trips". @@ -297,7 +297,7 @@ spod_download( ) ``` -## Convert all data into analysis ready format +### Convert all data into analysis ready format ```{r} save_format <- "duckdb" @@ -319,11 +319,11 @@ For this conversion, 4 GB or operating memory should be enough, the speed of the Finally, `analysis_data_storage` will simply store the path to the converted data. Either a path to the `DuckDB` database file or a path to the folder with `Parquet` files. -## Conversion speed +### Conversion speed For reference, converting the whole v1 origin-destination data to `DuckDB` takes about 20 minutes with 4 GB of memory and 3 processor cores. The final size of the `DuckDB` database is about 18 GB, in `Parquet` format - 26 GB. The raw CSV files in gzip archives are about 20GB. v2 data is much larger, with origin-destination tables for 2022 - mid-2024 taking up 150+ GB in raw CSV.gz format. -## Connecting to and analysing the converted datasets +### Connecting to and analysing the converted datasets You can pass the `analysis_data_storage` path to `spod_connect()` function, whether it is `DuckDB` or `Parquet`. The function will determine the data type automatically and give you back a `tbl_duckdb_connection`[^1]. diff --git a/vignettes/disaggregation.qmd b/vignettes/disaggregation.qmd index 8c028ae..56b6863 100644 --- a/vignettes/disaggregation.qmd +++ b/vignettes/disaggregation.qmd @@ -32,13 +32,13 @@ library(sf) library(tmap) ``` -# Introduction +## Introduction This vignette demonstrates origin-destination (OD) data disaggregation using the `{odjitter}` package. The package is an implementation of the method described in the paper "Jittering: A Computationally Efficient Method for Generating Realistic Route Networks from Origin-Destination Data" [@lovelace2022jittering] for adding value to OD data by disaggregating desire lines. This can be especially useful for transport planning purposes in which high levels of geographic resolution are required (see also the [`od2net`](https://od2net.org/){target="_blank"} for direct network generation from OD data). -# Data preparation +## Data preparation We'll start by loading a week's worth of origin-destination data for the city of Salamanca, building on the example in the README (note: these chunks are not evaluated): @@ -75,7 +75,7 @@ od_salamanca_sf <- od::od_to_sf( ``` -# Disaggregating desire lines +## Disaggregating desire lines For this you'll need some additional dependencies: diff --git a/vignettes/flowmaps-interactive.qmd b/vignettes/flowmaps-interactive.qmd index b0dafab..49c7829 100644 --- a/vignettes/flowmaps-interactive.qmd +++ b/vignettes/flowmaps-interactive.qmd @@ -19,7 +19,7 @@ execute: This tutorial shows how to make interactive 'flow maps' with data from `{spanishoddata}` and the `{flowmapblue}` [@flowmapblue_r] data visualisation package. We cover two examples. [First](#simple-example), we only visualise the total flows for a single day. In the [second](#advanced-example) more advanced example we also use the time component that allows you to interactively filter flows by time of day. For both examples, make sure you first go though the initial [setup steps](#setup). To make static flow maps, please see the [static flow maps](https://ropenspain.github.io/spanishoddata/articles/flowmaps-static.html) tutorial. -# Setup {#setup} +## Setup {#setup} For the basemap in the final visualisation you will need a free Mapbox access token. You can get one at [account.mapbox.com/access-tokens/](https://account.mapbox.com/access-tokens/){target='_blank'} (you need to have a Mapbox account, which is free). You may skip this step, but in this case your interative flowmap will have no basemap, and the flows will just flow on solid colour background. @@ -40,11 +40,11 @@ library(sf) {{< include ../inst/vignette-include/setup-data-directory.qmd >}} -# Simple example - plot flows data as it is {#simple-example} +## Simple example - plot flows data as it is {#simple-example} -## Get data +### Get data -### Flows +#### Flows Let us get the flows between `districts` for a tipycal working day `2021-04-07`: @@ -68,7 +68,7 @@ head(od_20210407) 6 2021-04-07 01001_AM 01001_AM home other 01 Araba/Álava 6 010-050 10.8 119. 2021 4 7 ``` -### Zones +#### Zones We also get the district zones polygons to mathch the flows. We use version 1 for the polygons, because the selected date is in 2021, which corresponds to the v1 data (see the relevant [codebook](v1-2020-2021-mitma-data-codebook.qmd)). @@ -95,9 +95,9 @@ Projected CRS: ETRS89 / UTM zone 30N (N-E) 6 2305005 2305005 23050 23050 Jaén distrito 05 2305005 (((430022.7 4181101, 429… ``` -## Prepare data for visualization +### Prepare data for visualization -### Expected data format +#### Expected data format To visualise the flows, `{flowmapblue}` expects two `data.frame`s in the following format (we use the packages's built-in data on Switzerland for illustration): @@ -135,7 +135,7 @@ str(flowmapblue::ch_flows) ``` -### Aggregate data - count total flows +#### Aggregate data - count total flows ```{r} od_20210407_total <- od_20210407 |> @@ -158,7 +158,7 @@ head(od_20210407_total) 6 01001_AM 17033 9.61 ``` -## Create locations table with coordinates {#create-locations-table} +### Create locations table with coordinates {#create-locations-table} We need the coordinates for each origin and destination. We can use the centroids of `districts_v1` polygons for that. @@ -185,7 +185,7 @@ head(districts_v1_centroids) 6 -3.8151096 37.86309 2305005 ``` -## Create the plot +### Create the plot Remember, what for the map to have a basemap, you need to have setup your Mapbox access token in the [setup](#setup) section of this vignette. @@ -230,11 +230,11 @@ flowmap_anim ![Screenshot demonstrating the animated interactive flowmap](../man/figures/flowmapblue-animated.png){width=80%} -# Advanced example - time filter {#advanced-example} +## Advanced example - time filter {#advanced-example} After following the simple example, let us now add a time filter to the flows. We will use the `flowmapblue` function to plot flows between `districts_v1_centroids` for a typical working day `2021-04-07`. -## Prepare data for visualization +### Prepare data for visualization Just like before, we aggregate the data and rename some columns. This time we will keep combine the `date` and `hour` (which corresponds to the hour of the day) to procude timestamps, so that the flows can be interactively filtered by time of day. @@ -260,7 +260,7 @@ head(od_20210407_time) 6 08054 0818403 2021-04-07 02:00:00 7.11 ``` -#### Filter the zones +### Filter the zones Because we are now using the flows for each hour of the day, there are 24 times more rows in this data, than in the simple example. Therefore it will take longer to generate the plot and the resulting visualisation may work slower. To create a more manageable example, let us only filter the data to Madrid and surrounding areas. @@ -313,7 +313,7 @@ head(zones_barcelona_fua_coords) 6 2.152419 41.41014 0801906 ``` -#### Prepare the flows +### Prepare the flows Now we can use the zone `id`s from the `zones_barcelona_fua` data to select the flows that correspond to Barcelona and the 10 km radius around it. @@ -323,7 +323,7 @@ od_20210407_time_barcelona <- od_20210407_time |> ``` -#### Visualise the flows for Barcelona and surrounding areas +### Visualise the flows for Barcelona and surrounding areas Now, we can create a new plot with this data. diff --git a/vignettes/flowmaps-static.qmd b/vignettes/flowmaps-static.qmd index 10420bb..cd57300 100644 --- a/vignettes/flowmaps-static.qmd +++ b/vignettes/flowmaps-static.qmd @@ -19,7 +19,7 @@ execute: This tutorial shows how to make static 'flow maps' with data from `{spanishoddata}` and the `{flowmapper}` [@flowmapper-r] data visualisation package. We cover two examples. [First](#simple-example), we only use the origin-destination flows and district zones that you can get using the `{spanishoddata}` package. In the [second](#advanced-example) more advanced example we also use `{mapSpain}` and `{hexSticker}` packages to re-create the `{spanishoddata}` logo. For both examples, make sure you first go though the initial [setup steps](#setup). To make interactive flow maps, please see the [interactive flow maps](https://ropenspain.github.io/spanishoddata/articles/flowmaps-interactive.html) tutorial. -# Setup {#setup} +## Setup {#setup} ```{r} library(spanishoddata) @@ -30,11 +30,11 @@ library(sf) {{< include ../inst/vignette-include/setup-data-directory.qmd >}} -# Simple example - plot flows data as it is {#simple-example} +## Simple example - plot flows data as it is {#simple-example} -## Get data +### Get data -### Flows +#### Flows Let us get the flows between `districts` for a typical working day `2021-04-07`: @@ -59,7 +59,7 @@ head(od_20210407) # ℹ 1 more variable: day ``` -### Zones +#### Zones We also get the district zones polygons to match the flows. We use version 1 for the polygons, because the selected date is in 2021, which corresponds to the v1 data (see the relevant [codebook](v1-2020-2021-mitma-data-codebook.html)). @@ -86,7 +86,7 @@ Projected CRS: ETRS89 / UTM zone 30N (N-E) 6 2305005 2305005 23050 23050 Jaén distrito 05 2305005 (((430022.7 4181101, 429… ``` -## Aggregate data - count total flows +### Aggregate data - count total flows ```{r} od_20210407_total <- od_20210407 |> @@ -96,7 +96,7 @@ od_20210407_total <- od_20210407 |> arrange(o, d, value) ``` -## Reshape flows for visualization +### Reshape flows for visualization The `{flowmapper}` package was developed to visualise origin-destination 'flow' data [@flowmapper-r]. This package expects the data to be in the following format: @@ -117,7 +117,7 @@ Another `data.frame` with the node `id`s or names and their coorindates. The coo `y`: The y coordinate of the node; -### Prepare the flows table +#### Prepare the flows table The previous code chunk created `od_20210407_total` with the column names expected by `{flowmapper}`. @@ -137,7 +137,7 @@ head(od_20210407_total) 6 2408910 4718608 4.75 ``` -### Prepare the nodes table with coordinates +#### Prepare the nodes table with coordinates We need the coordinates for each origin and destination. We can use the centroids of `districts_v1` polygons for that. @@ -162,9 +162,9 @@ head(districts_v1_coords) 6 428302.1 4190937 2305005 ``` -## Plot the flows +### Plot the flows -### Plot the entire country +#### Plot the entire country Now we have the data structure that match the `{flowmapper}`'s expected data format we can plot a sample of the data (a plot containing all flows would be very 'busy' and world resemble a haystack!). The `k_node` argument in the `add_flowmap` function can be used to reduce this business. @@ -226,11 +226,11 @@ ggsave("./man/figures/flows_plot_all_districts.png", plot = flows_plot_all_distr ![](../man/figures/flows_plot_all_districts.png) -### Zoom in to the city level +#### Zoom in to the city level Let us filter the flows and zones data to just a specific functional urban area to take a closer look at the flows. -#### Filter the zones +##### Filter the zones Let us select all districts that correspond to Barcelona and a 10 km radius around it. Thanks to the `district_names_in_v2` column in the zones data, we can easily select all the districts that correspond to Barcelona and then apply the spatial join on the to select some more districts around the polygons that correspond to Barcelona. @@ -281,7 +281,7 @@ head(zones_barcelona_fua_coords) 6 930702.3 4597116 0801906 ``` -#### Prepare the flows +##### Prepare the flows Now we can use the zone `id`s from the `zones_barcelona_fua` data to select the flows that correspond to Barcelona and the 10 km radius around it. @@ -290,7 +290,7 @@ od_20210407_total_barcelona <- od_20210407_total |> filter(o %in% zones_barcelona_fua$id & d %in% zones_barcelona_fua$id) ``` -#### Visualise the flows for Barcelona and surrounding areas +##### Visualise the flows for Barcelona and surrounding areas Now, we can create a new plot with this data. Once again, we need the `k_node` argument to tweak the aggregation of nodes and flows. Feel free to tweak it yourself and see how the results change. @@ -350,7 +350,7 @@ ggsave("./man/figures/flows_plot_barcelona.png", plot = flows_plot_barcelona, wi ![](../man/figures/flows_plot_barcelona.png) -# Advanced example - aggregate flows for `{spanishoddata}` logo {#advanced-example} +## Advanced example - aggregate flows for `{spanishoddata}` logo {#advanced-example} For the advanced example we will need two additional packages: `{mapSpain}` [@R-mapspain] and `{hexSticker}` [@R-hexSticker]. @@ -367,9 +367,9 @@ library(sf) ``` -## Get data +### Get data -### Flows +#### Flows Just like in the simple example above, we will need the flows to visualise. @@ -385,7 +385,7 @@ Also get the spatial data for the zones. We are using the version 2 of zones, be districts <- spod_get_zones("distr", ver = 2) ``` -### Map of Spain +#### Map of Spain Ultimately, we would like to plot the flows on a map of Spain, so we will aggregate the flows for visualisation to avoid visual clutter. We therefore also need a nice map of Spain, which we will get using `{mapSpain}` [@R-mapspain] package: @@ -397,9 +397,9 @@ spain_for_join <- esp_get_ccaa(moveCAN = FALSE) We are getting two sets of boundaries. First one is with Canary Islands moved closer to the mainland Spain, for nicer visualisation. Second one is with the original location of the islands, so that we can spatially join them to the zones `districts` data we got from `{spanishoddata}`. -## Flows aggregation +### Flows aggregation -### Aggregate raw origin destination data by original `id`s +#### Aggregate raw origin destination data by original `id`s Let us count the total number of trips made between all locations on our selected day of `2022-04-06`: @@ -433,7 +433,7 @@ flows_by_district ``` -### Match `id`s of `districts` with autonomous communities +#### Match `id`s of `districts` with autonomous communities Now we need to do a spatial join between `districts` and `spain_for_join` to find out which districts fall within which autonomous community. We use `spain_for_join`. If we used `spain_for_vis`, the `districts` in the Canary Islands would not match with the boundaries of the islands. @@ -473,7 +473,7 @@ ca_distr This way we get a table with `districts` `id`s and their corresponding autonomous community names. -### Count flows between pairs of autonomous communities +#### Count flows between pairs of autonomous communities We can now add these ids to the total flows by `districts` `id` pairs and calculate total flows between autonomous communities: @@ -513,7 +513,7 @@ flows_by_ca # ℹ Use `print(n = ...)` to see more rows ``` -## Reshape flows for visualization +### Reshape flows for visualization We are going to use the `{flowmapper}` [@flowmapper-r] package to plot the flows. This package expects the data to be in the following format: @@ -534,7 +534,7 @@ Another `data.frame` with the node `id`s or names and their coorindates. The coo `y`: The y coordinate of the node; -### Prepare the flows table +#### Prepare the flows table The data we have right now in `flows_by_ca` already has the correct format expected by `{flowmapper}`. @@ -554,7 +554,7 @@ head(flows_by_ca) 6 Andalusia Canary Islands 1899. ``` -### Prepare the nodes table with coordinates +#### Prepare the nodes table with coordinates We need the coordinates for each origin and destination. We can use the centroids of `districts_v1` polygons for that. @@ -581,7 +581,7 @@ head(spain_for_vis_coords) 6 -4.0300438 43.19772 Cantabria ``` -## Plot the flows +### Plot the flows Now we have the data structure that match the `{flowmapper}`'s expected data format: @@ -637,7 +637,7 @@ ggsave("./man/figures/logo-before-hex.png", plot = flows_plot, width = 6, height The image may look a bit bleak, but when we put it on a sticker, it will look great. -## Make the sticker +### Make the sticker We make the sticker using the `{hexSticker}` [@hexSticker-r] package. diff --git a/vignettes/quick-get.qmd b/vignettes/quick-get.qmd index 39c17ed..af0bedd 100644 --- a/vignettes/quick-get.qmd +++ b/vignettes/quick-get.qmd @@ -15,11 +15,11 @@ execute: eval: false --- -# Introduction {#intro} +## Introduction {#intro} This vignette demonstrates how to get minimal daily aggregated data on the number of trips between municipalities using the `spod_quick_get_od()` function. With this function, you only get total trips for a single day, and no additional variables that are available in the full [v2 (2022 onwards) data set](v2-2022-onwards-mitma-data-codebook.html). The advantage of this function is that it is much faster than downloading the full data from source CSV files using `spod_get()`, as each CSV file for a single day is about 200 MB in size. Also, this way of getting the data is much less demanding on your computer as you are only getting a small table from the internet (less than 1 MB), and no data processing (such as aggregation from more detailed hourly data with extra columns that is happening when you use `spod_get()` function) is required on your computer. -# Setup {#setup} +## Setup {#setup} Make sure you have loaded the package: @@ -36,9 +36,9 @@ Setting a local data directory in this case is optional, as the data is download ::: -# Get the data {#get-data} +## Get the data {#get-data} -## Get all flows with at least 1000 trips +### Get all flows with at least 1000 trips To get the data, use the `spod_quick_get_od()` function. There is no need to specify whether you need municipalities or districts, as the only municipal level data can be accessed with this function. The `min_trips` argument specifies the minimum number of trips to include in the data. If you set `min_trips` to 0, you will get all data for all origin-destination pairs for the specified date. @@ -89,7 +89,7 @@ od_1000 # ℹ Use `print(n = ...)` to see more rows ``` -## Get only trips of certain length +### Get only trips of certain length To get only trips of a certain length, use the `distances` argument. @@ -140,7 +140,7 @@ od_long # ℹ Use `print(n = ...)` to see more rows ``` -## Get only trips between certain municipalities +### Get only trips between certain municipalities To get only trips between certain municipalities, use the `id_origin` and `id_destination` arguments. @@ -153,7 +153,7 @@ municipalities <- spod_get_zones("muni", ver = 2) head(municipalities) ``` -### All trips from Madrid +#### All trips from Madrid Let us select all locations with Madrid in the name: @@ -212,7 +212,7 @@ $ trips_total_length_km 11120, 7268, 75798, 3385, 82, 296, 1036, … # ℹ Use `print(n = ...)` to see more rows ``` -### All trips from Madrid to Barcelona +#### All trips from Madrid to Barcelona Similarly, you can set limits on the destination municipalities: diff --git a/vignettes/v1-2020-2021-mitma-data-codebook.qmd b/vignettes/v1-2020-2021-mitma-data-codebook.qmd index 6be588d..d65498b 100644 --- a/vignettes/v1-2020-2021-mitma-data-codebook.qmd +++ b/vignettes/v1-2020-2021-mitma-data-codebook.qmd @@ -63,11 +63,11 @@ Using the instructions below, set the data folder for the package to download th ![The overview of package functions to get the data](../man/figures/package-functions-overview.svg){#fig-overall-flow width="78%"} -# 1. Spatial data with zoning boundaries +## 1. Spatial data with zoning boundaries The boundary data is provided at two geographic levels: [`Districts`](#districts) and [`Municipalities`](#municipalities). It's important to note that these do not always align with the official Spanish census districts and municipalities. To comply with data protection regulations, certain aggregations had to be made to districts and municipalities". -## 1.1 `Districts` {#districts} +### 1.1 `Districts` {#districts} `Districts` correspond to official census districts in cities; however, in those with lower population density, they are grouped together. In rural areas, one district is often equal to a municipality, but municipalities with low population are combined into larger units to preserve privacy of individuals in the dataset. Therefore, there are 2850 'districts' compared to the 10494 official census districts on which they are based. @@ -90,7 +90,7 @@ Data structure: | `district_names_in_v2` | A string with semicolon-separated list of names of district polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.html) that covers the year 2022 and onwards that correspond to polygons with `id` above. | | `district_ids_in_v2` | A string with semicolon-separated list of identifiers of district polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.html) that covers the year 2022 and onwards that correspond to polygons with `id` above. | -## 1.2 `Municipalities` {#municipalities} +### 1.2 `Municipalities` {#municipalities} `Municipalities` are made up of official municipalities in those of a certain size; however, they have also been aggregated in cases of lower population density. As a result, there are 2,205 municipalities compared to the 8,125 official municipalities on which they are based. @@ -115,11 +115,11 @@ Data structure: The spatial data you get via `spanishoddata` package is downloaded directly from the source, the geometries of polygons are automatically fixed if there are any invalid geometries. The zone identifiers are stored in `id` column. Apart from that `id` column, the original zones files do not have any metadata. However, as seen above, using the `spanishoddata` package you get many additional columns that provide a semantic connection between official statistical zones used by the Spanish government and the zones you can get for the v2 data (for 2022 onward). -# 2. Mobility data +## 2. Mobility data All mobility data is referenced via `id_origin`, `id_destination`, or other location identifiers (mostly labelled as `id`) with the two sets of zones described above. -## 2.1. Origin-Destination data {#od-data} +### 2.1. Origin-Destination data {#od-data} The origin-destination data contain the number of trips and distance travelled between `districts` or `municipalities` in Spain for every hour of every day between 2020-02-14 and 2021-05-09. Each flow also has attributes such as the trip purpose (composed of the type of activity (`home`/`work_or_study`/`other`) at both the origin and destination), province of residence of individuals making this trip, distance covered while making the trip. See the detailed attributes below in a table. @fig-flows-barcelona shows an example of total flows in the province of Barcelona on Feb 14th, 2020. @@ -203,7 +203,7 @@ The same summary operation as provided in the example above can be done with the {{< include ../inst/vignette-include/csv-date-filter-note.qmd >}} -## 2.2. Population by trip count data {#ptc-data} +### 2.2. Population by trip count data {#ptc-data} The population by trip count data shows the number of individuals in each district or municipality, categorized by the trips they make: 0, 1, 2, or more than 2. @@ -239,6 +239,6 @@ Because this data is small, we can actually load it completely into memory: nt_dist_tbl <- nt_dist |> dplyr::collect() ``` -# Advanced use +## Advanced use For more advanced use, especially for analysing longer periods (months or even years), please see [Download and convert mobility datasets](convert.html). diff --git a/vignettes/v2-2022-onwards-mitma-data-codebook.qmd b/vignettes/v2-2022-onwards-mitma-data-codebook.qmd index d46946c..c1e166c 100644 --- a/vignettes/v2-2022-onwards-mitma-data-codebook.qmd +++ b/vignettes/v2-2022-onwards-mitma-data-codebook.qmd @@ -68,11 +68,11 @@ Using the instructions below, set the data folder for the package to download th ![The overview of package functions to get the data](../man/figures/package-functions-overview.svg){#fig-overall-flow width="78%"} -# 1. Spatial data with zoning boundaries +## 1. Spatial data with zoning boundaries The boundary data is provided at three geographic levels: [`Distrtics`](#districts), [`Municipalities`](#municipalities), and [`Large Urban Areas`](#lua). It's important to note that these do not always align with the official Spanish census districts and municipalities. To comply with data protection regulations, certain aggregations had to be made to districts and municipalities". -## 1.1 `Districts` {#districts} +### 1.1 `Districts` {#districts} `Districts` correspond to official census districts in cities; however, in those with lower population density, they are grouped together. In rural areas, one district is often equal to a municipality, but municipalities with low population are combined into larger units to preserve privacy of individuals in the dataset. Therefore, there are 3792 'districts' compared to the 10494 official census districts on which they are based. There are also [NUTS3 statistical regions](https://ec.europa.eu/eurostat/web/nuts){target="_blank"} covering France (94 units) and Portugal (23 units). Therefore there is a total of 3909 zones in the `Districts` dataset. @@ -99,7 +99,7 @@ Data structure: [^2]: This is likely the population as of end of 2021 or start of 2022. Population for a few districts is missing. Instead of population, residence and overnight stays data may be used as a proxy with caution. Also, newer population figures may be obtained and joined with the provided zones using the reference tables that match the zones ids with official municipal and census district ids from INE. -## 1.2 `Municipalities` {#municipalities} +### 1.2 `Municipalities` {#municipalities} `Municipalities` are made up of official municipalities in those of a certain size; however, they have also been aggregated in cases of lower population density. As a result, there are 2618 municipalities compared to the 8,125 official municipalities on which they are based. There are also [NUTS3 statistical regions](https://ec.europa.eu/eurostat/web/nuts){target="_blank"} covering France (94 units) and Portugal (23 units). Therefore there is a total of 2735 zones in the `Districts` dataset. @@ -126,7 +126,7 @@ Data structure: [^3]: This is likely the population as of end of 2021 or start of 2022. Population for a few districts is missing. Instead of population, residence and overnight stays data may be used as a proxy with caution. Also, newer population figures may be obtained and joined with the provided zones using the reference tables that match the zones ids with official municipal and census district ids from INE. -## 1.3 `LUAs (Large Urban Areas)` {#luas} +### 1.3 `LUAs (Large Urban Areas)` {#luas} `Large Urban Areas (LUAs)` has essentially the same spatial units as [`Municipalities`](#municipalities), but are not aggregated. Therefore, there are 2086 locations in the `LUAs` dataset. There are also [NUTS3 statistical regions](https://ec.europa.eu/eurostat/web/nuts){target="_blank"} covering France (94 units) and Portugal (23 units). Therefore there is a total of 2203 zones in the `LUAs` dataset. @@ -152,11 +152,11 @@ Data structure: [^4]: This is likely the population as of end of 2021 or start of 2022. Population for a few districts is missing. Instead of population, residence and overnight stays data may be used as a proxy with caution. Also, newer population figures may be obtained and joined with the provided zones using the reference tables that match the zones ids with official municipal and census district ids from INE. -# 2. Mobility data +## 2. Mobility data All mobility data is referenced via `id_origin`, `id_destination`, or other location identifiers (mostly labelled as `id`) with the two sets of zones described above. -## 2.1. Origin-destination data {#od-data} +### 2.1. Origin-destination data {#od-data} The origin-destination data contain the number of trips between `districts`, `municipalities`, or `large urban areas (LUAs)` in Spain for every hour of every day between 2022-02-01 and whichever currently available latest data (2024-06-30 at the time of writing). Each flow also has attributes such as the trip purpose (composed of the type of activity (`home`/`work_or_study`/`frequent_activity`/`infrequent_activity`) at both the origin and destination, but also age, sex, and income of each group of individuals traveling between the origin and destination), province of residence of individuals making this trip, distance covered while making the trip. See the detailed attributes below in a table. @@ -241,7 +241,7 @@ The same summary operation as provided in the example above can be done with the {{< include ../inst/vignette-include/csv-date-filter-note.qmd >}} -## 2.2. Population by trip count data {#ptc-data} +### 2.2. Population by trip count data {#ptc-data} The population by trip count data shows the number of individuals in each district or municipality, categorized by the trips they make (0, 1, 2, or more than 2), age, and sex. @@ -272,7 +272,7 @@ Because this data is small, we can actually load it completely into memory: nt_dist_tbl <- nt_dist |> dplyr::collect() ``` -## 2.3. Population by overnight stay data {#pos-data} +### 2.3. Population by overnight stay data {#pos-data} This dataset provides the number of people who spend the night in each location, also identifying their place of residence down to the census district level according to the [INE encoding](https://www.ine.es/ss/Satellite?c=Page&p=1259952026632&pagename=ProductosYServicios%2FPYSLayout&cid=1259952026632&L=1){target="_blank"}. @@ -304,3 +304,8 @@ Because this data is small, we can actually load it completely into memory: ```{r} os_dist_tbl <- os_dist |> dplyr::collect() ``` + + +## Advanced use + +For more advanced use, especially for analysing longer periods (months or even years), please see [Download and convert mobility datasets](convert.html). From b7ec8b2a2375deed51f685e91d4966a9fe32c309 Mon Sep 17 00:00:00 2001 From: Egor Kotov Date: Fri, 7 Feb 2025 11:47:08 +0100 Subject: [PATCH 8/9] add lifecycle badges to functions, add deprecation message about time_slot column --- DESCRIPTION | 1 + NAMESPACE | 1 + R/available-data.R | 4 +++ R/codebook.R | 3 ++ R/connect.R | 3 ++ R/convert.R | 3 ++ R/data-dir.R | 8 +++++ R/disconnect.R | 3 ++ R/download_data.R | 4 +++ R/get-zones.R | 3 ++ R/get.R | 20 +++++++++-- R/internal-utils.R | 7 ++++ R/quick-get.R | 4 +++ R/spanishoddata-package.R | 7 ++++ .../v1-od-distritos-clean-csv-view-en.sql | 3 +- .../v2-od-distritos-clean-csv-view-en.sql | 3 +- man/figures/lifecycle-deprecated.svg | 21 +++++++++++ man/figures/lifecycle-experimental.svg | 21 +++++++++++ man/figures/lifecycle-stable.svg | 29 +++++++++++++++ man/figures/lifecycle-superseded.svg | 21 +++++++++++ man/spanishoddata-package.Rd | 36 +++++++++++++++++++ man/spod_available_data.Rd | 2 ++ man/spod_codebook.Rd | 2 ++ man/spod_connect.Rd | 2 ++ man/spod_convert.Rd | 2 ++ man/spod_disconnect.Rd | 2 ++ man/spod_download.Rd | 2 ++ man/spod_get.Rd | 2 ++ man/spod_get_data_dir.Rd | 2 ++ man/spod_get_valid_dates.Rd | 4 ++- man/spod_get_zones.Rd | 2 ++ man/spod_quick_get_od.Rd | 2 ++ man/spod_set_data_dir.Rd | 2 ++ 33 files changed, 226 insertions(+), 5 deletions(-) create mode 100644 R/spanishoddata-package.R create mode 100644 man/figures/lifecycle-deprecated.svg create mode 100644 man/figures/lifecycle-experimental.svg create mode 100644 man/figures/lifecycle-stable.svg create mode 100644 man/figures/lifecycle-superseded.svg create mode 100644 man/spanishoddata-package.Rd diff --git a/DESCRIPTION b/DESCRIPTION index 3e90a18..2b2695e 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -37,6 +37,7 @@ Imports: glue, here, httr2, + lifecycle, lubridate, memuse, parallelly, diff --git a/NAMESPACE b/NAMESPACE index 4f14a3d..e817008 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -12,6 +12,7 @@ export(spod_get_valid_dates) export(spod_get_zones) export(spod_quick_get_od) export(spod_set_data_dir) +importFrom(lifecycle,deprecated) importFrom(rlang,.data) importFrom(stats,median) importFrom(utils,URLencode) diff --git a/R/available-data.R b/R/available-data.R index c7cf0cd..e1caf00 100644 --- a/R/available-data.R +++ b/R/available-data.R @@ -1,5 +1,9 @@ #' Get available data list #' +#' @description +#' +#' `r lifecycle::badge("stable")` +#' #' Get a table with links to available data files for the specified data version. Optionally check (see arguments) if certain files have already been downloaded into the cache directory specified with SPANISH_OD_DATA_DIR environment variable (set by \link{spod_set_data_dir}) or a custom path specified with `data_dir` argument. #' #' @param ver Integer. Can be 1 or 2. The version of the data to use. v1 spans 2020-2021, v2 covers 2022 and onwards. diff --git a/R/codebook.R b/R/codebook.R index 25bbc37..d2a1ad5 100644 --- a/R/codebook.R +++ b/R/codebook.R @@ -1,6 +1,9 @@ #' View codebooks for v1 and v2 open mobility data #' #' @description +#' +#' `r lifecycle::badge("stable")` +#' #' Opens relevant vignette with a codebook for v1 (2020-2021) and v2 (2022 onwards) data or provide a webpage if vignette is missing. #' #' diff --git a/R/connect.R b/R/connect.R index d9d686b..8b27d2b 100644 --- a/R/connect.R +++ b/R/connect.R @@ -1,6 +1,9 @@ #' Connect to data converted to `DuckDB` or hive-style `parquet` files #' #' @description +#' +#' `r lifecycle::badge("stable")` +#' #' This function allows the user to quickly connect to the data converted to DuckDB with the \link{spod_convert} function. This function simplifies the connection process. The user is free to use the `DBI` and `DuckDB` packages to connect to the data manually, or to use the `arrow` package to connect to the `parquet` files folder. #' #' @param data_path a path to the `DuckDB` database file with '.duckdb' extension, or a path to the folder with `parquet` files. Eigher one should have been created with the \link{spod_convert} function. diff --git a/R/convert.R b/R/convert.R index d9159b3..ef030d4 100644 --- a/R/convert.R +++ b/R/convert.R @@ -1,6 +1,9 @@ #' Convert data from plain text to duckdb or parquet format #' #' @description +#' +#' `r lifecycle::badge("stable")` +#' #' Converts data for faster analysis into either `DuckDB` file or into `parquet` files in a hive-style directory structure. Running analysis on these files is sometimes 100x times faster than working with raw CSV files, espetially when these are in gzip archives. To connect to converted data, please use 'mydata <- \link{spod_connect}(data_path = path_returned_by_spod_convert)' passing the path to where the data was saved. The connected `mydata` can be analysed using `dplyr` functions such as \link[dplyr]{select}, \link[dplyr]{filter}, \link[dplyr]{mutate}, \link[dplyr]{group_by}, \link[dplyr]{summarise}, etc. In the end of any sequence of commands you will need to add \link[dplyr]{collect} to execute the whole chain of data manipulations and load the results into memory in an R `data.frame`/`tibble`. For more in-depth usage of such data, please refer to DuckDB documentation and examples at [https://duckdb.org/docs/api/r#dbplyr](https://duckdb.org/docs/api/r#dbplyr) . Some more useful examples can be found here [https://arrow-user2022.netlify.app/data-wrangling#combining-arrow-with-duckdb](https://arrow-user2022.netlify.app/data-wrangling#combining-arrow-with-duckdb) . You may also use `arrow` package to work with parquet files [https://arrow.apache.org/docs/r/](https://arrow.apache.org/docs/r/). #' #' @param save_format A `character` vector of length 1 with values "duckdb" or "parquet". Defaults to "duckdb". If `NULL` automatically inferred from the `save_path` argument. If only `save_format` is provided, `save_path` will be set to the default location set in `SPANISH_OD_DATA_DIR` environment variable using `Sys.setenv(SPANISH_OD_DATA_DIR = 'path/to/your/cache/dir')` or \link{spod_set_data_dir}`(path = 'path/to/your/cache/dir')`. So for v1 data that path would be `/clean_data/v1/tabular/duckdb/` or `/clean_data/v1/tabular/parquet/`. diff --git a/R/data-dir.R b/R/data-dir.R index 94de7ce..e23c277 100644 --- a/R/data-dir.R +++ b/R/data-dir.R @@ -1,5 +1,9 @@ #' Set the data directory #' +#' @description +#' +#' `r lifecycle::badge("stable")` +#' #' This function sets the data directory in the environment variable SPANISH_OD_DATA_DIR, so that all other functions in the package can access the data. It also creates the directory if it doesn't exist. #' #' @param data_dir The data directory to set. @@ -53,6 +57,10 @@ spod_set_data_dir <- function( #' Get the data directory #' +#' @description +#' +#' `r lifecycle::badge("stable")` +#' #' This function retrieves the data directory from the environment variable SPANISH_OD_DATA_DIR. #' If the environment variable is not set, it returns the temporary directory. #' @inheritParams global_quiet_param diff --git a/R/disconnect.R b/R/disconnect.R index 95ab438..6742554 100644 --- a/R/disconnect.R +++ b/R/disconnect.R @@ -1,6 +1,9 @@ #' Safely disconnect from data and free memory #' #' @description +#' +#' `r lifecycle::badge("stable")` +#' #' This function is to ensure that `DuckDB` connections to CSV.gz files (created via `spod_get()`), as well as to `DuckDB` files or folders of `parquet` files (created via `spod_convert()`) are closed properly to prevent conflicting connections. Essentially this is just a wrapper around `DBI::dbDisconnect()` that reaches out into the `.$src$con` object of the `tbl_duckdb_connection` connection object that is returned to the user via `spod_get()` and `spod_connect()`. After disonnecting the database, it also frees up memory by running `gc()`. #' @param tbl_con A `tbl_duckdb_connection` connection object that you get from either `spod_get()` or `spod_connect()`. #' @param free_mem A `logical`. Whether to free up memory by running `gc()`. Defaults to `TRUE`. diff --git a/R/download_data.R b/R/download_data.R index e94790d..140b5b5 100644 --- a/R/download_data.R +++ b/R/download_data.R @@ -1,5 +1,9 @@ #' Download the data files of specified type, zones, and dates #' +#' @description +#' +#' `r lifecycle::badge("stable")` +#' #' This function downloads the data files of the specified type, zones, dates and data version. #' @param type The type of data to download. Can be `"origin-destination"` (or ust `"od"`), or `"number_of_trips"` (or just `"nt"`) for v1 data. For v2 data `"overnight_stays"` (or just `"os"`) is also available. More data types to be supported in the future. See codebooks for v1 and v2 data in vignettes with `spod_codebook(1)` and `spod_codebook(2)` (\link{spod_codebook}). #' @param zones The zones for which to download the data. Can be `"districts"` (or `"dist"`, `"distr"`, or the original Spanish `"distritos"`) or `"municipalities"` (or `"muni"`, `"municip"`, or the original Spanish `"municipios"`) for both data versions. Additionaly, these can be `"large_urban_areas"` (or `"lua"`, or the original Spanish `"grandes_areas_urbanas"`, or `"gau"`) for v2 data (2022 onwards). diff --git a/R/get-zones.R b/R/get-zones.R index 62a488f..77207be 100644 --- a/R/get-zones.R +++ b/R/get-zones.R @@ -1,6 +1,9 @@ #' Get zones #' #' @description +#' +#' `r lifecycle::badge("stable")` +#' #' Get spatial zones for the specified data version. Supports both v1 (2020-2021) and v2 (2022 onwards) data. #' #' @inheritParams spod_download diff --git a/R/get.R b/R/get.R index 93ed132..9c6f9ff 100644 --- a/R/get.R +++ b/R/get.R @@ -1,6 +1,11 @@ #' Get tabular mobility data #' -#' @description This function creates a DuckDB lazy table connection object from the specified type and zones. It checks for missing data and downloads it if necessary. The connnection is made to the raw CSV files in gzip archives, so analysing the data through this connection may be slow if you select more than a few days. You can manipulate this object using `dplyr` functions such as \link[dplyr]{select}, \link[dplyr]{filter}, \link[dplyr]{mutate}, \link[dplyr]{group_by}, \link[dplyr]{summarise}, etc. In the end of any sequence of commands you will need to add \link[dplyr]{collect} to execute the whole chain of data manipulations and load the results into memory in an R `data.frame`/`tibble`. See codebooks for v1 and v2 data in vignettes with \link{spod_codebook}(1) and \link{spod_codebook}(2). +#' +#' @description +#' +#' `r lifecycle::badge("stable")` +#' +#' This function creates a DuckDB lazy table connection object from the specified type and zones. It checks for missing data and downloads it if necessary. The connnection is made to the raw CSV files in gzip archives, so analysing the data through this connection may be slow if you select more than a few days. You can manipulate this object using `dplyr` functions such as \link[dplyr]{select}, \link[dplyr]{filter}, \link[dplyr]{mutate}, \link[dplyr]{group_by}, \link[dplyr]{summarise}, etc. In the end of any sequence of commands you will need to add \link[dplyr]{collect} to execute the whole chain of data manipulations and load the results into memory in an R `data.frame`/`tibble`. See codebooks for v1 and v2 data in vignettes with \link{spod_codebook}(1) and \link{spod_codebook}(2). #' #' If you want to analyse longer periods of time (especiially several months or even the whole data over several years), consider using the \link{spod_convert} and then \link{spod_connect}. #' @@ -69,7 +74,7 @@ spod_get <- function( checkmate::assert_directory_exists(data_dir, access = "rw") checkmate::assert_directory_exists(temp_path, access = "rw") checkmate::assert_flag(ignore_missing_dates) - + # simple null check is enough here, as spod_dates_arugument_to_dates_seq will do additional checks anyway if (is.null(dates)) { message("`dates` argument is undefined. Please set `dates='cached_v1'` or `dates='cached_v2'` to convert all data that was previously downloaded. Alternatively, specify at least one date between 2020-02-14 and 2021-05-09 (for v1 data) or between 2022-01-01 onwards (for v2). Any missing data will be downloaded before conversion. For more details on the dates argument, see ?spod_get.") @@ -77,6 +82,17 @@ spod_get <- function( # normalise type type <- spod_match_data_type(type = type) + + # deprecation message for od time_slot column + if (type == "od") { + lifecycle::deprecate_warn( + when = "0.1.0.9000", + what = I("`time slot`"), + with = I("`hour`"), + details = "`time_slot` column in origin destination data is now called `hour`. `time_slot` will be available in addition to `hour` and contain the same data in the outputs of `spod_get()` and `spod_convert()` until the end of 2025." + ) + } + # normalise zones zones <- spod_zone_names_en2es(zones) diff --git a/R/internal-utils.R b/R/internal-utils.R index fd25d63..c350d43 100644 --- a/R/internal-utils.R +++ b/R/internal-utils.R @@ -226,7 +226,14 @@ spod_expand_dates_from_regex <- function(date_regex) { #' Get valid dates for the specified data version #' +#' @description +#' +#' `r lifecycle::badge("stable")` +#' +#' Get all metadata for requested data version and identify all dates available for download. +#' #' @inheritParams spod_available_data +#' #' @return A vector of type `Date` with all possible valid dates for the specified data version (v1 for 2020-2021 and v2 for 2020 onwards). #' @export #' @examplesIf interactive() diff --git a/R/quick-get.R b/R/quick-get.R index 0209254..79e44c0 100644 --- a/R/quick-get.R +++ b/R/quick-get.R @@ -1,5 +1,9 @@ #' Get daily trip counts per origin-destionation municipality from 2022 onward #' +#' @description +#' +#' `r lifecycle::badge("experimental")` +#' #' This function provides a quick way to get daily aggregated (no hourly data) trip counts per origin-destination municipality from v2 data (2022 onward). Compared to \link[spanishoddata]{spod_get}, which downloads large CSV files, this function downloads the data directly from the GraphQL API. No data aggregation is performed on your computer (unlike in \link[spanishoddata]{spod_get}), so you do not need to worry about memory usage and do not have to use a powerful computer with multiple CPU cores just to get this simple data. Only about 1 MB of data is downloaded for a single day. The limitation of this function is that it can only retrieve data for a single day at a time and only with total number of trips and total km travelled. So it is not possible to get any of the extra variables available in the full dataset via \link[spanishoddata]{spod_get}. #' #' @param date A character or Date object specifying the date for which to retrieve the data. If date is a character, the date must be in "YYYY-MM-DD" or "YYYYMMDD" format. diff --git a/R/spanishoddata-package.R b/R/spanishoddata-package.R new file mode 100644 index 0000000..425b3c1 --- /dev/null +++ b/R/spanishoddata-package.R @@ -0,0 +1,7 @@ +#' @keywords internal +"_PACKAGE" + +## usethis namespace: start +#' @importFrom lifecycle deprecated +## usethis namespace: end +NULL diff --git a/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql b/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql index 2c9d95e..46faf3a 100644 --- a/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql +++ b/inst/extdata/sql-queries/v1-od-distritos-clean-csv-view-en.sql @@ -81,5 +81,6 @@ CREATE VIEW od_csv_clean AS SELECT viajes_km AS trips_total_length_km, CAST(year AS INTEGER) AS year, CAST(month AS INTEGER) AS month, - CAST(day AS INTEGER) AS day + CAST(day AS INTEGER) AS day, + periodo AS time_slot FROM od_csv_raw; diff --git a/inst/extdata/sql-queries/v2-od-distritos-clean-csv-view-en.sql b/inst/extdata/sql-queries/v2-od-distritos-clean-csv-view-en.sql index 69b195f..c4901af 100644 --- a/inst/extdata/sql-queries/v2-od-distritos-clean-csv-view-en.sql +++ b/inst/extdata/sql-queries/v2-od-distritos-clean-csv-view-en.sql @@ -110,5 +110,6 @@ CREATE VIEW od_csv_clean AS SELECT viajes_km AS trips_total_length_km, CAST(year AS INTEGER) AS year, CAST(month AS INTEGER) AS month, - CAST(day AS INTEGER) AS day + CAST(day AS INTEGER) AS day, + periodo AS time_slot FROM od_csv_raw; diff --git a/man/figures/lifecycle-deprecated.svg b/man/figures/lifecycle-deprecated.svg new file mode 100644 index 0000000..b61c57c --- /dev/null +++ b/man/figures/lifecycle-deprecated.svg @@ -0,0 +1,21 @@ + + lifecycle: deprecated + + + + + + + + + + + + + + + lifecycle + + deprecated + + diff --git a/man/figures/lifecycle-experimental.svg b/man/figures/lifecycle-experimental.svg new file mode 100644 index 0000000..5d88fc2 --- /dev/null +++ b/man/figures/lifecycle-experimental.svg @@ -0,0 +1,21 @@ + + lifecycle: experimental + + + + + + + + + + + + + + + lifecycle + + experimental + + diff --git a/man/figures/lifecycle-stable.svg b/man/figures/lifecycle-stable.svg new file mode 100644 index 0000000..9bf21e7 --- /dev/null +++ b/man/figures/lifecycle-stable.svg @@ -0,0 +1,29 @@ + + lifecycle: stable + + + + + + + + + + + + + + + + lifecycle + + + + stable + + + diff --git a/man/figures/lifecycle-superseded.svg b/man/figures/lifecycle-superseded.svg new file mode 100644 index 0000000..db8d757 --- /dev/null +++ b/man/figures/lifecycle-superseded.svg @@ -0,0 +1,21 @@ + + lifecycle: superseded + + + + + + + + + + + + + + + lifecycle + + superseded + + diff --git a/man/spanishoddata-package.Rd b/man/spanishoddata-package.Rd new file mode 100644 index 0000000..75c6fce --- /dev/null +++ b/man/spanishoddata-package.Rd @@ -0,0 +1,36 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/spanishoddata-package.R +\docType{package} +\name{spanishoddata-package} +\alias{spanishoddata} +\alias{spanishoddata-package} +\title{spanishoddata: Get Spanish Origin-Destination Data} +\description{ +\if{html}{\figure{logo.png}{options: style='float: right' alt='logo' width='120'}} + +Gain seamless access to origin-destination (OD) data from the Spanish Ministry of Transport, hosted at \url{https://www.transportes.gob.es/ministerio/proyectos-singulares/estudios-de-movilidad-con-big-data/opendata-movilidad}. This package simplifies the management of these large datasets by providing tools to download zone boundaries, handle associated origin-destination data, and process it efficiently with the 'duckdb' database interface. Local caching minimizes repeated downloads, streamlining workflows for researchers and analysts. Extensive documentation is available at \url{https://ropenspain.github.io/spanishoddata/index.html}, offering guides on creating static and dynamic mobility flow visualizations and transforming large datasets into analysis-ready formats. +} +\seealso{ +Useful links: +\itemize{ + \item \url{https://rOpenSpain.github.io/spanishoddata/} + \item \url{https://github.com/rOpenSpain/spanishoddata} + \item Report bugs at \url{https://github.com/rOpenSpain/spanishoddata/issues} +} + +} +\author{ +\strong{Maintainer}: Egor Kotov \email{kotov.egor@gmail.com} (\href{https://orcid.org/0000-0001-6690-5345}{ORCID}) + +Authors: +\itemize{ + \item Robin Lovelace \email{rob00x@gmail.com} (\href{https://orcid.org/0000-0001-5679-6536}{ORCID}) +} + +Other contributors: +\itemize{ + \item Eugeni Vidal-Tortosa (\href{https://orcid.org/0000-0001-5199-4103}{ORCID}) [contributor] +} + +} +\keyword{internal} diff --git a/man/spod_available_data.Rd b/man/spod_available_data.Rd index cfe6033..5ba5145 100644 --- a/man/spod_available_data.Rd +++ b/man/spod_available_data.Rd @@ -33,6 +33,8 @@ A tibble with links, release dates of files in the data, dates of data coverage, } } \description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}} + Get a table with links to available data files for the specified data version. Optionally check (see arguments) if certain files have already been downloaded into the cache directory specified with SPANISH_OD_DATA_DIR environment variable (set by \link{spod_set_data_dir}) or a custom path specified with \code{data_dir} argument. } \examples{ diff --git a/man/spod_codebook.Rd b/man/spod_codebook.Rd index 1fb6532..0dd5447 100644 --- a/man/spod_codebook.Rd +++ b/man/spod_codebook.Rd @@ -13,6 +13,8 @@ spod_codebook(ver = 1) Nothing, opens vignette if it is installed. If vignette is missing, prints a message with a link to a webpage with the codebook. } \description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}} + Opens relevant vignette with a codebook for v1 (2020-2021) and v2 (2022 onwards) data or provide a webpage if vignette is missing. } \examples{ diff --git a/man/spod_connect.Rd b/man/spod_connect.Rd index ab93b55..1b56e8d 100644 --- a/man/spod_connect.Rd +++ b/man/spod_connect.Rd @@ -30,6 +30,8 @@ spod_connect( a \code{DuckDB} table connection object. } \description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}} + This function allows the user to quickly connect to the data converted to DuckDB with the \link{spod_convert} function. This function simplifies the connection process. The user is free to use the \code{DBI} and \code{DuckDB} packages to connect to the data manually, or to use the \code{arrow} package to connect to the \code{parquet} files folder. } \examples{ diff --git a/man/spod_convert.Rd b/man/spod_convert.Rd index a055659..11bdf97 100644 --- a/man/spod_convert.Rd +++ b/man/spod_convert.Rd @@ -69,6 +69,8 @@ You can also set \code{save_path}. If it ends with ".duckdb", will save to \code Path to saved \code{DuckDB} database file or to a folder with \code{parquet} files in hive-style directory structure. } \description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}} + Converts data for faster analysis into either \code{DuckDB} file or into \code{parquet} files in a hive-style directory structure. Running analysis on these files is sometimes 100x times faster than working with raw CSV files, espetially when these are in gzip archives. To connect to converted data, please use 'mydata <- \link{spod_connect}(data_path = path_returned_by_spod_convert)' passing the path to where the data was saved. The connected \code{mydata} can be analysed using \code{dplyr} functions such as \link[dplyr]{select}, \link[dplyr]{filter}, \link[dplyr]{mutate}, \link[dplyr]{group_by}, \link[dplyr]{summarise}, etc. In the end of any sequence of commands you will need to add \link[dplyr]{collect} to execute the whole chain of data manipulations and load the results into memory in an R \code{data.frame}/\code{tibble}. For more in-depth usage of such data, please refer to DuckDB documentation and examples at \url{https://duckdb.org/docs/api/r#dbplyr} . Some more useful examples can be found here \url{https://arrow-user2022.netlify.app/data-wrangling#combining-arrow-with-duckdb} . You may also use \code{arrow} package to work with parquet files \url{https://arrow.apache.org/docs/r/}. } \examples{ diff --git a/man/spod_disconnect.Rd b/man/spod_disconnect.Rd index 5434ded..61747d5 100644 --- a/man/spod_disconnect.Rd +++ b/man/spod_disconnect.Rd @@ -15,6 +15,8 @@ spod_disconnect(tbl_con, free_mem = TRUE) No return value, called for side effect of disconnecting from the database and freeing up memory. } \description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}} + This function is to ensure that \code{DuckDB} connections to CSV.gz files (created via \code{spod_get()}), as well as to \code{DuckDB} files or folders of \code{parquet} files (created via \code{spod_convert()}) are closed properly to prevent conflicting connections. Essentially this is just a wrapper around \code{DBI::dbDisconnect()} that reaches out into the \code{.$src$con} object of the \code{tbl_duckdb_connection} connection object that is returned to the user via \code{spod_get()} and \code{spod_connect()}. After disonnecting the database, it also frees up memory by running \code{gc()}. } \examples{ diff --git a/man/spod_download.Rd b/man/spod_download.Rd index 081aca5..fe4262b 100644 --- a/man/spod_download.Rd +++ b/man/spod_download.Rd @@ -50,6 +50,8 @@ The possible values can be any of the following: Nothing. If \code{return_local_file_paths = TRUE}, a \code{character} vector of the paths to the downloaded files. } \description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}} + This function downloads the data files of the specified type, zones, dates and data version. } \examples{ diff --git a/man/spod_get.Rd b/man/spod_get.Rd index 8ec6675..bd940b4 100644 --- a/man/spod_get.Rd +++ b/man/spod_get.Rd @@ -59,6 +59,8 @@ The possible values can be any of the following: A DuckDB lazy table connection object of class \code{tbl_duckdb_connection}. } \description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}} + This function creates a DuckDB lazy table connection object from the specified type and zones. It checks for missing data and downloads it if necessary. The connnection is made to the raw CSV files in gzip archives, so analysing the data through this connection may be slow if you select more than a few days. You can manipulate this object using \code{dplyr} functions such as \link[dplyr]{select}, \link[dplyr]{filter}, \link[dplyr]{mutate}, \link[dplyr]{group_by}, \link[dplyr]{summarise}, etc. In the end of any sequence of commands you will need to add \link[dplyr]{collect} to execute the whole chain of data manipulations and load the results into memory in an R \code{data.frame}/\code{tibble}. See codebooks for v1 and v2 data in vignettes with \link{spod_codebook}(1) and \link{spod_codebook}(2). If you want to analyse longer periods of time (especiially several months or even the whole data over several years), consider using the \link{spod_convert} and then \link{spod_connect}. diff --git a/man/spod_get_data_dir.Rd b/man/spod_get_data_dir.Rd index 45ad881..40b5cd1 100644 --- a/man/spod_get_data_dir.Rd +++ b/man/spod_get_data_dir.Rd @@ -13,6 +13,8 @@ spod_get_data_dir(quiet = FALSE) A \code{character} vector of length 1 containing the path to the data directory where the package will download and convert the data. } \description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}} + This function retrieves the data directory from the environment variable SPANISH_OD_DATA_DIR. If the environment variable is not set, it returns the temporary directory. } diff --git a/man/spod_get_valid_dates.Rd b/man/spod_get_valid_dates.Rd index fb0e9f5..f4d2007 100644 --- a/man/spod_get_valid_dates.Rd +++ b/man/spod_get_valid_dates.Rd @@ -13,7 +13,9 @@ spod_get_valid_dates(ver = NULL) A vector of type \code{Date} with all possible valid dates for the specified data version (v1 for 2020-2021 and v2 for 2020 onwards). } \description{ -Get valid dates for the specified data version +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}} + +Get all metadata for requested data version and identify all dates available for download. } \examples{ \dontshow{if (interactive()) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf} diff --git a/man/spod_get_zones.Rd b/man/spod_get_zones.Rd index ade9217..81674c8 100644 --- a/man/spod_get_zones.Rd +++ b/man/spod_get_zones.Rd @@ -50,6 +50,8 @@ The columns for v2 (2022 onwards) data include: } } \description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}} + Get spatial zones for the specified data version. Supports both v1 (2020-2021) and v2 (2022 onwards) data. } \examples{ diff --git a/man/spod_quick_get_od.Rd b/man/spod_quick_get_od.Rd index f7681bb..d8dab4e 100644 --- a/man/spod_quick_get_od.Rd +++ b/man/spod_quick_get_od.Rd @@ -34,6 +34,8 @@ A \code{tibble} containing the flows for the specified date, minimum number of j } } \description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#experimental}{\figure{lifecycle-experimental.svg}{options: alt='[Experimental]'}}}{\strong{[Experimental]}} + This function provides a quick way to get daily aggregated (no hourly data) trip counts per origin-destination municipality from v2 data (2022 onward). Compared to \link[spanishoddata]{spod_get}, which downloads large CSV files, this function downloads the data directly from the GraphQL API. No data aggregation is performed on your computer (unlike in \link[spanishoddata]{spod_get}), so you do not need to worry about memory usage and do not have to use a powerful computer with multiple CPU cores just to get this simple data. Only about 1 MB of data is downloaded for a single day. The limitation of this function is that it can only retrieve data for a single day at a time and only with total number of trips and total km travelled. So it is not possible to get any of the extra variables available in the full dataset via \link[spanishoddata]{spod_get}. } \examples{ diff --git a/man/spod_set_data_dir.Rd b/man/spod_set_data_dir.Rd index 6a2709e..805a4e6 100644 --- a/man/spod_set_data_dir.Rd +++ b/man/spod_set_data_dir.Rd @@ -15,6 +15,8 @@ spod_set_data_dir(data_dir, quiet = FALSE) Nothing. If quiet is \code{FALSE}, prints a message with the path and confirmation that the path exists. } \description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#stable}{\figure{lifecycle-stable.svg}{options: alt='[Stable]'}}}{\strong{[Stable]}} + This function sets the data directory in the environment variable SPANISH_OD_DATA_DIR, so that all other functions in the package can access the data. It also creates the directory if it doesn't exist. } \examples{ From ba3b4f54130c310a96e6df5c3bc2dafc97d2b643 Mon Sep 17 00:00:00 2001 From: Egor Kotov Date: Fri, 7 Feb 2025 11:51:55 +0100 Subject: [PATCH 9/9] add warnings about time_slot column in the vignettes with codebooks --- vignettes/v1-2020-2021-mitma-data-codebook.qmd | 2 +- vignettes/v2-2022-onwards-mitma-data-codebook.qmd | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/vignettes/v1-2020-2021-mitma-data-codebook.qmd b/vignettes/v1-2020-2021-mitma-data-codebook.qmd index d65498b..32d4955 100644 --- a/vignettes/v1-2020-2021-mitma-data-codebook.qmd +++ b/vignettes/v1-2020-2021-mitma-data-codebook.qmd @@ -130,7 +130,7 @@ Here are the variables you can find in both the `district` and `municipality` le | **English Variable Name** | **Original Variable Name** | **Type** | **Description** | |----------------|----------------|----------------|------------------------| | `date` | `fecha` | `Date` | The date of the recorded data, formatted as `YYYY-MM-DD`. | -| `hour` | `periodo` | `integer` | The time slot during which the trips occurred. | +| `hour` | `periodo` | `integer` | The time slot during which the trips occurred. *Note*: this column used to be called `time_slot`, it is still going to be available in the output of package functions until the end of 2025, but going forward please use `hour` instead. | | `id_origin` | `origen` | `factor` | The origin zone `id` of `district` or `municipalitity`. | | `id_destination` | `destino` | `factor` | The destination zone `id` of `district` or `municipalitity`. | | `distance` | `distancia` | `factor` | The distance range of the trip, categorized into specific intervals such as `0005-002` (500 m to 2 km), `002-005` (2-5 km), `005-010` (5-10km), `010-050` (10-50 km), `050-100` (50-100 km), and `100+` (more than 100 km). | diff --git a/vignettes/v2-2022-onwards-mitma-data-codebook.qmd b/vignettes/v2-2022-onwards-mitma-data-codebook.qmd index c1e166c..2d0c5ed 100644 --- a/vignettes/v2-2022-onwards-mitma-data-codebook.qmd +++ b/vignettes/v2-2022-onwards-mitma-data-codebook.qmd @@ -165,7 +165,7 @@ Here are the variables you can find in the `district`, `municipality` and `large | **English Variable Name** | **Original Variable Name** | **Type** | **Description** | |---------|---------|---------|---------------------------------------------| | `date` | `fecha` | `Date` | The date of the recorded data, formatted as `YYYY-MM-DD`. | -| `hour` | `periodo` | `integer` | The time slot during which the trips occurred. | +| `hour` | `periodo` | `integer` | The time slot during which the trips occurred. *Note*: this column used to be called `time_slot`, it is still going to be available in the output of package functions until the end of 2025, but going forward please use `hour` instead. | | `id_origin` | `origen` | `factor` | The origin zone `id` of `district`, `municipality`, or `large urban area`. | | `id_destination` | `destino` | `factor` | The destination zone `id` of `district`, `municipality`, or `large urban area`. | | `distance` | `distancia` | `factor` | The distance range of the trip, categorized into specific intervals such as `0.5-2` (500 m to 2 km), `2-10` (2-10 km), `10-50` (10-50km), and `>50` (50 or more km). |