-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import TRY citations, and some useful DB functions #1848
Conversation
Also, bugfix `insert_table`.
This is to catch the automatic addition of IDs and other timestamps, for instance by Postgres. Also, add insert function unit tests.
Only use columns that match the SQL database.
When the data are huge, it's nice to get more regular status updates.
Use the `here` package to always work relative to `db` package directory. Also, set paths and connections in a `config.R` script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice code overall! I left a bunch of comments, but most are opinion and could be ignored/argued against.
Do the tests pass on your machine? They fail on mine -- I can send session_info/logs if helpful, but won't spam you unless asked.
base/db/tests/testthat/test.insert.R
Outdated
library(PEcAn.DB) | ||
library(DBI) | ||
library(testthat) | ||
library(dplyr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can some of these be used namespaced instead? Having a package attached at test time but not run time means the tests won't catch e.g. a forgotten dplyr::
in the function code.
base/db/tests/testthat/test.insert.R
Outdated
library(dplyr) | ||
context("SQL insertion helper functions") | ||
|
||
if (requireNamespace("RSQLite")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider reformatting any flavor of "if package exists" as a test block that starts with a skip_if_not_installed
:
test_that("RSQLite-dependent tests work", {
skip_if_not_installed("RSQLite")
...
test_that("Subset of iris...", ...)
test_that("Only subset...", ...)
...
})
Note that this relies on nested test_that
calls, which didn't know you could do until just now! But this seems like a cleaner expression of the actual intent: one big set of tests whether RSQLite "works", skipped as a unit when not available but made up of smaller tests that each have their own setup and expectations.
base/db/tests/testthat/test.insert.R
Outdated
if (requireNamespace("RSQLite")) { | ||
library(RSQLite) | ||
iris_file <- tempfile(fileext = ".sqlite") | ||
irisdb <- dbConnect(SQLite(), iris_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would dbConnect(RSQLite::SQLite(), ":memory:")
work and simplify the setup logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does this need a paired dbDisconnect? If so, can wrap it in teardown()
for guaranteed run at end of test file.
testthat (>= 1.0.2) | ||
RSQLite, | ||
testthat (>= 1.0.2), | ||
tidyverse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only 60% sure on this, but my impression is that the tidyverse package is analogous to PEcAn.all -- i.e. it exists for load-time convenience and for use in packages it's better to list the specific packages used -- Looks like tibble, tidyr, purrr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point taken, but it is only a Suggests
rather than a Imports/Depends
, and this way, it matches with what I'm doing in the scripts. Though, grudgingly, I'll admit it is good coding practice to be as explicit as possible about where things are coming from.
In general, my approach is to stick packages required for scripts but not internal functions in Suggests
, since I think those aren't installed unless you explicitly ask for them in devtools::install
(right?).
Also, as an aside, get a load of the import
package, which brings pythonic from Z import X as Y
syntax to R. I might switch to using this in a bunch of my interactive scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point on suggests vs imports, but I see calls to all three of tibble, purrr, tidyr in R/
, so those should probably be in imports anyway 😉
base/db/R/insert_table.R
Outdated
) | ||
values_sub <- values[, use_cols] | ||
# Load one row to get column types | ||
sql_row <- dplyr::tbl(con, table) %>% head(1) %>% collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collect -> dplyr::collect
base/db/R/insert_table.R
Outdated
#' irisdb <- src_sqlite(irisfile, create = TRUE) | ||
#' copy_to(irisdb, iris[1,], name = "iris", overwrite = TRUE) | ||
#' insert_table(iris[-1,], "iris", irisdb$con) | ||
#' tbl(irisdb, "iris") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example fails too, similar error as above
base/db/R/insert_table.R
Outdated
insert_table <- function(values, table, con, coerce_col_class = TRUE) { | ||
values_fixed <- match_dbcols(values, table, con, coerce_col_class) | ||
insert_query <- build_insert_query(values_fixed, table, con = con) | ||
print(insert_query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debugging artifact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I was wondering why my citations-inserting code was so verbose.
base/db/R/insert_table.R
Outdated
|
||
dbplyr::build_sql( | ||
dbplyr::sql("INSERT INTO"), | ||
dbplyr::sql(" "), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, no change necessary: Is this a stylistic choice or is there a reason the spaces can't go into their preceding strings, e.g. dbplyr::sql("INSERT INTO "), dbplyr::ident(table),...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just style. But, I discovered more recently that it's even better to drop the dbplyr::sql
calls, since they're implicit in dbplyr::build_sql
. In general, build_sql
is a very smart function -- it even successfully parses things like this:
build_sql("SELECT ", ident(c("col1", "col2", "col3")))
...so moving forward, I'm gonna try to rely as much as I can on its internal logic to get things right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in my upcoming commit, I ended up replacing dbplyr::build_sql
with glue::glue_sql
, which is a much cooler function. It allows syntax like this:
my_table <- "test_table"
my_columns <- c("col1", "col2", "col3")
my_values <- list(list(1, 2, 3), list(4, 5, 6))
glue::glue_sql("INSERT INTO {`my_table`} ({`my_columns`*}) VALUES {my_values*}")
base/db/R/db_merge_into.R
Outdated
} | ||
sql_tbl <- dplyr::tbl(con, table) | ||
values_merge <- dplyr::anti_join(values_fixed, sql_tbl, by = by, copy = TRUE) | ||
if (nrow(values_merge) < 1 || ncol(values_merge) < 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the rest of the function is a no-op? If so, would there be any potential savings from returning early here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. You don't write to bety
, but you do still perform the merge step at the end. This is important because it allows you to use this function to retrieve bety IDs and other bety metadat for your local table.
base/db/tests/testthat/test.insert.R
Outdated
iris_insert <- iris2[2:10,] | ||
.insert <- insert_table(iris_insert, "iris", irisdb) | ||
iris_insert_test <- tbl(irisdb, "iris") %>% collect() | ||
test_that( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another style thing: I like to put any setup code that's specific to a given test inside of that test's test_that block, e.g. here I'd move at least iris_insert
through iris_insert_test
inside the test_that to make it clearer that that they don't (... or at least aren't supposed to...) affect the other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, yes, I agree. Here, this is a bit fuzzy because the tests build on each other (i.e. the merge only makes sense if the table has already been previously inserted), but I can certainly slide a few of the lines into the test_that
blocks.
@infotroph Thanks for the helpful and thorough review! I agree with pretty much all of your comments, so I'll go through them one by one. If you wouldn't mind, I'd like to have a look at the logs for the tests that fail on your machine. |
|
Where is the error in there? I only see the warning about encoding. |
Right! Sorry, here it is with the default reporter (and after a
|
Notable changes: - Use `glue::glue_sql` instead of `dbplyr::build_sql` for generating SQL query. The syntax is much cleaner. This also fixes the breaking tests by improving the way column names are quoted. - Remove the `.here` file - In tests and examples, use `":memory:"` SQLite database instead of a temporary file - Use explicit `package::function` syntax in tests, to test for NAMESPACE issues
Retesting after 433c96e I get no failures, but do get these three warnings that look maybe-easy to fix:
Approving now; let me if have more changes and want another review of anything. |
Thanks! I think all of those warnings are specific to |
Description
Part 1 of the workflow for importing TRY: Processing the TRY citations and adding them to bety. Note that a working version of this requires the database migration in PecanProject/bety#573.
This PR also includes prototypes of several useful functions for importing data into bety:
insert_table
attempts to insert a data frame directly into an SQL table. Helper functions automatically line up the columns by name (dropping missing columns) and coerce column types appropriately. It returns adbplyr
"lazy" table -- basically, an un-evaluated SQL query describing a join of the local table with bety.db_merge_into
builds on this functionality by joining a local data frame to abety
table, skipping columns that are already present. This allows users to effectively "update" bety tables based on local data frames. It also returns adbplyr
lazy table.Motivation and Context
Importing TRY has been on our to-do list for a while (#709). This is the first step towards that. The database helper functions are designed to be broadly useful for adding data into Bety.
Review Time Estimate
Types of changes
Checklist: