Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import TRY citations, and some useful DB functions #1848

Merged
merged 21 commits into from
Feb 9, 2018

Conversation

ashiklom
Copy link
Member

@ashiklom ashiklom commented Feb 7, 2018

Description

Part 1 of the workflow for importing TRY: Processing the TRY citations and adding them to bety. Note that a working version of this requires the database migration in PecanProject/bety#573.

This PR also includes prototypes of several useful functions for importing data into bety:

  • insert_table attempts to insert a data frame directly into an SQL table. Helper functions automatically line up the columns by name (dropping missing columns) and coerce column types appropriately. It returns a dbplyr "lazy" table -- basically, an un-evaluated SQL query describing a join of the local table with bety.
  • db_merge_into builds on this functionality by joining a local data frame to a bety table, skipping columns that are already present. This allows users to effectively "update" bety tables based on local data frames. It also returns a dbplyr lazy table.

Motivation and Context

Importing TRY has been on our to-do list for a while (#709). This is the first step towards that. The database helper functions are designed to be broadly useful for adding data into Bety.

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Copy link
Member

@infotroph infotroph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice code overall! I left a bunch of comments, but most are opinion and could be ignored/argued against.

Do the tests pass on your machine? They fail on mine -- I can send session_info/logs if helpful, but won't spam you unless asked.

library(PEcAn.DB)
library(DBI)
library(testthat)
library(dplyr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can some of these be used namespaced instead? Having a package attached at test time but not run time means the tests won't catch e.g. a forgotten dplyr:: in the function code.

library(dplyr)
context("SQL insertion helper functions")

if (requireNamespace("RSQLite")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider reformatting any flavor of "if package exists" as a test block that starts with a skip_if_not_installed:

test_that("RSQLite-dependent tests work", { 
    skip_if_not_installed("RSQLite")
    ...
    test_that("Subset of iris...", ...)
    test_that("Only subset...", ...)
    ...
})

Note that this relies on nested test_that calls, which didn't know you could do until just now! But this seems like a cleaner expression of the actual intent: one big set of tests whether RSQLite "works", skipped as a unit when not available but made up of smaller tests that each have their own setup and expectations.

if (requireNamespace("RSQLite")) {
library(RSQLite)
iris_file <- tempfile(fileext = ".sqlite")
irisdb <- dbConnect(SQLite(), iris_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would dbConnect(RSQLite::SQLite(), ":memory:") work and simplify the setup logic here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does this need a paired dbDisconnect? If so, can wrap it in teardown() for guaranteed run at end of test file.

testthat (>= 1.0.2)
RSQLite,
testthat (>= 1.0.2),
tidyverse,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only 60% sure on this, but my impression is that the tidyverse package is analogous to PEcAn.all -- i.e. it exists for load-time convenience and for use in packages it's better to list the specific packages used -- Looks like tibble, tidyr, purrr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken, but it is only a Suggests rather than a Imports/Depends, and this way, it matches with what I'm doing in the scripts. Though, grudgingly, I'll admit it is good coding practice to be as explicit as possible about where things are coming from.

In general, my approach is to stick packages required for scripts but not internal functions in Suggests, since I think those aren't installed unless you explicitly ask for them in devtools::install (right?).

Also, as an aside, get a load of the import package, which brings pythonic from Z import X as Y syntax to R. I might switch to using this in a bunch of my interactive scripts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point on suggests vs imports, but I see calls to all three of tibble, purrr, tidyr in R/, so those should probably be in imports anyway 😉

)
values_sub <- values[, use_cols]
# Load one row to get column types
sql_row <- dplyr::tbl(con, table) %>% head(1) %>% collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect -> dplyr::collect

#' irisdb <- src_sqlite(irisfile, create = TRUE)
#' copy_to(irisdb, iris[1,], name = "iris", overwrite = TRUE)
#' insert_table(iris[-1,], "iris", irisdb$con)
#' tbl(irisdb, "iris")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example fails too, similar error as above

insert_table <- function(values, table, con, coerce_col_class = TRUE) {
values_fixed <- match_dbcols(values, table, con, coerce_col_class)
insert_query <- build_insert_query(values_fixed, table, con = con)
print(insert_query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugging artifact?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I was wondering why my citations-inserting code was so verbose.


dbplyr::build_sql(
dbplyr::sql("INSERT INTO"),
dbplyr::sql(" "),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, no change necessary: Is this a stylistic choice or is there a reason the spaces can't go into their preceding strings, e.g. dbplyr::sql("INSERT INTO "), dbplyr::ident(table),...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just style. But, I discovered more recently that it's even better to drop the dbplyr::sql calls, since they're implicit in dbplyr::build_sql. In general, build_sql is a very smart function -- it even successfully parses things like this:

build_sql("SELECT ", ident(c("col1", "col2", "col3")))

...so moving forward, I'm gonna try to rely as much as I can on its internal logic to get things right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in my upcoming commit, I ended up replacing dbplyr::build_sql with glue::glue_sql, which is a much cooler function. It allows syntax like this:

my_table <- "test_table"
my_columns <- c("col1", "col2", "col3")
my_values <- list(list(1, 2, 3), list(4, 5, 6))
glue::glue_sql("INSERT INTO {`my_table`} ({`my_columns`*}) VALUES {my_values*}")

}
sql_tbl <- dplyr::tbl(con, table)
values_merge <- dplyr::anti_join(values_fixed, sql_tbl, by = by, copy = TRUE)
if (nrow(values_merge) < 1 || ncol(values_merge) < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the rest of the function is a no-op? If so, would there be any potential savings from returning early here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. You don't write to bety, but you do still perform the merge step at the end. This is important because it allows you to use this function to retrieve bety IDs and other bety metadat for your local table.

iris_insert <- iris2[2:10,]
.insert <- insert_table(iris_insert, "iris", irisdb)
iris_insert_test <- tbl(irisdb, "iris") %>% collect()
test_that(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another style thing: I like to put any setup code that's specific to a given test inside of that test's test_that block, e.g. here I'd move at least iris_insert through iris_insert_test inside the test_that to make it clearer that that they don't (... or at least aren't supposed to...) affect the other tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, yes, I agree. Here, this is a bit fuzzy because the tests build on each other (i.e. the merge only makes sense if the table has already been previously inserted), but I can certainly slide a few of the lines into the test_that blocks.

@ashiklom
Copy link
Member Author

ashiklom commented Feb 8, 2018

@infotroph Thanks for the helpful and thorough review! I agree with pretty much all of your comments, so I'll go through them one by one.

If you wouldn't mind, I'd like to have a look at the logs for the tests that fail on your machine.

@infotroph
Copy link
Member

$ Rscript -e 'devtools::test(reporter="list"); devtools::session_info()'
Loading PEcAn.DB
Loading required package: testthat
Testing PEcAn.DB
2018-02-08 17:08:33 DEBUG  [insert_table.R#41: PEcAn.logger::logger.debug] : 
   Matched the following cols: Sepal.Length, Sepal.Width, Petal.Length, 
   Petal.Width, Species 
<SQL> INSERT INTO `iris` (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species) VALUES (4.9, 3.0, 1.4, 0.2, 'setosa'), (4.7, 3.2, 1.3, 0.2, 'setosa'), (4.6, 3.1, 1.5, 0.2, 'setosa'), (5.0, 3.6, 1.4, 0.2, 'setosa'), (5.4, 3.9, 1.7, 0.4, 'setosa'), (4.6, 3.4, 1.4, 0.3, 'setosa'), (5.0, 3.4, 1.5, 0.2, 'setosa'), (4.4, 2.9, 1.4, 0.2, 'setosa'), (4.9, 3.1, 1.5, 0.1, 'setosa')
Warning message:
`encoding` is deprecated; all files now assumed to be UTF-8 
Session info ------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.3 (2017-11-30)
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       Europe/Amsterdam            
 date     2018-02-08                  

Packages ----------------------------------------------------------------------
 package      * version date       source        
 assertthat     0.2.0   2017-04-11 CRAN (R 3.4.0)
 base         * 3.4.3   2017-12-07 local         
 bindr          0.1     2016-11-13 CRAN (R 3.4.0)
 bindrcpp     * 0.2     2017-06-17 CRAN (R 3.4.0)
 bit            1.1-12  2014-04-09 CRAN (R 3.4.0)
 bit64          0.9-7   2017-05-08 CRAN (R 3.4.0)
 blob           1.1.0   2017-06-17 CRAN (R 3.4.0)
 commonmark     1.4     2017-09-01 CRAN (R 3.4.1)
 compiler       3.4.3   2017-12-07 local         
 datasets     * 3.4.3   2017-12-07 local         
 DBI          * 0.7     2017-06-18 cran (@0.7)   
 dbplyr         1.2.0   2018-01-03 CRAN (R 3.4.3)
 devtools       1.13.4  2017-11-09 CRAN (R 3.4.2)
 digest         0.6.14  2018-01-14 CRAN (R 3.4.3)
 dplyr        * 0.7.4   2017-09-28 cran (@0.7.4) 
 glue           1.2.0   2017-10-29 CRAN (R 3.4.2)
 graphics     * 3.4.3   2017-12-07 local         
 grDevices    * 3.4.3   2017-12-07 local         
 lubridate      1.7.1   2017-11-03 CRAN (R 3.4.2)
 magrittr       1.5     2014-11-22 CRAN (R 3.4.0)
 memoise        1.1.0   2017-04-21 CRAN (R 3.4.0)
 methods      * 3.4.3   2017-12-07 local         
 ncdf4          1.16    2017-04-01 cran (@1.16)  
 PEcAn.DB     * 1.5.2   <NA>       local         
 PEcAn.logger   1.5.2   2018-02-08 local (@1.5.2)
 PEcAn.remote   1.5.2   2018-01-18 local (@1.5.2)
 PEcAn.utils    1.5.2   2018-02-08 local (@1.5.2)
 pillar         1.1.0   2018-01-14 CRAN (R 3.4.3)
 pkgconfig      2.0.1   2017-03-21 CRAN (R 3.4.0)
 plyr           1.8.4   2016-06-08 CRAN (R 3.4.0)
 purrr          0.2.4   2017-10-18 CRAN (R 3.4.2)
 R6             2.2.2   2017-06-17 CRAN (R 3.4.0)
 Rcpp           0.12.15 2018-01-20 CRAN (R 3.4.3)
 rlang          0.1.6   2017-12-21 CRAN (R 3.4.3)
 roxygen2       6.0.1   2017-02-06 CRAN (R 3.4.0)
 RPostgreSQL    0.6-2   2017-06-24 cran (@0.6-2) 
 RSQLite      * 2.0     2017-06-19 CRAN (R 3.4.1)
 stats        * 3.4.3   2017-12-07 local         
 stringi        1.1.6   2017-11-17 cran (@1.1.6) 
 stringr        1.2.0   2017-02-18 CRAN (R 3.4.0)
 testthat     * 2.0.0   2017-12-13 CRAN (R 3.4.3)
 tibble         1.4.2   2018-01-22 cran (@1.4.2) 
 tools          3.4.3   2017-12-07 local         
 udunits2       0.13    2016-11-17 cran (@0.13)  
 utils        * 3.4.3   2017-12-07 local         
 withr          2.1.1   2017-12-19 CRAN (R 3.4.3)
 xml2           1.1.1   2017-01-24 CRAN (R 3.4.0)

@ashiklom
Copy link
Member Author

ashiklom commented Feb 8, 2018

Where is the error in there? I only see the warning about encoding.

@infotroph
Copy link
Member

Right! Sorry, here it is with the default reporter (and after a make clean; make that didn't change the result)

Rscript -e 'devtools::test("base/db"); devtools::session_info()' 
Loading PEcAn.DB
Loading required package: testthat
Testing PEcAn.DB
✔ | OK F W S | Context
✔ |  2       | Basic Sanity tests for PEcAn functions that query BETYdb [0.3 s]
✔ | 39       | test that expected tables exist [1.9 s]
✔ |  3       | Testing utility functions
⠏ |  0       | SQL insertion helper functions2018-02-08 17:27:57 DEBUG  [insert_table.R#41: PEcAn.logger::logger.debug] : 
   Matched the following cols: Sepal.Length, Sepal.Width, Petal.Length, 
   Petal.Width, Species 
<SQL> INSERT INTO `iris` (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species) VALUES (4.9, 3.0, 1.4, 0.2, 'setosa'), (4.7, 3.2, 1.3, 0.2, 'setosa'), (4.6, 3.1, 1.5, 0.2, 'setosa'), (5.0, 3.6, 1.4, 0.2, 'setosa'), (5.4, 3.9, 1.7, 0.4, 'setosa'), (4.6, 3.4, 1.4, 0.3, 'setosa'), (5.0, 3.4, 1.5, 0.2, 'setosa'), (4.4, 2.9, 1.4, 0.2, 'setosa'), (4.9, 3.1, 1.5, 0.1, 'setosa')
✖ |  0 1     | SQL insertion helper functions [0.3 s]
────────────────────────────────────────────────────────────────────────────────
test.insert.R:20: error: (unknown)
near ".": syntax error
1: insert_table(iris_insert, "iris", irisdb) at /Users/chrisb/github_forks/pecan/base/db/tests/testthat/test.insert.R:20
2: db.query(insert_query, con) at /Users/chrisb/github_forks/pecan/base/db/R/insert_table.R:26
3: DBI::dbGetQuery(con, query) at /Users/chrisb/github_forks/pecan/base/db/R/utils_db.R:46
4: DBI::dbGetQuery(con, query)
5: .local(conn, statement, ...)
6: dbSendQuery(conn, statement, ...)
7: dbSendQuery(conn, statement, ...)
8: .local(conn, statement, ...)
9: new("SQLiteResult", sql = statement, ptr = rsqlite_send_query(conn@ptr, statement), 
       conn = conn)
10: initialize(value, ...)
11: initialize(value, ...)
12: rsqlite_send_query(conn@ptr, statement)
────────────────────────────────────────────────────────────────────────────────
✔ |  3       | test db.query

══ Results ═════════════════════════════════════════════════════════════════════
Duration: 3.1 s

OK:       53
Failed:   1
Warnings: 0
Skipped:  0
Warning message:
`encoding` is deprecated; all files now assumed to be UTF-8 
Session info ------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.3 (2017-11-30)
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       Europe/Amsterdam            
 date     2018-02-08                  

Packages ----------------------------------------------------------------------
 package      * version date       source        
 assertthat     0.2.0   2017-04-11 CRAN (R 3.4.0)
 base         * 3.4.3   2017-12-07 local         
 bindr          0.1     2016-11-13 CRAN (R 3.4.0)
 bindrcpp     * 0.2     2017-06-17 CRAN (R 3.4.0)
 bit            1.1-12  2014-04-09 CRAN (R 3.4.0)
 bit64          0.9-7   2017-05-08 CRAN (R 3.4.0)
 blob           1.1.0   2017-06-17 CRAN (R 3.4.0)
 cli            1.0.0   2017-11-05 CRAN (R 3.4.2)
 commonmark     1.4     2017-09-01 CRAN (R 3.4.1)
 compiler       3.4.3   2017-12-07 local         
 crayon         1.3.4   2017-09-16 CRAN (R 3.4.1)
 datasets     * 3.4.3   2017-12-07 local         
 DBI          * 0.7     2017-06-18 cran (@0.7)   
 dbplyr         1.2.0   2018-01-03 CRAN (R 3.4.3)
 devtools       1.13.4  2017-11-09 CRAN (R 3.4.2)
 digest         0.6.15  2018-01-28 cran (@0.6.15)
 dplyr        * 0.7.4   2017-09-28 cran (@0.7.4) 
 glue           1.2.0   2017-10-29 CRAN (R 3.4.2)
 graphics     * 3.4.3   2017-12-07 local         
 grDevices    * 3.4.3   2017-12-07 local         
 lubridate      1.7.1   2017-11-03 CRAN (R 3.4.2)
 magrittr       1.5     2014-11-22 CRAN (R 3.4.0)
 memoise        1.1.0   2017-04-21 CRAN (R 3.4.0)
 methods      * 3.4.3   2017-12-07 local         
 ncdf4          1.16    2017-04-01 cran (@1.16)  
 PEcAn.DB     * 1.5.2   <NA>       local         
 PEcAn.logger   1.5.2   2018-02-08 local (@1.5.2)
 PEcAn.remote   1.5.2   2018-02-08 local (@1.5.2)
 PEcAn.utils    1.5.2   2018-02-08 local (@1.5.2)
 pillar         1.1.0   2018-01-14 CRAN (R 3.4.3)
 pkgconfig      2.0.1   2017-03-21 CRAN (R 3.4.0)
 plyr           1.8.4   2016-06-08 CRAN (R 3.4.0)
 purrr          0.2.4   2017-10-18 CRAN (R 3.4.2)
 R6             2.2.2   2017-06-17 CRAN (R 3.4.0)
 Rcpp           0.12.15 2018-01-20 CRAN (R 3.4.3)
 rlang          0.1.6   2017-12-21 CRAN (R 3.4.3)
 roxygen2       6.0.1   2017-02-06 CRAN (R 3.4.0)
 RPostgreSQL    0.6-2   2017-06-24 cran (@0.6-2) 
 RSQLite      * 2.0     2017-06-19 CRAN (R 3.4.1)
 stats        * 3.4.3   2017-12-07 local         
 stringi        1.1.6   2017-11-17 cran (@1.1.6) 
 stringr        1.2.0   2017-02-18 CRAN (R 3.4.0)
 testthat     * 2.0.0   2017-12-13 CRAN (R 3.4.3)
 tibble         1.4.2   2018-01-22 cran (@1.4.2) 
 tools          3.4.3   2017-12-07 local         
 udunits2       0.13    2016-11-17 cran (@0.13)  
 utils        * 3.4.3   2017-12-07 local         
 withr          2.1.1   2017-12-19 CRAN (R 3.4.3)
 xml2           1.1.1   2017-01-24 CRAN (R 3.4.0)

Notable changes:
- Use `glue::glue_sql` instead of `dbplyr::build_sql` for generating SQL
query. The syntax is much cleaner. This also fixes the breaking tests by
improving the way column names are quoted.
- Remove the `.here` file
- In tests and examples, use `":memory:"` SQLite database instead of a
temporary file
- Use explicit `package::function` syntax in tests, to test for
NAMESPACE issues
@infotroph
Copy link
Member

Retesting after 433c96e I get no failures, but do get these three warnings that look maybe-easy to fix:

test.insert.R:18: warning: RSQLite-dependent tests work
Don't need to call dbFetch() for statements, only for queries

test.insert.R:18: warning: RSQLite-dependent tests work
RSQLite::dbGetException() is deprecated, please switch to using standard error handling via tryCatch().

test.insert.R:28: warning: RSQLite-dependent tests work
Don't need to call dbFetch() for statements, only for queries

Approving now; let me if have more changes and want another review of anything.

@ashiklom
Copy link
Member Author

ashiklom commented Feb 9, 2018

Thanks! I think all of those warnings are specific to RSQLite. Since they don't actually break anything, and we don't generally make too much use of SQLite in PEcAn, I am hesitant to mess with the db.query, etc. code lest it break something for Postgres.

@mdietze mdietze merged commit 20dd9d7 into PecanProject:develop Feb 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants