Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learn keys from SQLite databases #352

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gadenbuie
Copy link
Contributor

Hi all, I know there are a few things currently in motion with respect to SQLite databases, so I'm submitting this with the complete expectation that you may want to implement this in another way. Please take whatever you find useful.

The primary change I've made is to add an sqlite_learn_query() that's wired up through db_learn_query() and that returns a table of table and column information in same structure as postgres_learn_query().

I've added a small test to confirm that my changes work, and I know I'm not taking full advantage of your testing infrastructure. I tried to setup a test using dm_nycflights_small() but the foreign keys weren't making the round trip and I wasn't sure if this was a bug or a problem with how I was using copy_dm_to(). I'd be happy to revise the tests with a little guidance.

@krlmlr
Copy link
Collaborator

krlmlr commented May 9, 2020

Thanks, this is amazing! If roundtrip worked, testing would be easier indeed. It feels like queries_set_fk_relations() needs a little tweak. Are primary keys roundtripped already?

I'm happy to support this to have another example for a data source that can be learned from. It means a bit more work further down -- learning should lead to an intermediate format first (#229) and also work for compound keys (#3). If you would like to take a stab at defining dm_meta() that works for SQLite -- I'm all for it! If not, no worries, I'll port later.

@krlmlr
Copy link
Collaborator

krlmlr commented May 9, 2020

It isn't specified anywhere yet: dm_meta() is a new generic that dispatches on the DBI connection and returns a simple dm object with a tables, a columns, a primary_keys and an foreign_keys table, perhaps with helper tables for m:n relationships. The information in that "meta dm" should be sufficient to create the structure for the dm.

Not sure about the name yet -- do we need another dm_meta_from_dm() that returns the same structure for an existing dm object?

@krlmlr
Copy link
Collaborator

krlmlr commented May 9, 2020

Looking at your query, maybe the structure for the "meta dm" can be modeled after SQLite? Also, would it be difficult to rewrite as dbplyr code, perhaps using e.g. tbl(ident_q("pragma_table_info(m.name)")) ?

@gadenbuie
Copy link
Contributor Author

If roundtrip worked, testing would be easier indeed. It feels like queries_set_fk_relations() needs a little tweak. Are primary keys roundtripped already?

Yes, primary keys roundtrip, so that's a good start!

If you would like to take a stab at defining dm_meta() that works for SQLite

Sure, I'll give it a try!

Also, would it be difficult to rewrite as dbplyr code, perhaps using e.g. tbl(ident_q("pragma_table_info(m.name)"))?

I think this is doable, although we'd have to query pragma_table_info(tbl_name) for each table name. Here's a sketch of the dbplyr/purrr code with three components (table name, column info, foreign keys), including a table with compound keys.

SQLite Column Info Sketch
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
library(purrr)
library(DBI)
library(RSQLite)

tmp_db <- tempfile(fileext = ".db")
con <- dbConnect(SQLite(), tmp_db)

# Create Tables
dbExecute(con, "CREATE TABLE first (id INTEGER PRIMARY KEY)")
#> [1] 0
dbExecute(con, paste(
  "CREATE TABLE second (",
  "id INTEGER PRIMARY KEY, first_id INTEGER,",
  "FOREIGN KEY (first_id) REFERENCES first (id))"
))
#> [1] 0
dbExecute(con, paste(
  "CREATE TABLE third (",
  "id INTEGER PRIMARY KEY, first_id INTEGER, second_id INTEGER,",
  "FOREIGN KEY (first_id) REFERENCES first (id),",
  "FOREIGN KEY (second_id) REFERENCES second (id))"
))
#> [1] 0
dbExecute(con, paste(
  "CREATE TABLE fourth (",
  "id_1 INTEGER, id_2 INTEGER, third_id INTEGER,",
  "PRIMARY KEY (id_1, id_2),",
  "FOREIGN KEY (third_id) REFERENCES third (id))"
))
#> [1] 0

This next part could by skipped using dbListTables() instead if all we need are the table names.

(tables <-
  list(
    tbl(con, "sqlite_master"),
    tbl(con, "sqlite_temp_master")
  ) %>% 
  map(~ filter(.x, type == "table")) %>% 
  map_dfr(collect))
#> # A tibble: 4 x 5
#>   type  name   tbl_name rootpage sql                                            
#>   <chr> <chr>  <chr>       <int> <chr>                                          
#> 1 table first  first           2 CREATE TABLE first (id INTEGER PRIMARY KEY)    
#> 2 table second second          3 CREATE TABLE second ( id INTEGER PRIMARY KEY, …
#> 3 table third  third           4 CREATE TABLE third ( id INTEGER PRIMARY KEY, f…
#> 4 table fourth fourth          5 CREATE TABLE fourth ( id_1 INTEGER, id_2 INTEG…

table_names <- set_names(tables$name)
(columns <- 
  table_names %>% 
  map(~ tbl(con, ident_q(sprintf("pragma_table_info('%s')", .x)))) %>% 
  map_dfr(collect, .id = "table"))
#> # A tibble: 9 x 7
#>   table    cid name      type    notnull dflt_value    pk
#>   <chr>  <int> <chr>     <chr>     <int> <lgl>      <int>
#> 1 first      0 id        INTEGER       0 NA             1
#> 2 second     0 id        INTEGER       0 NA             1
#> 3 second     1 first_id  INTEGER       0 NA             0
#> 4 third      0 id        INTEGER       0 NA             1
#> 5 third      1 first_id  INTEGER       0 NA             0
#> 6 third      2 second_id INTEGER       0 NA             0
#> 7 fourth     0 id_1      INTEGER       0 NA             1
#> 8 fourth     1 id_2      INTEGER       0 NA             2
#> 9 fourth     2 third_id  INTEGER       0 NA             0
(foreign_keys <-
  table_names %>% 
  map(~ tbl(con, ident_q(sprintf("pragma_foreign_key_list('%s')", .x)))) %>% 
  map_dfr(collect, .id = "table"))
#> # A tibble: 4 x 9
#>   table     id   seq table  from      to    on_update on_delete match
#>   <chr>  <int> <int> <chr>  <chr>     <chr> <chr>     <chr>     <chr>
#> 1 second     0     0 first  first_id  id    NO ACTION NO ACTION NONE 
#> 2 third      0     0 second second_id id    NO ACTION NO ACTION NONE 
#> 3 third      1     0 first  first_id  id    NO ACTION NO ACTION NONE 
#> 4 fourth     0     0 third  third_id  id    NO ACTION NO ACTION NONE

Created on 2020-05-11 by the reprex package (v0.3.0)

@gadenbuie
Copy link
Contributor Author

It feels like queries_set_fk_relations() needs a little tweak.

Unfortunately, in SQLite it's not possible to modify foreign keys using ALTER TABLE. The official process for adding foreign keys to an existing table is a 12-step procedure. 😢

Instead of implementing the 12-step table schema update method, it might be easier to include the foreign keys in the CREATE TABLE statement. Where is the most appropriate place to try to fix this? A db_copy_to.SQLiteConnection or db_create_table.SQLiteConnection method in dbplyr? I looked for another example of DB type where foreign keys had to make it down to db_create_table() but it seems most other SQL variants provide easier ways to update the keys.

@krlmlr
Copy link
Collaborator

krlmlr commented May 13, 2020

Ouch.

We already had to force the NOT NULL clause into the creation statement in SQL Server, because changing this is near impossible on a live table. These cases will occur, it's better to separate table creation from population anyway. The latter will be handled by dm_insert_rows() at some point.

We need to sort the tables topologically so that fact tables come before dimension tables. There's code in #319 that does this.

I looked into dbplyr::db_create_table(), it feels like we're better off rolling our own CREATE TABLE for now, in a private unexported generic. That method would receive information on primary and foreign keys, and insert accordingly.

For SQLite it feels we can collect the relevant tables from sqlite_master and primary and foreign key info locally, and use local dplyr operations to join, or even nested data frames.

@krlmlr
Copy link
Collaborator

krlmlr commented May 30, 2022

@gadenbuie: Would you like to take another look? The idea would be to implement dm_meta() for RSQLite. The internals aren't pretty now, but we can clean this up with S3 generics/methods later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants