-
Notifications
You must be signed in to change notification settings - Fork 1
/
database.qmd
1330 lines (833 loc) · 33 KB
/
database.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Database and Larger Than Memory Problems"
html:
code-fold: "show"
---
## Book Format
# Resource
[Resource to understand databases](https://smithjd.github.io/sql-pet/chapter-setup-adventureworks-db.html#)
[R for datascience chapter on database](https://r4ds.hadley.nz/databases.html)
[Dbplyr functions](https://dbplyr.tidyverse.org/articles/translation-function.html)
[How to use duckdb](https://bwlewis.github.io/duckdb_and_r/talk/talk.html)
[DBI function overivew](https://dbi.r-dbi.org/)
[Posit learning materials](https://solutions.posit.co/connections/db/)
[dbplot_histogram](https://edgararuiz.github.io/dbplot/)
# Why is this important?
- You will get to a point where the data you need is inside a database and
not excel sheets or csv files
- This is a point of potential rejoice or mourning
- While we can say goodbye excel sheets and csv files we also need to say hello
to poorly documented databases, overwhelmed database product owners,
overworked data engineers, and finally SQL and all of its variations
- I recommend you take the time to learn SQL -- the basics are very
similar to the dplyr commands you already know (slight tweaking of
evaluation order and syntax) so the beginners learning curve isn't steep
:::{.callout-note collapse="true"}
## How much SQL should I learn?
If you're curious how much SQL you should learn, below is a framework that I
found to be helpful, with their R equivalent
```{r}
#| warning: false
#| echo: false
#| eval: true
library(tidyverse)
tribble(
~sql,~dpylr,~comment
,"WHERE" ,"filter()", "Pass arguments with AND or OR,\nSometimes the first argument is TRUE for purely formatting purposes"
,"SELECT" ,"select()" , "* is short cut for all columns\nseperate multiple columns with a comma"
,"GROUP BY" ,"group_by()", "seperate multiple columns with a comma"
,"SUM" ,"sum()","works the same as sum, na.rm is always TRUE for SQL"
,"MIN" ,"min()","works the same as min, na.rm is always TRUE for SQL"
,"MAX" ,"max()","works the same as max, na.rm is always TRUE for SQL"
,"COUNT(*)" ,"n()","wworks same as n()"
,"DISTINCT" ,"distinct()" ,"works similar"
,"SELECT DISTINCT" , "select() |> distinct()","specify the columns you want"
,"PARTITION" ,"group_by() |> mutate()","you will need to specify which colums you want with OVER() and BY()"
,"LEFT JOIN" , "left_join()", "Uses ON to join conditions"
,"LIKE" ,"str_detect()", "works with WHERE, % is wildcard"
,"TOP" ,"head()" ,"work the same as TOP"
,"DESCRIBE" ,"glimpse()" ,"works simliar, will display the columns and thier data types"
,"SET" ," <- " ,"works simliar to assign for a single variables"
,"BETWEEN" ,"between()" ,"VAR BETWEEN X AND Y"
,"MONTH" ,"month()" ,"works similar"
,"YEAR" ,"year()" ,"works similar"
,"DAY" ,"day()" ,"works similar"
,"QUARTER" ,"quarter()" ,"works similar"
,"DATEDIFF" ,"difftime() " ,"works similar"
,"DATETRUC" ,"floor_date()","works similar, can input minute, hour, day, week, month, quarter, year, etc"
,"CREATE OR REPLACE" ,"tibble()" ,"conceptually the same, but use this to create the final table"
,'IN' ," %in%" , "works similar"
,"AS" , "rename()" ,"work similar"
,"WITH" , "tibble()" , "Use this to create CTE or basic mini tables that you can then reference in different steps of the query, makes readable easier"
,"HAVING" , "filter()","Similiar to WHERE, but must be used with aggregated measures"
) |>
gt::gt() |>
gt::cols_label(
sql="SQL"
) |>
gt::tab_header(
title="Summary of SQL commands and their dplyr counterpart"
)
```
There are some nuances, in particular the evaluation order and coding conventions that typically trip up new SQL users but in general if you understand the above R commands you will quickly learn the SQL counterparts.
With anything, you need to practice! luckily there are multiple SQL resources and practice studios which can help with reinforcement learning.
:::
- The reason you can survive off of a beginners knowledge base of SQL is that
luckily there is a life saving package called `dbplyr` that translates your
dplyr queries into SQL for you
- It has fairly great coverage but there are limitations which is why
eventually it will help you to learn some intermediate SQL commands and
overall database frameworks
- This chapter will go over database essentials and provide resources to
learn more
## The Essentials
**What do you need to access data in a database?**
- A database with data inside of it
- Access / permission to the database
- Location, user name and password to database (or equivalent protocols as
dictated by your organization's security model)
- Database driver and related utilities
- SQL querys
- Patience
::: {.callout-note collapse="true"}
## "What is the advantage of a database?"
It comes down to scale and size. At some point your organization or process
will generate substantial data and needs a more structured process to store
the data so that multiple parties can access the data at scale.
When dealing with a new database some key frameworks are:
- Cloud vs. On-Premise
- Security model and access
- "Flavor" of database
- Improved data management: A database centralizes data, making it
easier to manage and maintain
- Enhanced data security: A database provides secure storage and
retrieval of sensitive data through user authentication and access
control mechanisms
- Better data organization: A database **theoretically** allows for better organization
and structure of data through the use of tables, indexes, and
relationships
- Improved query performance: Databases are optimized for query
performance, allowing for faster retrieval of data.
- Scalability: Databases can handle large amounts of data and scale as
needed to meet growing storage demands.
Cloud vs On-Premise Database:
- Cloud databases are hosted on remote servers, while on-premise
databases are hosted on local servers (when you read servers just replace it with the word computers. You are either using your organization's computer (on prem) or you are using someone else (cloud))
- Cloud databases offer greater flexibility in terms of scalability and
accessibility, as they can be accessed from any location with an
internet connection.
- On-premise databases provide more control over data security and
privacy, as the data is stored on a local server and not transmitted
over the internet.
- Cloud databases typically require less setup and maintenance than
on-premise databases, as they are managed by the provider.
- Cost: Cloud databases are often subscription-based and can be more
cost-effective than on-premise databases, especially for small to
medium-sized businesses.
*Security Model and Access:*
- Security model: A database's security model determines who has access
to the data and how they can access it. Common security models include
Role-Based Access Control (RBAC), Attribute-Based Access Control (ABAC),
and Identity-Based Access Control (IBAC).
- Access control: A database's access control mechanisms determine who
can view, edit, or delete data. This can be based on user
authentication, role-based access control, or attribute-based access
control.
- Authentication methods: Databases support various authentication
methods such as username and password, single sign-on (SSO), and
two-factor authentication (2FA).
- Authorization methods: Databases support various authorization
methods such as row-level security, column-level security, and
table-level security.
- Auditing and logging: Databases can log all access attempts and
successful accesses to track user activity and detect potential security
breaches.
- Encryption: Databases can encrypt data both in transit and at rest to
protect it from unauthorized access.
- Backup and recovery: Databases provide mechanisms for backing up data
and recovering from failures or security incidents.
- Identity and access management (IAM): IAM systems manage user
identities and access rights within the database, ensuring that only
authorized users can access the data.
- Role-based access control (RBAC): RBAC allows for assigning roles to
users based on their job function or responsibilities, limiting the data
they have access to.
- Attribute-based access control (ABAC): ABAC grants or denies access
to data based on attributes associated with the user or the data itself,
such as location or time of day.
:::
**What does a database need from you?**
- The most frustrating part of database is getting access to the database,
setting up the database utilities and then making the initial connection
- There are multiple ways to connect to a database however almost all
require the following:
- User name
- Password
- Database driver
- Connection string and associated arguments
- SQL query
- We will review the DSN method for connecting to a database
- These tend to be confusing because much of this is controlled and managed
by your local IT department so whatever documentation or guide you read on
online may not translate one for one to your localized experienced (this
includes this guide as well :(
## Set up ODBC Driver
- Download (if required) a database driver for your database -- this is
typically on the database company's website
- Your company may have centralized package manager system where you
will need to download and install all required drivers via that packet
manager
- Configure your DSN so that you can be authenticated
- Your database platform should have documentation on how to do this and
your internal IT team **should** be able to articulate any proxy / security
requirements
- Here is some example
[documentation](https://docs.snowflake.com/en/developer-guide/odbc/odbc-windows)
- You are essentially saving the required information (listed above)
to your computer so that you can pass these arguments to the database
- Pay attention to the name you setup the DSN driver, you will need this
later one
**Example Paramaters are below**
- User
- Password
- Server
- Database
- Schema
- Warehouse
- Tracing
- Authenticator
## Create Connection String
- If the above is done correctly you can then use DBI package in R
to connect to the database of your choice
- Create a connection string with DBI::dbConnect()
- Select the DBMS wtih the driver_name such as ODBC::ODBC()
to access your DSN set up connect to external database or can use
DMBS package such as duckdb::duckdb() to replicate an internal
instance of a database
- DSN name if external database (the name used to set up ODBC driver)
- Alternatively, you can directly supply the arguments
in DBI::dbConnect() such as hostname,port,username, etc
```{r}
#| echo: true
#| eval: false
#| code-fold: true
#| error: false
#| warning: false
#| fig-cap: "hello"
#| label: "fig-example"
## this uses duckdb example to create a connection string
con <- DBI::dbConnect(drv=duckdb::duckdb())
## this is alternative example using a made up DSN name
con <- DBI::dbConnect(
drv=odbc:odbc()
,dsn="your_DSN_name"
)
```
::: {.callout-note collapse="true"}
## Additional Utilities
- the DBI and ODBC packages are extremely useful for database related
utilities
- While they have some existing overlap, they can be used to view the
schema in your database, list active connections and also disconnect.
- Below are some useful utilities:
```{r}
tribble(
~name ,~purpose
,"odbc::odbcListDrivers()","list your drivers"
,"DBI::dbListConnections()","as you create connections with dbConnect(), this wil list active connections"
,"DBI::dbCanConnect()","checks if you can connect to tables"
,"DBI::dbListTables()","lists tables associated with the connection"
)
```
:::
## Are all tables equal?
There are three *types* of tables in most databases. Understanding the different tables and how they can be used or referenced will you help you.
- Temporary Tables / CTE
- These are tables that only exist when you run them
- Helpful as interim steps or to break code into subqueries to make it more modular
```{sql}
--| eval: false
WITH TEMP AS (
SELECT
CUT
,MEAN(PRICE) AS AVG_PRICE
,COUNT(*) AS N
FROM
DIAMOND_DB
GROUP BY
CUT
)
SELECT *
FROM
DIAMOND_DB AS MAIN
LEFT JOIN
TEMP ON MAIN.CUT=TEMP.CUT
)
```
- Curated tables
- Often times you may have loads of raw tables (eg 100s) that you need to join together, filter or aggregrate before the data can be usable
- This process of turning raw /streaming data into table that can be consumed for analysis is oftern called data curation
- This is often times created as a view which can be though of as particular snapshot of a table
```{sql}
--| eval: false
CREATE VIEW DiamondSummaryView AS
WITH TEMP AS (
SELECT
CUT,
MEAN(PRICE) AS AVG_PRICE,
COUNT(*) AS N
FROM
DIAMONDS_DB
GROUP BY
CUT
)
SELECT *
FROM
DIAMONDS_DB AS MAIN
LEFT JOIN
TEMP ON MAIN.CUT = TEMP.CUT;
```
- Materialized layers
- Materialized layer means the data is more persistent so when you run it its not triggering the underlying queries (which will save you alot of time)
```{sql}
--| eval: false
CREATE MATERIALIZED VIEW DiamondSummaryMaterializedView AS
WITH TEMP AS (
SELECT
CUT,
MEAN(PRICE) AS AVG_PRICE,
COUNT(*) AS N
FROM
DIAMONDS_DB
GROUP BY
CUT
)
SELECT *
FROM
DIAMONDS_DB AS MAIN
LEFT JOIN
TEMP ON MAIN.CUT = TEMP.CUT;
```
**Database structure**
- Security Model
- Because data can be privileged, without a doubt your organization has some security model that will aplly row level security and IDs to ensure when you access a table you are seeing what you should be seeing
- There is way to much to write here about it and honestly, I'm not the right person to answer it
You may not need to know any of this but this mostly depends on your organizations
data strategy, staffing levels and operating model
DBI::dbCanConnect()
## List tables listed under connection
### list tables in connection
`dbListTables(con)` to list tables associated with a connection
:::
- After you have created a connection string you now need to retrieve
information from the database
## Option 1: Create SQL string
- If you know the database, schema and table name that you want, you can
write the initial sql query to connect to the database
- You can write simple or advance query insde the `dplyr::sql()` function
```{r}
#| echo: true
#| eval: false
#| error: false
#| warning: false
sql_query <- dplyr::sql("select * from database_name")
```
## Accessing Databse
- Use connection string and sql query together to create a lazy table with `dplyr::tbl()`
- We call this a lazy table because it won't actually execute the query and
return the results which is good because your query might return 100s of results
```{r}
#| echo: true
#| eval: false
#| error: false
#| warning: false
data_db <- dplyr::tbl(con,sql_query)
```
- From there you can use *dplyr* back end queries to see everything
(notice the distinction between **db**pyr and dplyr
- If you use the dbplyr package, you are limited to queries that can
be translated to sql which are detailed below
- [github of dplyr commands that can be used in dbplyr](https://github.com/tidyverse/dbplyr/blob/main/R/backend-.R)
- You can also check the database specific[here]([https://github.com/tidyverse/dbplyr/blob/main/R/backend-snowflake.R)
- You can see what query it will generate with `dbplyr::show_query()`
- Notice the class of the object you return, you are returning a database
object -- if you want to return a dataframe you need to use `dplyr::collect()`
## Option 2: Push existing data into a database
- First you need to have a connection string to a database (and write permissions)
- If you already have something as a dataframe you can upload it to a
database with DBI::dbWriteTable(con,"tbl_name",df) which will write the table
to the connection with the name you gave
- DBI::dbWriteTable() can write a r dataframe or you can use sql to create a virtual table if you want
- dbplyr::copy_inline(con_db,df = df) is alternative method
- If you have duckdb connection you can use the duckdb::duckdb_register()
```{r}
# create connection
con_db <- DBI::dbConnect(duckdb::duckdb())
# write data into database
DBI::dbWriteTable(con_db,name = "diamonds_db",value = ggplot2::diamonds)
# or alternative use the database argument to regester
duckdb::duckdb_register(con_db, "diamonds_db_2",df = ggplot2::diamonds)
# validate data is in database by reference connection
DBI::dbListTables(con_db)
# Pull in data in database format
diamonds_db <- dplyr::tbl(con_db,"diamonds_db")
```
## What happens if dbplyr doesn't have a function that I need?
- This will happen, take for example if you want to do the ceiling date of date column (eg. round 2024-01-05 to 2024-01-31)
-
### built in helper functions
- translate_sql()
### create sql query and use it
Use a parameterised query with dbSendQuery() and dbBind()
Use the sqlInterpolate() function to safely combine a SQL string with data
Manually escape the inputs using dbQuoteString()
https://solutions.posit.co/connections/db/best-practices/run-queries-safely/
## how can I build a package for this?
build_sql()
airport <- dbSendQuery(con, "SELECT * FROM airports WHERE faa = ?")
Use dbBind() to execute the query with specific values, then dbFetch() to get the results:
dbBind(airport, list("GPT"))
dbFetch(airport)
Once you’re done using the parameterised query, clean it up by calling dbClearResult()
dbClearResult(airport)
## Putting it all together
- Set up your drive, get required database info and related utilities
- Create connection to your database
- write an inital query to select the columns that you want or need
- Use dbplyr to translate dplyr queries to SQL
- return results to your local machine with dplyr::collect()
# Larger than memory problems
- Sometimes you don't have a database but have data that is larger than
memory
- Luckily, you don't need a database to take advantage of the tricks we have
learnt to move solve your data larger than memomory problems
- Sometimes you may not have a database to connect to and instead have a
very large csv file or dataframes
- There are two packages that are exceptionally helpful here
- duckdb() and arrow()
- This isn't the technically correct response but duckdb() allows you to build
an in memory database whereas arrow compresses your information into some sort
of parquet type structure
- What makes these so great is that dbplyr will translate your dplyr commands
to either duckdb or arrow language
- Some dplyr functions are avaialble in one package that aren't avaialble in
the other
- However you can pass an object from duckdb to arrow as much as you want
**What is key difference?**
- Duckdb will return first 1000 rows of your query so you can check if your
query worked well, whereas arrow won't let you see it (including if your query
returns an error which can be annoying)
To understand how to use the packages,let us define two scenarios:
1) You have a lot of csv files that you need to upload and analyze
2) As a result of simulation of some other you have multiple tables that
seperately are okay but together are generating you have very large
dataframe that you need to join together to analyze and manipulate
3) You have interim data that you want to save in workflow that is large
### Large csv files that you need to work with:
- If you have large offline files you can quickly easily load this into
duckdb with `duckdb_read_csv()` or the collection of arrow functions
(eg. `arrow::read_csv_arrow()`).
::: .callout-note
Arrow will support json, feather, delimiter, parquet, or csv amongst other
whereas duckdb only supports csv (if not already an object)
:::
- This will load the csv files directly to your duck db in memory database
- You first need to create connection string
### Existing dataframes (however they got there) that are either too large
or individually are okay but seperately aren't okay
- You first need to get your data into R as a dataframe
- Then you need to register your dataframe to your in duckdb memory database
- From there you can move the datafrme from duckdb to arrow as you would like
# Special tricks
## You can pass one database object to duckdb
```{r}
#| echo: fenced
#| label: db-connect example
#| eval: true
## create connection string locally
con_db <- DBI::dbConnect(duckdb::duckdb())
# loads data into your connection either in memory
DBI::dbWriteTable(con_db,"diamonds_db",ggplot2::diamonds)
#create new table to the connection
DBI::dbExecute(con_db, "CREATE TABLE duckdb_table (col1 INT, col2 STRING)")
```
#preview what is in your connection
```{r}
#| eval: false
DBI::dbListTables(con_db)
dbplyr::copy_inline(con_db,df = diamonds)
diamonds_db <- dplyr::tbl(con_db
,"diamonds_db"
)
```
# Example of passing one database object to arrow
```{r}
#| echo: true
#| label: query-example
#| eval: false
#| error: false
#| warning: false
diamonds_db %>%
mutate(good_indicator=if_else(cut=="Good",1,0)) %>%
group_by(color) %>%
summarise(
n=n()
,mean=mean(carat)
,mean_price=mean(price)
,mean_ind=mean(good_indicator)
,mean_adj=mean(carat[good_indicator==1])
) %>%
arrange(desc(color)) %>%
mutate(rolling_price=cumsum(mean_price)) %>%
arrow::to_arrow() %>%
ungroup() %>%
filter(color=="H") %>%
select(color,mean_adj) %>%
collect()
```
# Additional tricks
### Running SQL in R
- If you are using rmakrdown or quarto, you can run the sql query in a window
and have it results saved as a datafarme
```{sql}
--| eval: false
--| connection: con_db
--| output.var: test.tbl
--| echo: fenced
SELECT *
FROM diamonds_db
WHERE cut=='Good'
LIMIT 100
```
- If you want to run it in rmarkdown, you can do the following
```{sql, eval=FALSE,connection=con_db, output.var = "mydataframe"}
--| echo: fenced
SELECT *
FROM diamonds_db
WHERE cut=='Good'
LIMIT 100
```
## How to plot a database object
rm package
pool
```{r}
library(dm)
install.packages("dm")
dm <- dm_nycflights13()
dm
```
```{r}
dm %>%
dm_draw()
```
```{r}
library(tidyverse)
price <- 2
diamonds |>
mutate(
test=.data$price*2
,test3=price*2
,test2=.env$price*.data$price
) |>
relocate(contains("test"))
```
```{r}
library(duckdb)
library(tidyverse)
library(duckplyr)
drv <- duckdb::duckdb(dbdir = "data/database.duckdb")
con <- dbConnect(drv)
dbListTables(con)
# gets column names
DBI::dbListFields(con,"diamonds.dbi")
DBI::dbListTables(con)
library(duckplyr)
duckdb::dbWriteTable(con, "diamonds.dbi", diamonds, append = TRUE)
nested_sql <- sql("
select
cut
,list({'x':x,'y':y,'z':z,'carat':carat}) as list
from 'diamonds.dbi'
group by all
")
diamonds_nested_dbi <- tbl(con,nested_sql)
diamonds_mod_tbl <- diamonds_nested_tbl |>
mutate(
mod=map(list,\(x) lm(carat ~.,data=x))
,tidy=map(mod,\(x) broom::tidy(x))
,glance=map(mod,\(x) broom::glance(x))
) |>
select(-mod)
beaver_tbl <- tbl(con,sql("select * from 'beaver.dbi'"))
diamonds_nested_tbl <- diamonds_nested_dbi |> collect()
diamonds_nested_tbl
duckdb::dbWriteTable(con, "diamonds_nested.dbi", diamonds_nested_dbi |> collect(), append = TRUE)
duckdb::dbRemoveTable(con, "diamonds_mod_dbi")
duckdb::dbRemoveTable(con, "diamonds_nested_dbi")
nested_dbi <- tbl(con,sql("select * from 'diamonds_nested.dbi'"))
diamonds_dm <- diamonds_dbi |> dm::dm()
diamonds_arw <- arrow::to_arrow(diamonds_dbi)
pak::pak("arrow")
```
```{r}
#creats temporary view
duckdb_register(con,"diamonds",diamonds,overwrite = TRUE)
duckdb_register(con,"mtcars",mtcars,overwrite = TRUE)
duckdb_register(con,"cars",cars,overwrite = TRUE)
#writes table to the database
duckdb::dbWriteTable(con, "diamonds.dbi", diamonds, append = TRUE)
duckdb::dbWriteTable(con, "mtcars.dbi", mtcars, append = TRUE)
duckdb::dbWriteTable(con, "cars.dbi", cars, append = TRUE)
tbl(con_db,sql("select * from temp.information_schema.columns"))
duckdb::dbListTables(con)
tbl(con,sql("select * from temp.information_schema.schemata"))
tbl(con,sql("SELECT * FROM duckdb_views()"))
" SELECT * FROM 'cars.dbi' USING SAMPLE 50 PERCENT (bernoulli) "
"SELECT * FROM 'cars.dbi' USING SAMPLE 50 PERCENT (system, 377)"
tbl(con,sql("
SELECT * FROM 'cars.dbi' USING SAMPLE 50 PERCENT (system, 377)
"))
mtcars_dbi <- tbl(con,sql("select * from 'mtcars.dbi'"))
mtcars_dbi |> sample_n(10)
Sys.getenv("DATABASE_NAME")
cars |>
apply(
MARGIN = 1,\(x) x-lag(x)
)
bind_cols(cars)
```
## how to create your own database
You need need two things -- storage and data. From there you then need to decide to you want your database limited to RAM or do yu want a database that is capable of larger then memory.
Either a persistance database (eg. one that lives on even after you close your computer) or a temporary database (one that just exists for session) you set iwth our initial driver argument duckdb().
For either method For temprorary tables simplfy set your driver to duckdb::duckdb() or set the dbdir arg in duckdb to a temp file.
This works because without passing dbdir it defaults to `DBDIR_MEMORY` which is just means using your RAM
```{r}
library(duckdb)
drv1 <- ?duckdb()
temp_name_vec <- tempfile(fileext = ".duckdb")
# this will enable larger than memory quries but will disappear
drv2 <- duckdb::duckdb(dbdir=temp_name_vec)
db_name_vec <- "data/database.duckdb"
drv3 <- duckdb::duckdb(dbdir=db_name_vec)
```
We can check that our DBI package can use the driver
```{r}
DBI::dbCanConnect(drv)
```
From there, we will not build a connection to our database which is easy to do with DBI::dbConnect()
- This step is easy to remember because you will use it to connect to any database (eg your own on your computer or another server)
- Under the hood, I'm sure alot is going on but simply you are identifying the type of database you will connect by teferenceing the database driver, then you are passing on your creditentials to your drive so that it can make a connection to the database. These credentials are typically informataion that identifys who you are, where is the database located, and any security information.
- Since you created your own database, we don't need need that info so we can leave the args blank (unlike when we connect to a third party database)
```{r}
con_db <- dbConnect(drv)
```
okay -- so from here we have now connected to our database, we can validate that connection with the ``
```{r}
DBI::dbIsValid(con_db)
```
Great -- it works and active
So what do we have in our brand new database?
```{r}
DBI::dbListTables(con_db)
```