Skip to content

Latest commit

 

History

History
91 lines (80 loc) · 3.06 KB

README.md

File metadata and controls

91 lines (80 loc) · 3.06 KB

db_comparisons

Compare sqlite and DuckDB for aggregation queries.

Related:

Steps

  1. Copy from sqlite to duckdb. Follow create_duckdb_from_mag.md. Need access to current database.
  2. Create sqlite copy from the duckdb file created in step 1 (for exact comparison). Create indexes. TODO
  3. Compare queries: directly in Duck, directly in sqlite, and querying sqlite with duck. Compare timing; make sure results are the same. TODO

Queries to compare

Citation count

create temp table temp_citations as 
SELECT  a.PaperReferenceId, 
            b.Year, 
            b.DocType AS ReferencingDocType, 
            COUNT(DISTINCT a.PaperId) AS CitationCount 
FROM PaperReferences a 
INNER JOIN (
    SELECT PaperId, Year, DocType 
    FROM Papers 
    INNER JOIN (
        SELECT PaperId 
        FROM PaperMainFieldsOfStudy
    ) USING (PaperId)
    WHERE 
    DocType IN ('Journal', 'Conference', 'Book', 'BookChapter', 'Thesis') 
) b on a.PaperId = b.PaperId 
INNER JOIN (
    SELECT PaperId, Year, DocType 
    FROM Papers 
    INNER JOIN ( 
        SELECT PaperId 
        FROM PaperMainFieldsOfStudy
    ) USING (PaperId) 
    WHERE 
    DocType IN ('Journal', 'Conference', 'Book', 'BookChapter', 'Thesis') 
    AND 
    Year >= 1950
) c on a.PaperReferenceId = c.PaperId 
GROUP BY a.PaperReferenceId, b.Year, ReferencingDocType

Highest N papers by author-affiliation-year

create temp table author_affil_year as 
SELECT AuthorId, AffiliationId, Year
FROM (
    SELECT *,
            MAX(PaperCount) OVER(PARTITION BY AuthorId, Year) AS MaxPaperCount
    FROM (
        SELECT a.AuthorId, 
                a.AffiliationId, 
                c.Year, 
                count(PaperId) AS PaperCount
        FROM PaperAuthorAffiliations a
        INNER JOIN (
            SELECT AuthorId 
            FROM author_sample 
        ) b USING (AuthorId)
        INNER JOIN (
            SELECT PaperId, Year 
            FROM Papers
            WHERE DocType IN ('Journal', 'Conference')
        ) c USING (PaperId)
        GROUP BY a.AuthorId, a.AffiliationId, c.Year 
    ) 
)   
WHERE PaperCount = MaxPaperCount ;  

Thoughts and lessons learned

  • if sqlite does not define types, duckdb cannot convert them.
    • raises Error: Conversion Error: Unimplemented type for cast (BLOB -> INTEGER)
    • So, it's important to have a type for each column on each table when combining sqlite and duckdb.
  • "static input data" != static database.
    • what seems to matter is whether input data is static; I did not understand that correctly until today
  • sqlite still has advantages that may be overlooked by these comparisons
    • for instance, the unique constraint on indexes is quick check that data integrity is not violated; a similar mechanism should always be used even when using duckdb
  • I think Duck parallelizes across CPUs. So the gains may differ by by laptop vs cluster/server
    • are the gains proportional to the CPUs available?
    • compare both on a cluster and on a laptop