diff --git a/notebooks/load-json-files-s3/meta.toml b/notebooks/load-json-files-s3/meta.toml index 7f3a525c..c82e5c0d 100644 --- a/notebooks/load-json-files-s3/meta.toml +++ b/notebooks/load-json-files-s3/meta.toml @@ -1,9 +1,11 @@ [meta] title="Load JSON files with Pipeline from S3" -description="This notebook will help you load JSON files from a public open AWS S3 bucket. You will see two modes: -*) where you map the JSON elements to columns in a relational table -*) where you just ingest all documents ito a JSON column. In that mode we also show how you can use persisted computed column for extracting JSON fields -" +description="""\ + This notebook will help you load JSON files from a public open + AWS S3 bucket. You will see two modes: + *) where you map the JSON elements to columns in a relational table + *) where you just ingest all documents ito a JSON column. In that mode we also show how you can use persisted computed column for extracting JSON fields +""" icon="chart-network" tags=["pipeline", "json", "s3"] -destinations=["spaces"] \ No newline at end of file +destinations=["spaces"] diff --git a/notebooks/load-json-files-s3/notebook.ipynb b/notebooks/load-json-files-s3/notebook.ipynb index a8067ff3..9f0ad7ff 100644 --- a/notebooks/load-json-files-s3/notebook.ipynb +++ b/notebooks/load-json-files-s3/notebook.ipynb @@ -1 +1,445 @@ -{"cells":[{"cell_type":"markdown","id":"deb8dbf4-2368-41b4-9f09-b14c96ccb344","metadata":{"language":"sql"},"source":"
\n
\n \n
\n
\n
SingleStore Notebooks
\n

Learn How to ingest JSON files in S3 into SingleStoreDB

\n
\n
"},{"cell_type":"markdown","id":"50093846-9ea3-441d-89f0-fbe0576f78bf","metadata":{},"source":"This notebook helps you navigate through different scenarios data ingestion of JSON files from an AWS S3 location:\n* Ingest JSON files in AWS S3 using wildcards with pre-defined schema\n* Ingest JSON files in AWS S3 using wildcards into a JSON column"},{"cell_type":"markdown","id":"b2ed410a-87b8-452a-b906-431fb0e949b3","metadata":{},"source":"## Create a Pipeline from JSON files in AWS S3 using wildcards"},{"cell_type":"markdown","id":"9996b479-586d-4af3-b0ee-b61eead39ebc","metadata":{},"source":"In this example, we want to create a pipeline from two JSON files called **actors1.json** and **actors2.json** stored in an AWS S3 bucket called singlestoredb and a folder called **actors**. This bucket is located in **us-east-1**."},{"cell_type":"markdown","id":"9a4caf68-0610-41a6-bfd1-59612b8e959a","metadata":{},"source":"Each file has the following shape with nested objects and arrays:\n```json\n{\n \"Actors\": [\n {\n \"name\": \"Tom Cruise\",\n \"age\": 56,\n \"Born At\": \"Syracuse, NY\",\n \"Birthdate\": \"July 3, 1962\",\n \"photo\": \"https://jsonformatter.org/img/tom-cruise.jpg\",\n \"wife\": null,\n \"weight\": 67.5,\n \"hasChildren\": true,\n \"hasGreyHair\": false,\n \"children\": [\n \"Suri\",\n \"Isabella Jane\",\n \"Connor\"\n ]\n },\n {\n \"name\": \"Robert Downey Jr.\",\n \"age\": 53,\n \"Born At\": \"New York City, NY\",\n \"Birthdate\": \"April 4, 1965\",\n \"photo\": \"https://jsonformatter.org/img/Robert-Downey-Jr.jpg\",\n \"wife\": \"Susan Downey\",\n \"weight\": 77.1,\n \"hasChildren\": true,\n \"hasGreyHair\": false,\n \"children\": [\n \"Indio Falconer\",\n \"Avri Roel\",\n \"Exton Elias\"\n ]\n }\n ]\n}\n```"},{"cell_type":"markdown","id":"98a8e14f-808e-43ff-b670-b6656091b81a","metadata":{},"source":"### Create a Table"},{"cell_type":"markdown","id":"a70e168d-de32-4988-90c4-651089ac25a0","metadata":{},"source":"We first create a table called **actors** in the database **demo_database**"},{"cell_type":"code","execution_count":11,"id":"b703aab8-7449-43db-af04-9d65520239a5","metadata":{"execution":{"iopub.execute_input":"2023-09-29T16:16:21.090281Z","iopub.status.busy":"2023-09-29T16:16:21.089700Z","iopub.status.idle":"2023-09-29T16:16:23.607632Z","shell.execute_reply":"2023-09-29T16:16:23.607074Z","shell.execute_reply.started":"2023-09-29T16:16:21.090258Z"},"language":"sql","tags":[],"trusted":true},"outputs":[{"data":{"text/plain":""},"execution_count":11,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\nCreate database if not exists demo_database;\nUse demo_database;\nCREATE TABLE if not exists demo_database.actors (\nname text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\nage int NOT NULL,\nborn_at text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\nBirthdate text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\nphoto text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\nwife text CHARACTER SET utf8 COLLATE utf8_general_ci,\nweight float NOT NULL,\nhaschildren boolean,\nhasGreyHair boolean,\nchildren JSON COLLATE utf8_bin NOT NULL,\nSHARD KEY ()\n);"},{"cell_type":"markdown","id":"e4c15a63-eb17-432d-b0b5-d7485bcf028d","metadata":{},"source":"### Create a pipeline"},{"cell_type":"markdown","id":"5e09146a-74cb-4e0d-bd0a-3502c2d15a00","metadata":{},"source":"We then create a pipeline called **actors** in the database **demo_database**. Since those files are small, batch_interval is not as important and the maximum partitions per batch is only 1. For faster performance, we recommend increasing the maximum partitions per batch. \nNote, that since the bucket is publcly accessible, you do not need to provide access key and secret."},{"cell_type":"code","execution_count":13,"id":"92df7943-e68d-4509-b7f5-4a93697f6578","metadata":{"execution":{"iopub.execute_input":"2023-09-29T16:16:32.661374Z","iopub.status.busy":"2023-09-29T16:16:32.661156Z","iopub.status.idle":"2023-09-29T16:16:32.799913Z","shell.execute_reply":"2023-09-29T16:16:32.799326Z","shell.execute_reply.started":"2023-09-29T16:16:32.661355Z"},"language":"sql","trusted":true},"outputs":[{"data":{"text/plain":""},"execution_count":13,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\nCREATE PIPELINE if not exists demo_database.actors\nAS LOAD DATA S3 'studiotutorials/sample_dataset/json_files/wildcard_demo/*.json'\nCONFIG '{ \\\"region\\\": \\\"us-east-1\\\" }'\n/*\nCREDENTIALS '{\"aws_access_key_id\": \"\", \n \"aws_secret_access_key\": \"\"}'\n*/\nBATCH_INTERVAL 2500\nMAX_PARTITIONS_PER_BATCH 1\nDISABLE OUT_OF_ORDER OPTIMIZATION\nDISABLE OFFSETS METADATA GC\nSKIP DUPLICATE KEY ERRORS\nINTO TABLE `actors`\nFORMAT JSON\n(\n actors.name <- name,\n actors.age <- age,\n actors.born_at <- `Born At`,\n actors.Birthdate <- Birthdate,\n actors.photo <- photo,\n actors.wife <- wife,\n actors.weight <- weight,\n actors.haschildren <- hasChildren,\n actors.hasGreyHair <- hasGreyHair,\n actors.children <- children\n);"},{"cell_type":"markdown","id":"5410c1b9-573f-4326-ba4c-b7af71e069ad","metadata":{},"source":"### Start and monitor the pipeline"},{"cell_type":"code","execution_count":15,"id":"eeddd12e-e28c-4000-859b-6d1291c4a137","metadata":{"execution":{"iopub.execute_input":"2023-09-29T16:16:46.866565Z","iopub.status.busy":"2023-09-29T16:16:46.866099Z","iopub.status.idle":"2023-09-29T16:16:46.884560Z","shell.execute_reply":"2023-09-29T16:16:46.884032Z","shell.execute_reply.started":"2023-09-29T16:16:46.866544Z"},"language":"sql","trusted":true},"outputs":[{"data":{"text/plain":""},"execution_count":15,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\nStart pipeline demo_database.actors;"},{"cell_type":"markdown","id":"a555997d-38dc-4b69-821b-390e52bb4d00","metadata":{},"source":"If there is no error or warning, you should see no error message."},{"cell_type":"code","execution_count":17,"id":"f48de155-af85-4c40-ad56-955573a434f8","metadata":{"execution":{"iopub.execute_input":"2023-09-29T16:16:53.215976Z","iopub.status.busy":"2023-09-29T16:16:53.215653Z","iopub.status.idle":"2023-09-29T16:16:53.330859Z","shell.execute_reply":"2023-09-29T16:16:53.330370Z","shell.execute_reply.started":"2023-09-29T16:16:53.215958Z"},"language":"sql","trusted":true},"outputs":[{"data":{"text/html":"\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
DATABASE_NAMEPIPELINE_NAMEERROR_UNIX_TIMESTAMPERROR_TYPEERROR_CODEERROR_MESSAGEERROR_KINDSTD_ERRORLOAD_DATA_LINELOAD_DATA_LINE_NUMBERBATCH_IDERROR_IDBATCH_SOURCE_PARTITION_IDBATCH_EARLIEST_OFFSETBATCH_LATEST_OFFSETHOSTPORTPARTITION
","text/plain":"+---------------+---------------+----------------------+------------+------------+---------------+------------+-----------+----------------+-----------------------+----------+----------+---------------------------+-----------------------+---------------------+------+------+-----------+\n| DATABASE_NAME | PIPELINE_NAME | ERROR_UNIX_TIMESTAMP | ERROR_TYPE | ERROR_CODE | ERROR_MESSAGE | ERROR_KIND | STD_ERROR | LOAD_DATA_LINE | LOAD_DATA_LINE_NUMBER | BATCH_ID | ERROR_ID | BATCH_SOURCE_PARTITION_ID | BATCH_EARLIEST_OFFSET | BATCH_LATEST_OFFSET | HOST | PORT | PARTITION |\n+---------------+---------------+----------------------+------------+------------+---------------+------------+-----------+----------------+-----------------------+----------+----------+---------------------------+-----------------------+---------------------+------+------+-----------+\n+---------------+---------------+----------------------+------------+------------+---------------+------------+-----------+----------------+-----------------------+----------+----------+---------------------------+-----------------------+---------------------+------+------+-----------+"},"execution_count":17,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\nselect * from information_schema.pipelines_errors\nwhere pipeline_name = 'actors' ;"},{"cell_type":"markdown","id":"c18ac453-63de-424a-b9bf-ae6846817ea6","metadata":{},"source":"### Query the table"},{"cell_type":"code","execution_count":18,"id":"09a739cb-4925-4699-ab61-71016a04bfb6","metadata":{"execution":{"iopub.execute_input":"2023-09-29T16:16:58.182000Z","iopub.status.busy":"2023-09-29T16:16:58.181746Z","iopub.status.idle":"2023-09-29T16:16:58.247687Z","shell.execute_reply":"2023-09-29T16:16:58.247115Z","shell.execute_reply.started":"2023-09-29T16:16:58.181983Z"},"language":"sql","trusted":true},"outputs":[{"data":{"text/html":"\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
nameageborn_atBirthdatephotowifeweighthaschildrenhasGreyHairchildren
Robert Downey Jr.53New York City, NYApril 4, 1965https://jsonformatter.org/img/Robert-Downey-Jr.jpgSusan Downey77.110['Indio Falconer', 'Avri Roel', 'Exton Elias']
Tom Cruise56Syracuse, NYJuly 3, 1962https://jsonformatter.org/img/tom-cruise.jpgNone67.510['Suri', 'Isabella Jane', 'Connor']
","text/plain":"+-------------------+-----+-------------------+---------------+-------------------------------------------------------------------------------------------------------------------+--------------+--------+-------------+-------------+------------------------------------------------+\n| name | age | born_at | Birthdate | photo | wife | weight | haschildren | hasGreyHair | children |\n+-------------------+-----+-------------------+---------------+-------------------------------------------------------------------------------------------------------------------+--------------+--------+-------------+-------------+------------------------------------------------+\n| Robert Downey Jr. | 53 | New York City, NY | April 4, 1965 | https://jsonformatter.org/img/Robert-Downey-Jr.jpg | Susan Downey | 77.1 | 1 | 0 | ['Indio Falconer', 'Avri Roel', 'Exton Elias'] |\n| Tom Cruise | 56 | Syracuse, NY | July 3, 1962 | https://jsonformatter.org/img/tom-cruise.jpg | None | 67.5 | 1 | 0 | ['Suri', 'Isabella Jane', 'Connor'] |\n+-------------------+-----+-------------------+---------------+-------------------------------------------------------------------------------------------------------------------+--------------+--------+-------------+-------------+------------------------------------------------+"},"execution_count":18,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\nselect * from demo_database.actors;"},{"cell_type":"markdown","id":"c4815572-10d8-4c31-a246-05ad6e7e6e99","metadata":{},"source":"### Cleanup ressources"},{"cell_type":"code","execution_count":20,"id":"6a6dfc1d-c758-4287-a797-6cc3e4fff934","metadata":{"execution":{"iopub.execute_input":"2023-09-29T16:17:03.196969Z","iopub.status.busy":"2023-09-29T16:17:03.196725Z","iopub.status.idle":"2023-09-29T16:17:03.232536Z","shell.execute_reply":"2023-09-29T16:17:03.231955Z","shell.execute_reply.started":"2023-09-29T16:17:03.196954Z"},"language":"sql","trusted":true},"outputs":[{"data":{"text/plain":""},"execution_count":20,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\nDrop pipeline if exists demo_database.actors;\nDrop table if exists demo_database.actors;"},{"cell_type":"markdown","id":"09fbffac-9a0a-45fd-ad07-ede4e11b3691","metadata":{},"source":"## Ingest JSON files in AWS S3 using wildcards into a JSON column"},{"cell_type":"markdown","id":"d3e8ff65-1b2d-47c5-8754-28fa4c254edd","metadata":{},"source":"As the schema of your files might change, you might want to keep flexibility in ingesting the data into one JSON column that we name **json_data**. the table we create is named **actors_json**."},{"cell_type":"markdown","id":"d761f324-0d28-4713-a866-3f96673d8317","metadata":{},"source":"### Create Table"},{"cell_type":"code","execution_count":14,"id":"bcb14814-7b79-4df2-ab47-7def7ae03ce3","metadata":{"execution":{"iopub.execute_input":"2023-10-05T05:17:30.337597Z","iopub.status.busy":"2023-10-05T05:17:30.337247Z","iopub.status.idle":"2023-10-05T05:17:33.420875Z","shell.execute_reply":"2023-10-05T05:17:33.420319Z","shell.execute_reply.started":"2023-10-05T05:17:30.337580Z"},"language":"sql","trusted":true},"outputs":[{"data":{"text/html":"\n \n \n \n \n \n \n
","text/plain":"++\n||\n++\n++"},"execution_count":14,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\nCreate database if not exists demo_database;\nUse demo_database;\nCREATE TABLE if not exists demo_database.actors_json (\njson_data JSON NOT NULL ,\nSHARD KEY ()\n);"},{"cell_type":"markdown","id":"429fce4b-c529-4acf-af7e-5d802f79eda6","metadata":{},"source":"### Create a pipeline"},{"cell_type":"code","execution_count":21,"id":"a1d60130-095e-45da-b55d-b427a0af3d26","metadata":{"execution":{"iopub.execute_input":"2023-10-05T05:18:03.678851Z","iopub.status.busy":"2023-10-05T05:18:03.678450Z","iopub.status.idle":"2023-10-05T05:18:03.839061Z","shell.execute_reply":"2023-10-05T05:18:03.838442Z","shell.execute_reply.started":"2023-10-05T05:18:03.678826Z"},"language":"sql","tags":[],"trusted":true},"outputs":[{"data":{"text/html":"\n \n \n \n \n \n \n
","text/plain":"++\n||\n++\n++"},"execution_count":21,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\nCREATE PIPELINE if not exists demo_database.actors_json\nAS LOAD DATA S3 'studiotutorials/sample_dataset/json_files/wildcard_demo/*.json'\nCONFIG '{ \\\"region\\\": \\\"us-east-1\\\" }'\n/*\nCREDENTIALS '{\"aws_access_key_id\": \"\", \n \"aws_secret_access_key\": \"\"}'\n*/\nBATCH_INTERVAL 2500\nMAX_PARTITIONS_PER_BATCH 1\nDISABLE OUT_OF_ORDER OPTIMIZATION\nDISABLE OFFSETS METADATA GC\nSKIP DUPLICATE KEY ERRORS\nINTO TABLE `actors_json`\nFORMAT JSON\n(json_data <- %);"},{"cell_type":"markdown","id":"bd296bf5-db20-4028-a1d7-b5c9da0a6cb2","metadata":{},"source":"### Start and monitor pipeline"},{"cell_type":"code","execution_count":22,"id":"b374598a-f9cb-43c4-a2a4-ebcd298108c4","metadata":{"execution":{"iopub.execute_input":"2023-10-05T05:18:05.818171Z","iopub.status.busy":"2023-10-05T05:18:05.817819Z","iopub.status.idle":"2023-10-05T05:18:05.837111Z","shell.execute_reply":"2023-10-05T05:18:05.836663Z","shell.execute_reply.started":"2023-10-05T05:18:05.818153Z"},"language":"sql","trusted":true},"outputs":[{"data":{"text/html":"\n \n \n \n \n \n \n
","text/plain":"++\n||\n++\n++"},"execution_count":22,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\nStart pipeline demo_database.actors_json;"},{"cell_type":"code","execution_count":23,"id":"ca06781b-61fa-4fea-97de-cd0dbacd86e8","metadata":{"execution":{"iopub.execute_input":"2023-10-05T05:18:07.571689Z","iopub.status.busy":"2023-10-05T05:18:07.571335Z","iopub.status.idle":"2023-10-05T05:18:07.690209Z","shell.execute_reply":"2023-10-05T05:18:07.688498Z","shell.execute_reply.started":"2023-10-05T05:18:07.571671Z"},"language":"sql","trusted":true},"outputs":[{"data":{"text/html":"\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
DATABASE_NAMEPIPELINE_NAMEERROR_UNIX_TIMESTAMPERROR_TYPEERROR_CODEERROR_MESSAGEERROR_KINDSTD_ERRORLOAD_DATA_LINELOAD_DATA_LINE_NUMBERBATCH_IDERROR_IDBATCH_SOURCE_PARTITION_IDBATCH_EARLIEST_OFFSETBATCH_LATEST_OFFSETHOSTPORTPARTITION
","text/plain":"+---------------+---------------+----------------------+------------+------------+---------------+------------+-----------+----------------+-----------------------+----------+----------+---------------------------+-----------------------+---------------------+------+------+-----------+\n| DATABASE_NAME | PIPELINE_NAME | ERROR_UNIX_TIMESTAMP | ERROR_TYPE | ERROR_CODE | ERROR_MESSAGE | ERROR_KIND | STD_ERROR | LOAD_DATA_LINE | LOAD_DATA_LINE_NUMBER | BATCH_ID | ERROR_ID | BATCH_SOURCE_PARTITION_ID | BATCH_EARLIEST_OFFSET | BATCH_LATEST_OFFSET | HOST | PORT | PARTITION |\n+---------------+---------------+----------------------+------------+------------+---------------+------------+-----------+----------------+-----------------------+----------+----------+---------------------------+-----------------------+---------------------+------+------+-----------+\n+---------------+---------------+----------------------+------------+------------+---------------+------------+-----------+----------------+-----------------------+----------+----------+---------------------------+-----------------------+---------------------+------+------+-----------+"},"execution_count":23,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\n# Monitor and see if there is any error or warning\nselect * from information_schema.pipelines_errors\nwhere pipeline_name = 'actors_json' ;"},{"cell_type":"markdown","id":"7419ccdd-0f85-414e-bd05-fbe8d9656305","metadata":{},"source":"### Query the table"},{"cell_type":"code","execution_count":25,"id":"e34c5b49-0e97-4b07-9026-38bb6c370f73","metadata":{"execution":{"iopub.execute_input":"2023-10-05T05:18:10.653949Z","iopub.status.busy":"2023-10-05T05:18:10.653693Z","iopub.status.idle":"2023-10-05T05:18:11.007955Z","shell.execute_reply":"2023-10-05T05:18:11.007444Z","shell.execute_reply.started":"2023-10-05T05:18:10.653929Z"},"language":"sql","trusted":true},"outputs":[{"data":{"text/html":"\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
json_data
{'Birthdate': 'July 3, 1962', 'Born At': 'Syracuse, NY', 'age': 56, 'children': ['Suri', 'Isabella Jane', 'Connor'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Tom Cruise', 'photo': 'https://jsonformatter.org/img/tom-cruise.jpg', 'weight': 67.5, 'wife': None}
{'Birthdate': 'April 4, 1965', 'Born At': 'New York City, NY', 'age': 53, 'children': ['Indio Falconer', 'Avri Roel', 'Exton Elias'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Robert Downey Jr.', 'photo': 'https://jsonformatter.org/img/Robert-Downey-Jr.jpg', 'weight': 77.1, 'wife': 'Susan Downey'}
{'Birthdate': 'April 4, 1965', 'Born At': 'New York City, NY', 'age': 53, 'children': ['Indio Falconer', 'Avri Roel', 'Exton Elias'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Robert Downey Jr.', 'photo': 'https://jsonformatter.org/img/Robert-Downey-Jr.jpg', 'weight': 77.1, 'wife': 'Susan Downey'}
{'Birthdate': 'July 3, 1962', 'Born At': 'Syracuse, NY', 'age': 56, 'children': ['Suri', 'Isabella Jane', 'Connor'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Tom Cruise', 'photo': 'https://jsonformatter.org/img/tom-cruise.jpg', 'weight': 67.5, 'wife': None}
","text/plain":"+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n| json_data |\n+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n| {'Birthdate': 'July 3, 1962', 'Born At': 'Syracuse, NY', 'age': 56, 'children': ['Suri', 'Isabella Jane', 'Connor'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Tom Cruise', 'photo': 'https://jsonformatter.org/img/tom-cruise.jpg', 'weight': 67.5, 'wife': None} |\n| {'Birthdate': 'April 4, 1965', 'Born At': 'New York City, NY', 'age': 53, 'children': ['Indio Falconer', 'Avri Roel', 'Exton Elias'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Robert Downey Jr.', 'photo': 'https://jsonformatter.org/img/Robert-Downey-Jr.jpg', 'weight': 77.1, 'wife': 'Susan Downey'} |\n| {'Birthdate': 'April 4, 1965', 'Born At': 'New York City, NY', 'age': 53, 'children': ['Indio Falconer', 'Avri Roel', 'Exton Elias'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Robert Downey Jr.', 'photo': 'https://jsonformatter.org/img/Robert-Downey-Jr.jpg', 'weight': 77.1, 'wife': 'Susan Downey'} |\n| {'Birthdate': 'July 3, 1962', 'Born At': 'Syracuse, NY', 'age': 56, 'children': ['Suri', 'Isabella Jane', 'Connor'], 'hasChildren': True, 'hasGreyHair': False, 'name': 'Tom Cruise', 'photo': 'https://jsonformatter.org/img/tom-cruise.jpg', 'weight': 67.5, 'wife': None} |\n+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+"},"execution_count":25,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\nselect * from demo_database.actors_json"},{"cell_type":"markdown","id":"c4c155e5-a4a5-4b01-a8a7-e7e626e5fac8","metadata":{},"source":"### Cleanup ressources"},{"cell_type":"code","execution_count":27,"id":"6f0bd356-8a11-4cd9-b774-569d8f5e2520","metadata":{"execution":{"iopub.execute_input":"2023-10-05T05:18:29.840296Z","iopub.status.busy":"2023-10-05T05:18:29.840026Z","iopub.status.idle":"2023-10-05T05:18:29.881582Z","shell.execute_reply":"2023-10-05T05:18:29.880978Z","shell.execute_reply.started":"2023-10-05T05:18:29.840279Z"},"language":"sql","tags":[],"trusted":true},"outputs":[{"data":{"text/html":"\n \n \n \n \n \n \n
","text/plain":"++\n||\n++\n++"},"execution_count":27,"metadata":{},"output_type":"execute_result"}],"source":"%%sql\n\nDrop pipeline if exists demo_database.actors_json;\nDrop table if exists demo_database.actors_json;"},{"cell_type":"markdown","id":"c572193e-7f5b-4637-af5d-2f33f5ba5d86","metadata":{"language":"sql"},"source":"
\n
\n"}],"metadata":{"jupyterlab":{"notebooks":{"version_major":6,"version_minor":4}},"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.11.4"},"singlestore_cell_default_language":"sql","singlestore_connection":{"connectionID":"a98658e6-5e5d-4be9-be2e-9fa993172504","defaultDatabase":"demo_database"},"singlestore_row_limit":300},"nbformat":4,"nbformat_minor":5} +{ + "cells": [ + { + "cell_type": "markdown", + "id": "deb8dbf4-2368-41b4-9f09-b14c96ccb344", + "metadata": {}, + "source": [ + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "
SingleStore Notebooks
\n", + "

Load JSON files with Pipeline from S3

\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "50093846-9ea3-441d-89f0-fbe0576f78bf", + "metadata": {}, + "source": [ + "This notebook helps you navigate through different scenarios data ingestion of JSON files from an AWS S3 location:\n", + "* Ingest JSON files in AWS S3 using wildcards with pre-defined schema\n", + "* Ingest JSON files in AWS S3 using wildcards into a JSON column" + ] + }, + { + "cell_type": "markdown", + "id": "b2ed410a-87b8-452a-b906-431fb0e949b3", + "metadata": {}, + "source": [ + "## Create a Pipeline from JSON files in AWS S3 using wildcards" + ] + }, + { + "cell_type": "markdown", + "id": "9996b479-586d-4af3-b0ee-b61eead39ebc", + "metadata": {}, + "source": [ + "In this example, we want to create a pipeline from two JSON files called **actors1.json** and **actors2.json** stored in an AWS S3 bucket called singlestoredb and a folder called **actors**. This bucket is located in **us-east-1**." + ] + }, + { + "cell_type": "markdown", + "id": "9a4caf68-0610-41a6-bfd1-59612b8e959a", + "metadata": {}, + "source": [ + "Each file has the following shape with nested objects and arrays:\n", + "```json\n", + "{\n", + " \"Actors\": [\n", + " {\n", + " \"name\": \"Tom Cruise\",\n", + " \"age\": 56,\n", + " \"Born At\": \"Syracuse, NY\",\n", + " \"Birthdate\": \"July 3, 1962\",\n", + " \"photo\": \"https://jsonformatter.org/img/tom-cruise.jpg\",\n", + " \"wife\": null,\n", + " \"weight\": 67.5,\n", + " \"hasChildren\": true,\n", + " \"hasGreyHair\": false,\n", + " \"children\": [\n", + " \"Suri\",\n", + " \"Isabella Jane\",\n", + " \"Connor\"\n", + " ]\n", + " },\n", + " {\n", + " \"name\": \"Robert Downey Jr.\",\n", + " \"age\": 53,\n", + " \"Born At\": \"New York City, NY\",\n", + " \"Birthdate\": \"April 4, 1965\",\n", + " \"photo\": \"https://jsonformatter.org/img/Robert-Downey-Jr.jpg\",\n", + " \"wife\": \"Susan Downey\",\n", + " \"weight\": 77.1,\n", + " \"hasChildren\": true,\n", + " \"hasGreyHair\": false,\n", + " \"children\": [\n", + " \"Indio Falconer\",\n", + " \"Avri Roel\",\n", + " \"Exton Elias\"\n", + " ]\n", + " }\n", + " ]\n", + "}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "98a8e14f-808e-43ff-b670-b6656091b81a", + "metadata": {}, + "source": [ + "### Create a Table" + ] + }, + { + "cell_type": "markdown", + "id": "a70e168d-de32-4988-90c4-651089ac25a0", + "metadata": {}, + "source": [ + "We first create a table called **actors** in the database **demo_database**" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "b703aab8-7449-43db-af04-9d65520239a5", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "Create database if not exists demo_database;\n", + "Use demo_database;\n", + "CREATE TABLE if not exists demo_database.actors (\n", + "name text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", + "age int NOT NULL,\n", + "born_at text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", + "Birthdate text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", + "photo text CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,\n", + "wife text CHARACTER SET utf8 COLLATE utf8_general_ci,\n", + "weight float NOT NULL,\n", + "haschildren boolean,\n", + "hasGreyHair boolean,\n", + "children JSON COLLATE utf8_bin NOT NULL,\n", + "SHARD KEY ()\n", + ");" + ] + }, + { + "cell_type": "markdown", + "id": "e4c15a63-eb17-432d-b0b5-d7485bcf028d", + "metadata": {}, + "source": [ + "### Create a pipeline" + ] + }, + { + "cell_type": "markdown", + "id": "5e09146a-74cb-4e0d-bd0a-3502c2d15a00", + "metadata": {}, + "source": [ + "We then create a pipeline called **actors** in the database **demo_database**. Since those files are small, batch_interval is not as important and the maximum partitions per batch is only 1. For faster performance, we recommend increasing the maximum partitions per batch.\n", + "Note, that since the bucket is publcly accessible, you do not need to provide access key and secret." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "92df7943-e68d-4509-b7f5-4a93697f6578", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE PIPELINE if not exists demo_database.actors\n", + "AS LOAD DATA S3 'studiotutorials/sample_dataset/json_files/wildcard_demo/*.json'\n", + "CONFIG '{ \\\"region\\\": \\\"us-east-1\\\" }'\n", + "/*\n", + "CREDENTIALS '{\"aws_access_key_id\": \"\",\n", + " \"aws_secret_access_key\": \"\"}'\n", + "*/\n", + "BATCH_INTERVAL 2500\n", + "MAX_PARTITIONS_PER_BATCH 1\n", + "DISABLE OUT_OF_ORDER OPTIMIZATION\n", + "DISABLE OFFSETS METADATA GC\n", + "SKIP DUPLICATE KEY ERRORS\n", + "INTO TABLE `actors`\n", + "FORMAT JSON\n", + "(\n", + " actors.name <- name,\n", + " actors.age <- age,\n", + " actors.born_at <- `Born At`,\n", + " actors.Birthdate <- Birthdate,\n", + " actors.photo <- photo,\n", + " actors.wife <- wife,\n", + " actors.weight <- weight,\n", + " actors.haschildren <- hasChildren,\n", + " actors.hasGreyHair <- hasGreyHair,\n", + " actors.children <- children\n", + ");" + ] + }, + { + "cell_type": "markdown", + "id": "5410c1b9-573f-4326-ba4c-b7af71e069ad", + "metadata": {}, + "source": [ + "### Start and monitor the pipeline" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "eeddd12e-e28c-4000-859b-6d1291c4a137", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "Start pipeline demo_database.actors;" + ] + }, + { + "cell_type": "markdown", + "id": "a555997d-38dc-4b69-821b-390e52bb4d00", + "metadata": {}, + "source": [ + "If there is no error or warning, you should see no error message." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "f48de155-af85-4c40-ad56-955573a434f8", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "select * from information_schema.pipelines_errors\n", + "where pipeline_name = 'actors' ;" + ] + }, + { + "cell_type": "markdown", + "id": "c18ac453-63de-424a-b9bf-ae6846817ea6", + "metadata": {}, + "source": [ + "### Query the table" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "09a739cb-4925-4699-ab61-71016a04bfb6", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "select * from demo_database.actors;" + ] + }, + { + "cell_type": "markdown", + "id": "c4815572-10d8-4c31-a246-05ad6e7e6e99", + "metadata": {}, + "source": [ + "### Cleanup ressources" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "6a6dfc1d-c758-4287-a797-6cc3e4fff934", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "Drop pipeline if exists demo_database.actors;\n", + "Drop table if exists demo_database.actors;" + ] + }, + { + "cell_type": "markdown", + "id": "09fbffac-9a0a-45fd-ad07-ede4e11b3691", + "metadata": {}, + "source": [ + "## Ingest JSON files in AWS S3 using wildcards into a JSON column" + ] + }, + { + "cell_type": "markdown", + "id": "d3e8ff65-1b2d-47c5-8754-28fa4c254edd", + "metadata": {}, + "source": [ + "As the schema of your files might change, you might want to keep flexibility in ingesting the data into one JSON column that we name **json_data**. the table we create is named **actors_json**." + ] + }, + { + "cell_type": "markdown", + "id": "d761f324-0d28-4713-a866-3f96673d8317", + "metadata": {}, + "source": [ + "### Create Table" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "bcb14814-7b79-4df2-ab47-7def7ae03ce3", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "Create database if not exists demo_database;\n", + "Use demo_database;\n", + "CREATE TABLE if not exists demo_database.actors_json (\n", + "json_data JSON NOT NULL ,\n", + "SHARD KEY ()\n", + ");" + ] + }, + { + "cell_type": "markdown", + "id": "429fce4b-c529-4acf-af7e-5d802f79eda6", + "metadata": {}, + "source": [ + "### Create a pipeline" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "a1d60130-095e-45da-b55d-b427a0af3d26", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE PIPELINE if not exists demo_database.actors_json\n", + "AS LOAD DATA S3 'studiotutorials/sample_dataset/json_files/wildcard_demo/*.json'\n", + "CONFIG '{ \\\"region\\\": \\\"us-east-1\\\" }'\n", + "/*\n", + "CREDENTIALS '{\"aws_access_key_id\": \"\",\n", + " \"aws_secret_access_key\": \"\"}'\n", + "*/\n", + "BATCH_INTERVAL 2500\n", + "MAX_PARTITIONS_PER_BATCH 1\n", + "DISABLE OUT_OF_ORDER OPTIMIZATION\n", + "DISABLE OFFSETS METADATA GC\n", + "SKIP DUPLICATE KEY ERRORS\n", + "INTO TABLE `actors_json`\n", + "FORMAT JSON\n", + "(json_data <- %);" + ] + }, + { + "cell_type": "markdown", + "id": "bd296bf5-db20-4028-a1d7-b5c9da0a6cb2", + "metadata": {}, + "source": [ + "### Start and monitor pipeline" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b374598a-f9cb-43c4-a2a4-ebcd298108c4", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "Start pipeline demo_database.actors_json;" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "ca06781b-61fa-4fea-97de-cd0dbacd86e8", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "# Monitor and see if there is any error or warning\n", + "select * from information_schema.pipelines_errors\n", + "where pipeline_name = 'actors_json' ;" + ] + }, + { + "cell_type": "markdown", + "id": "7419ccdd-0f85-414e-bd05-fbe8d9656305", + "metadata": {}, + "source": [ + "### Query the table" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "e34c5b49-0e97-4b07-9026-38bb6c370f73", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "select * from demo_database.actors_json" + ] + }, + { + "cell_type": "markdown", + "id": "c4c155e5-a4a5-4b01-a8a7-e7e626e5fac8", + "metadata": {}, + "source": [ + "### Cleanup ressources" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "6f0bd356-8a11-4cd9-b774-569d8f5e2520", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "\n", + "Drop pipeline if exists demo_database.actors_json;\n", + "Drop table if exists demo_database.actors_json;" + ] + }, + { + "cell_type": "markdown", + "id": "c572193e-7f5b-4637-af5d-2f33f5ba5d86", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + } + ], + "metadata": { + "jupyterlab": { + "notebooks": { + "version_major": 6, + "version_minor": 4 + } + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/optimize-performance-with-tpch-100/meta.toml b/notebooks/optimize-performance-with-tpch-100/meta.toml index 7562ad1c..330d242e 100644 --- a/notebooks/optimize-performance-with-tpch-100/meta.toml +++ b/notebooks/optimize-performance-with-tpch-100/meta.toml @@ -1,12 +1,17 @@ [meta] title="Learn how to Optimize Performance with TPCH 100" -description="This notebook will help you understand how you can take advantage of SingleStoreDB distributed capability using TPCH-100. We recommend using a S2 or S4 workspace to see the difference in performance. -If you come from single node database, this is an important step to follow to scale your performance linearly as your data grows. +description="""\ + This notebook will help you understand how you can take advantage of + SingleStoreDB distributed capability using TPCH-100. We recommend using + a S2 or S4 workspace to see the difference in performance. -You will see two areas: -*) Ingesting data using pipeline at a super fast speed (You will use a real-time embedded dashboard) -*) Compare query performance with an unoptimized database and an optimized database -" + If you come from single node database, this is an important step to follow + to scale your performance linearly as your data grows. + + You will see two areas: + *) Ingesting data using pipeline at a super fast speed (You will use a real-time embedded dashboard) + *) Compare query performance with an unoptimized database and an optimized database +""" icon="database" tags=["performance", "benchmark", "tpch", "benchmark", "shardkey", "ingest"] -destinations=["spaces"] \ No newline at end of file +destinations=["spaces"] diff --git a/notebooks/optimize-performance-with-tpch-100/notebook.ipynb b/notebooks/optimize-performance-with-tpch-100/notebook.ipynb index 84137758..cc9a0713 100644 --- a/notebooks/optimize-performance-with-tpch-100/notebook.ipynb +++ b/notebooks/optimize-performance-with-tpch-100/notebook.ipynb @@ -1 +1,1113 @@ -{"cells":[{"attachments":null,"cell_type":"markdown","id":"8e67bcbe-6ace-4ca9-b28c-927b4b5a85b2","metadata":{"language":"sql"},"source":"
\n
\n \n
\n
\n
SingleStore Notebooks
\n

Learn How to Optimize Table Data Structures with TPCH Benchmark

\n
\n
"},{"attachments":null,"cell_type":"markdown","id":"5d93af8b-eb1d-4207-a060-1a45c46d8b41","metadata":{"language":"sql"},"source":"### Context\n\nThis notebook will help you with four core key principles for getting performance out of SingleStoreDB using TPCH Benchmark. SingleStoreDB is a distributed database, so you should think of using shard keys, database partitions, primary keys and indexes for getting the best performance out of it. \n\n##### About database partitions\nThe generalized recommendation for most clusters is to have 4 CPU cores per database partition on each leaf. This means if you had a cluster with 16 cores on each of 4 leaves (64 CPU cores total across all leaf hosts), you would want to have 4 partitions on each leaf (16 partitions throughout the cluster). If you are using a S00 workspace, you will have 2 partitions per database. ***Note*** *that increasing partitions will have additional memory and caching overheads, which can be expensive if you have thousands of tables*\n\n##### About shard keys\nData is distributed across the SingleStoreDB Cloud workspace into a number of partitions on the leaf nodes. The shard key is a collection of the columns in a table that are used to control how the rows of that table are distributed. To determine the partition responsible for a given row, SingleStoreDB Cloud computes a hash from all the columns in the shard key to the partition ID. Therefore, rows with the same shard key will reside on the same partition.\n\n##### About hash indexes\nThey are highly efficient for exact-match lookups (point-reads). Because hash indexes store rows in a sparse array of buckets indexed through a hash function on the relevant columns, queries can quickly retrieve data by examining only the corresponding bucket rather than searching the entire dataset. This enables significant reduction in lookup time and hence, increased performance for specific query types.\n\n**For that tutorial, we recommend using a workspace of size S4 to ingest data faster and also see the difference and gain you can get from a distributed architecture.**"},{"attachments":null,"cell_type":"markdown","id":"67f041ef-5605-43ef-8ca0-5db3194b4cad","metadata":{"language":"sql"},"source":"
\n \n
\n

Note

\n

For that tutorial, we recommend using workspace of size S4 to ingest data faster and also see the difference and gain you can get from a distributed architecture.

\n
\n
"},{"attachments":null,"cell_type":"markdown","id":"6052728d-7828-4fb2-bb53-b960a7ad43af","metadata":{"language":"sql"},"source":"### Let's first create the unoptimized database"},{"cell_type":"code","execution_count":null,"id":"7301a602-48cf-4f3b-9cbc-2e7184d97ae0","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\ncreate database if not exists s2_tpch_unoptimized\n\n# To create a database with custom partitions use the following syntax: CREATE DATABASE YourDatabaseName PARTITIONS=X;\n# You cannot change after creation the number of partitions"},{"attachments":null,"cell_type":"markdown","id":"94c8bb6f-658d-4434-9074-4847f1c7d721","metadata":{"language":"sql"},"source":"If using a S00, the database will have 2 partitions, if using S1, it will have 8 partitions"},{"cell_type":"code","execution_count":null,"id":"b36585b9-4d52-4301-ac68-60fa49425751","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nSELECT num_partitions FROM information_schema.DISTRIBUTED_DATABASES WHERE database_name = 's2_tpch_unoptimized';"},{"attachments":null,"cell_type":"markdown","id":"b576e31c-6a67-4126-86ab-480fd96805d3","metadata":{"language":"sql"},"source":"##### Let's create all the tables in that database with no index, shard key or primary key"},{"attachments":null,"cell_type":"markdown","id":"4587c575-9b5a-4535-bebe-70779064e9dc","metadata":{"execution":{"iopub.execute_input":"2023-10-02T05:17:36.998748Z","iopub.status.busy":"2023-10-02T05:17:36.998493Z","iopub.status.idle":"2023-10-02T05:17:37.009703Z","shell.execute_reply":"2023-10-02T05:17:37.009283Z","shell.execute_reply.started":"2023-10-02T05:17:36.998731Z"},"language":"sql"},"source":"
\n \n
\n

Action Required

\n

Make sure to select the s2_tpch_unoptimized database from the drop-down menu at the top of this notebook.\n It updates the connection_url to connect to that database.

\n
\n
"},{"cell_type":"code","execution_count":null,"id":"afde1362-2d38-4732-94ed-6d4ed05a6806","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nCREATE TABLE IF NOT EXISTS `customer` (\n `c_custkey` int(11) NOT NULL,\n `c_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `c_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `c_nationkey` int(11) NOT NULL,\n `c_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `c_acctbal` decimal(15,2) NOT NULL,\n `c_mktsegment` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `c_comment` varchar(117) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n);\n\nCREATE TABLE IF NOT EXISTS `lineitem` (\n `l_orderkey` bigint(11) NOT NULL,\n `l_partkey` int(11) NOT NULL,\n `l_suppkey` int(11) NOT NULL,\n `l_linenumber` int(11) NOT NULL,\n `l_quantity` decimal(15,2) NOT NULL,\n `l_extendedprice` decimal(15,2) NOT NULL,\n `l_discount` decimal(15,2) NOT NULL,\n `l_tax` decimal(15,2) NOT NULL,\n `l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `l_shipdate` date NOT NULL,\n `l_commitdate` date NOT NULL,\n `l_receiptdate` date NOT NULL,\n `l_shipinstruct` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `l_shipmode` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n);\n\nCREATE TABLE IF NOT EXISTS `nation` (\n `n_nationkey` int(11) NOT NULL,\n `n_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `n_regionkey` int(11) NOT NULL,\n `n_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n);\n\nCREATE TABLE IF NOT EXISTS `orders` (\n `o_orderkey` bigint(11) NOT NULL,\n `o_custkey` int(11) NOT NULL,\n `o_orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `o_totalprice` decimal(15,2) NOT NULL,\n `o_orderdate` date NOT NULL,\n `o_orderpriority` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `o_clerk` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `o_shippriority` int(11) NOT NULL,\n `o_comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n);\n\nCREATE TABLE IF NOT EXISTS `part` (\n `p_partkey` int(11) NOT NULL,\n `p_name` varchar(55) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `p_mfgr` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `p_brand` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `p_type` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `p_size` int(11) NOT NULL,\n `p_container` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `p_retailprice` decimal(15,2) NOT NULL,\n `p_comment` varchar(23) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n);\n\nCREATE TABLE IF NOT EXISTS `partsupp` (\n `ps_partkey` int(11) NOT NULL,\n `ps_suppkey` int(11) NOT NULL,\n `ps_availqty` int(11) NOT NULL,\n `ps_supplycost` decimal(15,2) NOT NULL,\n `ps_comment` varchar(199) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n);\n\nCREATE TABLE IF NOT EXISTS `region` (\n `r_regionkey` int(11) NOT NULL,\n `r_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `r_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n);\n\nCREATE TABLE IF NOT EXISTS `supplier` (\n `s_suppkey` int(11) NOT NULL,\n `s_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `s_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `s_nationkey` int(11) NOT NULL,\n `s_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `s_acctbal` decimal(15,2) NOT NULL,\n `s_comment` varchar(101) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n);"},{"attachments":null,"cell_type":"markdown","id":"09711e8c-fb01-4e10-862b-ee5350be6076","metadata":{"language":"sql"},"source":"### Now let's create the pipelines and run them to ingest data"},{"cell_type":"code","execution_count":null,"id":"4e8ca124-ac4b-49de-b0ee-d9441d43bedd","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nCREATE PIPELINE IF NOT EXISTS `customer_pipeline`\nAS LOAD DATA S3 'memsql-tpch-dataset/sf_100/customer'\nCONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\nBATCH_INTERVAL 2500\nDISABLE OUT_OF_ORDER OPTIMIZATION\nDISABLE OFFSETS METADATA GC\nSKIP DUPLICATE KEY ERRORS\nINTO TABLE `customer`\nFIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\nLINES TERMINATED BY '|\\n' STARTING BY '';"},{"cell_type":"code","execution_count":null,"id":"ce739af4-6839-4751-8a81-019fb26cad72","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nCREATE PIPELINE IF NOT EXISTS `lineitem_pipeline`\nAS LOAD DATA S3 'memsql-tpch-dataset/sf_100/lineitem/lineitem.'\nCONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\nBATCH_INTERVAL 2500\nDISABLE OUT_OF_ORDER OPTIMIZATION\nDISABLE OFFSETS METADATA GC\nSKIP DUPLICATE KEY ERRORS\nINTO TABLE `lineitem`\nFIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\nLINES TERMINATED BY '|\\n' STARTING BY '';"},{"cell_type":"code","execution_count":null,"id":"bfdd5bbc-702a-4f77-b771-cd38189040e0","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nCREATE PIPELINE IF NOT EXISTS `nation_pipeline`\nAS LOAD DATA S3 'memsql-tpch-dataset/sf_100/nation'\nCONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\nBATCH_INTERVAL 2500\nDISABLE OUT_OF_ORDER OPTIMIZATION\nDISABLE OFFSETS METADATA GC\nSKIP DUPLICATE KEY ERRORS\nINTO TABLE `nation`\nFIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\nLINES TERMINATED BY '|\\n' STARTING BY '';"},{"cell_type":"code","execution_count":null,"id":"1b040083-f864-4e64-9bd2-00b1ff7d1e2b","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nCREATE PIPELINE IF NOT EXISTS `orders_pipeline`\nAS LOAD DATA S3 'memsql-tpch-dataset/sf_100/orders'\nCONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\nBATCH_INTERVAL 2500\nDISABLE OUT_OF_ORDER OPTIMIZATION\nDISABLE OFFSETS METADATA GC\nSKIP DUPLICATE KEY ERRORS\nINTO TABLE `orders`\nFIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\nLINES TERMINATED BY '|\\n' STARTING BY '';"},{"cell_type":"code","execution_count":null,"id":"b5e06dfe-f679-4fe2-bd47-802a0b127270","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nCREATE PIPELINE IF NOT EXISTS `partsupp_pipeline`\nAS LOAD DATA S3 'memsql-tpch-dataset/sf_100/partsupp'\nCONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\nBATCH_INTERVAL 2500\nDISABLE OUT_OF_ORDER OPTIMIZATION\nDISABLE OFFSETS METADATA GC\nSKIP DUPLICATE KEY ERRORS\nINTO TABLE `partsupp`\nFIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\nLINES TERMINATED BY '|\\n' STARTING BY '';"},{"cell_type":"code","execution_count":null,"id":"8b2726ea-9d0c-4809-bf35-b851821cf336","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nCREATE PIPELINE IF NOT EXISTS `part_pipeline`\nAS LOAD DATA S3 'memsql-tpch-dataset/sf_100/part'\nCONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\nBATCH_INTERVAL 2500\nDISABLE OUT_OF_ORDER OPTIMIZATION\nDISABLE OFFSETS METADATA GC\nSKIP DUPLICATE KEY ERRORS\nINTO TABLE `part`\nFIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\nLINES TERMINATED BY '|\\n' STARTING BY '';"},{"cell_type":"code","execution_count":null,"id":"06114a70-d7f9-4c1f-b6a6-4554b33bb5c6","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nCREATE PIPELINE IF NOT EXISTS `region_pipeline`\nAS LOAD DATA S3 'memsql-tpch-dataset/sf_100/region'\nCONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\nBATCH_INTERVAL 2500\nDISABLE OUT_OF_ORDER OPTIMIZATION\nDISABLE OFFSETS METADATA GC\nSKIP DUPLICATE KEY ERRORS\nINTO TABLE `region`\nFIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\nLINES TERMINATED BY '|\\n' STARTING BY '';"},{"cell_type":"code","execution_count":null,"id":"bfc2e19d-f32c-4b9e-8fa7-1e68711f834e","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nCREATE PIPELINE IF NOT EXISTS `supplier_pipeline`\nAS LOAD DATA S3 'memsql-tpch-dataset/sf_100/supplier'\nCONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\nBATCH_INTERVAL 2500\nDISABLE OUT_OF_ORDER OPTIMIZATION\nDISABLE OFFSETS METADATA GC\nSKIP DUPLICATE KEY ERRORS\nINTO TABLE `supplier`\nFIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\nLINES TERMINATED BY '|\\n' STARTING BY '';"},{"cell_type":"code","execution_count":null,"id":"e19cb045-bebc-4aa6-92d7-ae06778d8af8","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nSTART PIPELINE customer_pipeline;\nSTART PIPELINE lineitem_pipeline;\nSTART PIPELINE nation_pipeline;\nSTART PIPELINE orders_pipeline;\nSTART PIPELINE partsupp_pipeline;\nSTART PIPELINE part_pipeline;\nSTART PIPELINE region_pipeline;\nSTART PIPELINE supplier_pipeline;"},{"attachments":null,"cell_type":"markdown","id":"3eacdd09-9b27-4995-a3df-9514cd733a57","metadata":{"language":"sql"},"source":"#### [Optional Step] Check data ingestion in real-time with Perspective"},{"cell_type":"code","execution_count":null,"id":"e5c3fa1a-af9b-4fdb-98bc-9c73a1fa7f33","metadata":{"language":"python"},"outputs":[],"source":"pip install perspective-python --quiet"},{"cell_type":"code","execution_count":null,"id":"b61f205f-5d1d-4af2-8369-e31057c76f66","metadata":{"language":"python"},"outputs":[],"source":"import perspective\nimport threading\nimport random\nimport time\nfrom datetime import datetime, date\nfrom perspective import Table, PerspectiveWidget\nimport warnings\nwarnings.filterwarnings('ignore')"},{"cell_type":"code","execution_count":null,"id":"ac376696-0b7e-4182-bc04-ddef400b7fca","metadata":{"language":"python"},"outputs":[],"source":"def loop():\n while mode != 'stop':\n while mode == 'run':\n table.update(data_source())\n time.sleep(1)"},{"cell_type":"code","execution_count":null,"id":"5e2ce253-576d-49c9-a92b-f165ffcc4ae7","metadata":{"language":"python"},"outputs":[],"source":"def data_source():\n result = %sql select sum(rows_streamed) as rows_streamed from information_schema.pipelines_batches_summary where database_name = 's2_tpch_unoptimized';\n result2 = list(result.dicts()) \n return result2\n\nSCHEMA = {\n \"rows_streamed\": int\n}"},{"cell_type":"code","execution_count":null,"id":"d388a00d-23ec-45d2-a94c-5e747da707c0","metadata":{"language":"python"},"outputs":[],"source":"mode = 'run'\ntable = perspective.Table(SCHEMA, limit=100)\nthreading.Thread(target=loop).start()"},{"cell_type":"code","execution_count":null,"id":"9e88e523-63b1-48e3-bde5-a2840490199c","metadata":{"language":"python"},"outputs":[],"source":"perspective.PerspectiveWidget(table,title = \"Track Row Ingestion\",plugin=\"Y Line\",columns=[\"count_rows\"])"},{"cell_type":"code","execution_count":null,"id":"debbed63-11dc-43dc-9f12-a0e80a5b7703","metadata":{"language":"python"},"outputs":[],"source":"mode = 'stop'"},{"attachments":null,"cell_type":"markdown","id":"442281db-7fd2-4bba-9104-84f3b0537a9a","metadata":{"language":"sql"},"source":"### Now, let's see the performance of a few queries"},{"cell_type":"code","execution_count":null,"id":"97e9050b-3b8e-40e3-b871-2b0bb73eb5ae","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\n# TPC-H Query 1: Pricing Summary Report \nselect \n l_returnflag,\n l_linestatus,\n sum(l_quantity) as sum_qty,\n sum(l_extendedprice) as sum_base_price,\n sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,\n sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,\n avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,\n avg(l_discount) as avg_disc,\n count(*) as count_order \nfrom s2_tpch_unoptimized.lineitem\nwhere l_shipdate <= date('1998-12-01') - interval '90' day \ngroup by l_returnflag, l_linestatus \norder by l_returnflag, l_linestatus;"},{"cell_type":"code","execution_count":null,"id":"71669c07-d06e-41f4-bd63-5046add22afb","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\n# TPC-H Query 4: Order Priority Checking \nselect\n o_orderpriority,\n count(*) as order_count\nfrom\n s2_tpch_unoptimized.orders\nwhere\n o_orderdate >= date('1993-07-01')\n and o_orderdate < date('1993-10-01')\n and exists (\n select *\n from s2_tpch_unoptimized.lineitem\n where l_orderkey = o_orderkey and l_commitdate < l_receiptdate\n )\ngroup by o_orderpriority\norder by o_orderpriority;"},{"cell_type":"code","execution_count":null,"id":"654f872b-4a35-4897-86d2-c51b548919b8","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\n-- TPC-H Query 21: Suppliers Who Kept Orders Waiting \n\nselect\n s_name,\n count(*) as numwait\nfrom\n s2_tpch_unoptimized.supplier,\n s2_tpch_unoptimized.lineitem l1,\n s2_tpch_unoptimized.orders,\n s2_tpch_unoptimized.nation\nwhere\n s_suppkey = l1.l_suppkey\n and o_orderkey = l1.l_orderkey\n and o_orderstatus = 'F'\n and l1.l_receiptdate > l1.l_commitdate\n and exists (\n select\n *\n from\n s2_tpch_unoptimized.lineitem l2\n where\n l2.l_orderkey = l1.l_orderkey\n and l2.l_suppkey <> l1.l_suppkey\n )\n and not exists (\n select\n *\n from\n s2_tpch_unoptimized.lineitem l3\n where\n l3.l_orderkey = l1.l_orderkey\n and l3.l_suppkey <> l1.l_suppkey\n and l3.l_receiptdate > l3.l_commitdate\n )\n and s_nationkey = n_nationkey\n and n_name = 'EGYPT'\ngroup by\n s_name\norder by\n numwait desc,\n s_name\nlimit 100;"},{"attachments":null,"cell_type":"markdown","id":"0ad2d768-cb53-4e0b-8353-6178c8f9508c","metadata":{"language":"sql"},"source":"### Now, let's first focus on optimizing the performance"},{"cell_type":"code","execution_count":null,"id":"d1844926-ae73-4fec-b5ba-802c8173e846","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\ncreate database if not exists s2_tpch_optimized"},{"attachments":null,"cell_type":"markdown","id":"a163696a-0507-4b05-9146-6cbfa1ba1e29","metadata":{"language":"sql"},"source":"
\n \n
\n

Action Required

\n

Make sure to select the s2_tpch_optimized database from the drop-down menu at the top of this notebook.\n It updates the connection_url to connect to that database.

\n
\n
"},{"attachments":null,"cell_type":"markdown","id":"bafd1114-f88c-409c-8281-b32ac27f1222","metadata":{"language":"sql"},"source":"##### Now, let's create each table with optimized data structure:\n* We create a unique key through primary key. For example **lineitem** table needs both the orderkey and linenumber to identify rows by uniqueness\n* We create a shard key which will distribute data in an efficient way to perform fast join and filtering. For **lineitem** table since we perform joins and calculation based on the orderkey we create a shardkey with orderkey"},{"cell_type":"code","execution_count":null,"id":"ff6aea3d-a965-475d-becd-3910965f5c8f","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nCREATE TABLE IF NOT EXISTS `lineitem` (\n `l_orderkey` bigint(11) NOT NULL,\n `l_partkey` int(11) NOT NULL,\n `l_suppkey` int(11) NOT NULL,\n `l_linenumber` int(11) NOT NULL,\n `l_quantity` decimal(15,2) NOT NULL,\n `l_extendedprice` decimal(15,2) NOT NULL,\n `l_discount` decimal(15,2) NOT NULL,\n `l_tax` decimal(15,2) NOT NULL,\n `l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `l_shipdate` date NOT NULL,\n `l_commitdate` date NOT NULL,\n `l_receiptdate` date NOT NULL,\n `l_shipinstruct` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `l_shipmode` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n UNIQUE KEY `pk` (`l_orderkey`,`l_linenumber`) USING HASH,\n SHARD KEY `__SHARDKEY` (`l_orderkey`),\n KEY `l_orderkey` (`l_orderkey`) USING CLUSTERED COLUMNSTORE\n);\n\nCREATE TABLE IF NOT EXISTS `customer` (\n `c_custkey` int(11) NOT NULL,\n `c_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `c_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `c_nationkey` int(11) NOT NULL,\n `c_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `c_acctbal` decimal(15,2) NOT NULL,\n `c_mktsegment` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `c_comment` varchar(117) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n UNIQUE KEY `pk` (`c_custkey`) USING HASH,\n SHARD KEY `__SHARDKEY` (`c_custkey`),\n KEY `c_custkey` (`c_custkey`) USING CLUSTERED COLUMNSTORE\n);\n\nCREATE TABLE IF NOT EXISTS `nation` (\n `n_nationkey` int(11) NOT NULL,\n `n_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `n_regionkey` int(11) NOT NULL,\n `n_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n UNIQUE KEY `pk` (`n_nationkey`) USING HASH,\n SHARD KEY `__SHARDKEY` (`n_nationkey`),\n KEY `n_nationkey` (`n_nationkey`) USING CLUSTERED COLUMNSTORE\n);\n\nCREATE TABLE IF NOT EXISTS `orders` (\n `o_orderkey` bigint(11) NOT NULL,\n `o_custkey` int(11) NOT NULL,\n `o_orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `o_totalprice` decimal(15,2) NOT NULL,\n `o_orderdate` date NOT NULL,\n `o_orderpriority` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `o_clerk` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `o_shippriority` int(11) NOT NULL,\n `o_comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n UNIQUE KEY `pk` (`o_orderkey`) USING HASH,\n SHARD KEY `__SHARDKEY` (`o_orderkey`),\n KEY `o_orderkey` (`o_orderkey`) USING CLUSTERED COLUMNSTORE\n);\n\nCREATE TABLE IF NOT EXISTS `part` (\n `p_partkey` int(11) NOT NULL,\n `p_name` varchar(55) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `p_mfgr` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `p_brand` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `p_type` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `p_size` int(11) NOT NULL,\n `p_container` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `p_retailprice` decimal(15,2) NOT NULL,\n `p_comment` varchar(23) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n UNIQUE KEY `pk` (`p_partkey`) USING HASH,\n SHARD KEY `__SHARDKEY` (`p_partkey`),\n KEY `p_partkey` (`p_partkey`) USING CLUSTERED COLUMNSTORE\n);\n\nCREATE TABLE IF NOT EXISTS `partsupp` (\n `ps_partkey` int(11) NOT NULL,\n `ps_suppkey` int(11) NOT NULL,\n `ps_availqty` int(11) NOT NULL,\n `ps_supplycost` decimal(15,2) NOT NULL,\n `ps_comment` varchar(199) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n UNIQUE KEY `pk` (`ps_partkey`,`ps_suppkey`) USING HASH,\n SHARD KEY `__SHARDKEY` (`ps_partkey`),\n KEY `ps_partkey` (`ps_partkey`,`ps_suppkey`) USING CLUSTERED COLUMNSTORE\n);\n\nCREATE TABLE IF NOT EXISTS `region` (\n `r_regionkey` int(11) NOT NULL,\n `r_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `r_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n UNIQUE KEY `pk` (`r_regionkey`) USING HASH,\n SHARD KEY `__SHARDKEY` (`r_regionkey`),\n KEY `r_regionkey` (`r_regionkey`) USING CLUSTERED COLUMNSTORE\n);\n\nCREATE TABLE IF NOT EXISTS `supplier` (\n `s_suppkey` int(11) NOT NULL,\n `s_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `s_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `s_nationkey` int(11) NOT NULL,\n `s_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n `s_acctbal` decimal(15,2) NOT NULL,\n `s_comment` varchar(101) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n UNIQUE KEY `pk` (`s_suppkey`) USING HASH,\n SHARD KEY `__SHARDKEY` (`s_suppkey`),\n KEY `s_suppkey` (`s_suppkey`) USING CLUSTERED COLUMNSTORE\n);"},{"cell_type":"code","execution_count":null,"id":"6ed484a7-8ed8-479d-9ebe-5716749369bc","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\nINSERT INTO s2_tpch_optimized.nation SELECT * FROM s2_tpch_unoptimized.nation;\nINSERT INTO s2_tpch_optimized.lineitem SELECT * FROM s2_tpch_unoptimized.lineitem;\nINSERT INTO s2_tpch_optimized.customer SELECT * FROM s2_tpch_unoptimized.customer;\nINSERT INTO s2_tpch_optimized.orders SELECT * FROM s2_tpch_unoptimized.orders;\nINSERT INTO s2_tpch_optimized.part SELECT * FROM s2_tpch_unoptimized.part;\nINSERT INTO s2_tpch_optimized.partsupp SELECT * FROM s2_tpch_unoptimized.partsupp;\nINSERT INTO s2_tpch_optimized.region SELECT * FROM s2_tpch_unoptimized.region;\nINSERT INTO s2_tpch_optimized.supplier SELECT * FROM s2_tpch_unoptimized.supplier;"},{"cell_type":"code","execution_count":null,"id":"9c79d8a5-c626-4f14-85e2-1f00afbceb8f","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\n# TPC-H Query 1: Pricing Summary Report \nselect \n l_returnflag,\n l_linestatus,\n sum(l_quantity) as sum_qty,\n sum(l_extendedprice) as sum_base_price,\n sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,\n sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,\n avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,\n avg(l_discount) as avg_disc,\n count(*) as count_order \nfrom lineitem\nwhere l_shipdate <= date('1998-12-01') - interval '90' day \ngroup by l_returnflag, l_linestatus \norder by l_returnflag, l_linestatus;"},{"cell_type":"code","execution_count":null,"id":"e9b21e9b-1869-4a1e-8998-a209c3dc6ffd","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\n# TPC-H Query 4: Order Priority Checking \nselect\n o_orderpriority,\n count(*) as order_count\nfrom\n s2_tpch_optimized.orders\nwhere\n o_orderdate >= date('1993-07-01')\n and o_orderdate < date('1993-10-01')\n and exists (\n select *\n from s2_tpch_optimized.lineitem\n where l_orderkey = o_orderkey and l_commitdate < l_receiptdate\n )\ngroup by o_orderpriority\norder by o_orderpriority;"},{"cell_type":"code","execution_count":null,"id":"c8b63c2a-b1fb-4cd1-90bb-62e8b56922f5","metadata":{"language":"sql"},"outputs":[],"source":"%%sql\n-- TPC-H Query 21: Suppliers Who Kept Orders Waiting \n\nselect\n s_name,\n count(*) as numwait\nfrom\n s2_tpch_optimized.supplier,\n s2_tpch_optimized.lineitem l1,\n s2_tpch_optimized.orders,\n s2_tpch_optimized.nation\nwhere\n s_suppkey = l1.l_suppkey\n and o_orderkey = l1.l_orderkey\n and o_orderstatus = 'F'\n and l1.l_receiptdate > l1.l_commitdate\n and exists (\n select\n *\n from\n s2_tpch_optimized.lineitem l2\n where\n l2.l_orderkey = l1.l_orderkey\n and l2.l_suppkey <> l1.l_suppkey\n )\n and not exists (\n select\n *\n from\n s2_tpch_optimized.lineitem l3\n where\n l3.l_orderkey = l1.l_orderkey\n and l3.l_suppkey <> l1.l_suppkey\n and l3.l_receiptdate > l3.l_commitdate\n )\n and s_nationkey = n_nationkey\n and n_name = 'EGYPT'\ngroup by\n s_name\norder by\n numwait desc,\n s_name\nlimit 100;"},{"attachments":null,"cell_type":"markdown","id":"54a6902d-bc3b-455d-8f47-3450d5928de8","metadata":{"language":"sql"},"source":"### Finally, let's do a side by side comparison between the optimized and unoptimized database"},{"cell_type":"code","execution_count":null,"id":"2f23e742-d9fe-467a-a299-a09a20d2af1e","metadata":{"language":"python"},"outputs":[],"source":"from singlestoredb import create_engine\nimport sqlalchemy as sa\nconnection_url_unoptimized = \"singlestoredb://\"+connection_user+\":\"+connection_password+\"@\"+connection_host+\":\"+connection_port+\"/s2_tpch_unoptimized?ssl_cipher=HIGH\"\ndb_connection_unoptimized = create_engine(connection_url_unoptimized).connect()\nconnection_url_optimized = \"singlestoredb://\"+connection_user+\":\"+connection_password+\"@\"+connection_host+\":\"+connection_port+\"/s2_tpch_optimized?ssl_cipher=HIGH\"\ndb_connection_optimized = create_engine(connection_url_optimized).connect()"},{"attachments":null,"cell_type":"markdown","id":"7f84ed59-9d2c-4e3e-92eb-63603645d953","metadata":{"language":"sql"},"source":"Here are a few queries that you can test side by side against. Overall you will notice an average of 4x improvement in performance"},{"cell_type":"code","execution_count":null,"id":"f3d86d62-ffe2-40ad-b59a-7c77cae98ca0","metadata":{"language":"python"},"outputs":[],"source":"sql_query4 = sa.text('''\nselect\n o_orderpriority,\n count(*) as order_count\nfrom\n s2_tpch_unoptimized.orders\nwhere\n o_orderdate >= date('1993-07-01')\n and o_orderdate < date('1993-10-01')\n and exists (\n select *\n from s2_tpch_unoptimized.lineitem\n where l_orderkey = o_orderkey and l_commitdate < l_receiptdate\n )\ngroup by o_orderpriority\norder by o_orderpriority;\n''')"},{"cell_type":"code","execution_count":null,"id":"7d947658-6dec-4efa-97f6-863977b2003b","metadata":{"language":"python"},"outputs":[],"source":"sql_query21 = sa.text('''\n select\n s_name,\n count(*) as numwait\nfrom\n supplier,\n lineitem l1,\n orders,\n nation\nwhere\n s_suppkey = l1.l_suppkey\n and o_orderkey = l1.l_orderkey\n and o_orderstatus = 'F'\n and l1.l_receiptdate > l1.l_commitdate\n and exists (\n select\n *\n from\n lineitem l2\n where\n l2.l_orderkey = l1.l_orderkey\n and l2.l_suppkey <> l1.l_suppkey\n )\n and not exists (\n select\n *\n from\n lineitem l3\n where\n l3.l_orderkey = l1.l_orderkey\n and l3.l_suppkey <> l1.l_suppkey\n and l3.l_receiptdate > l3.l_commitdate\n )\n and s_nationkey = n_nationkey\n and n_name = 'EGYPT'\ngroup by\n s_name\norder by\n numwait desc,\n s_name\nlimit 100;\n''')"},{"cell_type":"code","execution_count":null,"id":"6e4da6ac-52ff-4506-8826-3a46bf350656","metadata":{"language":"python"},"outputs":[],"source":"result = db_connection_optimized.execute(sql_query21)"},{"cell_type":"code","execution_count":null,"id":"b699bc42-e9bd-4b8f-81c0-48311b7fd14e","metadata":{"language":"python"},"outputs":[],"source":"import time\nimport pandas as pd\nimport plotly.graph_objs as go\nnum_iterations = 10\nopt_times = []\n\nfor i in range(num_iterations):\n opt_start_time = time.time()\n opt_result = db_connection_optimized.execute(sql_query21)\n opt_stop_time = time.time()\n opt_times.append(opt_stop_time - opt_start_time)\n\nunopt_times = []\nfor i in range(num_iterations):\n unopt_start_time = time.time()\n unopt_result = db_connection_unoptimized.execute(sql_query21)\n unopt_stop_time = time.time()\n unopt_times.append(unopt_stop_time - unopt_start_time)\n\nx_axis = list(range(1, num_iterations + 1))\ndata = {\n 'iteration': x_axis,\n 'opt_times': opt_times,\n 'unopt_times': unopt_times,\n}\ndf = pd.DataFrame.from_dict(data)\n\nfig = go.Figure()\n\n# Adding optimized times to the plot\nfig.add_trace(go.Scatter(x=df['iteration'], y=df['opt_times'], mode='lines+markers', name='Optimized Database'))\n\n# Adding unoptimized times to the plot\nfig.add_trace(go.Scatter(x=df['iteration'], y=df['unopt_times'], mode='lines+markers', name='Unoptimized Database'))\n\n# Update y-axis and x-axis properties\nfig.update_layout(\n title=\"Execution Time Comparison\",\n xaxis_title=\"Iteration\",\n yaxis_title=\"Time in Seconds\",\n xaxis=dict(tickmode='array', tickvals=list(range(1, num_iterations + 1)))\n)\n\n# Show the plot\nfig.show()"},{"attachments":null,"cell_type":"markdown","id":"4708ffe5-ea88-48a8-a4ac-9eaa5f801f79","metadata":{"language":"sql"},"source":"
\n
\n"}],"metadata":{"jupyterlab":{"notebooks":{"version_major":6,"version_minor":4}},"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimeType":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.11.4"},"singlestore_cell_default_language":"sql","singlestore_connection":{"connectionID":"a98658e6-5e5d-4be9-be2e-9fa993172504","defaultDatabase":"s2_tpch_optimized"}},"nbformat":4,"nbformat_minor":5} \ No newline at end of file +{ + "cells": [ + { + "cell_type": "markdown", + "id": "8e67bcbe-6ace-4ca9-b28c-927b4b5a85b2", + "metadata": {}, + "source": [ + "
\n", + "
\n", + " \n", + "
\n", + "
\n", + "
SingleStore Notebooks
\n", + "

Learn how to Optimize Performance with TPCH 100

\n", + "
\n", + "
" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "5d93af8b-eb1d-4207-a060-1a45c46d8b41", + "metadata": {}, + "source": [ + "### Context\n", + "\n", + "This notebook will help you with four core key principles for getting performance out of SingleStoreDB using TPCH Benchmark. SingleStoreDB is a distributed database, so you should think of using shard keys, database partitions, primary keys and indexes for getting the best performance out of it.\n", + "\n", + "##### About database partitions\n", + "The generalized recommendation for most clusters is to have 4 CPU cores per database partition on each leaf. This means if you had a cluster with 16 cores on each of 4 leaves (64 CPU cores total across all leaf hosts), you would want to have 4 partitions on each leaf (16 partitions throughout the cluster). If you are using a S00 workspace, you will have 2 partitions per database. ***Note*** *that increasing partitions will have additional memory and caching overheads, which can be expensive if you have thousands of tables*\n", + "\n", + "##### About shard keys\n", + "Data is distributed across the SingleStoreDB Cloud workspace into a number of partitions on the leaf nodes. The shard key is a collection of the columns in a table that are used to control how the rows of that table are distributed. To determine the partition responsible for a given row, SingleStoreDB Cloud computes a hash from all the columns in the shard key to the partition ID. Therefore, rows with the same shard key will reside on the same partition.\n", + "\n", + "##### About hash indexes\n", + "They are highly efficient for exact-match lookups (point-reads). Because hash indexes store rows in a sparse array of buckets indexed through a hash function on the relevant columns, queries can quickly retrieve data by examining only the corresponding bucket rather than searching the entire dataset. This enables significant reduction in lookup time and hence, increased performance for specific query types.\n", + "\n", + "**For that tutorial, we recommend using a workspace of size S4 to ingest data faster and also see the difference and gain you can get from a distributed architecture.**" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "67f041ef-5605-43ef-8ca0-5db3194b4cad", + "metadata": {}, + "source": [ + "
\n", + " \n", + "
\n", + "

Note

\n", + "

For that tutorial, we recommend using workspace of size S4 to ingest data faster and also see the difference and gain you can get from a distributed architecture.

\n", + "
\n", + "
" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "6052728d-7828-4fb2-bb53-b960a7ad43af", + "metadata": {}, + "source": [ + "### Let's first create the unoptimized database" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "7301a602-48cf-4f3b-9cbc-2e7184d97ae0", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "create database if not exists s2_tpch_unoptimized\n", + "\n", + "# To create a database with custom partitions use the following syntax: CREATE DATABASE YourDatabaseName PARTITIONS=X;\n", + "# You cannot change after creation the number of partitions" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "94c8bb6f-658d-4434-9074-4847f1c7d721", + "metadata": {}, + "source": [ + "If using a S00, the database will have 2 partitions, if using S1, it will have 8 partitions" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "b36585b9-4d52-4301-ac68-60fa49425751", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "SELECT num_partitions FROM information_schema.DISTRIBUTED_DATABASES WHERE database_name = 's2_tpch_unoptimized';" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "b576e31c-6a67-4126-86ab-480fd96805d3", + "metadata": {}, + "source": [ + "##### Let's create all the tables in that database with no index, shard key or primary key" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "4587c575-9b5a-4535-bebe-70779064e9dc", + "metadata": {}, + "source": [ + "
\n", + " \n", + "
\n", + "

Action Required

\n", + "

Make sure to select the s2_tpch_unoptimized database from the drop-down menu at the top of this notebook.\n", + " It updates the connection_url to connect to that database.

\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "afde1362-2d38-4732-94ed-6d4ed05a6806", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE TABLE IF NOT EXISTS `customer` (\n", + " `c_custkey` int(11) NOT NULL,\n", + " `c_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `c_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `c_nationkey` int(11) NOT NULL,\n", + " `c_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `c_acctbal` decimal(15,2) NOT NULL,\n", + " `c_mktsegment` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `c_comment` varchar(117) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `lineitem` (\n", + " `l_orderkey` bigint(11) NOT NULL,\n", + " `l_partkey` int(11) NOT NULL,\n", + " `l_suppkey` int(11) NOT NULL,\n", + " `l_linenumber` int(11) NOT NULL,\n", + " `l_quantity` decimal(15,2) NOT NULL,\n", + " `l_extendedprice` decimal(15,2) NOT NULL,\n", + " `l_discount` decimal(15,2) NOT NULL,\n", + " `l_tax` decimal(15,2) NOT NULL,\n", + " `l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `l_shipdate` date NOT NULL,\n", + " `l_commitdate` date NOT NULL,\n", + " `l_receiptdate` date NOT NULL,\n", + " `l_shipinstruct` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `l_shipmode` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `nation` (\n", + " `n_nationkey` int(11) NOT NULL,\n", + " `n_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `n_regionkey` int(11) NOT NULL,\n", + " `n_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `orders` (\n", + " `o_orderkey` bigint(11) NOT NULL,\n", + " `o_custkey` int(11) NOT NULL,\n", + " `o_orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `o_totalprice` decimal(15,2) NOT NULL,\n", + " `o_orderdate` date NOT NULL,\n", + " `o_orderpriority` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `o_clerk` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `o_shippriority` int(11) NOT NULL,\n", + " `o_comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `part` (\n", + " `p_partkey` int(11) NOT NULL,\n", + " `p_name` varchar(55) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `p_mfgr` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `p_brand` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `p_type` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `p_size` int(11) NOT NULL,\n", + " `p_container` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `p_retailprice` decimal(15,2) NOT NULL,\n", + " `p_comment` varchar(23) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `partsupp` (\n", + " `ps_partkey` int(11) NOT NULL,\n", + " `ps_suppkey` int(11) NOT NULL,\n", + " `ps_availqty` int(11) NOT NULL,\n", + " `ps_supplycost` decimal(15,2) NOT NULL,\n", + " `ps_comment` varchar(199) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `region` (\n", + " `r_regionkey` int(11) NOT NULL,\n", + " `r_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `r_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `supplier` (\n", + " `s_suppkey` int(11) NOT NULL,\n", + " `s_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `s_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `s_nationkey` int(11) NOT NULL,\n", + " `s_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `s_acctbal` decimal(15,2) NOT NULL,\n", + " `s_comment` varchar(101) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL\n", + ");" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "09711e8c-fb01-4e10-862b-ee5350be6076", + "metadata": {}, + "source": [ + "### Now let's create the pipelines and run them to ingest data" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "4e8ca124-ac4b-49de-b0ee-d9441d43bedd", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE PIPELINE IF NOT EXISTS `customer_pipeline`\n", + "AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/customer'\n", + "CONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\n", + "BATCH_INTERVAL 2500\n", + "DISABLE OUT_OF_ORDER OPTIMIZATION\n", + "DISABLE OFFSETS METADATA GC\n", + "SKIP DUPLICATE KEY ERRORS\n", + "INTO TABLE `customer`\n", + "FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\n", + "LINES TERMINATED BY '|\\n' STARTING BY '';" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "ce739af4-6839-4751-8a81-019fb26cad72", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE PIPELINE IF NOT EXISTS `lineitem_pipeline`\n", + "AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/lineitem/lineitem.'\n", + "CONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\n", + "BATCH_INTERVAL 2500\n", + "DISABLE OUT_OF_ORDER OPTIMIZATION\n", + "DISABLE OFFSETS METADATA GC\n", + "SKIP DUPLICATE KEY ERRORS\n", + "INTO TABLE `lineitem`\n", + "FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\n", + "LINES TERMINATED BY '|\\n' STARTING BY '';" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "bfdd5bbc-702a-4f77-b771-cd38189040e0", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE PIPELINE IF NOT EXISTS `nation_pipeline`\n", + "AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/nation'\n", + "CONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\n", + "BATCH_INTERVAL 2500\n", + "DISABLE OUT_OF_ORDER OPTIMIZATION\n", + "DISABLE OFFSETS METADATA GC\n", + "SKIP DUPLICATE KEY ERRORS\n", + "INTO TABLE `nation`\n", + "FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\n", + "LINES TERMINATED BY '|\\n' STARTING BY '';" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "1b040083-f864-4e64-9bd2-00b1ff7d1e2b", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE PIPELINE IF NOT EXISTS `orders_pipeline`\n", + "AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/orders'\n", + "CONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\n", + "BATCH_INTERVAL 2500\n", + "DISABLE OUT_OF_ORDER OPTIMIZATION\n", + "DISABLE OFFSETS METADATA GC\n", + "SKIP DUPLICATE KEY ERRORS\n", + "INTO TABLE `orders`\n", + "FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\n", + "LINES TERMINATED BY '|\\n' STARTING BY '';" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "b5e06dfe-f679-4fe2-bd47-802a0b127270", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE PIPELINE IF NOT EXISTS `partsupp_pipeline`\n", + "AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/partsupp'\n", + "CONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\n", + "BATCH_INTERVAL 2500\n", + "DISABLE OUT_OF_ORDER OPTIMIZATION\n", + "DISABLE OFFSETS METADATA GC\n", + "SKIP DUPLICATE KEY ERRORS\n", + "INTO TABLE `partsupp`\n", + "FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\n", + "LINES TERMINATED BY '|\\n' STARTING BY '';" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "8b2726ea-9d0c-4809-bf35-b851821cf336", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE PIPELINE IF NOT EXISTS `part_pipeline`\n", + "AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/part'\n", + "CONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\n", + "BATCH_INTERVAL 2500\n", + "DISABLE OUT_OF_ORDER OPTIMIZATION\n", + "DISABLE OFFSETS METADATA GC\n", + "SKIP DUPLICATE KEY ERRORS\n", + "INTO TABLE `part`\n", + "FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\n", + "LINES TERMINATED BY '|\\n' STARTING BY '';" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "06114a70-d7f9-4c1f-b6a6-4554b33bb5c6", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE PIPELINE IF NOT EXISTS `region_pipeline`\n", + "AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/region'\n", + "CONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\n", + "BATCH_INTERVAL 2500\n", + "DISABLE OUT_OF_ORDER OPTIMIZATION\n", + "DISABLE OFFSETS METADATA GC\n", + "SKIP DUPLICATE KEY ERRORS\n", + "INTO TABLE `region`\n", + "FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\n", + "LINES TERMINATED BY '|\\n' STARTING BY '';" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "bfc2e19d-f32c-4b9e-8fa7-1e68711f834e", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE PIPELINE IF NOT EXISTS `supplier_pipeline`\n", + "AS LOAD DATA S3 'memsql-tpch-dataset/sf_100/supplier'\n", + "CONFIG '{\\\"region\\\":\\\"us-east-1\\\", \\\"disable_gunzip\\\": false}'\n", + "BATCH_INTERVAL 2500\n", + "DISABLE OUT_OF_ORDER OPTIMIZATION\n", + "DISABLE OFFSETS METADATA GC\n", + "SKIP DUPLICATE KEY ERRORS\n", + "INTO TABLE `supplier`\n", + "FIELDS TERMINATED BY '|' ENCLOSED BY '' ESCAPED BY '\\\\'\n", + "LINES TERMINATED BY '|\\n' STARTING BY '';" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "e19cb045-bebc-4aa6-92d7-ae06778d8af8", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "START PIPELINE customer_pipeline;\n", + "START PIPELINE lineitem_pipeline;\n", + "START PIPELINE nation_pipeline;\n", + "START PIPELINE orders_pipeline;\n", + "START PIPELINE partsupp_pipeline;\n", + "START PIPELINE part_pipeline;\n", + "START PIPELINE region_pipeline;\n", + "START PIPELINE supplier_pipeline;" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "3eacdd09-9b27-4995-a3df-9514cd733a57", + "metadata": {}, + "source": [ + "#### [Optional Step] Check data ingestion in real-time with Perspective" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "e5c3fa1a-af9b-4fdb-98bc-9c73a1fa7f33", + "metadata": {}, + "outputs": [], + "source": [ + "pip install perspective-python --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "b61f205f-5d1d-4af2-8369-e31057c76f66", + "metadata": {}, + "outputs": [], + "source": [ + "import perspective\n", + "import threading\n", + "import random\n", + "import time\n", + "from datetime import datetime, date\n", + "from perspective import Table, PerspectiveWidget\n", + "import warnings\n", + "warnings.filterwarnings('ignore')" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "ac376696-0b7e-4182-bc04-ddef400b7fca", + "metadata": {}, + "outputs": [], + "source": [ + "def loop():\n", + " while mode != 'stop':\n", + " while mode == 'run':\n", + " table.update(data_source())\n", + " time.sleep(1)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "5e2ce253-576d-49c9-a92b-f165ffcc4ae7", + "metadata": {}, + "outputs": [], + "source": [ + "def data_source():\n", + " result = %sql select sum(rows_streamed) as rows_streamed from information_schema.pipelines_batches_summary where database_name = 's2_tpch_unoptimized';\n", + " result2 = list(result.dicts())\n", + " return result2\n", + "\n", + "SCHEMA = {\n", + " \"rows_streamed\": int\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "d388a00d-23ec-45d2-a94c-5e747da707c0", + "metadata": {}, + "outputs": [], + "source": [ + "mode = 'run'\n", + "table = perspective.Table(SCHEMA, limit=100)\n", + "threading.Thread(target=loop).start()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "9e88e523-63b1-48e3-bde5-a2840490199c", + "metadata": {}, + "outputs": [], + "source": [ + "perspective.PerspectiveWidget(table,title = \"Track Row Ingestion\",plugin=\"Y Line\",columns=[\"count_rows\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "debbed63-11dc-43dc-9f12-a0e80a5b7703", + "metadata": {}, + "outputs": [], + "source": [ + "mode = 'stop'" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "442281db-7fd2-4bba-9104-84f3b0537a9a", + "metadata": {}, + "source": [ + "### Now, let's see the performance of a few queries" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "97e9050b-3b8e-40e3-b871-2b0bb73eb5ae", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "# TPC-H Query 1: Pricing Summary Report\n", + "select\n", + " l_returnflag,\n", + " l_linestatus,\n", + " sum(l_quantity) as sum_qty,\n", + " sum(l_extendedprice) as sum_base_price,\n", + " sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,\n", + " sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,\n", + " avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,\n", + " avg(l_discount) as avg_disc,\n", + " count(*) as count_order\n", + "from s2_tpch_unoptimized.lineitem\n", + "where l_shipdate <= date('1998-12-01') - interval '90' day\n", + "group by l_returnflag, l_linestatus\n", + "order by l_returnflag, l_linestatus;" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "71669c07-d06e-41f4-bd63-5046add22afb", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "# TPC-H Query 4: Order Priority Checking\n", + "select\n", + " o_orderpriority,\n", + " count(*) as order_count\n", + "from\n", + " s2_tpch_unoptimized.orders\n", + "where\n", + " o_orderdate >= date('1993-07-01')\n", + " and o_orderdate < date('1993-10-01')\n", + " and exists (\n", + " select *\n", + " from s2_tpch_unoptimized.lineitem\n", + " where l_orderkey = o_orderkey and l_commitdate < l_receiptdate\n", + " )\n", + "group by o_orderpriority\n", + "order by o_orderpriority;" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "654f872b-4a35-4897-86d2-c51b548919b8", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "-- TPC-H Query 21: Suppliers Who Kept Orders Waiting\n", + "\n", + "select\n", + " s_name,\n", + " count(*) as numwait\n", + "from\n", + " s2_tpch_unoptimized.supplier,\n", + " s2_tpch_unoptimized.lineitem l1,\n", + " s2_tpch_unoptimized.orders,\n", + " s2_tpch_unoptimized.nation\n", + "where\n", + " s_suppkey = l1.l_suppkey\n", + " and o_orderkey = l1.l_orderkey\n", + " and o_orderstatus = 'F'\n", + " and l1.l_receiptdate > l1.l_commitdate\n", + " and exists (\n", + " select\n", + " *\n", + " from\n", + " s2_tpch_unoptimized.lineitem l2\n", + " where\n", + " l2.l_orderkey = l1.l_orderkey\n", + " and l2.l_suppkey <> l1.l_suppkey\n", + " )\n", + " and not exists (\n", + " select\n", + " *\n", + " from\n", + " s2_tpch_unoptimized.lineitem l3\n", + " where\n", + " l3.l_orderkey = l1.l_orderkey\n", + " and l3.l_suppkey <> l1.l_suppkey\n", + " and l3.l_receiptdate > l3.l_commitdate\n", + " )\n", + " and s_nationkey = n_nationkey\n", + " and n_name = 'EGYPT'\n", + "group by\n", + " s_name\n", + "order by\n", + " numwait desc,\n", + " s_name\n", + "limit 100;" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "0ad2d768-cb53-4e0b-8353-6178c8f9508c", + "metadata": {}, + "source": [ + "### Now, let's first focus on optimizing the performance" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "d1844926-ae73-4fec-b5ba-802c8173e846", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "create database if not exists s2_tpch_optimized" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "a163696a-0507-4b05-9146-6cbfa1ba1e29", + "metadata": {}, + "source": [ + "
\n", + " \n", + "
\n", + "

Action Required

\n", + "

Make sure to select the s2_tpch_optimized database from the drop-down menu at the top of this notebook.\n", + " It updates the connection_url to connect to that database.

\n", + "
\n", + "
" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "bafd1114-f88c-409c-8281-b32ac27f1222", + "metadata": {}, + "source": [ + "##### Now, let's create each table with optimized data structure:\n", + "* We create a unique key through primary key. For example **lineitem** table needs both the orderkey and linenumber to identify rows by uniqueness\n", + "* We create a shard key which will distribute data in an efficient way to perform fast join and filtering. For **lineitem** table since we perform joins and calculation based on the orderkey we create a shardkey with orderkey" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "ff6aea3d-a965-475d-becd-3910965f5c8f", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE TABLE IF NOT EXISTS `lineitem` (\n", + " `l_orderkey` bigint(11) NOT NULL,\n", + " `l_partkey` int(11) NOT NULL,\n", + " `l_suppkey` int(11) NOT NULL,\n", + " `l_linenumber` int(11) NOT NULL,\n", + " `l_quantity` decimal(15,2) NOT NULL,\n", + " `l_extendedprice` decimal(15,2) NOT NULL,\n", + " `l_discount` decimal(15,2) NOT NULL,\n", + " `l_tax` decimal(15,2) NOT NULL,\n", + " `l_returnflag` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `l_linestatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `l_shipdate` date NOT NULL,\n", + " `l_commitdate` date NOT NULL,\n", + " `l_receiptdate` date NOT NULL,\n", + " `l_shipinstruct` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `l_shipmode` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `l_comment` varchar(44) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " UNIQUE KEY `pk` (`l_orderkey`,`l_linenumber`) USING HASH,\n", + " SHARD KEY `__SHARDKEY` (`l_orderkey`),\n", + " KEY `l_orderkey` (`l_orderkey`) USING CLUSTERED COLUMNSTORE\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `customer` (\n", + " `c_custkey` int(11) NOT NULL,\n", + " `c_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `c_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `c_nationkey` int(11) NOT NULL,\n", + " `c_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `c_acctbal` decimal(15,2) NOT NULL,\n", + " `c_mktsegment` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `c_comment` varchar(117) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " UNIQUE KEY `pk` (`c_custkey`) USING HASH,\n", + " SHARD KEY `__SHARDKEY` (`c_custkey`),\n", + " KEY `c_custkey` (`c_custkey`) USING CLUSTERED COLUMNSTORE\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `nation` (\n", + " `n_nationkey` int(11) NOT NULL,\n", + " `n_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `n_regionkey` int(11) NOT NULL,\n", + " `n_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " UNIQUE KEY `pk` (`n_nationkey`) USING HASH,\n", + " SHARD KEY `__SHARDKEY` (`n_nationkey`),\n", + " KEY `n_nationkey` (`n_nationkey`) USING CLUSTERED COLUMNSTORE\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `orders` (\n", + " `o_orderkey` bigint(11) NOT NULL,\n", + " `o_custkey` int(11) NOT NULL,\n", + " `o_orderstatus` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `o_totalprice` decimal(15,2) NOT NULL,\n", + " `o_orderdate` date NOT NULL,\n", + " `o_orderpriority` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `o_clerk` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `o_shippriority` int(11) NOT NULL,\n", + " `o_comment` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " UNIQUE KEY `pk` (`o_orderkey`) USING HASH,\n", + " SHARD KEY `__SHARDKEY` (`o_orderkey`),\n", + " KEY `o_orderkey` (`o_orderkey`) USING CLUSTERED COLUMNSTORE\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `part` (\n", + " `p_partkey` int(11) NOT NULL,\n", + " `p_name` varchar(55) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `p_mfgr` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `p_brand` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `p_type` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `p_size` int(11) NOT NULL,\n", + " `p_container` varchar(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `p_retailprice` decimal(15,2) NOT NULL,\n", + " `p_comment` varchar(23) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " UNIQUE KEY `pk` (`p_partkey`) USING HASH,\n", + " SHARD KEY `__SHARDKEY` (`p_partkey`),\n", + " KEY `p_partkey` (`p_partkey`) USING CLUSTERED COLUMNSTORE\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `partsupp` (\n", + " `ps_partkey` int(11) NOT NULL,\n", + " `ps_suppkey` int(11) NOT NULL,\n", + " `ps_availqty` int(11) NOT NULL,\n", + " `ps_supplycost` decimal(15,2) NOT NULL,\n", + " `ps_comment` varchar(199) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " UNIQUE KEY `pk` (`ps_partkey`,`ps_suppkey`) USING HASH,\n", + " SHARD KEY `__SHARDKEY` (`ps_partkey`),\n", + " KEY `ps_partkey` (`ps_partkey`,`ps_suppkey`) USING CLUSTERED COLUMNSTORE\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `region` (\n", + " `r_regionkey` int(11) NOT NULL,\n", + " `r_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `r_comment` varchar(152) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " UNIQUE KEY `pk` (`r_regionkey`) USING HASH,\n", + " SHARD KEY `__SHARDKEY` (`r_regionkey`),\n", + " KEY `r_regionkey` (`r_regionkey`) USING CLUSTERED COLUMNSTORE\n", + ");\n", + "\n", + "CREATE TABLE IF NOT EXISTS `supplier` (\n", + " `s_suppkey` int(11) NOT NULL,\n", + " `s_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `s_address` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `s_nationkey` int(11) NOT NULL,\n", + " `s_phone` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " `s_acctbal` decimal(15,2) NOT NULL,\n", + " `s_comment` varchar(101) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,\n", + " UNIQUE KEY `pk` (`s_suppkey`) USING HASH,\n", + " SHARD KEY `__SHARDKEY` (`s_suppkey`),\n", + " KEY `s_suppkey` (`s_suppkey`) USING CLUSTERED COLUMNSTORE\n", + ");" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "6ed484a7-8ed8-479d-9ebe-5716749369bc", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "INSERT INTO s2_tpch_optimized.nation SELECT * FROM s2_tpch_unoptimized.nation;\n", + "INSERT INTO s2_tpch_optimized.lineitem SELECT * FROM s2_tpch_unoptimized.lineitem;\n", + "INSERT INTO s2_tpch_optimized.customer SELECT * FROM s2_tpch_unoptimized.customer;\n", + "INSERT INTO s2_tpch_optimized.orders SELECT * FROM s2_tpch_unoptimized.orders;\n", + "INSERT INTO s2_tpch_optimized.part SELECT * FROM s2_tpch_unoptimized.part;\n", + "INSERT INTO s2_tpch_optimized.partsupp SELECT * FROM s2_tpch_unoptimized.partsupp;\n", + "INSERT INTO s2_tpch_optimized.region SELECT * FROM s2_tpch_unoptimized.region;\n", + "INSERT INTO s2_tpch_optimized.supplier SELECT * FROM s2_tpch_unoptimized.supplier;" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "9c79d8a5-c626-4f14-85e2-1f00afbceb8f", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "# TPC-H Query 1: Pricing Summary Report\n", + "select\n", + " l_returnflag,\n", + " l_linestatus,\n", + " sum(l_quantity) as sum_qty,\n", + " sum(l_extendedprice) as sum_base_price,\n", + " sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,\n", + " sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,\n", + " avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,\n", + " avg(l_discount) as avg_disc,\n", + " count(*) as count_order\n", + "from lineitem\n", + "where l_shipdate <= date('1998-12-01') - interval '90' day\n", + "group by l_returnflag, l_linestatus\n", + "order by l_returnflag, l_linestatus;" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "e9b21e9b-1869-4a1e-8998-a209c3dc6ffd", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "# TPC-H Query 4: Order Priority Checking\n", + "select\n", + " o_orderpriority,\n", + " count(*) as order_count\n", + "from\n", + " s2_tpch_optimized.orders\n", + "where\n", + " o_orderdate >= date('1993-07-01')\n", + " and o_orderdate < date('1993-10-01')\n", + " and exists (\n", + " select *\n", + " from s2_tpch_optimized.lineitem\n", + " where l_orderkey = o_orderkey and l_commitdate < l_receiptdate\n", + " )\n", + "group by o_orderpriority\n", + "order by o_orderpriority;" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "c8b63c2a-b1fb-4cd1-90bb-62e8b56922f5", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "-- TPC-H Query 21: Suppliers Who Kept Orders Waiting\n", + "\n", + "select\n", + " s_name,\n", + " count(*) as numwait\n", + "from\n", + " s2_tpch_optimized.supplier,\n", + " s2_tpch_optimized.lineitem l1,\n", + " s2_tpch_optimized.orders,\n", + " s2_tpch_optimized.nation\n", + "where\n", + " s_suppkey = l1.l_suppkey\n", + " and o_orderkey = l1.l_orderkey\n", + " and o_orderstatus = 'F'\n", + " and l1.l_receiptdate > l1.l_commitdate\n", + " and exists (\n", + " select\n", + " *\n", + " from\n", + " s2_tpch_optimized.lineitem l2\n", + " where\n", + " l2.l_orderkey = l1.l_orderkey\n", + " and l2.l_suppkey <> l1.l_suppkey\n", + " )\n", + " and not exists (\n", + " select\n", + " *\n", + " from\n", + " s2_tpch_optimized.lineitem l3\n", + " where\n", + " l3.l_orderkey = l1.l_orderkey\n", + " and l3.l_suppkey <> l1.l_suppkey\n", + " and l3.l_receiptdate > l3.l_commitdate\n", + " )\n", + " and s_nationkey = n_nationkey\n", + " and n_name = 'EGYPT'\n", + "group by\n", + " s_name\n", + "order by\n", + " numwait desc,\n", + " s_name\n", + "limit 100;" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "54a6902d-bc3b-455d-8f47-3450d5928de8", + "metadata": {}, + "source": [ + "### Finally, let's do a side by side comparison between the optimized and unoptimized database" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "2f23e742-d9fe-467a-a299-a09a20d2af1e", + "metadata": {}, + "outputs": [], + "source": [ + "from singlestoredb import create_engine\n", + "import sqlalchemy as sa\n", + "connection_url_unoptimized = \"singlestoredb://\"+connection_user+\":\"+connection_password+\"@\"+connection_host+\":\"+connection_port+\"/s2_tpch_unoptimized?ssl_cipher=HIGH\"\n", + "db_connection_unoptimized = create_engine(connection_url_unoptimized).connect()\n", + "connection_url_optimized = \"singlestoredb://\"+connection_user+\":\"+connection_password+\"@\"+connection_host+\":\"+connection_port+\"/s2_tpch_optimized?ssl_cipher=HIGH\"\n", + "db_connection_optimized = create_engine(connection_url_optimized).connect()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "7f84ed59-9d2c-4e3e-92eb-63603645d953", + "metadata": {}, + "source": [ + "Here are a few queries that you can test side by side against. Overall you will notice an average of 4x improvement in performance" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "f3d86d62-ffe2-40ad-b59a-7c77cae98ca0", + "metadata": {}, + "outputs": [], + "source": [ + "sql_query4 = sa.text('''\n", + "select\n", + " o_orderpriority,\n", + " count(*) as order_count\n", + "from\n", + " s2_tpch_unoptimized.orders\n", + "where\n", + " o_orderdate >= date('1993-07-01')\n", + " and o_orderdate < date('1993-10-01')\n", + " and exists (\n", + " select *\n", + " from s2_tpch_unoptimized.lineitem\n", + " where l_orderkey = o_orderkey and l_commitdate < l_receiptdate\n", + " )\n", + "group by o_orderpriority\n", + "order by o_orderpriority;\n", + "''')" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "7d947658-6dec-4efa-97f6-863977b2003b", + "metadata": {}, + "outputs": [], + "source": [ + "sql_query21 = sa.text('''\n", + " select\n", + " s_name,\n", + " count(*) as numwait\n", + "from\n", + " supplier,\n", + " lineitem l1,\n", + " orders,\n", + " nation\n", + "where\n", + " s_suppkey = l1.l_suppkey\n", + " and o_orderkey = l1.l_orderkey\n", + " and o_orderstatus = 'F'\n", + " and l1.l_receiptdate > l1.l_commitdate\n", + " and exists (\n", + " select\n", + " *\n", + " from\n", + " lineitem l2\n", + " where\n", + " l2.l_orderkey = l1.l_orderkey\n", + " and l2.l_suppkey <> l1.l_suppkey\n", + " )\n", + " and not exists (\n", + " select\n", + " *\n", + " from\n", + " lineitem l3\n", + " where\n", + " l3.l_orderkey = l1.l_orderkey\n", + " and l3.l_suppkey <> l1.l_suppkey\n", + " and l3.l_receiptdate > l3.l_commitdate\n", + " )\n", + " and s_nationkey = n_nationkey\n", + " and n_name = 'EGYPT'\n", + "group by\n", + " s_name\n", + "order by\n", + " numwait desc,\n", + " s_name\n", + "limit 100;\n", + "''')" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "6e4da6ac-52ff-4506-8826-3a46bf350656", + "metadata": {}, + "outputs": [], + "source": [ + "result = db_connection_optimized.execute(sql_query21)" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "b699bc42-e9bd-4b8f-81c0-48311b7fd14e", + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "import pandas as pd\n", + "import plotly.graph_objs as go\n", + "num_iterations = 10\n", + "opt_times = []\n", + "\n", + "for i in range(num_iterations):\n", + " opt_start_time = time.time()\n", + " opt_result = db_connection_optimized.execute(sql_query21)\n", + " opt_stop_time = time.time()\n", + " opt_times.append(opt_stop_time - opt_start_time)\n", + "\n", + "unopt_times = []\n", + "for i in range(num_iterations):\n", + " unopt_start_time = time.time()\n", + " unopt_result = db_connection_unoptimized.execute(sql_query21)\n", + " unopt_stop_time = time.time()\n", + " unopt_times.append(unopt_stop_time - unopt_start_time)\n", + "\n", + "x_axis = list(range(1, num_iterations + 1))\n", + "data = {\n", + " 'iteration': x_axis,\n", + " 'opt_times': opt_times,\n", + " 'unopt_times': unopt_times,\n", + "}\n", + "df = pd.DataFrame.from_dict(data)\n", + "\n", + "fig = go.Figure()\n", + "\n", + "# Adding optimized times to the plot\n", + "fig.add_trace(go.Scatter(x=df['iteration'], y=df['opt_times'], mode='lines+markers', name='Optimized Database'))\n", + "\n", + "# Adding unoptimized times to the plot\n", + "fig.add_trace(go.Scatter(x=df['iteration'], y=df['unopt_times'], mode='lines+markers', name='Unoptimized Database'))\n", + "\n", + "# Update y-axis and x-axis properties\n", + "fig.update_layout(\n", + " title=\"Execution Time Comparison\",\n", + " xaxis_title=\"Iteration\",\n", + " yaxis_title=\"Time in Seconds\",\n", + " xaxis=dict(tickmode='array', tickvals=list(range(1, num_iterations + 1)))\n", + ")\n", + "\n", + "# Show the plot\n", + "fig.show()" + ] + }, + { + "cell_type": "markdown", + "id": "4708ffe5-ea88-48a8-a4ac-9eaa5f801f79", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + } + ], + "metadata": { + "jupyterlab": { + "notebooks": { + "version_major": 6, + "version_minor": 4 + } + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimeType": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/resources/nb-check.py b/resources/nb-check.py index ab4cb007..7a1414b0 100755 --- a/resources/nb-check.py +++ b/resources/nb-check.py @@ -167,8 +167,9 @@ def new_markdown_cell(cell_id: str, content: list[str]) -> dict[str, Any]: footer_cell = cells.pop(-1) footer_id = footer_cell.get('id', footer_id) - # Convert source lists to a string for cell in cells: + + # Convert source lists to a string source = cell.get('source', []) if isinstance(source, list): source = ''.join(source) @@ -178,6 +179,10 @@ def new_markdown_cell(cell_id: str, content: list[str]) -> dict[str, Any]: source = [] cell['source'] = source + # Remove "attachments": null (not sure how they get in there) + if 'attachments' in cell and cell['attachments'] is None: + cell['attachments'] = {} + # Prepare parameter substitutions for header try: icon_name = toml_info['meta']['icon'] diff --git a/resources/toml-check.py b/resources/toml-check.py index ff6da45d..5927c2cd 100755 --- a/resources/toml-check.py +++ b/resources/toml-check.py @@ -35,8 +35,8 @@ def error(msg): if [x.lower() for x in tags] != tags: error(f'Tags must be in all lower-case ({tags}) in {f}') - if [re.sub(r'[^a-z]', r'', x) for x in tags] != tags: - error(f'Tags can only contain letters ({tags}) in {f}') + if [re.sub(r'[^a-z0-9]', r'', x) for x in tags] != tags: + error(f'Tags can only contain letters and numbers ({tags}) in {f}') # Currently only "spaces" is allowed in destinations destinations = meta.get('destinations', [])