Repository for MSDS610 Communications for Analytics final project.
This project demonstrates the use of different types of join to analyze employee and project data. It involves three primary SQL operations: inner join, left join, and outer join.
To run these SQL queries, you need access to a relational database system (e.g., PostgreSQL, MySQL) and a database containing tables named employee
and project
. You can set up a database and run these queries using SQL client tools or by pasting them into a SQL script.
-
employee
: Contains information about employees, includingemployee_id
,employee_name
,department
, andposition
. -
project
: Contains information about projects, includingproject_id
,project_name
,pic_id
(person in charge),supervisor_id
, andstatus
. -
The code includes three SQL queries: inner join, left join, and outer join. These queries combine data from the
employee
andproject
tables based on common keys.
-
Inner Join: Combines employee and project data using an inner join. It selects project information along with the employee responsible (
person_in_charge
) and their department and position. The result includes only matched rows.inner_join AS ( SELECT p.project_name, e.employee_name AS person_in_charge, e.department, e.position, p.status, p.supervisor_id FROM employee AS e JOIN project AS p ON e.employee_id = p.pic_id )
-
Left Join: Combines employee and project data using a left join. It selects employee information and project names where employees are in charge (
pic_id
). Rows with no matches in theproject
table will have NULL values for project-related columns.left_join AS ( SELECT e.employee_name, p.project_name, e.department, e.position, p.status FROM employee AS e LEFT JOIN project AS p ON e.employee_id = p.pic_id )
-
Outer Join: Combines employee and project data using a full outer join. It selects all available data from both tables. Rows with no matches in either table will have NULL values in the corresponding columns.
outer_join AS ( SELECT e.employee_name, p.project_name, e.department, e.position, p.status FROM employee AS e FULL OUTER JOIN project AS p ON e.employee_id = p.pic_id )
The results of each join are saved in the inner_join
, left_join
, and outer_join
CTEs. You can view the results by selecting from these CTEs. The output includes relevant employee and project information based on the type of join performed.
Example: outer join
SELECT * FROM outer_join
-
Inner Join: Shows all employees with corresponding projects
-
Left Join: Can include employee that doesn't have the current project
-
Outer Join: Can see who is available and which project is not yet taken
In this part, we will discuss the runtime inefficient query and efficient query with different joins in joins.ipynb
To run these SQL queries, you need access to a relational database system (e.g., PostgreSQL, MySQL). In this part, we will use psycopg
to access PostgreSQL.
We first import the modules we need.
import psycopg
import time
import pandas as pd
Then, we connect the database with `psycopg`.
user = 'postgres'
host = 'localhost'
dbname = 'msds691_HW'
with psycopg.connect(f"user='{user}' \
host='{host}' \
dbname='{dbname}'") as conn:
with conn.cursor() as curs:
# use the query you want here
To analyze runtime, we will use %%time
in each code block.
-
incident_type
: Contains information about incident type, includingincident_code
,incident_category
,incident_subcategory
, andincident_description
. -
incient
: Contains information about incidnet, includingincident_id
,incident_datetime
,report_datetime
,longtitude
,latitude
,report_type_code
andincident_code
. -
location
: Contains information about location, includinglongtitude
,latitude
,supervisor_district
,police_district
andanalysis_neighborhood
. -
report_type
: Contains information about report type, includingreport_type_code
, andreport_type_description
.
Here are the results of runtime comparison of efficent and inefficent query.
-
Basic Join
- Inefficient (without joins)
CPU times: user 508 ms, sys: 121 ms, total: 629 ms Wall time: 1.4 s
- Efficient
CPU times: user 473 ms, sys: 107 ms, total: 580 ms Wall time: 771 ms
- Inefficient (without joins)
-
Inner Join
- Inefficient (without joins)
CPU times: user 556 ms, sys: 107 ms, total: 664 ms Wall time: 1.52 s
- Efficient
CPU times: user 544 ms, sys: 103 ms, total: 647 ms Wall time: 1.05 s
- Inefficient (without joins)
-
Outer Join
- Inefficient (without joins)
CPU times: user 546 ms, sys: 100 ms, total: 646 ms Wall time: 1.41 s
- Efficient
CPU times: user 547 ms, sys: 104 ms, total: 651 ms Wall time: 796 ms
- Inefficient (without joins)
We can see that there's much more faster when we're using join!