-
Notifications
You must be signed in to change notification settings - Fork 45
Scripts
All of the scripts used to run calculations and aggregations on the data are found in the scripts branch of the main repository.
## Raw Data Your working environment will need to have a DATA_DIR environment variable that points to a directory on your machine where the raw data is located. This is also the directory where the final calculated data will be output. The data directory that you specify needs to have a specific folder for each dataset you would like to use, along with files that follow a specific naming convention. Our scripts currently support the following datasets:
Source | Description | Folder | Filename |
---|---|---|---|
RAIS | Employment Data supplied by MTE | rais/ | Raisxxxx.csv1 |
SECEX | Foreign Trade Data supplied by MDIC | secex/ | MDIC_xxxx.csv1 |
1 Where "xxxx" is the year (eg. 2009, 2010)
The files you provide can be either compressed (zip, gzip, or bzip2) or uncompressed.
## Running the Scripts From your environment shell, you can run any specific step you would like with one easy line. Here, we'll run the first step for RAIS, for the year 2002:
python -m scripts.rais.step_1_aggregate -y 2002
You can substitute "rais" for "secex" run the same step for SECEX. There is also a bash script to run all of the steps for a given year. Here, we will run it for the 2008 SECEX data:
bash secex_full_year.sh 2008
Again, change "secex" to the dataset you would like to use, and "2008" to the year you would like to analyze.
NOTE: All scripts MUST be run in order by step number and year (growth calculations use previous year data).
## Importing to MySQL When you have run all of the required steps for a given year, you can then either manually import the appropriate .tsv file that was generated into the database, or you can run the following bash script:
bash scripts/rais/import_sql.sh 2009 ybp
This script will import the final calculated RAIS data for 2009 into the "rais_ybp" table.
## RAIS Calculations
### Step 1 - Aggregate
Filename: step_1_aggregate.py
The first step is to take the raw data file, clean it, and aggregate it to the following levels:
Type | Column | Aggregates |
---|---|---|
Location | bra_id | States, Mesoregions, Planning Regions (for Minas Gerais), and Municipalities |
Industry | isic_id | 1-digit, 3-digit, and 5-digit resolutions |
Occupations | cbo_id | 1-digit, 2-digit, and 4-digit resolutions |
This outputs our first usable file, which everything else is based on: ybio.tsv
### Step 2 - Disaggregate
Filename: step_2_disaggregate.py
Input: ybio.tsv
Taking the newly created ybio.tsv file from step 1, we disaggregate the data to all of the different tables.
By disaggregating the data to different tables, we can dramatically improve query time by only accessing the table that contains data pertinent to our query. Also, certain calculations are only applicable to certain index pairings.
Output: ybo.tsv, ybi.tsv, yio.tsv, yo.tsv, yi.tsv, yb.tsv
### Step 3 - Recommended Employees
Filename: step_3_required.py
Inputs: ybio.tsv, ybi.tsv
We now calculate the recommended number of employees for each location-indsutry-occupation grouping.
Output: ybio_required.tsv
### Step 4 - Importance
Filename: step_4_importance.py
Inputs: ybio.tsv, yio.tsv
Next, we calculate how unique an occupation is in each industry that it is present in. Notice that we can use the old ybio.tsv file (not the newly generated ybio_required.tsv) because we do not need the Recommended Employee number for this calculation.
Output: yio_importance.tsv
### Step 5 - Diversity
Filename: step_5_diversity.py
Inputs: ybi.tsv, ybo.tsv, yio.tsv
For each index, we can calculate the Diversity and Effective Diversity of itself in regards to the other 2 indexes.
Outputs: yb_diversity.tsv, yi_diversity.tsv, yo_diversity.tsv
### Step 6 - RCA, Distance, and Opportunity Gain
Filename: step_6_rca_dist_opp.py
Input: ybi.tsv
Now, we calculate RCA, Distance, and Opportunity Gain for each location-industry pairing.
Output: ybi_rcas_dist_opp.tsv
### Step 7 - Growth
Filename: step_7_growth.py
Inputs: ybio_requred.tsv, ybo.tsv, ybi_rcas_dist_opp.tsv, yio_importance.tsv, yo_diversity.tsv, yi_diversity.tsv, yb_diversity.tsv
Finally, we calculate both the Wage and Employee growth for every table. This adds 8 columns to each table, as each value has growth values and percents for 1 year and 5 year.
Remember, this step requires the previous years data to run the calculations, if that data is not present (or it's the first year), the script will return NULL values.
Outputs: ybio_requred_growth.tsv, ybo_growth.tsv, ybi_rcas_dist_opp_growth.tsv, yio_importance_growth.tsv, yo_diversity_growth.tsv, yi_diversity_growth.tsv, yb_diversity_growth.tsv
## SECEX Calculations
### Step 1 - Aggregate
Filename: step_1_aggregate.py
The first step is to take the raw data file, clean it, and aggregate it to the following levels:
Type | Column | Aggregates |
---|---|---|
Location | bra_id | States, Mesoregions, Planning Regions (for Minas Gerais), and Municipalities |
Product | hs_id | 2-digit, 4-digit, and 6-digit resolutions |
Trade Partner | wld_id | 2-digit and 5-digit resolutions |
This outputs our first usable file, which everything else is based on: ybpw.tsv
### Step 2 - Disaggregate
Filename: step_2_disaggregate.py
Input: ybpw.tsv
Taking the newly created ybpw.tsv file from step 1, we disaggregate the data to all of the different tables.
By disaggregating the data to different tables, we can dramatically improve query time by only accessing the table that contains data pertinent to our query. Also, certain calculations are only applicable to certain index pairings.
Outputs: ybw.tsv, ybp.tsv, ypw.tsv, yw.tsv, yp.tsv, yb.tsv
### Step 3 - PCI and International ECI
Filename: step_3_pci_wld_eci.py
Inputs: yp.tsv, yw.tsv, comtrade_ecis.tsv, comtrade_pcis.tsv
In this step, we take the pre-calculated ECI and PCI values for Trade Partners and Products (respectively) and put them in our tables.
These values were calculated by the team from The Observatory of Economic Complexity.
Outputs: yp_pcis.tsv, yw_ecis.tsv
### Step 4 - Domestic ECI
Filename: step_4_eci.py
Inputs: yp_pcis.tsv, yb.tsv, ybp.tsv
The ECI for a location is calculated as a sum of the shares of a location's exports' PCIs. We only include the products exported by a location with RCA >= 1. The following is a step-by-step example of how the ECI for a fictional location is calculated.
exports = [1, 8, 2]
exports_rca = [0.21, 33.1211, 1.09]
So now excluding the products which are not exported with RCA >= 1 we are left with the following export list:
exports = [8, 2]
or represented as a share:
exports = [0.8, 0.2]
Now using the PCI values for these products by multiplying them by the share of that product in that location's basket:
pcis = [-0.034, 2.333]
exports_as_pci_share = [-0.0272, 0.4666]
Summing these product's PCI shares we get the locations ECI
eci = sum([-0.0272, 0.4666])
eci == 0.4395
Output: yb_ecis.tsv
### Step 5 - Diversity
Filename: step_5_diversity.py
Inputs: ybp.tsv, ybw.tsv, ypw.tsv, yb_ecis.tsv, yp_pcis.tsv, yw_ecis.tsv
For each index, we can calculate the Diversity and Effective Diversity of itself in regards to the other 2 indexes.
Outputs: yb_ecis_diversity.tsv, yp_pcis_diversity.tsv, yw_ecis_diversity.tsv
### Step 6 - International RCA for Brazil
Filename: step_6_yp_rca.py
Inputs: yp_pcis_diversity.tsv, comtrade_ypw.tsv
Here, we bring in the International RCA of Brazil for each product from the COMTRADE data.
Output: yp_pcis_diversity_rca.tsv
### Step 7 - RCA, Distance, and Opportunity Gain
Filename: step_7_ybp_rdo.py
Input: ybp.tsv, yp_pcis_diversity_rca.tsv
Now, we calculate RCA, Distance, and Opportunity Gain for each location-industry pairing (both domestic and international).
Output: ybp_rcas_dist_opp.tsv
### Step 8 - Growth
Filename: step_8_growth.py
Inputs: ybpw.tsv, ybw.tsv, ybp_rcas_dist_opp.tsv, ypw.tsv, yw_ecis_diversity.tsv, yp_pcis_diversity.tsv, yb_ecis_diversity.tsv
Finally, we calculate the Trade Value growth for every table. This adds 4 columns to each table, growth values and percents for 1 year and 5 year.
Remember, this step requires the previous years data to run the calculations, if that data is not present (or it's the first year), the script will return NULL values.
Outputs: ybpw_growth.tsv, ybw_growth.tsv, ybp_rcas_dist_opp_growth.tsv, ypw_growth.tsv, yw_ecis_diversity_growth.tsv, yp_pcis_diversity_growth.tsv, yb_ecis_diversity_growth.tsv