Skip to content

Scripts

Alex Simoes edited this page Feb 19, 2015 · 3 revisions

All of the scripts used to run calculations and aggregations on the data are found in the scripts branch of the main repository.

## Raw Data Your working environment will need to have a DATA_DIR environment variable that points to a directory on your machine where the raw data is located. This is also the directory where the final calculated data will be output. The data directory that you specify needs to have a specific folder for each dataset you would like to use, along with files that follow a specific naming convention. Our scripts currently support the following datasets:

Source Description Folder Filename
RAIS Employment Data supplied by MTE rais/ Raisxxxx.csv1
SECEX Foreign Trade Data supplied by MDIC secex/ MDIC_xxxx.csv1

1 Where "xxxx" is the year (eg. 2009, 2010)

The files you provide can be either compressed (zip, gzip, or bzip2) or uncompressed.

## Running the Scripts From your environment shell, you can run any specific step you would like with one easy line. Here, we'll run the first step for RAIS, for the year 2002:

python -m scripts.rais.step_1_aggregate -y 2002

You can substitute "rais" for "secex" run the same step for SECEX. There is also a bash script to run all of the steps for a given year. Here, we will run it for the 2008 SECEX data:

bash secex_full_year.sh 2008

Again, change "secex" to the dataset you would like to use, and "2008" to the year you would like to analyze.

NOTE: All scripts MUST be run in order by step number and year (growth calculations use previous year data).

## Importing to MySQL When you have run all of the required steps for a given year, you can then either manually import the appropriate .tsv file that was generated into the database, or you can run the following bash script:

bash scripts/rais/import_sql.sh 2009 ybp

This script will import the final calculated RAIS data for 2009 into the "rais_ybp" table.

## RAIS Calculations

### Step 1 - Aggregate

Filename: step_1_aggregate.py

The first step is to take the raw data file, clean it, and aggregate it to the following levels:

Type Column Aggregates
Location bra_id States, Mesoregions, Planning Regions (for Minas Gerais), and Municipalities
Industry isic_id 1-digit, 3-digit, and 5-digit resolutions
Occupations cbo_id 1-digit, 2-digit, and 4-digit resolutions

This outputs our first usable file, which everything else is based on: ybio.tsv

### Step 2 - Disaggregate

Filename: step_2_disaggregate.py

Input: ybio.tsv

Taking the newly created ybio.tsv file from step 1, we disaggregate the data to all of the different tables.

By disaggregating the data to different tables, we can dramatically improve query time by only accessing the table that contains data pertinent to our query. Also, certain calculations are only applicable to certain index pairings.

Output: ybo.tsv, ybi.tsv, yio.tsv, yo.tsv, yi.tsv, yb.tsv

### Step 3 - Recommended Employees

Filename: step_3_required.py

Inputs: ybio.tsv, ybi.tsv

We now calculate the recommended number of employees for each location-indsutry-occupation grouping.

Output: ybio_required.tsv

### Step 4 - Importance

Filename: step_4_importance.py

Inputs: ybio.tsv, yio.tsv

Next, we calculate how unique an occupation is in each industry that it is present in. Notice that we can use the old ybio.tsv file (not the newly generated ybio_required.tsv) because we do not need the Recommended Employee number for this calculation.

Output: yio_importance.tsv

### Step 5 - Diversity

Filename: step_5_diversity.py

Inputs: ybi.tsv, ybo.tsv, yio.tsv

For each index, we can calculate the Diversity and Effective Diversity of itself in regards to the other 2 indexes.

Outputs: yb_diversity.tsv, yi_diversity.tsv, yo_diversity.tsv

### Step 6 - RCA, Distance, and Opportunity Gain

Filename: step_6_rca_dist_opp.py

Input: ybi.tsv

Now, we calculate RCA, Distance, and Opportunity Gain for each location-industry pairing.

Output: ybi_rcas_dist_opp.tsv

### Step 7 - Growth

Filename: step_7_growth.py

Inputs: ybio_requred.tsv, ybo.tsv, ybi_rcas_dist_opp.tsv, yio_importance.tsv, yo_diversity.tsv, yi_diversity.tsv, yb_diversity.tsv

Finally, we calculate both the Wage and Employee growth for every table. This adds 8 columns to each table, as each value has growth values and percents for 1 year and 5 year.

Remember, this step requires the previous years data to run the calculations, if that data is not present (or it's the first year), the script will return NULL values.

Outputs: ybio_requred_growth.tsv, ybo_growth.tsv, ybi_rcas_dist_opp_growth.tsv, yio_importance_growth.tsv, yo_diversity_growth.tsv, yi_diversity_growth.tsv, yb_diversity_growth.tsv

## SECEX Calculations

### Step 1 - Aggregate

Filename: step_1_aggregate.py

The first step is to take the raw data file, clean it, and aggregate it to the following levels:

Type Column Aggregates
Location bra_id States, Mesoregions, Planning Regions (for Minas Gerais), and Municipalities
Product hs_id 2-digit, 4-digit, and 6-digit resolutions
Trade Partner wld_id 2-digit and 5-digit resolutions

This outputs our first usable file, which everything else is based on: ybpw.tsv

### Step 2 - Disaggregate

Filename: step_2_disaggregate.py

Input: ybpw.tsv

Taking the newly created ybpw.tsv file from step 1, we disaggregate the data to all of the different tables.

By disaggregating the data to different tables, we can dramatically improve query time by only accessing the table that contains data pertinent to our query. Also, certain calculations are only applicable to certain index pairings.

Outputs: ybw.tsv, ybp.tsv, ypw.tsv, yw.tsv, yp.tsv, yb.tsv

### Step 3 - PCI and International ECI

Filename: step_3_pci_wld_eci.py

Inputs: yp.tsv, yw.tsv, comtrade_ecis.tsv, comtrade_pcis.tsv

In this step, we take the pre-calculated ECI and PCI values for Trade Partners and Products (respectively) and put them in our tables.

These values were calculated by the team from The Observatory of Economic Complexity.

Outputs: yp_pcis.tsv, yw_ecis.tsv

### Step 4 - Domestic ECI

Filename: step_4_eci.py

Inputs: yp_pcis.tsv, yb.tsv, ybp.tsv

The ECI for a location is calculated as a sum of the shares of a location's exports' PCIs. We only include the products exported by a location with RCA >= 1. The following is a step-by-step example of how the ECI for a fictional location is calculated.

exports = [1, 8, 2]
exports_rca = [0.21, 33.1211, 1.09]

So now excluding the products which are not exported with RCA >= 1 we are left with the following export list:

exports = [8, 2]

or represented as a share:

exports = [0.8, 0.2]

Now using the PCI values for these products by multiplying them by the share of that product in that location's basket:

pcis = [-0.034, 2.333]
exports_as_pci_share = [-0.0272, 0.4666]

Summing these product's PCI shares we get the locations ECI

eci = sum([-0.0272, 0.4666])
eci == 0.4395

Output: yb_ecis.tsv

### Step 5 - Diversity

Filename: step_5_diversity.py

Inputs: ybp.tsv, ybw.tsv, ypw.tsv, yb_ecis.tsv, yp_pcis.tsv, yw_ecis.tsv

For each index, we can calculate the Diversity and Effective Diversity of itself in regards to the other 2 indexes.

Outputs: yb_ecis_diversity.tsv, yp_pcis_diversity.tsv, yw_ecis_diversity.tsv

### Step 6 - International RCA for Brazil

Filename: step_6_yp_rca.py

Inputs: yp_pcis_diversity.tsv, comtrade_ypw.tsv

Here, we bring in the International RCA of Brazil for each product from the COMTRADE data.

Output: yp_pcis_diversity_rca.tsv

### Step 7 - RCA, Distance, and Opportunity Gain

Filename: step_7_ybp_rdo.py

Input: ybp.tsv, yp_pcis_diversity_rca.tsv

Now, we calculate RCA, Distance, and Opportunity Gain for each location-industry pairing (both domestic and international).

Output: ybp_rcas_dist_opp.tsv

### Step 8 - Growth

Filename: step_8_growth.py

Inputs: ybpw.tsv, ybw.tsv, ybp_rcas_dist_opp.tsv, ypw.tsv, yw_ecis_diversity.tsv, yp_pcis_diversity.tsv, yb_ecis_diversity.tsv

Finally, we calculate the Trade Value growth for every table. This adds 4 columns to each table, growth values and percents for 1 year and 5 year.

Remember, this step requires the previous years data to run the calculations, if that data is not present (or it's the first year), the script will return NULL values.

Outputs: ybpw_growth.tsv, ybw_growth.tsv, ybp_rcas_dist_opp_growth.tsv, ypw_growth.tsv, yw_ecis_diversity_growth.tsv, yp_pcis_diversity_growth.tsv, yb_ecis_diversity_growth.tsv

Clone this wiki locally