Markowitz' Critical Line Algorithm (CLA) explores the Pareto frontier of the bi-objective optimization problem
with many problems in modern portfolio theory (MPT) building from here [1]. When developing algorithms and software to solve these, it's important to have realistic test problems to check against.
However, many papers which discuss MPT either do not include their test data or reference datasets which are no longer available. To save others the time, here's a collation of relevant datasets which I compiled over the course of my PhD on CLA.
Dataset Name | Assets | Description | License | Sources |
---|---|---|---|---|
Xu2 |
2 | Simple two asset problem | MIT | [2] |
Xu3 |
3 | Simple three asset problem | MIT | [2] |
Stanford3 |
3 | Three assets with non-integer bounds | Unknown | [3] |
Full5 |
5 | Small, but more-fully explores CLA | MIT | Original |
StockIndex |
6 | Monthly prices of stock indices | GPL v3+ | [5] |
StockIndexAdj |
6 | Adjusted monthly prices of stock indices | GPL v3+ | [5] |
StockIndexAdjD |
6 | Adjusted daily prices of stock indices | GPL v3+ | [5] |
ESCBFX |
7 | Currency rates against the EUR | GPL v3+ | [5] |
Markowitz |
10 | Markowitz' original ten asset problem | Unknown | [1] |
MultiAsset |
10 | Stock and bond indices against gold | GPL v3+ | [5] |
DowJones |
28 | DJIA asset prices | CC-BY 4.0 | [4] |
INDTRACK1 |
31 | HSI asset prices | MIT | [7] [8] |
EuroStoxx50 |
48 | EURO STOXX 50 asset prices | Unknown | [6] |
FF49Industries |
49 | Fama and French 49 Industry prices | CC-BY 4.0 | [4] [9] |
NASDAQ100 |
82 | NASDAQ 100 asset prices | CC-BY 4.0 | [4] |
FTSE100 |
83 | FTSE 100 asset prices | CC-BY 4.0 | [4] |
INDTRACK2 |
85 | DAX 100 asset prices | MIT | [7] [8] |
INDTRACK3 |
89 | FTSE 100 asset prices | MIT | [7] [8] |
INDTRACK4 |
98 | S&P 100 asset prices | MIT | [7] [8] |
INDTRACK5 |
225 | Nikkei 225 asset prices | MIT | [7] [8] |
MIBTEL |
226 | Borsa Italiana asset prices | Unknown | [6] |
SP500 |
442 | S&P 500 asset prices | CC-BY 4.0 | [4] |
INDTRACK6 |
457 | S&P 500 asset prices | MIT | [8] |
NASDAQComp |
1203 | NASDAQ Composite asset prices | CC-BY 4.0 | [4] |
These datasets have been standardised to use a consistent, plaintext file format so that they can more easily be used across multiple programming languages. The number of decimal places included in each dataset varies based on its original source.
Where available, datasets include the original asset price time series in a timeseries.csv
file. Each row is a different time step, with the first column containing the date of the price (in YYYY-MM-DD format) if known or otherwise T{i}
, where {i}
is the step number. Each column is a different asset, with the first row containing the asset's name if known or otherwise S{j}
, where {j}
is the asset number. For example, the time series for EuroStoxx50 begins
EuroStoxx50, AABA.AS, ACA.PA, AGN.AS, AI.PA
2003-03-03, 10.4, 11.42, 5.54, 23.3
2003-03-10, 11.26, 11.93, 6.58, 24.3
2003-03-17, 12.16, 12.04, 6.7, 26.33
2003-03-24, 11.34, 12.21, 6.17, 25.53
Note that those datasets related to index tracking research begin with a column for the price of the index itself, followed by columns for a number of constituent assets. For example, the time series for INDTRACK1
which tracks HSI begins
INDTRACK1, Index, S1, S2, S3
T1, 8749.31759356, 9.33675195, 14.37788798, 7.36644878
T2, 8713.53262793, 9.86926631, 14.18263271, 7.40194974
T3, 9001.60515134, 9.95801871, 14.59089372, 7.77470979
T4, 8903.30299873, 9.15924716, 15.74467486, 7.63270596
Every dataset includes a return.csv
file encoding expected returns for each asset. Where one was included with the original dataset, this has been retained. Otherwise, the mean is computed over 50 most recent time steps. The file has one asset per row, with the first column being the mean return and the second column being the standard deviation of the return. For example, the mean returns file for INDTRACK1 begins
0.001309,0.043208
0.004177,0.040258
0.001487,0.041342
0.004515,0.044896
0.010865,0.069105
Every dataset includes a risk.csv
file encoding the covariances between each pair of assets. Where one was included with the original dataset, this has been retained. Otherwise, the covariance matrix is computed over 50 most recent time steps.
For some problem, the covariance matrix is stored directly as a CSV. For example, Stanford3
has the risk file
1.0,0.4,0.15
0.4,1.0,0.35
0.15,0.35,1.0
which corresponds to the matrix
For other problems, particularly where the covariance matrix has many zero entries, the covariance matrix is stored in sparse matrix coordinate format. For each non-zero matrix entry, the first column gives its row index (from 1), the second column its column index, and the third its matrix value. For example, the risk file for Xu3
is
1,1,5
2,2,3
3,3,2
which corresponds to the matrix
Some datasets (though not all) include a problem.csv
file encoding all the main problem variables. The structure of the datafile is
S1, S2, ... , Sn
mu_1, mu_2, ... , mu_n
lb_1, lb_2, ... , lb_n
ub_1, ub_2, ... , ub_n
C_11, C_12, ... , C_1n
C_21, C_22, ... , C_2n
.. , .. , ... , ..
C_n1, C_n2, ... , C_nn
where mu
is the vector of respected returns, lb
is the vector of lower weight bounds, ub
is the vector of upper weight bounds, and C_ij
is the (i, j)th entry of the covariance matrix. The first row gives identifying labels for each asset.
Some datasets (though not all) include a frontier.csv
file encoding the unconstrained efficient frontiers. Each row is a calculated point on the unconstrained frontier, with the first column being the mean return and the second being the variance of return. This file contains no headers or labels. For example, the efficient frontier for INDTRACK1
begins
0.0108650000,0.0047755010
0.0108609579,0.0047677406
0.0108569167,0.0047599943
0.0108528746,0.0047522585
0.0108488325,0.0047445351
Be aware that this work is derived from datasets which were released under a range of different licenses. If a license above is given as Unknown, that means though the data was released publicly the author(s) did not specify the terms of its release; if used as part of further work it is at your own risk. In that and all cases, usage should be accompanied with a citation to the relevant source.
If you notice any errors in the included files, can provide an update to an existing dataset, or find an additional dataset that would be worth including, feel free to get in touch or open an issue on GitHub.
If you feel the need to cite this repository, see the "Cite this repository" button in the GitHub sidebar. More appropriately though, below are the original sources of the included datasets:
-
Harry M. Markowitz (1987), Mean-Variance Analysis in Portfolio Choice and Capital Markets, B. Blackwell. ISBN: 0-631-15381-0.
-
Xu Shuhan (March 2020), Exploring Efficient Frontier in Portfolio Analysis, dissertation completed in part towards a degree at the University of Edinburgh, supervised by Julian Hall.
-
William F. Sharpe (May 1999), The Critical Line Method within Macro-Investment Analysis, an incomplete book published online at Stanford University.
-
Renato Bruni, Francesco Cesarone, Andrea Scozzari, and Fabio Tardella (September 2016), Real-world datasets for portfolio selection and solutions of some stochastic dominance portfolio models in Data in Brief, Volume 8, Pages 858-862. ISSN: 2352-3409. DOI: 10.1016/j.dib.2016.06.031. Data was constructed using a combination of information from Thomson Reuters Datastream and the Fama & French Data Library, then checked to correct missing or inaccurate values.
-
Bernhard Pfaff (December 2012), FRAPO: Financial Risk Modelling and Portfolio Optimisation with R in CRAN (Comprehensive R Archive Network), version 0.4. DOI: 10.32614/CRAN.package.FRAPO. The FRAPO package [5] also includes dataset called
FTSE100
,NASDAQ
, andSP500
. These are all drawn from [6] and are (presumably) outdated versions of what are included in [4], so have been omitted here. -
Francesco Cesarone, Andrea Scozzari, and Fabio Tardella (July 2015), Linear vs. quadratic portfolio selection models with hard real-world constraints in Computer Management Science, Volume 12, Pages 345-370. DOI: 10.1007/s10287-014-0210-1. This paper, via reference to another, mentions additional datasets from S&P 500 with 476 assets and one from NASDAQ which includes 2196 assets. However, these have been lost due to link rot.
-
T.-J. Chang, Nigel Meade, John E. Beasley, and Yazid M. Sharaiha (2000), Heuristics for cardinality constrained portfolio optimisation in Computers & Operations Research, Volume 27, Pages 1271-1302. DOI: 10.1016/S0305-0548(99)00074-X.
-
Nilgun A. Canakgoz and John E. Beasley (July 2009), Mixed-integer programming approaches for index tracking and enhanced indexation in European Journal of Operational Research, Volume 196, Pages 384-399. DOI: 10.1016/j.ejor.2008.03.015
-
Eugene F. Fama and Kenneth R. French (December 2023), Production of U.S. Rm-Rf, SMB, and HML in the Fama-French Data Library in Chicago Booth Research Paper, No. 23-22, Fama-Miller Working Paper. DOI: 10.2139/ssrn.4629613. Through Fama and French a more up to date version of this dataset could be obtained, though lacking the adjustments made by Bruni et al. [4].