Skip to content
Niklas Berliner edited this page Nov 14, 2015 · 9 revisions

Welcome to the delveData wiki!

This project is my playground to explore ways to combine and use some interesting datasets, to improve my problem solving skills, to learn more about Machine Learning techniques, and, hopefully, to get some insights into interrelationships between aspects of our lives that influence our living together.

Data used in this project

For this project multiple data sources are combined to give insights into the relationship between migration and refugee streams, education, economic investment, and ecological factors.

Currently the following data sources are integrated:

  • The World Bank maintains an open data portal offering many datasets in various categories. I grouped the indicators into Development, Ecology, Economy (general), Economy (social impact), Economy (employment), Education, Emission, Energy, Government expenditure, Health, International relations, Land use, and Population. The full list of the utilized indicators and their categorisation can be found here.

  • Migration and refugee data was obtained from the UN Refugee Agency (UNHCR) as well as from the Organisation for Economic Co-operation and Development (OECD). Both datasets should complement themselves and make the information more reliable. Note however that the OECD data is not available for the full time range.

  • Climate data was obtained from the US National Oceanic and Atmospheric Administration (see here). If available information about temperature, precipitation, and snowfall are extracted. The daily measurements from the original dataset are converted into a severity measure ranging from 1 to 12 based on the following criteria. For each weather station and each months the average value is compared against the average of all past years. If the average of the current year deviates by more than 1.54 standard deviations of the average of the past years, this event is classified as severe. The severity index of the respective year will be increased by one if the average of severity classification of the weather stations during that months is above or equal to 0.5. For additional information please refer to this notebook.

  • In addition to the hard facts coming from the other databases I am interested in a more "subjective" measure. For this I am using an estimate of the number of newspaper articles published in the New York Times containing the country name as keyword.

  • The most recent addition to the data is coming from the amazing GDELT Project. For each country, a normalised Goldstein Index is extracted. The Goldstein Index is "a numeric score from -10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country" (see the Data Format Documentation). This should complement and extent the information extracted from the New York Times article count.

Clone this wiki locally