Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add square kilometer parameter #821

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

mtekman
Copy link

@mtekman mtekman commented Nov 11, 2022

Related to #820 where I was asked to vet my SPARQL query against the wikidata profi's.
I got a lot of help from tagishsimon who helped me along the entire way.

This PR does not increase the SPARQL query fetch time in a significant way, and appends Area converted to Square Kilometers to two decimal places to the output CSV files.

From my tests, it produces the correct output CSV and SHP files via the fetch/write wiki python scripts.
However, I've not been able to see any of these changes into the geojson data.

This does not increase the SPARQL query fetch time in a significant way.
@mtekman
Copy link
Author

mtekman commented Nov 11, 2022

I see now that the area_sqkm field is added to the "new_names.csv" file, but not to the shape file.

It looks like the spec for the output shape file is based on the spec from the input shape file, but I don't see how to add a specific area_sqkm field. I've been playing with the generate_unique_stable_ids.py script, but all it does is create an automatically new field name "NE_ID1".

I understand that I need to use this output shape file as the basis for the write_wikidata.py script, but is there a way to specify that the area_sqkm field in the "new_names.csv" file should go to this new "NE_ID1" field?

@ImreSamu
Copy link
Collaborator

ImreSamu commented Nov 13, 2022

It looks like the spec for the output shape file is based on the spec from the input shape file, but I don't see how to add a specific area_sqkm field.

my minimal solution:

1.)
I have created a sample code for adding

  • area_sqkm
  • population

see: https://gist.github.com/ImreSamu/fe3abea82f388135671824d704b239e3

  • input: ./10m_cultural/ne_10m_admin_0_countries.shp
  • output: ./10m_cultural/ne2_10m_admin_0_countries.shp

2.)

you have to add 2 extra lines to tools/wikidata/write_wikidata.py

...
with open(args.input_csv, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        qid = row['wd_id']
        wddic[qid]['area_sqkm'] = row['area_sqkm']           # <--- add this line
        wddic[qid]['population'] = row['population']              # <--- and this line
        for d in row:
          ...
  1. test
python3 addcolumns.py
./tools/wikidata/update.sh  fetch  uppercase   10m_cultural  ne2_10m_admin_0_countries
./tools/wikidata/update.sh  write  uppercase   10m_cultural  ne2_10m_admin_0_countries 

and I see the new values ...

  1. thinking 🤔
  • This is just a minimal working example of solving your issue
  • I am still thinking about what is the best implementation as a general and "future proof" ...
  • The ne2_10m_admin_0_countries already has a POP_EST and POP_YEAR columns .. for the population value
  • There are other values GDP_MD,GDP_YEAR - we can update from wikidata ..

disclaimer:
The final decision to accept this PR is not mine, it is the repo owner's, I just help to get the PR in the best shape.

@mtekman
Copy link
Author

mtekman commented Nov 13, 2022

Oh perfect, I was wondering where to add these fields!
So I've given you push access to my repo, please do any edits you see fit!

As for using WikiData as a stable resource, I noticed that the SPARQL query would feed in something like 255 country entities, but would get back something like 270 results back. This was due to an entity having multiple population values over different years, and rough disagreements in what "land" was.

After much back and forth between tagishsimon and I, the best query we came up with was:

SELECT DISTINCT ?i ?iLabel ?population ?area_sqkm ?popstatement ?date 
WITH { SELECT ?i (MAX(?date_) as ?date) WHERE
  {
   VALUES ?i {wd:Q884 wd:Q31057 wd:Q35555 wd:Q220982 wd:Q31063 wd:Q475038}
              ## Problematic entities with multiple population values
      OPTIONAL {                  ## Get population dates for maxing purposes
      ?i p:P1082 ?popstatement. 
      ?popstatement a wikibase:BestRank .
#      ?popstatement ps:P1082 ?population .
      ?popstatement pq:P585 ?date_ .
      }
   } GROUP BY ?i } as %i
WHERE
{
  INCLUDE %i
  OPTIONAL {                  ## Population, if it exists
      ?i p:P1082 ?popstatement. 
      ?popstatement a wikibase:BestRank .
      ?popstatement ps:P1082 ?population .
      ?popstatement pq:P585 ?date .
  }
  OPTIONAL {                  ## Area, if it exists
    ?i p:P2046 ?areastatement .
    ?areastatement a wikibase:BestRank .     
    MINUS { ?areastatement pq:P518 wd:Q187223. }     ## Filter out lagoons
    MINUS { ?areastatement pq:P518 wd:Q9259. }       ## Filter out UNESCO sites
    MINUS { ?areastatement pq:P518 wd:Q64364418. }   ## Filter out buffer zones
    MINUS { ?areastatement pq:P1012 wd:Q389717. }    ## Filter out contested Islands
    ?areastatement psn:P2046/wikibase:quantityAmount ?area_norm .
    BIND( ROUND(?area_norm/1000000) as ?area_sqkm) . 
  }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} ORDER BY ASC(?iLabel)

Run it!

This just takes the last date of the population and uses that. You can see the entire discussion here


Some questions --

  • how were the initial POP_RANK GDP etc populated? What were the sources?
  • If we go down the SPARQL route for populating these values, should we always assume a 1:1 input between input and ouput values (e.g. For 1 country entity we get 1 population value, 1 GDP value, 1 land value, etc)?
    • Or should we record multiple values over successive years and embed this somehow in the shapefile?

@mtekman
Copy link
Author

mtekman commented Nov 16, 2022

Okay, here is an entire R script that:

  1. takes one of the shapefile objects, queries it wikidataids,
  2. Finds duplicate entries
  3. (after manual curation) collapses duplicate entities and their geometries,
  4. Filters for desired fields
  5. Simplifies geometries to save space
#!/usr/bin/env R
args <- commandArgs(trailingOnly = TRUE)

if ((length(args) == 1) && args[[1]] == "--defaults"){
    nev_shapefile="~/repos/_other/natural-earth-vector/10m_cultural/ne_10m_admin_0_map_subunits.shp"
    nev_layername="ne_10m_admin_0_map_subunits"
    geojsonfile="wikidata_shape_file.geojson"
    digits_out = 4 
    keep_thresh = 0.04      ## Any smaller than 0.04 and we lose Andorra

    message("Using default inputs:", paste0("\n - shape : ", nev_shapefile,
                                            "\n - layer : ", nev_layername,
                                            "\n - geoout: ", geojsonfile,
                                            "\n - digits: ", digits_out,
                                            "\n - keep  : ", keep_thresh))
    
} else {
    if (length(args) != 5){
        stop("
Converts a SHP shapefile from the Natural Earth Vector project, deduplicates entities with shared wikidata ids (manually for now), pulls in population and area data, and produces a GeoJSON object with it.

\tUsage: preformat_shape.R <NEV.shp> <layer name> <geojson out> <decimal places out> <keep_threshold>
                         or
\t       preformat_shape.R --defaults

")
    }

    nev_shapefile <- args[[1]]
    nev_layername <- args[[2]]
    geojsonfile <- args[[3]]
    digits_out <- as.integer(args[[4]])
    keep_thresh <- as.double(args[[5]])

}

library(tidyverse)
library(sf) ## library(rgdal) -- rgdal is retired, sf is the future
library(geojsonsf)
library(WikidataR)
library(rmapshaper) ## simplify objects

nev_shape <- read_sf(nev_shapefile, layer=nev_layername)

debugCountries <- function(shape_dat, identifier=NULL, isWID=FALSE, xlim=NULL, ylim=NULL){
    ## Plot and print metrics for an identifier (default NAME) otherwise if isWID
    ## is true then, it's a WIKIDATAID.
    if (is.null(identifier)){
        tmp <- shape_dat
    } else {
        if (isWID){ tmp <- shape_dat %>% filter(WIKIDATAID == identifier) }
        else { tmp <- shape_dat %>% filter(NAME == identifier) }
    }
    plot(ggplot(tmp, aes(fill=SU_A3)) +
         geom_sf() +
         coord_sf(xlim = xlim, ylim = ylim, expand = FALSE) +
         theme_minimal() + theme(legend.position="none"))
    print(tmp %>% select(NAME, WIKIDATAID) %>% unique())
}

findDuplicateWikidatas <- function(){
    ## First proof to see if there are duplicate WIKIDATAIDs with on differing rows
    dupe_ids <- (nev_shape %>% group_by(WIKIDATAID) %>% nest() %>%
             mutate(n = map(data, function(x) length(x$NAME))) %>% unnest(n) %>%
             filter(n > 1))$WIKIDATAID
    ## 
    nev_shape %>% filter(WIKIDATAID %in% dupe_ids) %>%
        select(c(WIKIDATAID, NAME, SOVEREIGNT, POP_EST, POP_YEAR, SU_A3))
}

removeAndMergeDupes <- function(delete_ids, merge_list){
    ids_to_delete <- delete_ids
    ids_to_merge <- merge_list

    ## First a shallow copy
    filtered_data <- nev_shape

    ## then, we merge
    for (pair in ids_to_merge){
        rowid1 <- which(filtered_data$SU_A3 == pair[1])
        rowid2 <- which(filtered_data$SU_A3 == pair[2])
        if ((length(rowid1) != 1) || (length(rowid2) != 1)){
            stop(paste("More than one record for either:", pair[1], "or", pair[2]))
        }
        ## We merge specific data fields, POP_EST and SU_A3
        filtered_data[rowid1,]$POP_EST <- filtered_data[rowid1,]$POP_EST + filtered_data[rowid2,]$POP_EST
        filtered_data[rowid1,]$SU_A3 <- paste0(filtered_data[rowid1,]$SU_A3, "+",
                                               filtered_data[rowid2,]$SU_A3)
        ## and geometry:
        ## print(to_merge %>% select(NAME, SU_A3))
        merged_geometry <- st_combine(filtered_data[c(rowid1,rowid2),]$geometry)
        print(merged_geometry)
        filtered_data[rowid1,]$geometry <- merged_geometry

        ## And now add the second ID to the delete list
        ids_to_delete <- c(ids_to_delete, pair[2])
        message("Merged: ", pair[1], " and ", pair[2])
    }

    ## Now we delete
    message("Deleting ids: ", paste0(ids_to_delete, collapse=" "))
    filtered_data <- filtered_data %>% filter(!(SU_A3 %in% ids_to_delete))

    return(filtered_data)
}
findDuplicateWikidatas()
##    WIKIDATAID NAME                   SOVEREIGNT           POP_EST POP_YEAR SU_A3
##  1 Q331990    Korean DMZ (south)     South Korea           2.18e2        0 KNX  <- border, delete
##  2 Q331990    Korean DMZ (north)     North Korea           0          2019 KNZ  <- border, delete
##  3 Q159       Russia                 Russia                3.11e7     2019 RUA  <- Large, merge
##  4 Q159       Russia                 Russia                1.10e8     2019 RUE  <- Large, merge
##  5 Q884       South Korea            South Korea           5.02e7     2017 KOX  <- keep
##  6 Q884       South Korea            South Korea           1.55e6     2017 KXI  <- tiny islands off KOX, delete
##  7 Q1042      Seychelles             Seychelles            9.76e4     2019 SYC  <- keep
##  8 Q1042      Fr. S. Antarctic Lands France                2   e1     2017 FSA  <- tiny islands, delete
##  9 Q43296     Spratly Is.            Spratly Islands       1   e2     2017 PGA  <- tiny islands, delete
## 10 Q43296     Wake Atoll             United States of Am…  6.8 e1     2017 WQI  <- tiny islands, delete

## Test the above with the debugCountries function
##debugCountries("Russia")
##debugCountries("Q43296", T)
##debugCountries(subset_shapedata, xlim=c(0,6), ylim=c(40,46))  ## Andorra

ids_to_delete <- c("KNX", "KNZ", "KXI", "FSA", "PGA", "WOI")
ids_to_merge <- list(c("RUA", "RUE"))  ## Second merged into first, second deleted
subset_shapedata <- removeAndMergeDupes(ids_to_delete, ids_to_merge)

message("From: ", paste0(dim(nev_shape), collapse=" "))
message("  to: ", paste0(dim(subset_shapedata), collapse=" "))

                                        # Getting Pop and Area Data
## Extract the ID's to format a query
wiki_ids <- subset_shapedata$WIKIDATAID

## Generate the query template:
query_template <- '
SELECT DISTINCT ?i ?iLabel ?population ?area_sqkm ?popstatement ?date
WITH { SELECT ?i (MAX(?date_) as ?date) WHERE
  {
   VALUES ?i { ENTITIES_LIST }
      ## Get population dates for maxing purposes
      OPTIONAL {
      ?i p:P1082 ?popstatement.
      ?popstatement a wikibase:BestRank .
      ?popstatement pq:P585 ?date_ .
      }
   } GROUP BY ?i } as %i
WHERE
{
  INCLUDE %i
  OPTIONAL {                  ## Population, if it exists
      ?i p:P1082 ?popstatement.
      ?popstatement a wikibase:BestRank .
      ?popstatement ps:P1082 ?population .
      ?popstatement pq:P585 ?date .
  }
  OPTIONAL {                  ## Area, if it exists
    ?i p:P2046 ?areastatement .
    ?areastatement a wikibase:BestRank .
    MINUS { ?areastatement pq:P518 wd:Q187223. }     ## Filter out lagoons
    MINUS { ?areastatement pq:P518 wd:Q9259. }       ## Filter out UNESCO sites
    MINUS { ?areastatement pq:P518 wd:Q64364418. }   ## Filter out buffer zones
    MINUS { ?areastatement pq:P1012 wd:Q389717. }    ## Filter out contested Islands
    ?areastatement psn:P2046/wikibase:quantityAmount ?area_norm .
    BIND( ROUND(?area_norm/1000000) as ?area_sqkm) .
  }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}'

## Feed in the replacements
query_string <- gsub("ENTITIES_LIST",
                     paste0("wd:", paste0(wiki_ids, collapse=" wd:")),
                     query_template)
wiki_data <- query_wikidata(query_string)

## Test to see if you got 1:1 inputs to outputs
message("Lengths match? ", length(wiki_ids) == nrow(wiki_data))
##> True

                                        # Now we unite the data with the shapedata
nev_merge <- merge(
    subset_shapedata,
    wiki_data,
    by.x="WIKIDATAID", by.y="i") %>%
    relocate(WIKIDATAID, .after = last_col()) %>%
    ## Capitalize new columns
    mutate(NAME2 = iLabel, POPULATION = population, AREA_SQKM = area_sqkm, SCALE=scalerank) %>%
    ## Specific columns
    select(c(WIKIDATAID, NAME, POPULATION, AREA_SQKM, CONTINENT, REGION_UN, SU_A3,
             MIN_ZOOM, MIN_LABEL, MAX_LABEL, LABEL_X, LABEL_Y, NE_ID))

## Finally write out a geojson object to the specified simplicity
nev_merge %>% sf_geojson(digits=digits_out) %>% write_file(geojsonfile)
if (keep_thresh > 0){
    message("Simplifying")
    file.copy(geojsonfile, paste0("original_", geojsonfile))
    tmp <- ms_simplify(nev_merge, keep = keep_thresh, keep_shapes = FALSE)
    tmp %>% sf_geojson(digits=digits_out) %>% write_file(geojsonfile)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants