-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add square kilometer parameter #821
base: master
Are you sure you want to change the base?
Conversation
This does not increase the SPARQL query fetch time in a significant way.
I see now that the It looks like the spec for the output shape file is based on the spec from the input shape file, but I don't see how to add a specific area_sqkm field. I've been playing with the I understand that I need to use this output shape file as the basis for the |
my minimal solution: 1.)
see: https://gist.github.com/ImreSamu/fe3abea82f388135671824d704b239e3
2.) you have to add 2 extra lines to ...
with open(args.input_csv, newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
qid = row['wd_id']
wddic[qid]['area_sqkm'] = row['area_sqkm'] # <--- add this line
wddic[qid]['population'] = row['population'] # <--- and this line
for d in row:
...
python3 addcolumns.py
./tools/wikidata/update.sh fetch uppercase 10m_cultural ne2_10m_admin_0_countries
./tools/wikidata/update.sh write uppercase 10m_cultural ne2_10m_admin_0_countries and I see the new values ...
disclaimer: |
Oh perfect, I was wondering where to add these fields! As for using WikiData as a stable resource, I noticed that the SPARQL query would feed in something like 255 country entities, but would get back something like 270 results back. This was due to an entity having multiple population values over different years, and rough disagreements in what "land" was. After much back and forth between tagishsimon and I, the best query we came up with was:
This just takes the last date of the population and uses that. You can see the entire discussion here Some questions --
|
Okay, here is an entire R script that:
#!/usr/bin/env R
args <- commandArgs(trailingOnly = TRUE)
if ((length(args) == 1) && args[[1]] == "--defaults"){
nev_shapefile="~/repos/_other/natural-earth-vector/10m_cultural/ne_10m_admin_0_map_subunits.shp"
nev_layername="ne_10m_admin_0_map_subunits"
geojsonfile="wikidata_shape_file.geojson"
digits_out = 4
keep_thresh = 0.04 ## Any smaller than 0.04 and we lose Andorra
message("Using default inputs:", paste0("\n - shape : ", nev_shapefile,
"\n - layer : ", nev_layername,
"\n - geoout: ", geojsonfile,
"\n - digits: ", digits_out,
"\n - keep : ", keep_thresh))
} else {
if (length(args) != 5){
stop("
Converts a SHP shapefile from the Natural Earth Vector project, deduplicates entities with shared wikidata ids (manually for now), pulls in population and area data, and produces a GeoJSON object with it.
\tUsage: preformat_shape.R <NEV.shp> <layer name> <geojson out> <decimal places out> <keep_threshold>
or
\t preformat_shape.R --defaults
")
}
nev_shapefile <- args[[1]]
nev_layername <- args[[2]]
geojsonfile <- args[[3]]
digits_out <- as.integer(args[[4]])
keep_thresh <- as.double(args[[5]])
}
library(tidyverse)
library(sf) ## library(rgdal) -- rgdal is retired, sf is the future
library(geojsonsf)
library(WikidataR)
library(rmapshaper) ## simplify objects
nev_shape <- read_sf(nev_shapefile, layer=nev_layername)
debugCountries <- function(shape_dat, identifier=NULL, isWID=FALSE, xlim=NULL, ylim=NULL){
## Plot and print metrics for an identifier (default NAME) otherwise if isWID
## is true then, it's a WIKIDATAID.
if (is.null(identifier)){
tmp <- shape_dat
} else {
if (isWID){ tmp <- shape_dat %>% filter(WIKIDATAID == identifier) }
else { tmp <- shape_dat %>% filter(NAME == identifier) }
}
plot(ggplot(tmp, aes(fill=SU_A3)) +
geom_sf() +
coord_sf(xlim = xlim, ylim = ylim, expand = FALSE) +
theme_minimal() + theme(legend.position="none"))
print(tmp %>% select(NAME, WIKIDATAID) %>% unique())
}
findDuplicateWikidatas <- function(){
## First proof to see if there are duplicate WIKIDATAIDs with on differing rows
dupe_ids <- (nev_shape %>% group_by(WIKIDATAID) %>% nest() %>%
mutate(n = map(data, function(x) length(x$NAME))) %>% unnest(n) %>%
filter(n > 1))$WIKIDATAID
##
nev_shape %>% filter(WIKIDATAID %in% dupe_ids) %>%
select(c(WIKIDATAID, NAME, SOVEREIGNT, POP_EST, POP_YEAR, SU_A3))
}
removeAndMergeDupes <- function(delete_ids, merge_list){
ids_to_delete <- delete_ids
ids_to_merge <- merge_list
## First a shallow copy
filtered_data <- nev_shape
## then, we merge
for (pair in ids_to_merge){
rowid1 <- which(filtered_data$SU_A3 == pair[1])
rowid2 <- which(filtered_data$SU_A3 == pair[2])
if ((length(rowid1) != 1) || (length(rowid2) != 1)){
stop(paste("More than one record for either:", pair[1], "or", pair[2]))
}
## We merge specific data fields, POP_EST and SU_A3
filtered_data[rowid1,]$POP_EST <- filtered_data[rowid1,]$POP_EST + filtered_data[rowid2,]$POP_EST
filtered_data[rowid1,]$SU_A3 <- paste0(filtered_data[rowid1,]$SU_A3, "+",
filtered_data[rowid2,]$SU_A3)
## and geometry:
## print(to_merge %>% select(NAME, SU_A3))
merged_geometry <- st_combine(filtered_data[c(rowid1,rowid2),]$geometry)
print(merged_geometry)
filtered_data[rowid1,]$geometry <- merged_geometry
## And now add the second ID to the delete list
ids_to_delete <- c(ids_to_delete, pair[2])
message("Merged: ", pair[1], " and ", pair[2])
}
## Now we delete
message("Deleting ids: ", paste0(ids_to_delete, collapse=" "))
filtered_data <- filtered_data %>% filter(!(SU_A3 %in% ids_to_delete))
return(filtered_data)
}
findDuplicateWikidatas()
## WIKIDATAID NAME SOVEREIGNT POP_EST POP_YEAR SU_A3
## 1 Q331990 Korean DMZ (south) South Korea 2.18e2 0 KNX <- border, delete
## 2 Q331990 Korean DMZ (north) North Korea 0 2019 KNZ <- border, delete
## 3 Q159 Russia Russia 3.11e7 2019 RUA <- Large, merge
## 4 Q159 Russia Russia 1.10e8 2019 RUE <- Large, merge
## 5 Q884 South Korea South Korea 5.02e7 2017 KOX <- keep
## 6 Q884 South Korea South Korea 1.55e6 2017 KXI <- tiny islands off KOX, delete
## 7 Q1042 Seychelles Seychelles 9.76e4 2019 SYC <- keep
## 8 Q1042 Fr. S. Antarctic Lands France 2 e1 2017 FSA <- tiny islands, delete
## 9 Q43296 Spratly Is. Spratly Islands 1 e2 2017 PGA <- tiny islands, delete
## 10 Q43296 Wake Atoll United States of Am… 6.8 e1 2017 WQI <- tiny islands, delete
## Test the above with the debugCountries function
##debugCountries("Russia")
##debugCountries("Q43296", T)
##debugCountries(subset_shapedata, xlim=c(0,6), ylim=c(40,46)) ## Andorra
ids_to_delete <- c("KNX", "KNZ", "KXI", "FSA", "PGA", "WOI")
ids_to_merge <- list(c("RUA", "RUE")) ## Second merged into first, second deleted
subset_shapedata <- removeAndMergeDupes(ids_to_delete, ids_to_merge)
message("From: ", paste0(dim(nev_shape), collapse=" "))
message(" to: ", paste0(dim(subset_shapedata), collapse=" "))
# Getting Pop and Area Data
## Extract the ID's to format a query
wiki_ids <- subset_shapedata$WIKIDATAID
## Generate the query template:
query_template <- '
SELECT DISTINCT ?i ?iLabel ?population ?area_sqkm ?popstatement ?date
WITH { SELECT ?i (MAX(?date_) as ?date) WHERE
{
VALUES ?i { ENTITIES_LIST }
## Get population dates for maxing purposes
OPTIONAL {
?i p:P1082 ?popstatement.
?popstatement a wikibase:BestRank .
?popstatement pq:P585 ?date_ .
}
} GROUP BY ?i } as %i
WHERE
{
INCLUDE %i
OPTIONAL { ## Population, if it exists
?i p:P1082 ?popstatement.
?popstatement a wikibase:BestRank .
?popstatement ps:P1082 ?population .
?popstatement pq:P585 ?date .
}
OPTIONAL { ## Area, if it exists
?i p:P2046 ?areastatement .
?areastatement a wikibase:BestRank .
MINUS { ?areastatement pq:P518 wd:Q187223. } ## Filter out lagoons
MINUS { ?areastatement pq:P518 wd:Q9259. } ## Filter out UNESCO sites
MINUS { ?areastatement pq:P518 wd:Q64364418. } ## Filter out buffer zones
MINUS { ?areastatement pq:P1012 wd:Q389717. } ## Filter out contested Islands
?areastatement psn:P2046/wikibase:quantityAmount ?area_norm .
BIND( ROUND(?area_norm/1000000) as ?area_sqkm) .
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}'
## Feed in the replacements
query_string <- gsub("ENTITIES_LIST",
paste0("wd:", paste0(wiki_ids, collapse=" wd:")),
query_template)
wiki_data <- query_wikidata(query_string)
## Test to see if you got 1:1 inputs to outputs
message("Lengths match? ", length(wiki_ids) == nrow(wiki_data))
##> True
# Now we unite the data with the shapedata
nev_merge <- merge(
subset_shapedata,
wiki_data,
by.x="WIKIDATAID", by.y="i") %>%
relocate(WIKIDATAID, .after = last_col()) %>%
## Capitalize new columns
mutate(NAME2 = iLabel, POPULATION = population, AREA_SQKM = area_sqkm, SCALE=scalerank) %>%
## Specific columns
select(c(WIKIDATAID, NAME, POPULATION, AREA_SQKM, CONTINENT, REGION_UN, SU_A3,
MIN_ZOOM, MIN_LABEL, MAX_LABEL, LABEL_X, LABEL_Y, NE_ID))
## Finally write out a geojson object to the specified simplicity
nev_merge %>% sf_geojson(digits=digits_out) %>% write_file(geojsonfile)
if (keep_thresh > 0){
message("Simplifying")
file.copy(geojsonfile, paste0("original_", geojsonfile))
tmp <- ms_simplify(nev_merge, keep = keep_thresh, keep_shapes = FALSE)
tmp %>% sf_geojson(digits=digits_out) %>% write_file(geojsonfile)
}
|
Related to #820 where I was asked to vet my SPARQL query against the wikidata profi's.
I got a lot of help from tagishsimon who helped me along the entire way.
This PR does not increase the SPARQL query fetch time in a significant way, and appends Area converted to Square Kilometers to two decimal places to the output CSV files.
From my tests, it produces the correct output CSV and SHP files via the fetch/write wiki python scripts.
However, I've not been able to see any of these changes into the geojson data.