Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support of Gulf of Alaska Data Portal #132

Closed
brunj7 opened this issue Apr 27, 2021 · 66 comments
Closed

Adding support of Gulf of Alaska Data Portal #132

brunj7 opened this issue Apr 27, 2021 · 66 comments
Assignees
Labels
enhancement New feature or request

Comments

@brunj7
Copy link
Collaborator

brunj7 commented Apr 27, 2021

the LTER NGA site is using a specific data repository: https://gulf-of-alaska.portal.aoos.org/#

We need to add the support of it to metajam. This data repository is part of DataOne so we can still rely on its API to access the content. The biggest change is this repository is not supporting EML metadata standard.

We will thus to:

  • add detection of the metadata data type
  • wite the parsing of this metadata standard in a table format, matching the fields that are used when parsing EML

Data for test: https://doi.org/10.24431/rw1k45w

This should be done on the dev branch.

@brunj7 brunj7 added the enhancement New feature or request label Apr 27, 2021
@kristenpeach
Copy link
Collaborator

Progress

I am going to do a little exploring on my Rstudio on my computer but I will work on the dev branch when have a sense of where the changes need to be made.

The first error I run into is when I try to use the download_d1_data() function (which I expected). I am just going to take some notes here to keep track of my thought process.

I am used to just copying the URL of the download button of the dataset. But for this one the metadata is in a separate file from the csv file so I think I need to download multiple files. In any case it looks like changes need to be made to the read_d1_files.R document. That is where it looks like it makes sense to insert an if statement about the metadata type. Similarly, we need to make an analogous function to "tabularize_eml()" for this other type of metadata.

I see that someone has added a 'To do' line to the 'download_d1_data.R' file:

`download_d1_data <- function(data_url, path, dir_name = NULL) {

TODO: add meta_doi to explicitly specify doi`

This seems like it would solve some of the problem because we could insert the doi/url to the metadata doc separate from the data.

I am revisiting the metadata.R doc from arcticdatautils() because I know that there are some creative functions in there for XML, GMD files.

@kristenpeach
Copy link
Collaborator

Progress

I'm having a pretty hard time knowing where to start with this. I'm getting the feeling I am making it harder than it needs to be. The metadata file is a gmd file (format type = http://www.isotc211.org/2005/gmd) so I assume I need to change multiple files in the metajam package to allow files with that format type but also maybe all of the other gmd format types as well?
Screen Shot 2021-04-30 at 2 07 48 PM

Will add any breakthroughs later in the day

@mbjones
Copy link
Member

mbjones commented Apr 30, 2021

@kristenpeach Those other format types represent different variants of the ISO 19115 family that are in use specifically for NOAA and Pangaea. You should be able to work against vanilla gmd metadata and have it work for all three, but note that the other two have additional changes that make them not validate under the original schema. But they are 99% the same.

One challenge you will likely have with gmd is that it doesn't generally represent entity and attribute-level metadata in the same way as EML, and its not really built to support multiple data entities (e.g., tables, raster images) in a single ISO document. We've been discussing this wrt how we do metadata completeness checking in MetaDIG, and there are no easy answers. A lot of ISO documents describe a Dataset as a whole without providing the details needed to parse the data files. Happy to discuss on slack if you'd like.

@kristenpeach
Copy link
Collaborator

Progress

Explored the use of the geometa package (https://github.com/eblondel/geometa/wiki) to convert non-eml to eml within the metajam package. If this works well we can just insert an if statement early on so that metajam can convert any non-eml to eml and then proceed normally. Here is an excerpt from the geometa package Wiki on its ability to convert metadata:

"4.3 Convert metadata objects from/to other metadata languages (mapping)
geometa offers the capacity to convert objects from/to other metadata languages. The object is to provide a generic interoperable mechanism to convert R metadata objects from one metadata standard to another.

At now the focus was given on the mapping between ISO/OGC metadata (modeled in R by geometa) covering core business metadata elements with two widely used metadata formats which are:

NetCDF-CF Conventions - Climate and Forecast conventions - (modeled in R with ncdf4)
EML (Ecological Metadata Language) (modeled in R with EML and emld)"

@kristenpeach
Copy link
Collaborator

Progress

Julien helped me figure out where to start. I am going to work on the download_d1_data function rather than the download_d1_data_pkg function. The gist of the issue is that if the XML file in the package is not in eml, the download_d1_data() function (paired with the read_d1_files function as seen in the download-single vignette of metajam) it will produce a list of length 2 that includes summary_metadata and data. In comparison, when the download_d1_data function and read_d1_files function are applied to data associated with an XML file written in eml it produces a list of length 3 that includes attribute_metadata, summary_metadata, and data. So we need to make the metajam functions produce attribute metadata for gmd XML files.

This Line of the download_d1_data.R file feels like a good break point to determine if an xml doc is eml or gmd. We need to determine the class of the meta_obj before passing it to as_emld.

eml <- tryCatch({emld::as_emld(meta_obj, from = "xml")}, # Identify XML file and use it to make EML object error = function(e) {NULL})

When I use download_d1_pkg() to retrieve this file 'nga_SKQ201810S_seabird_survey_data_L0.csv' from the sample package it can find the metadata and says that it is eml. That is because the as_emld function coerces any input into an emld object. As we predicted it just fails to produce attribute level metadata but it does parse data and summary metadata fine.


# Directory to save the data set
path_folder2 <- "DataOne_test2"

# URL to download the dataset from DataONE
data_url <- "https://cn.dataone.org/cn/v2/resolve/81b1aecf-329a-48d4-b706-a39c1607e067"

# Create the local directory to download the datasets
dir.create(path_folder2, showWarnings = FALSE)

# Download the dataset and associated metdata 
data_folder <- metajam::download_d1_data(data_url, path_folder2)

example_data <- metajam::read_d1_files(data_folder)

example_data$summary_metadata$name

example_data$data$Species


So before I use the as_emld function I need to insert a split where I determine the class of the metadata object. I should be able to use the DocType function from xml2 package to determine the class that is provided at the top of an XML document.

I am getting a little hung up on testing geometa because to use the convert_metadata() function of geometa you need to input the appropriate format id of the object and then the format id of what you want it to be. There is a helper function in geometa called getMappingFormats() to help you pick the right format id but it returns NULL and there is no R script for it in the geometa package on Github so I can't poke around to find what it is supposed to return. I could certainly make an educated guess but it looks like this function is still in development and not ready for use (https://rdrr.io/github/eblondel/geometa/man/convert_metadata.html)

@mbjones
Copy link
Member

mbjones commented May 5, 2021

@kristenpeach the code for getMappingFormats() in the geometa package is at https://github.com/eblondel/geometa/blob/master/R/geometa_mapping.R#L130

@kristenpeach
Copy link
Collaborator

Thank you! @mbjones

@kristenpeach
Copy link
Collaborator

Progress

I tried various configurations of inserting geometa into the existing download_d1_data.R code at that diversion point I discussed in the comment above. Before I fuss wi

If I replace this:

# eml <- tryCatch({emld::as_emld(meta_obj, from = "xml")},  # Identify XML file and use it to make EML object
#                error = function(e) {NULL})`

with this:
` out_eml <- geometa::convert_metadata(meta_obj, from = "geometa|iso-19115-1", to = "eml",
mappings = geometa::getMappings(), verbose = FALSE)

eml <- emld::as_emld(out_eml) `

And then run the whole download_d1_data.R with a data_path for a dataset I know has that geometa|iso-19115-1 XML file type it feels like I should be getting attribute data in the output. Based on the wiki I am not sure if I need to generate the data.frame with the metadata mapping rules for this conversion or if this is one of the conversions that already has mapping rules built into the function. This makes me think I should expect 100% coverage? Going to revisit in an hour with a fresh brain
Screen Shot 2021-05-05 at 4 59 38 PM

@kristenpeach
Copy link
Collaborator

Progress

I realized you need to set pretty = FALSE to get the getMappingFormats() function to work. But geometa::getMappingFormats(pretty = FALSE) will show the available metadata formats that geometa can convert. Based on that function there are two flavors of ISO XML supported by geometa and they are 'geometa|iso-19115-1' (row 2 of the table above) and 'geometa|iso-19115-2' (row 3 of the table above).

For the example data package I have been using (https://search.dataone.org/view/10.24431/rw1k45w) the file named "Metadata: Marine bird survey observation and density data from Northern Gulf of Alaska LTER cruises, 2018" has a listed file type = http://www.isotc211.org/2005/gmd. The first line of the XML doc is "gmd:MD_Metadata xmlns:gmd="http://www.isotc211.org/2005/gmd". Does the "2005" element indicate that it corresponds to the XML format id in row 1 of the table above ? Because that format does not appear in the acceptable formats listed by geometa::getMappingFormats() (perhaps because this function is still under development?). Alternatively, if the namespace is GMD shouldn't it be the format type of row 2 of that table? Is it correct to assume that the "Supported" column of that table indicates the number of elements that can be converted? I think thats right. It looks like geometa is capable of mapping a lot of the attribute level data we would want for GMD files. To convince yourself of this you can run geometa::getMappings() or go to https://github.com/eblondel/geometa/blob/master/inst/extdata/coverage/geometa_coverage_inventory.csv.

Description of convert_metadata function from geometa package:

#' @description \code{convert_metadata} is a tentative generic metadata converter to
#' convert from one source object, represented in a source metadata object model in R
#' (eg eml) to a target metadata object, represented in another target metadata object
#' model (eg \pkg{geometa} \code{\link{ISOMetadata}}). This function relies on a list of
#' mapping rules defined to operate from the source metadata object to the target metadata
#' object. This list of mapping rules is provided in a tabular format. A version is embedded
#' in \pkg{geometa} and can be returned with \code{\link{getMappings}}

It feels like the problem lies in the mappings parameter of the convert_metadata function. The table called by geometa::getMappings() does not have column names that correspond to the format id's that you can list as the other parameters. So it seems like maybe its not connecting the 'from = "geometa|iso-19115-1"' parameter in the function to the 'geometa' column in the getMappings() table? Or I am totally wrong and its just not running because of some coding error on my part.
Another idea. The line before the lines I am changing in download_d1_data is this: meta_obj <- dataone::getObject(d1c@mn, meta_id). I know that this line is running correctly for both the eml file I am testing and the xml file I am testing. The getObject function returns a meta_obj in raw format. So I think the convert_metadata function is expecting an XML object with formatting rather than raw. I will try to experiment with that.

out_eml <- geometa::convert_metadata(meta_obj, from = "geometa|iso-19115-1", to = "eml", mappings = geometa::getMappings(), verbose = FALSE)

The code for the convert_metadata function begins at Line 720: https://github.com/eblondel/geometa/blob/master/R/geometa_mapping.R#L130

@kristenpeach
Copy link
Collaborator

Progress

I tried to give the convert_metadata function xml files in different formats to see if it would work. Realized I don't understand what most of the functions in the XML package actually do so spent some time trying to understand them. Had to shift to other projects mid-day to avoid throwing my computer out the window. I will return to it re-energized from the weekend on Monday!

https://www.youtube.com/watch?v=1cM_ZNZ9hhE

http://www.cse.chalmers.se/~chrdimi/downloads/web/getting_web_data_r4_parsing_xml_html.pdf

https://www.rdocumentation.org/packages/xml2/versions/1.3.2

@kristenpeach
Copy link
Collaborator

Progress

I tested the convert_metadata() function with an eml metadata object and converted it to ISO. It worked fine. But that does confirm what I thought about the function failing because my input for the metadata parameter was a meta_obj in raw format. In the example below 'polaris17_permafrost' is a package of data, summary metadata, and attribute metadata pulled from the Arctic Data Center (eml). So the pivot point in the code (where we ask it to determine if the XML file is in eml or ISO) needs to be before the creation of the meta_obj, not after.

test_meta_obj_eml <- polaris17_permafrost$attribute_metadata

out_eml <- geometa::convert_metadata(test_meta_obj_eml, from = "eml", to = "geometa|iso-19115-1", mappings = geometa::getMappings(), verbose = FALSE)

test2_meta_obj_eml <- polaris17_permafrost$summary_metadata

out_eml2 <- geometa::convert_metadata(test2_meta_obj_eml, from = "eml", to = "geometa|iso-19115-1", mappings = geometa::getMappings(), verbose = FALSE)

The XML file I pass to convert_metadata cannot be in raw format which is the default of dataone::getObject. I converted it to an ISO XMl object and now convert_metadat() runs but the output object has mostly empty fields. That should be disappointing but I am happy I got somewhere. Will continue in this direction the rest of the day.

`meta_df <- rawToChar(dataone::getObject(d1c@mn, meta_id2))
meta_iso_xml <- XML::xmlTreeParse(meta_df)
metadata_nodes2 <- dataone::resolve(cn, meta_id2)

out_eml <- geometa::convert_metadata(meta_iso_xml, from = "geometa|iso-19115-1", to = "eml",
mappings = geometa::getMappings(), verbose = FALSE)

eml <- emld::as_emld(out_eml)`

@kristenpeach
Copy link
Collaborator

I think the trick will be using xmlToDataFrame() to make an object that is just the metadata node of the ISO meta_obj and inputting that to convert_metadata()

@kristenpeach
Copy link
Collaborator

More Progress

I was able to get a 'flat' data frame version of ISO XML to use as an input for convert_metadata() which seems to be the format it wants.

xml.dataframe <- fxml_importXMLFlat("https://cn.dataone.org/cn/v2/resolve/2012b3a7-f6b0-4e46-b2fa-63bf4ae6ba25")

out_eml <- geometa::convert_metadata(xml.dataframe, from = "geometa|iso-19115-1", to = "eml", mappings = geometa::getMappings(), verbose = FALSE)

eml <- emld::as_emld(out_eml)

The convert_metadata function runs and produces all of the same elements it did when I tried it out in the reverse direction (eml to ISO) but it was basically just a big empty nested list. When I run the geometa_mapping.R doc it produces several versions of this warning message: "in method for ‘coerce’ with signature ‘"ISOMetadata","emld"’: no definition for class “ISOMetadata”". I still feel like I have a better sense of where the problem is than I did this morning though.

@kristenpeach
Copy link
Collaborator

@mbjones Have you gotten the geometa::convert_metadata() function to successfully convert ISO to eml? Or is that why you were talking with the maintainer eblondel? Whenever I try it it fails to identify/map attribute level metadata. I assume this is because of the differences in how attribute level information is stored in gmd vs. eml. You warned me that I should only expect a partial translation using convert_metadata(), I just want to make sure this is what you meant.

@mbjones
Copy link
Member

mbjones commented May 11, 2021

Hi @kristenpeach I have not tried it. I suspected that the conversion was incomplete via a quick scan of the documentation and code. My earlier conversations with the maintainer was about our contributing to the conversion, which we weren't able to do at the time. You are the first person I know of that has tried this extensively. You might find others that have used it through either 1) the #eml channel on the NCEAS slack, or the #im channel on the LTER slack.

@kristenpeach
Copy link
Collaborator

Progress

To recap: We determined that Lines 86-92 in the download_d1_data.R file is generally where the function starts to fail when the input is a data_url to data in this repo (https://search.dataone.org/view/10.24431/rw1k45w)/ any repo that uses non-eml metadata. For the example below I used the data_url for the file named 'nga_TGX201809_seabird_processed_densities_L1.csv'. This code that shows the method that produces the most complete eml object(s) so far (from non-eml metadata). None of them are great but I think working from the meta_iso_xml object may be the most straightforward.

https://github.com/kristenpeach/metajam/blob/master/reprex_iso_xml_to_eml_GMD.R

I keep thinking we should be able to use arcticdatautils::pid_to_eml_entity()(https://github.com/NCEAS/arcticdatautils/blob/main/R/eml.R) because the way it's written it should work for any DataOne object. I have only used it for the ADC so I have always set the member node to 'adc'. So maybe if I set the DataOne member node (https://www.dataone.org/network/#list-of-member-repositories) to EITHER 'LTER Network Member Repository' or the 'Alaska Ocean Observing System' where this example data is originally housed (https://gulf-of-alaska.portal.aoos.org/) I might be able to use this function or something similar to it? Feels like there may at least be some good clues in the eml.R doc of arcticdatautils.

Asked if anyone has used the convert_metadata() function in the eml NCEAS slack channel

@kristenpeach
Copy link
Collaborator

Progress

After chatting with Jeanette and Bryce on slack I think we may need to reassess our goals for making metajam work will non-eml. We talked about me adding an issue to geometa to make sure I was not using convert_metadata() incorrectly or passing it parameters of the wrong class. But it looks like an issue for our problem (or a very similar problem) already exists (eblondel/geometa#169) it was posted in June 2020.

One option would be that we make a totally new function that is analogous to download_d1_data.R but specific to ISO XML (or even specific to the flavor of ISO that is used by the repository of this one LTER site which wants to use it). Bryce's efforts with dataspice (https://github.com/ropenscilabs/dataspice#convert-to-eml) are probably a good place to start. We may have to use create_spice() to prompt ISO metajam users to do some manual entry for the attributes. That feels a little clunky but if the primary immediate goal is to get it working for this one LTER site maybe it would be ok.

@brunj7
Copy link
Collaborator Author

brunj7 commented May 13, 2021

@kristenpeach Thank you for all the investigation on this and the reprex! This is all good progress.

I agree that we should probably focus on mapping what we can for the summary metadata table metajam produces and set the rest to NA. Some of the info comes from the D1 API (see below) so that should be OK.
So I think getting: the title, an abstract, and a person (name) of contact from meta_iso_xml would be already great.

Field Provenance
Metadata_ID D1
Metadata_URL D1
Metadata_EML_Version Metadata
File_Description Metadata
File_Label Metadata
Dataset_URL D1
Dataset_Title Metadata
Dataset_StartDate Metadata
Dataset_EndDate Metadata
Dataset_Location Metadata
Dataset_WestBoundingCoordinate Metadata
Dataset_EastBoundingCoordinate Metadata
Dataset_NorthBoundingCoordinate Metadata
Dataset_SouthBoundingCoordinate Metadata
Dataset_Taxonomy Metadata
Dataset_Abstract Metadata
Dataset_Methods Metadata
Dataset_People Metadata

Then downloading the data should be fine since we are getting the filename from the sys metadata. As example:

## Set Nodes ------------
data_nodes <- dataone::resolve(dataone::CNode("PROD"), data_id)
d1c <- dataone::D1Client("PROD", data_nodes$data$nodeIdentifier[[1]])
cn <- dataone::CNode()
data_id <- "a0b7cf1a-bdbf-407e-be30-4c4ebd7d2dfc"

data_sys <- suppressMessages(dataone::getSystemMetadata(d1c@cn, data_id))
data_name <- data_sys@fileName
out <- dataone::downloadObject(d1c, data_id, path = "~/Desktop")

So the existing code should work

@kristenpeach
Copy link
Collaborator

Thank you @brunj7 !

@kristenpeach
Copy link
Collaborator

kristenpeach commented May 17, 2021

Progress

Fields for summary metadata that are not produced by as_emld: File_Description, Dataset_StartDate, Dataset_EndDate, Dataset_Location, Dataset_WestBoundingCoordinate, Dataset_EastBoundingCoordinate, Dataset_SouthBoundingCoordinate, Dataset_NorthBoundingCoordinate, and Data_Methods. The great news is as_emld does a great job of getting the most important summary metadata without any help. And each of these missing features HAS a corresponding field in ISO they are just not exact. The Data_Methods section has multiple possible inputs so I picked the one that made the most sense to me.

If you run this function (https://github.com/kristenpeach/metajam/blob/master/R/download_ISO_data.R) and then run this code you should see a more complete metadata output:

path_folder <- "DataOne_ISO_test"

data_url <- "https://cn.dataone.org/cn/v2/resolve/a0b7cf1a-bdbf-407e-be30-4c4ebd7d2dfc"

dir.create(path_folder, showWarnings = FALSE)

data_folder <- download_ISO_data(data_url, path_folder)

example_data <- metajam::read_d1_files(data_folder)

*You will need the utils function and the check_version function from metajam so that would need to be installed too.

@kristenpeach
Copy link
Collaborator

Progress

I worked on a few other projects today so not a ton of progress but some. It seems like ISO is really customizable so its possible this won't work for other data packages. I went looking for another data set from the Alaskan Ocean Observing System member node that had non-eml metadata so I could test out my mini function and it looks like a lot of the other packages use EML which is good. I found a package that used ISO metadata (https://search.dataone.org/view/10.24431%2Frw1k57t) and the function worked 99% as expected on this package. It found the Dataset_WestBoundingCoordinate, Dataset_SouthBoundingCoordinate, and Dataset_NorthBoundingCoordinatebut not the Dataset_EastBoundingCoordinate? Will poke around to figure out if that is a function problem or a metadata problem.

path_folder <- "DataOne_ISO_test2_research_workspace"
data_url <- "https://cn.dataone.org/cn/v2/resolve/16c5847d-a2e4-435f-b174-cb81f9d35568"
dir.create(path_folder, showWarnings = FALSE)
data_folder <- download_ISO_data(data_url, path_folder)
example_data <- metajam::read_d1_files(data_folder)

After inspecting the metadata it does look like the east bounding coordinate is actually missing (rather than the function just failed to find it). So the mini-function worked exactly as I hoped it would. After looking at the raw metadata I am less sure that I picked the right entry for the Data_Method field. Or rather, there are really multiple fields that should be concatenated together to populate that cell. To see what I mean run the code above and then create the parsed XML to inspect: meta_iso_xml <- XML::xmlTreeParse(meta_raw). Tomorrow I will try a few other non-eml datasets and maybe test out a method for concatenating methods descriptions to provide a more complete overview in the summary_metadata output.

@kristenpeach
Copy link
Collaborator

Progress

I was trying to use my download_ISO_data.R function on a few other ISO data packages and kept getting this error at the getObject() point in the function:

meta_obj <- dataone::getObject(d1c@mn, meta_id) Error in .local(x, ...) : get() error: Error in curl::curl_fetch_memory(url, handle = handle): server certificate verification failed. CAfile: none CRLfile: none

I tried it on my local R as well as aurora R and got the same error. From my Googling it seems like this may be something I can fix but may also be a dataone package level thing? I will update on other progress during our meeting tomorrow

@brunj7
Copy link
Collaborator Author

brunj7 commented May 19, 2021

@gothub Have you encounter this problem before with the R dataone package?

@mbjones
Copy link
Member

mbjones commented May 20, 2021

Just a guess here, but its most likely an expired SSL certificate on the member repository. Given that you are retrieving a metadata object, and the DataONE CN keeps a copy of all metadata objects, you could failover to the CN node by trying dataone::getObject(d1c@cn, meta_id) and see if that gets the object better. The other option is to use resolve to get the list of locations where that object is replicated, and try retrieving it from each until one succeeds.

@kristenpeach
Copy link
Collaborator

kristenpeach commented May 20, 2021

Progress

Thank you for the help @mbjones ! When I run metadata_nodes <- dataone::resolve(cn, meta_id) I can see where it's replicated but I have not tried accessing it from those nodes yet.

Made a plan with Julien for next steps. I change Metadata_EML_Version to Metadata_ISO_Version and populated that field using the meta_ISO_xml object. I realized it wasn't finding that correctly from the meta_obj because as_emld basically overrides any other metadata language because it squishes everything into EML format and then correctly lists the metadata format as EML. I used the data_name and data_extension to fill the File_Label and File_Description. I think that those might not be what I should be using to populate those fields but I will have to run my eml example again to see what I should actually be trying to put there. But across all of these fields I am feeling good about how much information we can give the user.

Screen Shot 2021-05-20 at 4 27 04 PM

I'm a little surprised that it did not successfully find the Dataset_Location for this example so I want to make sure I am using the best/broadest ISO xml location for this feature. Also, you can see that it accidentally snatches up some extra text for certain fields (the value for Dataset_People begins with "template"). I don't think this is a big deal but I will keep an eye on it when I try other ISO datasets to make sure its not scooping up too much extra stuff.

To Do

  • Check out how accurate the taxonomic_coverage field is on other data sets/ make sure that field is also included on eml metadata
  • Add try catch to new function
  • Try to get an estimate of how many datasets in DataOne have ISO metadata (revisit doc Matt sent)
  • Figure out if (generally speaking) a single member node is restricted to one metadata language (we already know there are some exceptions)
  • Compare File_Label and File_Description to eml example
  • Make a new table (like Julien's above) but with an updated provenance column

@mbjones
Copy link
Member

mbjones commented May 21, 2021

@kristenpeach the breakdown of metadata formats by repository on DataONE is a simple facet query:

https://cn.dataone.org/cn/v2/query/solr/?q=formatType:METADATA&fl=identifier,formatId&facet=true&facet.pivot=datasource,formatId&rows=0&wt=json

In those current results, check out ARCTIC and KNB for some nodes that support multiple types.

@kristenpeach
Copy link
Collaborator

Progress

Thank you Matt!

I ended up removing the field for taxonomic coverage. I think that that particular keyword field happened to have taxonomic information for the dataset I was looking at but that that was not going to be true for other datasets. Feels like it's better to have it blank than incorrect. I don't think File_Label is actually supposed to be the data file's format type but I need to run a few more eml examples (it was blank on the first one I tried) to figure out exactly what it IS supposed to be. I made a provenance table like we talked about. This isn't a final version but kind of what I was thinking:
Screen Shot 2021-05-21 at 6 17 14 PM

@kristenpeach
Copy link
Collaborator

Progress

Changed the Metadata_EML_Version/Metadata_ISO_Version feature to just Metadata_Version so that if anyone wanted to merge summary metadata tables produced by ISO and EML datasets they could do so (great idea Julien).

Julien also suggested that instead of keeping my work as its own ISO specific function we should slice the existing download_d1_data.R function in half and ad an if statement. So if the meta data is in ISO run function X and if it is in EML run function y. We agreed that those new language-specific functions wouldn't be exported to external users and would remain internal. The place it makes sense to me to do that is after this line:

meta_raw <- rawToChar(meta_obj)`

The meta_raw object produced by eml metadata includes this string: "eml://ecoinformatics.org/eml". The meta_raw object produced by iso metadata includes this string: "http://www.isotc211.org/". I know that there are other types of ISO that may not play nicely with this "iso specific" function I wrote but I will try to find a few of the different iso formats and run them to see if it totally fails . So something like this will go into the existing download_d1_data.R function and then I will break off the ISO and eml specific tasks into their own functions:

if (grepl("eml://ecoinformatics.org/eml-", meta_raw) == FALSE) { warning("Metadata is in ISO format") new_dir <- download_ISO_data(meta_raw) # add iso function here

} else if (grepl("eml://ecoinformatics.org/eml-", meta_raw) == TRUE) { warning("Metadata is in EML format") new_dir <- download_EML_data(meta_raw) # add eml function here

@kristenpeach
Copy link
Collaborator

kristenpeach commented Jun 23, 2021

Progress

First attempt at a new vignette highlighting different provenance options for data url depending on member node. Also the first steps toward two new use cases (one eml one ISO) to showcase differences in output. I know we talked about putting that in the Wiki instead so I'll put part of it here:

"## Summary
This vignette aims to showcase a use case using the 2 main functions of metajam - download_d1_data and read_d1_files to download one dataset from the DataOne data repository.

"## Note on data url provenance when using download_d1_data.R

There are two parameters required to run the download_d1_data.R function in metajam. One is the data url for the dataset you'd like to download.You can retrieve this by navigating to the data package of interest, right-clicking on the download data button, and selecting Copy Link Address.

For several DataOne member nodes (Arctic Data Center, Environmental Data Initiative, and The Knowledge Network for Biocomplexity), metajam users can retrieve the data url from either the 'home' site of the member node or the from the DataOne instance of that same data package. For example, if you wanted to download this dataset:

Kelsey J. Solomon, Rebecca J. Bixby, and Catherine M. Pringle. 2021. Diatom Community Data from Coweeta LTER, 2005-2019. Environmental Data Initiative. https://doi.org/10.6073/pasta/25e97f1eb9a8ed2aba8e12388f8dc3dc.

You have two options for where to obtain the data url.

  1. You could navigate to this page on the Environmental Data Initiative site (https://doi.org/10.6073/pasta/25e97f1eb9a8ed2aba8e12388f8dc3dc ) and right-click on the CWT_Hemlock_Diatom_Data.csv link to retrieve this data url: https://portal.edirepository.org/nis/dataviewer?packageid=edi.858.1&entityid=15ad768241d2eeed9f0ba159c2ab8fd5

  2. You could fine this data package on the DataOne site (https://search.dataone.org/view/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fmetadata%2Feml%2Fedi%2F858%2F1) and right-click the Download button next to CWT_Hemlock_Diatom_Data.csv to retrieve this data url:https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F858%2F1%2F15ad768241d2eeed9f0ba159c2ab8fd5

Both will work with metajam! You will get the same output either way.

We have not tested metajam's compatibility with the home sites of all DataOne member nodes. If you are using metajam to download data from a member node other than ADC, EDI, or KNB we highly recommend retrieving the data url from the DataOne instance of the package (example 2 above)."

Made a vignette with use cases for iso and eml datasets. I made it clear that there would be different outputs between the two but the vignette is pretty long at this point so a user would have to be pretty interested to scroll down and find that information.

@kristenpeach kristenpeach removed their assignment Jun 24, 2021
@kristenpeach
Copy link
Collaborator

I did not mean to remove my assignment?? Idk why it says that

@kristenpeach kristenpeach self-assigned this Jun 24, 2021
@kristenpeach
Copy link
Collaborator

Progress

The download_eml_data.R function that is called within the new download_d1_data function is not working right so Ive been working on debugging that. It feels like I will probably have it working by the end of the day tomorrow but now that I've put that out there into the universe something will probably go terribly wrong

@kristenpeach
Copy link
Collaborator

kristenpeach commented Jun 25, 2021

Progress

So the data_url for the eml example dataset I am using appears to be the problem. It is a weirdly short data url: "https://cn.dataone.org/cn/v2/resolve/df35b.296.15"

The function fails at the metadata object creation stage ( meta_obj <- dataone::getObject(mn, meta_id)) even with a valid mn.

This does not happen when I use other datasets from the same member node so maybe just a broken url for that particular dataset?

One problem I found and fixed is that there is a difference between the header documentation on the XML doc between eml versions. So while this works fine for some eml versions it fails to detect other versions and than think they are ISO:

if (grepl("eml://ecoinformatics.org/eml-", meta_raw) == FALSE) { warning("Metadata is in ISO format") new_dir <- download_ISO_data(meta_raw, meta_obj, meta_id, data_id, metadata_nodes, mn, path = path) } else if (grepl("eml://ecoinformatics.org/eml-", meta_raw) == TRUE) { warning("Metadata is in EML format") new_dir <- download_EML_data(meta_obj, meta_id, data_id, metadata_nodes, mn, path = path) }

So I just simplified the string grepl was looking for to "ecoinformatics.org". Now I'll just have to also test a few more ISO cases to make sure that does not miraculously appear in the raw metadata of an iso xml.

I fixed a few bugs and now it works fine except it is printing some extra messages I don't understand so trying to sort that out. Here is the printed messages running an eml data url through the new package:

**https://pasta.lternet.edu/package/data/eml/edi/853/1/1e02df107f9a4d5045bff3e4440ee202
is the latest version for identifier
https://pasta.lternet.edu/package/data/eml/edi/853/1/1e02df107f9a4d5045bff3e4440ee202

Downloading metadata https://pasta.lternet.edu/package/metadata/eml/edi/853/1 ...
Download metadata complete
Metadata is in EML formatNew names:

  • exclusive -> exclusive...6
  • exclusive -> exclusive...8
    New names:
  • exclusive -> exclusive...6
  • exclusive -> exclusive...8
    New names:
  • exclusive -> exclusive...6
  • exclusive -> exclusive...8
    New names:
  • exclusive -> exclusive...6
  • exclusive -> exclusive...8
    summarise() ungrouping output (override with .groups argument)

Downloading data https://pasta.lternet.edu/package/data/eml/edi/853/1/1e02df107f9a4d5045bff3e4440ee202 ...
Download complete**

The "New names:" bit is the unexpected part of that print out. The data and metadata all download as expected though.

@kristenpeach
Copy link
Collaborator

Progress

Tried to download data from a few more datasets with my new functions. Most went well but some did not.

The Gulf of Alaska data portal has some datasets with weird pids. I noted one in an update above but I will note it again here: https://search.dataone.org/view/df35b.298.15

This other dataset (https://search.dataone.org/view/urn%3Auuid%3A3249ada0-afe3-4dd6-875e-0f7928a4c171) had normal looking pids but I got an interesting error associated with my member node loop when I tried to download data from it

"Error in .local(x, ...) : get() error: Hazelcast Instance is not active!"

When I set the data url () and run the SMALL_download_d1_data.R line by line the error happens in the member node loop (as I expected) in lines 88-101.

I thought this was a member node issue but I think it's a memory issue:

https://community.atlassian.com/t5/Bitbucket-questions/what-causes-Hazelcast-instance-to-become-inactive/qaq-p/80060

https://stackoverflow.com/questions/23293072/suddenly-im-getting-hazelcast-instance-is-not-active

I was having so many problems with the eml package on the server I have been using my local R and it seems like it may be a memory issue? So I tried again working on R studio on aurora and got the same error about downloading eml (see updates above). So I cleared up a bunch of memory on my laptop and tried again on my local R and got the same Hazelcast Instance error...some pages online are saying I should just wait a few minutes and try again but I have tried a few times, even after terminating and restarting my R studio session and clearing my cache. I have been trying to understand the help pages I linked above to solve the problem but I am out of my depth here.

When I run my download_d1_data function on datasets that have previously worked fine it still works fine so I think the problem may be specific to the nodeid/member node I was trying to use as input for getObject. I did notice it was a new member node I had not seen before ("urn:node:mnUCSB1") and that the data_nodes list and the metadata_nodes list do not match, which I am sure is a problem. The more I look at the member node loop the less sure I am that it is doing what I think it's doing.

If you load the tidyverse and have the metajam::check_version function, metajam::utils function, metajam::tabularize_eml function, download_EML_data.R function and SMALL_download_d1_data.R function you can run this code and see the problem:

library(tidyverse)
path_folder <- "Data_test_Gulf_of_alaska"

# URL to download the dataset from DataONE
data_url <- "https://cn.dataone.org/cn/v2/resolve/urn%3Auuid%3Aae595730-172a-43d0-91f8-3173663d7dce"
dir.create(path_folder, showWarnings = FALSE)

# Download the dataset and associated metdata
data_folder <- SMALL_download_d1_data(data_url = data_url,path = path_folder)

Compare that to this which runs fine:

library(tidyverse)
path_folder <- "Data_test_eml"

# URL to download the dataset from DataONE
data_url <- "https://cn.dataone.org/cn/v2/resolve/https%3A%2F%2Fpasta.lternet.edu%2Fpackage%2Fdata%2Feml%2Fedi%2F853%2F1%2F1e02df107f9a4d5045bff3e4440ee202"
dir.create(path_folder, showWarnings = FALSE)

# Download the dataset and associated metdata
data_folder <- SMALL_download_d1_data(data_url = data_url,path = path_folder)

I think the root of the issue is the member node loop. I think if we really want people to be using the data_url from DataOne, EDI, KNB or ADC we should maybe force that within the function. Right below the first line of code chunk below is the place we could do that. If we pull all possible instances of the data but then select only the member nodes we know work well ("urn:node:KNB", "urn:node:ARCTIC", "urn:node:EDI") from that list...than we may be able to skip the loop?

Lines 54-57 (https://github.com/kristenpeach/metajam/blob/master/R/SMALL_download_d1_data.R)

data_nodes <- dataone::resolve(dataone::CNode("PROD"), data_id)
d1c <- dataone::D1Client("PROD", data_nodes$data$nodeIdentifier[[1]])
all_mns <- c(data_nodes$data$nodeIdentifier)

@mbjones
Copy link
Member

mbjones commented Jun 29, 2021

@kristenpeach Hazelcast is a software component that we use in DataONE and on some of our repositories, including the Gulf of Alaska Data Portal, the KNB, and the Arctic Data Center, among others. Hazelcast errors like you see above are indicators of a big problem on the repository and are not likely to be specific to one dataset. Let's get on slack with some of the devs and see what's up there.

@taojing2002
Copy link

I increased the max memory allocation for tomcat from 2G to 4G. Then restarted tomcat.

@kristenpeach
Copy link
Collaborator

@mbjones Oh interesting! Thank you for jumping in, it looks like I was not going to solve that on my own. Thank you @taojing2002 !

@kristenpeach
Copy link
Collaborator

kristenpeach commented Jul 9, 2021

Progress

I figured out (at least one of the reasons) that the function was working for some data packages with ISO metadata and not others. It looks like ISO metadata is not parsed exactly the same each time, so the 'place' where I found metadata version info for one data package ("doc.children.MD_Metadata.children.metadataStandardName.children.CharacterString.children.text.value") is not the same 'place' it is listed in others.

meta_iso_xml <- XML::xmlTreeParse(meta_raw)

metadata2 <- meta_iso_xml %>% unlist() %>% tibble::enframe()

ISO_type <- metadata2 %>% filter(name == "doc.children.MD_Metadata.children.metadataStandardName.children.CharacterString.children.text.value")

metadata <- metadata %>% mutate(value = ifelse(name == "@type", ISO_type$value ,value ))

Even when I ask it to look for something less specific like:

ISO_type <- metadata2 %>% filter(name %in% "metadataStandardName")

It often fails to find that.

In the 'main' function SMALL_download_d1_data.R I already have those lines of code that decide if the metadata is in ISO or eml. So we could just say anything that is passed to the download_ISO_data.R function has an xml.version of 'ISO' and anything that passes to the download_EML_data.R function has an xml.version of 'eml'.

But this was a good exercise because I am realizing that some of the other fields are not finding the info they are looking for either because of slight differences in the iso xml 'location'. I think I can improve this slightly by making the filters less specific (%in% instead of ==) but I can at least write a warning message saying something about how some summary metadata may be absent if the metadata is this type of iso

When I test the function sometimes it fails because it cannot retrieve a metadata_obj from the member node selected from the list. I think it would be helpful to write up a full issue on this to go along with an informative error message so that the usr can try manually setting their mn to one of the alternatives in all_mns and try again. Mining some clues for how to proceed from some arcticdatautils functions. I'm wondering if we should write another function that lives outside of SMALL_download_d1_data.R and download_ISO_data.R (like utils.R) that checks whether an mn is valid. That way we could insert it into SMALL_download_d1_data.R with a message directing them how to try a different mn.

When I revert to the old way of finding the right mn (d1c@mn) it works fine. I'm wondering if we should just go back to that way and then just add a thorough error message if it fails directing the user to an issue on the Github with instructions for how to manually set the mn. I'm sure there is a more sophisticated way to try each mn programatically though so I'll look into that more first

@kristenpeach
Copy link
Collaborator

Progress

Possible message draft for member node glitches: Data packages are often replicated on multiple member nodes. Some examples of member nodes include Atmospheric Radiation Measurement Data Center ("urn:node:ARM"), IEDA: Interdisciplinary Earth Data Alliance ("urn:node:IEDA_EARTHCHEM"), Nevada Research Data Center ("urn:node:NRDC"), and Knowledge Network for Biocomplexity ("urn:node:KNB"). Sometimes a dataset becomes unavailable on one of the several member nodes that host a copy of it. This can make download_d1_data.R fail because it tries the first member node listed as a possible member node when downloading the data. If you attempt to use the download_d1_data.R function to retrieve data from a DataOne repository and the function halts and you get one of the following errors:

Insert example of error(s) associated with this problem here

Then maybe adding some line like: If you encounter this problem please generate a new issue on the metajam Github page and provide a minimum reproducible example of how you attempted to you download_d1_data.R. Please be sure to include the data url that you use so we can track down the member node that is no longer operational (for that data set) and remove it.

Below I have tried to figure out a way to make an option for the user to manually set the mn to a different node but it would basically turn into them manually running each line of the function. I think it may be a safer bet to just include the warning message instead.
'You may need to try to retrieve the data from an alternative member node by setting the member node manually.'

Then I need to turn this code chunk into the most minimal version of itself.
data_url <- utils::URLdecode(data_url)
data_versions <- check_version(data_url, formatType = "data")

if (nrow(data_versions) == 1) {
data_id <- data_versions$identifier
} else if (nrow(data_versions) > 1) {
#get most recent version
data_versions$dateUploaded <- lubridate::ymd_hms(data_versions$dateUploaded)
data_id <- data_versions$identifier[data_versions$dateUploaded == max(data_versions$dateUploaded)]
} else {
stop("The DataONE ID could not be found for ", data_url)
}

Set Nodes ------------

data_nodes <- dataone::resolve(dataone::CNode("PROD"), data_id)
all_mns <- c(data_nodes$data$nodeIdentifier)
cn <- dataone::CNode()
meta_id <- dataone::query(
cn,
list(q = sprintf('documents:"%s" AND formatType:"METADATA" AND -obsoletedBy:*', data_id),
fl = "identifier")) %>%
unlist()

Generate list of all member nodes that 'host' this data packages by using the meta_id and

metadata_nodes <- dataone::resolve(cn, meta_id)
mn <- dataone::getMNode(cn, "urn:node:RW")

Spun my wheels a little bit because I think there were so many different errors I need a cohesive list of the circumstances in which the current version of the function fails but got back on track.

Worked on making the download_ISO_data.R function work with a wider variety of iso xml. The field names are so long but if I can have it look for at least some of the most common field names I think that will be better. I'm talking about this section of the code for download_ISO_data.R:

metadata <- metadata %>% dplyr::mutate(name = dplyr::case_when( grepl("@type", name) ~ "xml.version", grepl("title", name) ~ "title", grepl("individualName", name) ~ "people", grepl("abstract", name) ~ "abstract", grepl("identificationInfo.MD_DataIdentification.descriptiveKeywords.MD_Keywords.keyword.CharacterString", name) ~ "keyword", grepl("doc.children.MD_Metadata.children.metadataStandardName.children.CharacterString.children.text.value", name) ~ "Metadata_ISO_Version", grepl("geographicDescription", name) ~ "geographicCoverage.geographicDescription", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.geographicElement.EX_GeographicBoundingBox.westBoundLongitude.Decimal", name) ~ "geographicCoverage.westBoundingCoordinate", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.geographicElement.EX_GeographicBoundingBox.eastBoundLongitude.Decimal", name) ~ "geographicCoverage.eastBoundingCoordinate", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.geographicElement.EX_GeographicBoundingBox.northBoundLatitude.Decimal", name) ~ "geographicCoverage.northBoundingCoordinate", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.geographicElement.EX_GeographicBoundingBox.southBoundLatitude.Decimal", name) ~ "geographicCoverage.southBoundingCoordinate", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.temporalElement.EX_TemporalExtent.extent.TimePeriod.beginPosition", name) ~ "temporalCoverage.beginDate", grepl("identificationInfo.MD_DataIdentification.extent.EX_Extent.temporalElement.EX_TemporalExtent.extent.TimePeriod.endPosition", name) ~ "temporalCoverage.endDate", grepl("dataQualityInfo.DQ_DataQuality.report.DQ_ConceptualConsistency.evaluationMethodDescription.CharacterString", name) ~ "methods", grepl("objectName", name) ~ "objectName", grepl("online.url", name) ~ "url", grepl("dataQualityInfo.DQ_DataQuality.lineage.LI_Lineage.statement.CharacterString", name) ~ "methods" )) %>% dplyr::filter(!is.na(name)) %>% dplyr::mutate(value = stringr::str_trim(value)) %>% dplyr::distinct() %>% dplyr::group_by(name) %>% dplyr::summarize(value = paste(value, collapse = "; "), .groups = "drop") %>% dplyr::mutate(value = gsub("\n", "", value))

The df version of the iso xml has these super long and specific field names that correspond to the fields we need like xml version and some are retrieved from the meta_iso_xml object and some from the 'eml' object which is the iso metadata coerced into eml by as_emld.

Now I can't replicate the error I was getting associated with this problem and the function is working fine again?! Going to start with a clean slate tomorrow and try to use the function on several data packages from several member nodes and see if I can get a list

@mbjones
Copy link
Member

mbjones commented Jul 13, 2021

@kristenpeach it is fairly easy to instruct dataone to try each of the replica copies on DataONE until one is found that does not fail. I'm pretty sure the dataone package is supposed to try all replicas before it fails, as shown in DataONEorg/rdataone#266 and DataONEorg/rdataone#228

If you can produce a reprex for situations where we are not trying all of the replicas, this problem can likely be fixed pretty easily.

@kristenpeach
Copy link
Collaborator

Thank you @mbjones !!!

@kristenpeach
Copy link
Collaborator

Progress

Tried the new functions on a variety of member nodes and using data urls from data packages with both eml and iso xml files.

ISO

Member node: Research Workspace

Lisa Eisner and Michael Lomas. Phytoplankton identifications in the northern Bering and Chukchi seas, quantified with FlowCAM image analysis, Arctic Integrated Ecosystem Research Program, August-September 2017. Research Workspace. 10.24431/rw1k5ac, version: 10.24431_rw1k5ac_20210709T212354Z.

Jens Nielsen, Louise Copeman, Michael Lomas, and Lisa Eisner. Fatty acid seston samples collected from CTD samples in N. Bering and Chukchi Seas during Arctic Integrated Ecosystem Research Program, from research vessel Sikuliaq June 2017. Research Workspace. 10.24431/rw1k59z, version: 10.24431_rw1k59z_20210708T234958Z.

Member node: Arctic Data Center (via DataOne)

William Daniels, Yongsong Huang, James Russell, Anne Giblin, Jeffrey Welker, et al. 2021. Soil Water, plant xylem water, and leaf wax hydrogen isotope survey from Toolik Lake Area 2013-2014. Arctic Data Center. doi:10.18739/A2S17ST50.

Caitlin Livsey, Reinhard Kozdon, Dorothea Bauch, Geert-Jan Brummer, Lukas Jonkers, et al. 2021. In situ Magnesium/Calcium (Mg/Ca) and oxygen isotope (d18O) measurements in Neogloboquadrina pachyderma shells collected in 1999 by a MultiNet tow from different depth intervals in the Fram Strait. Arctic Data Center. doi:10.18739/A2WS8HN0X.

Member node: KNB (via DataOne)

Darcy Doran-Myers. 2021. Data: Density estimates for Canada lynx vary among estimation methods. Knowledge Network for Biocomplexity. urn:uuid:e9dc43c2-210f-40dc-86fb-a6ece2f5fd03.

This one does not work and it fails within the download_EML_data.R function at lines 38-40

entity_data <- entity_objs %>%
purrr::keep(~any(grepl(data_id, purrr::map_chr(.x$physical$distribution$online$url, utils::URLdecode))))

The message the user gets is that "Input does not appear to be an attributeList." but that is because the entity_data object is empty because the line above does not produce anything. When I inspect this dataset on the web interface (https://search.dataone.org/view/urn%3Auuid%3Ae9dc43c2-210f-40dc-86fb-a6ece2f5fd03) it looks like there should be an attribute list for this dataset. Happily, this problem does not seem to have anything to do with member nodes.

@kristenpeach
Copy link
Collaborator

kristenpeach commented Jul 14, 2021

Progress

It looks like this appears to be a problem for all or many data packages on KNB. By that I mean when I try the SMALL_download_d1_data.R package on any data url from a KNB dataset it fails with the same error about the data table lacks an attribute list. Because KNG uses eml I can test the original download_d1_data.R function to see if the problem is old or new. When I use the original download_d1_data.R function on the same data urls the function runs through but it fails to produce an attribute level metadata table and 'fails' at the same place my function does but just keeps running to produce the summary metadata. These feels like a good problem to work on.

At first glance when I compare the ADC dataset that worked great (https://search.dataone.org/view/doi%3A10.18739%2FA2S17ST50) and the KNB one that is failing (https://search.dataone.org/view/doi%3A10.5063%2FR78CMB) there are a couple differences on the web interface alone. The ADC attribute info has annotations which is lovely but I would not assume they are required for metajam to work. The KNB csv file I was trying to download is stored as an OtherEntity instead of a datatable. But I am not sure that is what would be causing problems either. The problem is I don't understand what the line that are failing actually do (see not about lines 38-40 above). Spending some time trying to understand purrr better so I can figure it out. I will go through the function line by line with a data url that IS working to see what those lines are supposed to do which should help me figure out why its failing for KNB.

AHA. When I compare the successful data package to the unsuccessful one they diverge here:

entity_data <- entity_objs %>%
  purrr::keep(~any(grepl(data_id,
                         purrr::map_chr(.x$physical$distribution$online$url, utils::URLdecode))))

Because in the unsuccessful (KNB) package the dataset of interest within the entity_objs list lacks a 'physical' slot. So purrr can't find the url and does not keep the object. Fun!

@mbjones
Copy link
Member

mbjones commented Jul 14, 2021

stored as an OtherEntity instead of a datatable

@kristenpeach this is a likely issue, and I was about to suggest it when reading your comment. Metajam needs to look in all of the allowed locations in EML for attribute info, and not assume that all providers will use just datatable. I suspect its a simple fix by adding an additional path to be searched for the CSV entity info before you do that search for the attributes. I;ll bet entity_objs does not contain the entities that are described with otherrEntity, spatialVector, spatialRaster, etc.

@kristenpeach
Copy link
Collaborator

Thank you @mbjones ! I think the function does look for other entities here:

entities <- c("dataTable", "spatialRaster", "spatialVector", "storedProcedure", "view", "otherEntity")
entities <- entities[entities %in% names(eml$dataset)]

entity_objs <- purrr::map(entities, ~EML::eml_get(eml, .x)) %>% # restructure so that all entities are at the same level
purrr::map_if(~!is.null(.x$entityName), list) %>%
unlist(recursive = FALSE)

But because the user is trying to download 1 data file (not all of the files in the package) those lines using the purrr package I noted above were looking for the specific file associated with the data_url the user provided so that it would just keep and download the data file of interest (and drop all others). It was throwing everything out though because the otherEntity file does not have a physical (so .x$physical$distribution$online$url was empty). Kind of tricky to explain. But here is the entity_objs list for the KNB otherentity:
Screen Shot 2021-07-15 at 2 30 34 PM

And here is the entity_obj of the ADC dataset (which works fine with metajam):
Screen Shot 2021-07-15 at 2 40 10 PM

So if I am understanding the problem correctly (big if) I think I can just tell R to only keep the items in the entity_objs list where the data_id and the "id" match (instead of looking for a match in the url). Then it should be able to identify the correct file even if it is a datatable or otherentity. Feel free to let me know if I am way off the mark.

@mbjones
Copy link
Member

mbjones commented Jul 15, 2021

That sounds reasonable, although I haven't had a chance to look at the details. @jeanetteclark and @laijasmine have worked with these structures a lot and might have good suggestions and maybe some code....

@laijasmine
Copy link

I'm kind coming in with little to no context here - so happy to jump on a call if that would be more helpful. If I understand correctly, based on what you said:

I think I can just tell R to only keep the items in the entity_objs list where the data_id and the "id" match (instead of looking for a match in the url).

Yes you can get the data pid (data_id?) using the id slot in the entities. The one thing to note is that to avoid issues with the : character all of the urn:uuid: is replaced with the dashes -

@jeanetteclark
Copy link

jeanetteclark commented Jul 15, 2021

There are a few different ways to match the data file with the metadata in a dataset, none of which are 100% guaranteed to work (it depends entirely on how well the metadata were constructed). I would try to match the pid in the following ways

  1. Data distribution URL
  2. @id
  3. entityName (match to system metadata fileName)

I believe this is how metacatUI operates, though I'm not sure if the order is the same or not.

@kristenpeach
Copy link
Collaborator

kristenpeach commented Jul 15, 2021

@mbjones @laijasmine @jeanetteclark Thank you everyone! I think I have plenty of options to try. I appreciate the help! Just a heads up that Julien and I have a system where I basically report my progress on this issue page every day. I'm not sure if Github emails everyone tagged in the issue every time I update it but that could get really annoying for you so sorry in advance!

@kristenpeach
Copy link
Collaborator

Progress

Spent wayy too much time trying to figure out how to find and replace a character string in a nested list. I wanted to replace "-" with ":" in the 'id' slot of each entry of the list (each entity) so that I could match it with the data_id. Then realized I could just swap them in the data id instead...

temp_data_id <- gsub("\\:", "\\-", data_id)
entity_data <- entity_objs %>% purrr::keep(~any(grepl(temp_data_id, .x$id)))

Seems to work fine though! Thanks everyone! It's work on a couple datasets I've tried but I will keep trying more until I find the next hiccup.

This feels like something I should understand by now but I don't really get how those KNB data sets don't have a physical. I know that at ADC we sometimes had to 'set' the physical (sysmeta_to_eml_physical). So basically pull information already "known" in the sysmeta like file size into an 'eml physical object'. If someone submits data to the ADC and they do everything right on the web interface, and the data team does not have to fix anything, isn't the physical set automatically? Just trying to wrap my head around this. But when I dig into the schema to the otherEntity level (https://eml.ecoinformatics.org/schema/) it seems like there are a lot of fields that 'could' be there in addition to the basic entityName, entityDescription, attributeList. So is physical considered an 'optional' element of eml and KNB just happens not to use it?

@laijasmine
Copy link

Yeah the physical needs to be set manually by someone on the team when we process a dataset and it is not something set automatically. We also need to update the physical if the file is replaced (the info like the file name and size might be slightly different). So since no one is actively reviewing the datasets that come through the KNB, the physical isn't included in the metadata.

@kristenpeach
Copy link
Collaborator

Progress

Cleaned out some unused code from all the new functions. Worked on writing testthats for the download_EML_data.R and download_ISO_data.R functions that are called within download_d1_data.R. Not sure how many test cases are appropriate but I will do some testing data urls from different member nodes just because that has caused problems before.

Discussed with Julien and going to do pull request

@eblondel
Copy link

eblondel commented Apr 6, 2022

Dear all, discovering this issue with attempts to use geometa converters between metadata objects :-) In case you need to exchange on that, feel free to contact me or post an issue on geometa repository. The geometa converters were the results of some RnD activity under a project funded by R Consortium some years ago to consolidate geometa standards coverage and explore new bridges between metadata standards. It has been a while I didn't look into these converters but I would be happy to get into it again. Cheers

@brunj7
Copy link
Collaborator Author

brunj7 commented Mar 18, 2024

See PR #134

This was referenced Apr 11, 2024
@njlyon0 njlyon0 closed this as completed Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants