Skip to content

Commit

Permalink
HYC-1280: Create intermediate ingest object (#738)
Browse files Browse the repository at this point in the history
* Create object to store xml and parse jats xml

Create intermediate object for parsing xml and translating to Hyrax objects

- Move check for extra files in package
- Do not map UNC affiliation until we can do so more reliably, put in "other affiliation" until then

* Add and configure javascript driver for capybara, upgrade

* Use ffaker for tests, do setup step once for feature, for faster run
  • Loading branch information
maxkadel authored Jan 12, 2022
1 parent 0e73b5b commit aa1ccb0
Show file tree
Hide file tree
Showing 21 changed files with 2,382 additions and 53 deletions.
3 changes: 1 addition & 2 deletions .rubocop.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,9 @@ inherit_from: .rubocop_todo.yml
AllCops:
TargetRubyVersion: 2.6.7
NewCops: disable

Metrics/BlockLength:
Exclude:
- 'db/schema.rb'
- 'vendor/**/*'

# TODO: Enable this cop - temporarily disabled here because it is not being added to the auto-generated to-do list
Lint/RedundantCopDisableDirective:
Expand Down
5 changes: 4 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -81,10 +81,13 @@ group :development do
end

group :test do
gem 'capybara', '~> 2.17.0'
gem 'capybara', '~> 3.36'
gem 'factory_bot_rails', '~> 6.1.0'
gem 'ffaker'
gem 'rspec-mocks'
gem "selenium-webdriver"
gem 'shoulda-matchers', '~> 5.0.0'
gem 'simplecov', '~> 0.17.0'
gem "webdrivers"
gem 'webmock', '~> 3.14.0'
end
28 changes: 22 additions & 6 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -179,18 +179,21 @@ GEM
simple_form
byebug (9.1.0)
cancancan (1.17.0)
capybara (2.17.0)
capybara (3.36.0)
addressable
matrix
mini_mime (>= 0.1.3)
nokogiri (>= 1.3.3)
rack (>= 1.0.0)
rack-test (>= 0.5.4)
xpath (>= 2.0, < 4.0)
nokogiri (~> 1.8)
rack (>= 1.6.0)
rack-test (>= 0.6.3)
regexp_parser (>= 1.5, < 3.0)
xpath (~> 3.2)
carrierwave (1.3.2)
activemodel (>= 4.0.0)
activesupport (>= 4.0.0)
mime-types (>= 1.16)
ssrf_filter (~> 1.0)
childprocess (4.1.0)
chronic_duration (0.10.6)
numerizer (~> 0.1.1)
clamav-client (3.2.0)
Expand Down Expand Up @@ -313,6 +316,7 @@ GEM
faraday (>= 0.7.4, < 1.0)
fcrepo_wrapper (0.8.0)
ruby-progressbar
ffaker (2.20.0)
ffi (1.15.4)
flipflop (2.6.0)
activesupport (>= 4.0)
Expand Down Expand Up @@ -588,6 +592,7 @@ GEM
carrierwave (>= 0.5.8)
rails (>= 5.0.0)
marcel (1.0.1)
matrix (0.4.2)
memoist (0.16.2)
method_source (1.0.0)
mime-types (3.3.1)
Expand Down Expand Up @@ -846,6 +851,10 @@ GEM
ffi (~> 1.9)
scanf (1.0.0)
select2-rails (3.5.11)
selenium-webdriver (4.1.0)
childprocess (>= 0.5, < 5.0)
rexml (~> 3.2, >= 3.2.5)
rubyzip (>= 1.2.2)
shex (0.6.2)
ebnf (~> 2.1)
json-ld (~> 3.1)
Expand Down Expand Up @@ -946,6 +955,10 @@ GEM
activemodel (>= 5.0)
bindex (>= 0.4.0)
railties (>= 5.0)
webdrivers (5.0.0)
nokogiri (~> 1.6)
rubyzip (>= 1.3.0)
selenium-webdriver (~> 4.0)
webmock (3.14.0)
addressable (>= 2.8.0)
crack (>= 0.3.2)
Expand All @@ -969,7 +982,7 @@ DEPENDENCIES
bootstrap-sass (~> 3.4.1)
bulkrax (~> 1.0.0)
byebug (~> 9.1.0)
capybara (~> 2.17.0)
capybara (~> 3.36)
clamav-client
coffee-rails (~> 4.2.2)
devise (~> 4.8.0)
Expand All @@ -978,6 +991,7 @@ DEPENDENCIES
execjs (= 2.8.1)
factory_bot_rails (~> 6.1.0)
fcrepo_wrapper (~> 0.8.0)
ffaker
httparty (~> 0.20.0)
hydra-editor (= 5.0.1)
hydra-role-management (~> 1.0)
Expand Down Expand Up @@ -1008,6 +1022,7 @@ DEPENDENCIES
rubocop-rails
rubocop-rspec
sass-rails (~> 5.0.6)
selenium-webdriver
shoulda-matchers (~> 5.0.0)
sidekiq (~> 5.2.9)
sidekiq-limit_fetch (~> 3.4.0)
Expand All @@ -1018,6 +1033,7 @@ DEPENDENCIES
turbolinks (~> 5.0.1)
uglifier (~> 4.2.0)
web-console (~> 3.7.0)
webdrivers
webmock (~> 3.14.0)
willow_sword!

Expand Down
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,13 @@ bundle exec rake sage:ingest[/hyrax/spec/fixtures/sage/sage_config.yml]
work.to_solr.deep_symbolize_keys!
```
* Copy the output of this to the sample_solr_documents file. Add a unique `:timestamp` value to the hash (e.g. `:timestamp => "2021-11-23T16:05:33.033Z"`) so that the `spec/requests/oai_pmh_endpoint_spec.rb` tests to continue to pass.

#### Debugging Capybara feature and javascript tests
* Save a screenshot
* Put `page.save_screenshot('screenshot.png')` on the line before the failing test (you can use a different name for the file if that's helpful)
* The screenshot will be saved to `tmp/capybara`.
* See https://github.com/teamcapybara/capybara#debugging for more info
##### Code Linter - Rubocop
* Helpful Rubocop documentation - https://docs.rubocop.org/rubocop/usage/basic_usage.html
Expand Down
224 changes: 224 additions & 0 deletions app/models/jats_ingest_work.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# For information on the JATS metadata standard, see https://jats.nlm.nih.gov/
# Currently used for Sage ingest
class JatsIngestWork
include ActiveModel
attr_reader :xml_path

def initialize(xml_path:)
@xml_path = xml_path
end

def jats_xml
@jats_xml ||= File.read(xml_path)
end

def document
@document ||= Nokogiri::XML(jats_xml)
end

def article_metadata
@article_metadata ||= document.xpath('.//article-meta')
end

def creators_metadata
@creators_metadata ||= document.xpath('.//contrib-group')
end

def journal_metadata
@journal_metadata ||= document.xpath('.//journal-meta')
end

def permissions
@permissions ||= article_metadata.xpath('.//permissions')
end

def abstract
article_metadata.xpath('.//abstract').map(&:inner_text)
end

def copyright_date
permissions.at('copyright-year').inner_text
end

def creators
@creators ||= begin
creators_metadata.xpath('.//contrib').map.with_index do |contributor, index|
[index, contributor_to_hash(contributor, index)]
end.to_h
end
end

# TODO: Map affiliation to UNC controlled vocabulary
def contributor_to_hash(contributor, index)
affiliation_ids = affiliation_ids(contributor)
first_affiliation = affiliation_map[affiliation_ids.first]
{
'name' => "#{surname(contributor)}, #{given_names(contributor)}",
'orcid' => orcid(contributor),
'affiliation' => '',
# 'affiliation' => some_method, # Do not store affiliation until we can map it to the controlled vocabulary
'other_affiliation' => first_affiliation,
'index' => (index+1).to_s
}
end

def affiliation_map
@affiliation_map ||= begin
document.xpath('//aff').map do |affil|
[affil.attributes["id"].value, affiliation_to_s(affil)]
end.to_h
end
end

def affiliation_ids(elem)
references = elem.xpath('xref')
references.map do |ref|
reference_type = ref['ref-type']
next unless reference_type=="aff"

ref["rid"]
end.compact
end

def affiliation_to_s(affil_elem)
affil_elem.children.map do |child|
# Don't include newlines or the order label
next if child.inner_text == "\n" || child.name == "label"

# Only include the institution name proper from the institution-wrap, don't include the institution-id
if child.xpath(".//institution").present?
child.xpath(".//institution").inner_text
else
child.inner_text
end
end.join
end

def date_of_publication
if publication_day && publication_month && publication_year
"#{publication_year}-#{publication_month}-#{publication_day}"
elsif publication_month && publication_year
"#{publication_year}-#{publication_month}"
else
publication_year
end
end

def funder
article_metadata.xpath('.//funding-source/institution-wrap/institution').map(&:inner_text)
end

# The Sage-assigned DOI
def identifier
article_metadata.xpath('.//article-id[@pub-id-type="doi"]').map(&:inner_text)
end

def issn
journal_metadata.xpath(".//issn").map(&:inner_text)
end

def journal_issue
article_metadata.at('issue')&.inner_text
end

def journal_title
journal_metadata.xpath(".//journal-title-group/journal-title").inner_text
end

def journal_volume
article_metadata.at('volume')&.inner_text
end

def keyword
article_metadata.at('kwd-group').xpath("//kwd").map do |elem|
if elem.at('italic')
elem.at('italic').inner_text
else
elem.inner_text
end
end
end

def license
permissions.xpath(".//license/@xlink:href").map do |elem|
CdrLicenseService.authority.find(elem&.inner_text)[:id]
end
end

def license_label
license.map do |lic|
CdrLicenseService.label(lic)
end
end

def page_end
article_metadata.at('lpage')&.inner_text
end

def page_start
article_metadata.at('fpage')&.inner_text
end

def publisher
journal_metadata.xpath('.//publisher/publisher-name').map(&:inner_text)
end

def rights_holder
permissions.xpath('.//copyright-holder').map(&:inner_text)
end

def title
article_metadata.xpath('.//title-group/article-title').map(&:inner_text)
end

private

def publication_year
year = publication_date_node_set.at('year')&.inner_text&.to_i
format('%04d', year) if year
end

def publication_month
month = publication_date_node_set.at('month')&.inner_text&.to_i
format('%02d', month) if month
end

def publication_day
day = publication_date_node_set.at('day')&.inner_text&.to_i
format('%02d', day) if day
end

def publication_date_node_set
if physical_publication_date.present?
physical_publication_date
elsif electronic_and_physical_publication_date.present?
electronic_and_physical_publication_date
elsif electronic_publication_date.present?
electronic_publication_date
end
end

def electronic_publication_date
article_metadata.xpath('.//pub-date[@pub-type="epub"]')
end

def electronic_and_physical_publication_date
article_metadata.xpath('.//pub-date[@pub-type="epub-ppub"]')
end

def physical_publication_date
article_metadata.xpath('.//pub-date[@pub-type="ppub"]')
end

def surname(contributor)
contributor.xpath('name/surname').inner_text
end

def given_names(contributor)
contributor.xpath('name/given-names').inner_text
end

def orcid(contributor)
contributor.xpath('contrib-id').inner_text
end
end
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def users_to_notify
repo_admins.each do |u|
users << u
end
users.uniq
users.compact.uniq
end
end
end
Expand Down
Loading

0 comments on commit aa1ccb0

Please sign in to comment.