Skip to content

Commit

Permalink
Merge pull request #188 from lyrasis/documenting-iterative-cleanup
Browse files Browse the repository at this point in the history
Add final job to IterativeCleanup; improve documentation
  • Loading branch information
kspurgin authored Sep 12, 2023
2 parents 7daa939 + 7a222ac commit 6fd77b0
Show file tree
Hide file tree
Showing 7 changed files with 431 additions and 55 deletions.
350 changes: 305 additions & 45 deletions doc/iterative_cleanup.md

Large diffs are not rendered by default.

12 changes: 10 additions & 2 deletions doc/iterative_cleanup_flowchart.mmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ graph TD;

CleanedUniq["`**CleanedUniq**
Deduplicate on :clean_fingerprint
Delete mod.cleaned_uniq_collate_fields
Collate mod.cleaned_uniq_collate_fields`"]
Delete collate_fields
Collate collate_fields`"]

Worksheet["`**Worksheet**
If worksheet already provided:
Expand All @@ -36,6 +36,12 @@ graph TD;
Explode collated mod.orig_values_identifier
Deduplicate on full row match`"]

Final["`**Final**
Lets you:
- Set custom lookup key for merge back into migration
- Apply custom transforms on cleaned data that won't interfere with cleanup iterations`"
]

base_job-->BaseJobCleaned;

Corrections-.
Expand Down Expand Up @@ -66,3 +72,5 @@ graph TD;
Provided-->KnownWorksheetValues;

KnownWorksheetValues-->Worksheet;

BaseJobCleaned-->Final;
Binary file modified doc/iterative_cleanup_flowchart.pdf
Binary file not shown.
Binary file modified doc/iterative_cleanup_flowchart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
77 changes: 71 additions & 6 deletions lib/kiba/extend/mixins/iterative_cleanup.rb
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ module Mixins
#
# `extend Kiba::Extend::Mixins::IterativeCleanup`
#
# ### Optional settings/methods in extending module
# ### Methods that can be optionally overridden in extending module
#
# Default values for the following methods are defined in this mixin
# module. If you want to override the values, define these methods
Expand All @@ -69,6 +69,7 @@ module Mixins
# - {collate_fields}
# - {collation_delim}
# - {clean_fingerprint_flag_ignore_fields}
# - {final_lookup_on_field}
#
# ## What extending this module does
#
Expand Down Expand Up @@ -143,8 +144,11 @@ def orig_values_identifier
:fingerprint
end

# Tags assigned to all jobs generated by IterativeCleanup for this
# module. DEFAULT VALUE: `[]` (empty array)
# Tags assigned to all jobs generated by IterativeCleanup for
# this module. Tags allow retrieval and running of jobs via
# `thor jobs:tagged`, `thor jobs:tagged_or`, and `thor
# jobs:tagged_and` commands. DEFAULT VALUE: `[]` (empty
# array)
#
# @note Optional: override in extending module after extending
#
Expand Down Expand Up @@ -177,9 +181,37 @@ def worksheet_field_order

# Fields from base_job_cleaned that will be deleted in
# cleaned_uniq, and then merged back into the deduplicated
# data from base_job_cleaned. I.e., fields whose values will
# be collated into multivalued fields on the deduplicated
# values. DEFAULT VALUE: `[]`
# data of that job from base_job_cleaned. I.e., fields whose
# values will be collated into multivalued fields on the
# deduplicated values. DEFAULT VALUE: `[]`
#
# Note that `:fingerprint` (or your overridden orig_values_identifier)
# is added to these values by the {all_collate_fields} method. That
# field should always be collated, or you will not be able to match
# final cleaned values back to original migration data.
#
# An example of when you might want to add additional collate
# fields: For authority term cleanup, especially if we are
# breaking up subject headings into individual subdivisions,
# I like to provide the full subject heading from which the
# term was derived, for context. For example, `:subdivision`
# = "History", `:fullheading` = "Ghana -- History". If you
# also have row with `:subdivision` = "Histories",
# `:fullheading` = "Ghana -- Histories", and the client
# corrects "Histories" to "History" in that row, if you
# include `:fullheading` in `collate_fields`, a subsequently
# generated worksheet row with `:subdivision` = "History"
# will have `:fullheading` = "Ghana -- History\\\\Ghana --
# Histories".
#
# It can also be useful for clients with large cleanup
# projects to provide the number of occurrences for each
# value in the project. Retain this information through
# multiple cleanup iterations by collating the occurrences
# field and adding an inline transform to split and sum the
# values in a custom `cleaned_uniq_post_xforms` method. See
# [Tms::PlacesCleanupInitial](https://github.com/lyrasis/kiba-tms/blob/main/lib/kiba/tms/places_cleanup_initial.rb)
# for an example
#
# @note Optional: override in extending module after extending
#
Expand Down Expand Up @@ -233,6 +265,18 @@ def clean_fingerprint_flag_ignore_fields
nil
end

# Will be used to set the `lookup_on` field in job registry
# hash for `cleanup_base_name__final`, for merging
# cleaned-up data back into the rest of your migration.
# DEFAULT VALUE: value of orig_values_identifier
#
# @note Optional: override in extending module after extending
#
# @return [Symbol]
def final_lookup_on_field
orig_values_identifier
end

# DO NOT OVERRIDE REMAINING METHODS

# @return [Array<Symbol>] supplied registry entry job keys
Expand Down Expand Up @@ -309,6 +353,10 @@ def corrections_job_key
"#{cleanup_base_name}__corrections".to_sym
end

def final_job_key
"#{cleanup_base_name}__final".to_sym
end

# Appends "s" to module's `orig_values_identifier`. Used to
# manage joining, collating, and splitting/exploding on this
# value, while clarifying that any collated field in output
Expand Down Expand Up @@ -417,6 +465,8 @@ def build_namespace
register mod.send(:job_name, mod.send(:corrections_job_key)),
mod.send(:corrections_job_hash, mod)
end
register mod.send(:job_name, mod.send(:final_job_key)),
mod.send(:final_job_hash, mod)
end
end
private :build_namespace
Expand Down Expand Up @@ -497,6 +547,21 @@ def corrections_job_hash(mod)
}
end
private :corrections_job_hash

def final_job_hash(mod)
{
path: File.join(Kiba::Extend::Mixins::IterativeCleanup.datadir(mod),
"working", "#{mod.cleanup_base_name}_final.csv"),
creator: {
callee:
Kiba::Extend::Mixins::IterativeCleanup::Jobs::Final,
args: {mod: mod}
},
tags: mod.job_tags,
lookup_on: mod.final_lookup_on_field
}
end
private :final_job_hash
end
end
end
Expand Down
4 changes: 2 additions & 2 deletions lib/kiba/extend/mixins/iterative_cleanup/jobs/cleaned_uniq.rb
Original file line number Diff line number Diff line change
Expand Up @@ -59,11 +59,11 @@ def cleaned_xforms(mod)
Kiba.job_segment do
job = bind.receiver

transform Delete::Fields,
fields: mod.all_collate_fields
transform Deduplicate::Table,
field: :clean_fingerprint,
delete_field: false
transform Delete::Fields,
fields: mod.all_collate_fields
transform Merge::MultiRowLookup,
lookup: send(mod.base_job_cleaned_job_key),
keycolumn: :clean_fingerprint,
Expand Down
43 changes: 43 additions & 0 deletions lib/kiba/extend/mixins/iterative_cleanup/jobs/final.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# frozen_string_literal: true

module Kiba
module Extend
module Mixins
module IterativeCleanup
module Jobs
module Final
module_function

def job(mod:)
Kiba::Extend::Jobs::Job.new(
files: {
source: mod.base_job_cleaned_job_key,
destination: mod.final_job_key
},
transformer: get_xforms(mod)
)
end

def get_xforms(mod)
base = []
if mod.respond_to?(:final_pre_xforms)
base << mod.final_pre_xforms
end
base << xforms(mod)
if mod.respond_to?(:final_post_xforms)
base << mod.final_post_xforms
end
base
end

def xforms(mod)
Kiba.job_segment do
# passthrough - pre and post mean nothing here
end
end
end
end
end
end
end
end

0 comments on commit 6fd77b0

Please sign in to comment.