Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Books: clean aleph numbers #15

Closed
7 tasks done
ludmilamarian opened this issue Oct 17, 2018 · 28 comments
Closed
7 tasks done

Books: clean aleph numbers #15

ludmilamarian opened this issue Oct 17, 2018 · 28 comments
Assignees
Milestone

Comments

@ludmilamarian
Copy link
Contributor

ludmilamarian commented Oct 17, 2018

More or less 25,000 records need to be corrected.
One way to see if the value is an Aleph number is that the number stats with 000s:
https://cds.cern.ch/search?ln=en&sc=1&p=962__b%3A%22000*%22+or+785%3A%22000*%22+or+770%3A%22000*%22+or+780%3A%22000*%22+or+787%3A%22000*%22+or+772%3A%22000*%22&action_search=Search&op1=a&m1=a&p1=&f1=&c=Articles+%26+Preprints&c=Books+%26+Proceedings&c=Presentations+%26+Talks&c=Periodicals+%26+Progress+Reports&c=Multimedia+%26+Outreach

(The search is maybe not 100% accurate).

The fields that need to be checked are:

  • 962b
  • 770w
  • 772w
  • 780w
  • 785w
  • 787w

Additionally:

  • 035$$9CERCER

The matching needs to be done against 970__a where there is ‘CER’ and one needs to replace this value with the corresponding CDS record number.

Here is an example:
https://cds.cern.ch/record/1163043?ln=en

As far as as know, there is also the field 035 that contains Aleph Numbers when $$9CERCER:
Ex: https://cds.cern.ch/record/1220684?ln=en -> to be checked if it is still in use for something

( requires #21 )

@ludmilamarian
Copy link
Contributor Author

@agentilb http://cds.cern.ch/record/611669/export/hm?ln=en contains 035$$CERCER but the number does not start with 000 is this correct? What should we do in this cases? Is this a valid Aleph number?

@agentilb
Copy link
Collaborator

agentilb commented Feb 7, 2019

Hi @ludmilamarian , The 970 contains the Aleph number: 000611669 970__ $$a002371072CER, here it seems to match with 035a. So we can ignore this field.

@ludmilamarian
Copy link
Contributor Author

ludmilamarian commented Feb 8, 2019

Hi @agentilb Unfortunately there is a bit of a grey zone. There are currently 13 records out of the ones we plan to migrate (71 records in total in CDS: see ticket #21 ) that have a 035 CERCER number that does not match the 970. Can you take a look and let me know how should I proceed with them?

Record: http://cds.cern.ch/record/688975        035$$9CERCER$$a2335634  970$$a002415211CER
Record: http://cds.cern.ch/record/108000        035$$9CERCER$$a2197315  970$$a000012716CER
Record: http://cds.cern.ch/record/644681        035$$9CERCER$$a0001170  970$$a002400422CER
Record: http://cds.cern.ch/record/960228        035$$9CERCER$$a0017737  970$$a002626513CER
Record: http://cds.cern.ch/record/395229        035$$9CERCER$$a0321541  970$$a002807641CER
Record: http://cds.cern.ch/record/682026        035$$9CERCER$$a0246139  970$$a002408512CER
Record: http://cds.cern.ch/record/887495        035$$9CERCER$$a0004029  970$$a002560162CER
Record: http://cds.cern.ch/record/1354483       035$$9CERCER$$a0321541  970$$a002807641CER
Record: http://cds.cern.ch/record/684955        035$$9CERCER$$a0013374  970$$a002411355CER
Record: http://cds.cern.ch/record/407879        035$$9CERCER$$a0334702  NO 970
Record: http://cds.cern.ch/record/277434        035$$9CERCER$$a0268334  970$$a000197258CER
Record: http://cds.cern.ch/record/277439        035$$9CERCER$$a0268012  970$$a000197263CER
Record: http://cds.cern.ch/record/278090        035$$9CERCER$$a0267712  970$$a000197947CER

@ludmilamarian
Copy link
Contributor Author

ludmilamarian commented Feb 8, 2019

787__w
-> currently 107 records containing values in 787__w:
cds_records_with_787w.txt
-> from the list it looks like a mix of CDS IDs and Aleph IDs; after checking the list, the rule that the Aleph number starts with 00 seems correct, so only these will be updated.

Based on the above 24records have been identified as needed to be updated and 1 needs manual update @agentilb

  • 1283985: ['00SYST.NUMBER']

@agentilb
Copy link
Collaborator

agentilb commented Feb 8, 2019

@ludmilamarian regarding those 13 records and the 58 mentioned in #21. I have checked a bit, and in some cases, the records are merged records or cloned records, that's explains the discrepancy between 035 and 970.
But I guess, we should find if the values in 035 correspond to any 962/775/780/785/787, and check them manually to make sure there is no mismatch, otherwise, those fields can be ignored.
WDYT?

@ludmilamarian
Copy link
Contributor Author

ludmilamarian commented Feb 11, 2019

770__w
-> 9 records
cds_records_with_770w.txt

2 records need to be fixed manually: @agentilb

  • 344633 ['002518702'] this Aleph ID matches a deleted record: RecID: 832330
  • 229425 ['000223435 : 000144270']

@agentilb
Copy link
Collaborator

@ludmilamarian
For 229425, this is now corrected.
For 344633, this is an old record, that shouldn't be migrated: when 980:PERIDUMP, this is something similar to hidden records. Those need to be ignored during the migration (they are normally not searchable in CDS).

@ludmilamarian
Copy link
Contributor Author

772__w
-> 7 records: cds_records_with_772w.txt

@ludmilamarian
Copy link
Contributor Author

ludmilamarian commented Feb 13, 2019

785__w
-> 415 records: cds_records_with_785w.txt
-> 339 records containing Aleph IDs to update

To fix manually @agentilb

  • 1283985 ['Continued by/Continued in part by/Split into/Absorbed by/Merged with', 'TITLE', '00SYST.NUMBER']
  • 1286045 ['Continued by/Continued in part by/Split into/Absorbed by/Merged with', 'TITLE', '00SYST.NUMBER']
  • 1336786 ['Continued by', 'Journal of the royal statistical society. series B (statistical methodology)', 'SYSNO']
  • 1338587 ['split into', 'Sankhy\xc4\x81: The Indian Journal of Statistics, Series A and: Sankhy\xc4\x81: The Indian Journal of Statistics, Series B', 'SYSNO1 : SYSNO2']
  • 1339672 ['Formed by the merger of', 'Sankhy\xc4\x81: The Indian Journal of Statistics, Series B', '002956689', 'Continued by', 'Sankhy\xc4\x81: The Indian Journal of Statistics', 'SYSNO']
  • 1339675 ['Formed by the merger of', 'Sankhy\xc4\x81: The Indian Journal of Statistics, Series A', '002956689l', 'Continued by', 'Sankhy\xc4\x81: The Indian Journal of Statistics', 'SYSNO']
  • 1339683 ['Split into', 'The Annals of Probability and: The Annals of Statistics', 'SYSNO1 : SYSNO2']
  • No exact match for 229529, with aleph ID: 00265544
  • No exact match for 624916, with aleph ID: 000143780
  • No exact match for 229762, with aleph ID: 000243127
  • No exact match for 229781, with aleph ID: 000143844
  • No exact match for 525593, with aleph ID: 0002176782
  • No exact match for 1336790, with aleph ID: 002954847
  • No exact match for 558787, with aleph ID: 002316410

@ludmilamarian
Copy link
Contributor Author

ludmilamarian commented Feb 13, 2019

780__w
-> 433 records cds_records_with_780w.txt
-> 360 records updated

To fix manually @agentilb

  • No exact match for 342897, with aleph ID: 000144284
  • No exact match for 427516, with aleph ID: 0002282392
  • No exact match for 937657, with aleph ID: 000302119
  • No exact match for 550674, with aleph ID: 002925371
  • No exact match for 1283985, with aleph ID: 00SYST.NUMBER
  • No exact match for 1286045, with aleph ID: 00SYST.NUMBER

@ludmilamarian
Copy link
Contributor Author

ludmilamarian commented Feb 13, 2019

962__b
-> ~194'000 records with a 962__b value, thus we need to identify a mechanism to select the ones that refer to Aleph. @agentilb I need your help on this

  • It looks like 962__l sometimes keeps the provenance. The current possible values are CER01', 'CERCER', 'ADMBUL', 'CER', 'MMD', 'PHOPHO' Can this information be used to know what needs to be updated? Can we assume that CER01, CERCER and CER are the same thing? Is the ADMBUL something that we can use? Looking at a few examples I could not find the correspondent ID in CDS
  • If the 962 does not contain an l subfield, but the 962__b starts with 00, can I assume these are aleph numbers without checking anything else? (currently 26'234 records); it looks like in many cases there is a 962__n that holds a conference code. I can use this information to make sure the replacement of the aleph number with the CDS Recid is correct. WDYT?

@agentilb
Copy link
Collaborator

Hi @ludmilamarian

In principle, only 962__b starting with 00 should be converted (at least for the articles, books, proceedings and standards). In most cases, indeed, there is a $$n with a conference code. In this case, it is actually a good idea to check against the 111__g of the corresponding record, if it is not too complicated.

Then for the records that have 962__lCER or CERCER, it concerns mostly multimedia and archive records. In this case, I have the feeling that even if there is no 00 in 962__b the number is an Aleph number. The number of cases here is small, so I can check those myself.

For the 962__l where the value is 'ADMBUL', 'MMD' or 'PHOPHO', it concerns mostly pictures. We have to check if there are Aleph number in there.

@ludmilamarian
Copy link
Contributor Author

@agentilb indeed the 962__l is tricky. For example: https://cds.cern.ch/record/1299149/export/hm?ln=en has 962__lCER but the 962__b is a record ID, not an aleph number: https://cds.cern.ch/record/827684. I am sure I saw cases yesterday with 962__lCER but the 962__b was not a record ID. In this case it will be difficult to treat them with a script. But I will let you investigate further, maybe you can discover a rule that we can apply. Meanwhile, I will focus on 962__b that starts with 00

@ludmilamarian
Copy link
Contributor Author

@agentilb records with 962__l: recs_with_962l.txt

@agentilb
Copy link
Collaborator

Hi @ludmilamarian

I have cleaned all the records with 962__l:'CER'. I.e. checking if the number was Aleph or CDS id, correcting when necessary, and deleting the 962__l.
There is still one task pending with multi-record editor but by tomorrow, this will be cleaned for all Library and Archive records

962__l:PHOPHO-> it seems to correspond to records (mostly Bulletin articles) linked with photos.
In this case, the corresponding record has a 035__ with $$9PHOPHO
Ex: https://cds.cern.ch/record/46124 linked to https://cds.cern.ch/record/43022?ln=en

962__l:MMD it seems to correspond to records (mostly Bulletin articles) linked with photos.
In this case, the corresponding record has 970__a:'MMD' (and/or 035__9:'MMD')
Ex: https://cds.cern.ch/record/749053 linked to https://cds.cern.ch/record/615876

962__l:ADMBUL it seems to correspond to photo records linked with Bulletin issues (all are from the years 2000-2001).
In this case, the corresponding record has 035:'ADMBUL'
Ex: https://cds.cern.ch/record/41801 linked to https://cds.cern.ch/record/44476?ln=en

Those will be need to be modified at some point, but they are not part of the current migration.

@ludmilamarian
Copy link
Contributor Author

@agentilb Perfect, thank you! I created a new ticket #24 to have in mind for the photo migration.

@ludmilamarian
Copy link
Contributor Author

@agentilb 962__b is a bit more tricky but we're getting there.
Out of 25'746 records to be fixed, 18'467 have a clear solution (searching for both aleph number and conference id gives only one result).
7'580 cases are not that straight forward, as the search for both aleph and conference gives either 0 or more than 1 result.
For these 7'580 I will search now only for aleph id, and if no results are found, search for conference id. Do you think this approach is ok? Would you like a more detail log for these 7'580?
I will bibupload the 18'467 records that seem ok, and then work on figuring out the rest.

@agentilb
Copy link
Collaborator

Hi @ludmilamarian
18,5k records that have a clear solution is already a good thing! :-)
for the 7580 remaining records, there are maybe the cases of book chapters, that do not link to conference proceedings but to books, in this case, there is "book" in 962__n. Here, we can only rely on the Aleph number, and for sanity-check, verify that the corresponding record is a BOOK.
I believe this should work a for a large part of those records.

@ludmilamarian
Copy link
Contributor Author

@agentilb indeed, I discovered this case with book instead of conference id (I forgot to mention this!) and I only look for the aleph id in that case. I know that restricting to the book collection is going to help in this case - I have not yet done that

@ludmilamarian
Copy link
Contributor Author

ludmilamarian commented Feb 22, 2019

@agentilb looking at the book cases, I see quite a lot of searches failing (no aleph ID) - 1'617 cases.
Investigating one case, I have discovered that in the past, some book updates were replacing fully the marcxml on CDS, erasing the 970/035. For example: https://cds.cern.ch/record/edit/compare_revisions?recid=214115&rev1=20130912172023&rev2=20130321051727
Trying to find a match for https://cds.cern.ch/record/857495. Do you have any pointers on how to address this issue. There could be the possibility of going back int he history for all the book records, but I would really keep this as the last option, as this will be super costly in terms of time/computation.

@agentilb
Copy link
Collaborator

agentilb commented Feb 22, 2019

Hi @ludmilamarian I fear this concerns one specific collection of books which was curated by one student in 2013, unfortunately, I don't think we can easily retrieve the 970/035. But this should concern less than 200 records, If I give you a selection of some 143 recids, is it easy to check the historical version before this Revision 2013-09-12 17:20:23 (I hope it was done in one time...) ?
https://cds.cern.ch/search?ln=en&cc=Books&sc=1&p=%22Landolt-Börnstein%22+and+964%3A%27*%27+and+916%3A201337&action_search=Search&op1=a&m1=a&p1=&f1=&c=Books&c=Book+Proposals&wl=0
The corresponding articles should be here I believe:
https://cds.cern.ch/search?ln=en&sc=1&p=LANDOLTBORNSTEIN1&action_search=Search&op1=a&m1=a&p1=&f1=&c=Articles+%26+Preprints&c=Books+%26+Proceedings&c=Presentations+%26+Talks&c=Periodicals+%26+Progress+Reports&c=Multimedia+%26+Outreach
And if you have a look to the 962, there are only 27 uniq Aleph number.
So, my assumption if that we need to match those 27 Aleph numbers with the 143 books, I hope this should solve most of the cases you mention. Do you think this is feasible without being too costly?
Or do you want me to have a look first to the logs of those errors, to see if the problem is wider than this specific collection?

@ludmilamarian
Copy link
Contributor Author

ludmilamarian commented Feb 26, 2019

Hi @agentilb I fixed most of the linking, we are missing 1'895 links (so 1895 Aleph IDs still in 962__b) that have not been straightforward to fix.
https://cds.cern.ch/search?ln=en&sc=1&p=962__b%3A%2F%5E00%5Ba-z%5D*%2F+-962__l%3A%2F%5Ba-z%5D*%2F&action_search=Search&op1=a&m1=a&p1=&f1=&c=Articles+%26+Preprints&c=Books+%26+Proceedings&c=Presentations+%26+Talks&c=Periodicals+%26+Progress+Reports&c=Multimedia+%26+Outreach

Out of these:

  • 279 links have a conference code -> in total 74 unique conference codes. Unfortunately, the exact search does not find these codes or finds two records for a conference code, so either there are typos or partial codes. Please take a look at the list bellow. Their associated Aleph IDs also are not findable. Some of them can be found linked to records that have been deleted, for example: https://cds.cern.ch/record/edit/?ln=en&#state=edit&recid=649525 At this point, I don't think there is anything I can do on my side, so I will leave the list with you to see if you can find a solution.
    aleph_linking_conference_codes.txt

  • 1'615 links have 962__n:book -> in total 49 unique Aleph IDs, please see file bellow (format: aleph_id number_of_records_affected). If you can take a look at a few and see if most of them are because of the student work, or there might be something else, that would be very helpful. For a small number of records (a few hundreds) I can go back in the history to see if there is an aleph id.
    aleph_linking_book_ids.txt

@agentilb
Copy link
Collaborator

agentilb commented Mar 1, 2019

Hi @ludmilamarian, Thanks! I'll have a look to those lists, and let you know if something else can be done.

@ludmilamarian
Copy link
Contributor Author

@agentilb let me know how you wish to proceed, I would like to try to finish the task this week :-)

@agentilb
Copy link
Collaborator

Hi @ludmilamarian, I started to study those lists manually with an intern, and I think we will be able to handle them on our side. This should be done by the end of the week. So no further action is required from you, I guess :-)

@ludmilamarian
Copy link
Contributor Author

That is fantastic news @agentilb ! This means we are close to have this project finished! I will leave the ticket open until you confirm that everything is fixed. Also, there are just a few cases left for 780__w, some checkboxes from one of the previous comments.

@agentilb
Copy link
Collaborator

@ludmilamarian the 780 are now done.

@agentilb
Copy link
Collaborator

All Aleph numbers have now been cleaned in the 962__b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants