Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Books: clean 035 field #167

Closed
ludmilamarian opened this issue Oct 17, 2018 · 22 comments
Closed

Books: clean 035 field #167

ludmilamarian opened this issue Oct 17, 2018 · 22 comments

Comments

@ludmilamarian
Copy link
Contributor

ludmilamarian commented Oct 17, 2018

parent of #169
blocked by CERNDocumentServer/cds-migrator-kit#15

Currently, these are the possible values in 035__9:

set(['CNUM-INSPIRE', 'IEEECONF', 'inspire', 'INSPIRE-CNUM', 'SCEM', 'arXiv', 'LHCLHC', 'SPIRES', 'DOE', '926408', 'ADMADM', 'Iinspire-CNUM', 'Inspire-CNUM', 'CERCER', 'INDICO.CERN.CH', 'SLACCONF', 'INIS', 'INPIRE-CNUM', 'DLC', 'HAL', 'DESY', 'FIZ', 'WAI01', 'http://inspirehep.net/oai2d', 'Isnpire-CNUM', 'INSPRIE-CNUM', 'SAFARI', 'Isnpire', 'SLAC', 'INSPIRE', 'AgendaMaker', 'KEK', 'Inspire', 'EBL', '290555', 'INSPEC', 'CERN annual report', 'CERN', '273873', 'inspire-CNUM']) but it looks like only the inspire-cnum/inspirecnum are being treated in a slightly different way.

There are several things here that I think @agentilb could help with:
i) the normalization of these values, some of them are obviously typos (Isnpire).
ii) it looks like inspire appears in various forms, do all relate to the same inspire_cnum? If yes, they should be migrated to the same filed in the new data model
iii) the cleaning of these values, as there are also number there, that might be a mistake and they need to be in another subfield
iv) only CERCER should potentially disappear (but only after doing CERNDocumentServer/cds-migrator-kit#15), or there are others that could be ignored?

@agentilb
Copy link
Contributor

agentilb commented Oct 19, 2018

I have cleaned the obviously wrong occurrences (typo, wrong value, numbers...)

Indeed only CERCER should disappear (once the linking with old Aleph numbers is solved).
I wonder what is ADMADM, there are several thousands of records. Not sure what it refers to and it looks like this is similar to CERCER, but one should investigate before deleting. It seems it is something similar to CERCER but for other types of records. @ludmilamarian do you have a way to see if those are used in some way in the system?

Also, there is 'http://inspirehep.net/oai2d' not sure this information should be stored here. But I leave it to you to decide, it seems to me this is more a "technical information" about the harvest of the record.

For Inspire, we can link in 2 ways:

  • Via INSPIRE-CNUM -> which is the Inspire Conference Code, in this case you find 035__9:INSPIRE-CNUM (whatever the case is)
  • Via Inspire recid -> which links to the Inspire corresponding records, in this case, you find 035__9:Inspire (whatever the case is)

Hope this helps,

Anne

@ludmilamarian
Copy link
Contributor Author

ludmilamarian commented Oct 19, 2018

Great, thank you @agentilb !

I have recomputed the list.

['IEEECONF', 'inspire', 'INSPIRE-CNUM', 'SCEM', 'arXiv', 'LHCLHC', 'SPIRES', 'DOE', 'ADMADM', 'Iinspire-CNUM', 'Inspire-CNUM', 'CERCER', 'INDICO.CERN.CH', 'SLACCONF', 'INIS', 'DLC', 'HAL', 'DESY', 'FIZ', 'WAI01', 'http://inspirehep.net/oai2d', 'SAFARI', 'SLAC', 'Inspire', 'AgendaMaker', 'KEK', 'INSPIRE', 'EBL', 'INSPEC', 'CERN annual report', 'inspire-CNUM']

There was another typo for Iinspire-CNUM I have just fixed.

These are the things left to decide or to fix in order to correctly migrate 035:

  • @agentilb Should CERN annual report be suppressed? It does not look like an identifier from another system
    -> I think they were used in the past to generate list of CERN publications. I don't think it is still used today. But I'll confirm that later.

  • @agentilb We have SLAC, SPIRES, SLACCONF. Shouldn't they be replaced by Inspire? I think these ids are obsolete now, what do you think?
    -> Indeed, but first we need to check that all records have correspoding Inspire and or Inspire-cnum. By the time of the migration, that should be done, so we can ignore those indeed.

  • @agentilb LHCLHC is just 1 record, do you think we can remove the field? https://cds.cern.ch/record/472405/export/hm?ln=en (actually this same record has also WAI01 which is also only used once, so maybe we could remove both)
    -> -> There is one record in the Books but more than 300 in the other collections. It looks it is something similar to CERCER and ADMADM, when the record was not in CDS (it concerns old records). I believe this can simply be ignored the migration.

  • @agentilb INSPEC is used only 5 time: https://cds.cern.ch/search?cc=Books+%26+Proceedings&p=035__9%3A%22INSPEC%22 do you think is something that we could maybe remove?
    -> Ok removed for the books and proceedings. I'll check if we can remove for the other collections as well.

  • @ludmilamarian I propose to skip http://inspirehep.net/oai2d during the migration, as this is information that we add in another field as well, plus the inspire id is also stored separately.

  • @ludmilamarianADMADM is probably a previous database used for CERN committee documents. All the 5 records present in Books and Proceedings having this tag are actually committee documents. As far as I see they are part of the Design Reports collection, one of the collections that we said that we are not migrating yet, as it's records do not circulate, so there is no action to take. But I will make sure to remove the records from the list of records to migrate. [US] Books: migrate collections cds-migrator-kit#2

  • @ludmilamarian For AgendaMaker I will check with our colleagues as the links can not be resolved. Possibly replace them with correct IDs

  • @agentilb Checking the records containing IDs from the AgendaMaker https://cds.cern.ch/search?ln=en&cc=Books+%26+Proceedings&sc=1&p=035__9%3A%22AgendaMaker%22 it looks like they might have been wrongly catalogued as proceedings. Most of them are the recordings of the conference. There are only 26 so maybe you can take a look and let me know what you think.

  • @ludmilamarian I already created a task to clean the CERCER (aka Aleph numbers) Books: clean aleph numbers cds-migrator-kit#15

@ludmilamarian
Copy link
Contributor Author

@agentilb regarding slac/spires I have just ran a quick script and there are ~ 20 records that have SLACCONF bot no INSPIRE-CNUM. All the rest of the records that mention spires/slac contain also inspire numbers. I have created a new ticket for this: #169

@ludmilamarian
Copy link
Contributor Author

ludmilamarian commented Oct 25, 2018

I have recomputed the list of values (based an all records that we need to migrate), and based on the comments above these are the things that need to be done:

List of tags to keep:
['inspire', 'ebl', 'lhclhc', 'dlc', 'hal', 'fiz', 'wai01', 'indico.cern.ch', 'scem', 'inis', 'udccer', 'ieeeconf', 'kek', 'desy', 'safari', 'inspire-cnum', 'arxiv', 'doe' ]

List of tags to fix:

  • 'isnpire'
  • '926408'
  • 'agendamaker'
  • 'cercer'
  • 'insprie-cnum'
  • '290555'
  • 'isnpire-cnum'
  • 'cnum-inspire'

To ignore:
'slac'
'slacconf'
'http://inspirehep.net/oai2d'
'spires'
'cern annual report'

@agentilb
Copy link
Contributor

Strange: all the typos have been corrected on 19/10, and the correction was simply removed during the night after.
See:
https://cds.cern.ch/record/edit/compare_revisions?recid=113410&rev1=20181019112212&rev2=20180611195506
and
https://cds.cern.ch/record/edit/compare_revisions?recid=113410&rev1=20181019112212&rev2=20181019220255

Do you understand why?

@ludmilamarian
Copy link
Contributor Author

It looks like ti was a huge multiedit (it updated ~6000 records). It was done during the data, at 11:14 but was executed in the evening due to the large number of records. This multiedit touched only the 035. DO you remember if you or one of your colleagues might have run it? Unfortunately, multi-edit is not "smart" enough to detect if updates have run since the time of the edit until the time of the execution of the set of records that it touches.

@agentilb
Copy link
Contributor

I did several multi-edits that day to clean records.. I don't remember having done one that touched so many records, but that's probably me, though. What I never realised is that multiedit modifications touch all records of the search results, even if the modification doesn't apply to them. I guess that's the cause of this.

@agentilb
Copy link
Contributor

'cern annual report' can also be ignored.

@ludmilamarian
Copy link
Contributor Author

Regarding the agendamaker and records that potentially are lectures, I will summarize here my findings:

Our colleagues from Indico gave us the corresponding new IDs, however, for past videos I think we can not replace them yet, because some recordings might be stored in folders named based on the old ID. This will be sorted out when we migrate lectures.

I think we need to see what we should migrate as part of books, and then do changes only on these records.

Once you decide which of these records should be kept, I can replace the previous agendamaker IDs with Indico IDs. Also, I see now that some records are mixing video recordings and proceedings. I think if we plan to migrate some of these records we should create separate records for the proceeding and for the video. The records can link with each other, but one record can not be both a video and a proceeding. Let me know what you think.

@agentilb
Copy link
Contributor

For: 26 records pointing to old events in indico (agendamaker): https://cds.cern.ch/search?ln=en&cc=Books+%26+Proceedings&sc=1&p=035__9%3A%22AgendaMaker%22

Can you let me know the new Indico ids of those events, so I can check what content is on Indico, and decide what are real proceedings or not? Indeed, I guess most of them will need to be filtered out...

@ludmilamarian
Copy link
Contributor Author

http://cds.cern.ch/record/517300 -> http://indico.cern.ch/event/408840
http://cds.cern.ch/record/726659 -> http://indico.cern.ch/event/041075
http://cds.cern.ch/record/733606 -> http://indico.cern.ch/event/041428
http://cds.cern.ch/record/741349 -> http://indico.cern.ch/event/415024
http://cds.cern.ch/record/801691 -> http://indico.cern.ch/event/419824
http://cds.cern.ch/record/821452 -> http://indico.cern.ch/event/042407
http://cds.cern.ch/record/822407 -> http://indico.cern.ch/event/419871
http://cds.cern.ch/record/931361 -> http://indico.cern.ch/event/411250
http://cds.cern.ch/record/933801 -> http://indico.cern.ch/event/427826
http://cds.cern.ch/record/933801 -> http://indico.cern.ch/event/427833
http://cds.cern.ch/record/933801 -> http://indico.cern.ch/event/427867
http://cds.cern.ch/record/933801 -> http://indico.cern.ch/event/427929
http://cds.cern.ch/record/933801 -> http://indico.cern.ch/event/427948
http://cds.cern.ch/record/933801 -> http://indico.cern.ch/event/427973
http://cds.cern.ch/record/933801 -> http://indico.cern.ch/event/427975
http://cds.cern.ch/record/935121 -> http://indico.cern.ch/event/432271
http://cds.cern.ch/record/935131 -> http://indico.cern.ch/event/411417
http://cds.cern.ch/record/973361 -> http://indico.cern.ch/event/417197
http://cds.cern.ch/record/973362 -> http://indico.cern.ch/event/417832
http://cds.cern.ch/record/973364 -> http://indico.cern.ch/event/417559
http://cds.cern.ch/record/973372 -> http://indico.cern.ch/event/417574
http://cds.cern.ch/record/974801 -> http://indico.cern.ch/event/419450
http://cds.cern.ch/record/974803 -> http://indico.cern.ch/event/419817
http://cds.cern.ch/record/974806 -> http://indico.cern.ch/event/420145
http://cds.cern.ch/record/975187 -> http://indico.cern.ch/event/423040
http://cds.cern.ch/record/975421 -> http://indico.cern.ch/event/424522
http://cds.cern.ch/record/975439 -> http://indico.cern.ch/event/424234
http://cds.cern.ch/record/975440 -> http://indico.cern.ch/event/424677
http://cds.cern.ch/record/976076 -> http://indico.cern.ch/event/427250
http://cds.cern.ch/record/976076 -> http://indico.cern.ch/event/427251
http://cds.cern.ch/record/977394 -> http://indico.cern.ch/event/427321
http://cds.cern.ch/record/987830 -> http://indico.cern.ch/event/409154

@agentilb
Copy link
Contributor

agentilb commented Oct 30, 2018

From the list above, we keep for sure:
http://cds.cern.ch/record/517300 ->  http://indico.cern.ch/event/408840
http://cds.cern.ch/record/741349 -> http://indico.cern.ch/event/415024
http://cds.cern.ch/record/821452 ->  http://indico.cern.ch/event/042407
http://cds.cern.ch/record/822407 ->  http://indico.cern.ch/event/419871

For:
http://cds.cern.ch/record/726659 ->  http://indico.cern.ch/event/041075
http://cds.cern.ch/record/733606 ->  http://indico.cern.ch/event/041428
-> they are not anymore in the Proceedings collection.

The others are events where there is only transparencies and/or videos, and in many cases, the Indico page has a restricted access. They shouldn't be in the proceedings collection, I believe.

ludmilamarian added a commit to ludmilamarian/cds-dojson that referenced this issue Oct 30, 2018
* Adds new lists for external identifiers in order to determine
  which values should be allowed and which to be ignored.
  (closes CERNDocumentServer#167)

Signed-off-by: Ludmila Marian <[email protected]>
@ludmilamarian
Copy link
Contributor Author

@agentilb what do you think we do with these records? should we update the 4 that should remain in the proceedings and move the rest in the lectures collections, or do do you have another suggestion?

ludmilamarian added a commit that referenced this issue Nov 1, 2018
* Adds new lists for external identifiers in order to determine
  which values should be allowed and which to be ignored.
  (closes #167)

Signed-off-by: Ludmila Marian <[email protected]>
@agentilb
Copy link
Contributor

Hi, I have moved all records that were not proceedings in Conference for the time being.
For the ones that stay in the proceedings, can I remove the AgendaMaker id, if we have the new Indico link?

There is only one record left in the proceedings with 340:'Streaming video'. It links to the Indico where the videos are.
I would be in favour of removing the 340:'Streaming video'. Is it ok for you?
https://cds.cern.ch/record/537931

@ludmilamarian
Copy link
Contributor Author

For the 5 remaining records: https://cds.cern.ch/search?ln=en&cc=Books+%26+Proceedings&sc=1&p=035__9%3A%22AgendaMaker%22 :

  • I see that these records (I only checked 2) are for proceedings, but there are other records in CDS for the videos. Did you reached the same conclusion? In this case, should we remove the agenda maker from 035 and only leave the link to indico event as an external resource? I was initially thinking we need to replace the agendamaker ID with the indico ID, but now seeing that there are other records in CDS for the videos, it is probably better if they kept the 035 towards Indico, rather than the proceedings. Was this your suggestion as well? Potentially, we could also link the video records with the proceedings records, that would be nice for people looking for the videos, not to have to pass via indico for the link.
  • One of these 5 videos has several links to several events. Are you planning to remove them? I'm not sure the data model supports this information in 518__
  • Looking at the holdings some of them have DVDs. I was under the impression that we agreed to delete these holdings but now I am not sure about it. In the case we keep them, definitely we should link to the CDS record for the video, as it is the online copy of the DVD probably.

Let me know if you want me to help with any of the above actions.

@agentilb
Copy link
Contributor

I've indeed removed the AgendaMaker id and added the link from the video to the proceedings for the 2 records: https://cds.cern.ch/record/821452 and https://cds.cern.ch/record/933801

For: https://cds.cern.ch/record/741349?ln=en if I'm not mistaken, there is no video on CDS and the link seems broken from Indico, do you think something is retrievable from AgendaMaker? Should we keep it.
For: https://cds.cern.ch/record/517300?ln=en
and https://cds.cern.ch/record/822407 there is no video at all on Indico and we do not have the DVD, do you think the videos are somewhere? If no, I would delete AgendaMaker and the Video information.

For the DVD holdings, I don't think we decided to delete them, we would need to do some cleaning beforehand, and this will be time consuming. Is it a problem to migrate them?

@agentilb
Copy link
Contributor

For this one: https://cds.cern.ch/record/537931?ln=en I've also linked the CDS videos to the record.

@ludmilamarian
Copy link
Contributor Author

Hi @agentilb From my side, I think it is ok to close this ticket. I have update 2 or 3 records from agendaMaker to new Indico IDs. I think the DVDs issue is handled in another ticket. Let me know if there is still something to be addressed here.

@agentilb
Copy link
Contributor

agentilb commented Feb 1, 2019

Hi @ludmilamarian I noticed one more thing for the 035 field.
for the arxiv OAI information stored in 035 such as in:
https://cds.cern.ch/record/edit/?ln=en#state=edit&recid=2240869
do you confirm this can be ignored during the migration or should it be stored somewhere?

Otherwise, I thing we can close the ticket.

@ludmilamarian
Copy link
Contributor Author

@agentilb do you want to ignore completely any 035 that has 035$$9arXiv ? Or we should keep $$9 and $$a and ignore the rest. We currently keep $$9 and $$a, but there is nothing defined for the rest as far as I can see.

@agentilb
Copy link
Contributor

agentilb commented Feb 4, 2019

@ludmilamarian If you do not need on your side, I think we can ignore completely 035 when $$9:arxiv since the arxiv number is normally stored in 037.

@ludmilamarian
Copy link
Contributor Author

Makes sense @agentilb ! I have created a new ticket for this #196 just because this one is getting bigger and bigger. If this was the only thing left, then I will close this one.

jrcastro2 pushed a commit that referenced this issue Jun 21, 2024
* Adds new lists for external identifiers in order to determine
  which values should be allowed and which to be ignored.
  (closes #167)

Signed-off-by: Ludmila Marian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants