Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10 TB -> 10 GB: Fix metadata w/ mk-zim-cat-item.py & mk-zim-cat.py [C'est pas sorcier, CrashCourse, TED-Ed, GCFAprendeLibre] #14

Merged
merged 2 commits into from
Jan 22, 2023

Conversation

holta
Copy link
Contributor

@holta holta commented Jan 19, 2023

@tim-moody please review before merging, to confirm this is sufficiently correct?

@holta holta requested a review from tim-moody January 19, 2023 01:18
@holta
Copy link
Contributor Author

holta commented Jan 19, 2023

FYI this PR is nothing more than an attempt at cleaning up the metadata for these three ~10GB ZIM files:

Long story short, this PR is nothing more than the automated results arising from running:

cd /opt/iiab-share/iiab-content/catalogs
./mk-zim-cat-item.py cest-pas-sorcier_fr_top_2021-01.zim --source https://s3.us-east-2.wasabisys.com/iiab-zims/
./mk-zim-cat-item.py crashcourse_en_top_2021-01.zim --source https://s3.us-east-2.wasabisys.com/iiab-zims/
./mk-zim-cat-item.py teded_en_top_2021-01.zim --source https://s3.us-east-2.wasabisys.com/iiab-zims/
mv cest-pas-sorcier_fr_top_2021-01.json crashcourse_en_top_2021-01.json teded_en_top_2021-01.json zim-cat-fragments
./mk-zim-cat.py

Concluding Questions:

  • Is the above automated process good enough?
  • Mangled and ugly metadata (below) unfortunately appears in each ZIM file's .json fragment — but seems to likely originate from the ZIM files themselves — so probably we're stuck with it for now?
"tags": "y;o;u;t;u;b;e;_;v;i;d;e;o;s;:;y;e;s;_ftindex:no;_pictures:yes;_videos:yes;_details:yes",

@holta holta requested a review from georgejhunt January 19, 2023 01:37
@holta
Copy link
Contributor Author

holta commented Jan 19, 2023

ASIDE: How important is the following issue?

@holta
Copy link
Contributor Author

holta commented Jan 19, 2023

Recap screenshots showing the original problem: (Admin Console ends up showing ~10 TB erroneously, when it should show ~10 GB for these 3 ZIM files)

image

image

@holta
Copy link
Contributor Author

holta commented Jan 19, 2023

Another way of viewing the proposed (harmless???) SIDE EFFECTS of this PR:

image

@holta
Copy link
Contributor Author

holta commented Jan 19, 2023

Note that metadata associated with these 3 ZIM files is already problematic even prior to this PR — as seen in this kiwix-serve screenshot — if you look closely:

IMG-20230118-WA0004

So I'd assume this PR does not make the situation any worse? Hopefully @tim-moody can confirm?

@tim-moody
Copy link
Contributor

I don't see anything wrong with the revised version, except the date, which is probably technically correct as in when the zim was created, but not in sync with the content.

The addition of the funny youtube tag and the first videos tag is not clear to me. Might have been passed as an argument at zim creation and kiwix mangled it. Hard for me to think that kiwix knows this is youtube material and added it.

The important thing is that size is now correct and publisher is now correct, both used by the catalog display. Also, we need the pictures and details tags that are set by kiwix and were missing.

Kiwix reports the path relative to the zims directory, so the ../library is common across all zims and isn't really used.

Is the above automated process good enough?

Well, it works for me, and I was the only one who used it. The problems identified in this ticket were produced by not using it.

But it is hardly ready for mass consumption, so if there are an increasing number of zim creators who can't use it, someone could improve it.

@holta
Copy link
Contributor Author

holta commented Jan 19, 2023

@georgejhunt can you recommend merging this PR experimentally — or another course of action if further metadata fixes are required?

@holta holta changed the title 10 TB -> 10 GB: Fix metadata w/ mk-zim-cat-item.py & mk-zim-cat.py 10 TB -> 10 GB: Fix metadata w/ mk-zim-cat-item.py & mk-zim-cat.py [C'est pas sorcier, CrashCourse, TED-Ed] Jan 19, 2023
@holta
Copy link
Contributor Author

holta commented Jan 19, 2023

ASIDE: @deldesir asks if we can learn from @georgejhunt about any crucial-or-critical metadata hand-fixes that might be advisable in general?

Specifically, do the 3 Basic Electricity ZIM files below illustrate any metadata tips & tricks we should learn from here?

@holta
Copy link
Contributor Author

holta commented Jan 19, 2023

ASIDE: @tim-moody you're probably aware but just FYI @deldesir did a quick survey of metadata across all 9 of our IIAB Catalog ZIM files:

@tim-moody
Copy link
Contributor

"Educatonal" is misspelled

I probably mistyped it. Can be fixed by hand.

Not sure that we use favicon since we don't use kiwix as a front end

@tim-moody
Copy link
Contributor

looks right

@holta
Copy link
Contributor Author

holta commented Jan 19, 2023

Not sure that we use favicon since we don't use kiwix as a front end

Isn't this logo/favicon automatically extracted when new menudefs are auto-created, to showcase new ZIM files on IIAB's main page? (Or so I thought, maybe I've got this all wrong, apologies if so!)

@tim-moody
Copy link
Contributor

tim-moody commented Jan 19, 2023

Isn't this logo/favicon automatically extracted when new menudefs are auto-created

not sure if we are still able to do this with the new catalog. in any event I always assumed that if we create a zim we would also create its menu def

@tim-moody
Copy link
Contributor

anyway, you're right someone should check if we are doing this with the new catalog and implement it if not. These logos look to be 48px x 48, which is a little small, but better than nothing. probably @deldesir could figure it out.

@holta holta changed the title 10 TB -> 10 GB: Fix metadata w/ mk-zim-cat-item.py & mk-zim-cat.py [C'est pas sorcier, CrashCourse, TED-Ed] 10 TB -> 10 GB: Fix metadata w/ mk-zim-cat-item.py & mk-zim-cat.py [C'est pas sorcier, CrashCourse, TED-Ed, GCFAprendeLibre] Jan 20, 2023
@holta
Copy link
Contributor Author

holta commented Jan 20, 2023

@tim-moody after this PR (or similar) is merged, can users see the new ZIM catalog by logging back in to Admin Console, and/or by clicking "Reindex Kiwix Content", "Refresh Kiwix Catalog" or similar ?

image

(Or is a fresh install of IIAB & Admin Console likely necessary?)

@tim-moody
Copy link
Contributor

just refresh catalog

@holta
Copy link
Contributor Author

holta commented Jan 21, 2023

I tried contacting @georgejhunt directly for his recommendations here. No luck reaching him so far, since Wednesday. He should have good ideas. In any case, I suggest we move forward with this PR allowing for community testing — and amend it later wherever George and users can offer further improvements.

@tim-moody
Copy link
Contributor

I agree

@holta holta merged commit 5733f5a into iiab-share:main Jan 22, 2023
@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

Looks great.
Hopefully @deldesir and others can confirm.
Fresh installs of all 4 ZIM files can't hurt to verify!

@tim-moody
Copy link
Contributor

Fresh installs of all 4 ZIM files can't hurt to verify!

would verify that source url not broken

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

Fresh installs of all 4 ZIM files can't hurt to verify!

would verify that source url not broken

First fresh install doesn't seem to work: @tim-moody is it unrelated that jobs are marked "SCHEDULED" but never begin?

image

I rebooted and that does not help. Waiting 10min also did not help. FYI this is Debian 12 (currently in pre-release freeze, and generally reliable in other regards). Feel free to login to 10.8.0.38 if that helps understand why it's stuck?

@tim-moody
Copy link
Contributor

can't login

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

http://10.8.0.46/admin (Ubuntu 22.04) appears to have the very same problem.

Can you log into either? (Or if not, any idea why?)

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

can log into admin but not ssh as my key not supported

Your ssh key should be supported on both VM's (mine works). Any idea what's happening?

@tim-moody
Copy link
Contributor

tim-moody commented Jan 22, 2023

mine doesn't on either

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

mine doesn't on either

Hopefully when you're logged in as iiab-admin you can diagnose (why your ssh key is suddenly not working?)

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

Just FYI iiab-cmdrv.service appears to be running fine on both VM's:
(below is from Debian 12 VM 10.8.0.38; Ubuntu 22.04 VM 10.8.0.46 is equivalent)

root@box:~# systemctl status iiab-cmdsrv
● iiab-cmdsrv.service - Provides the IIAB Command Server
     Loaded: loaded (/etc/systemd/system/iiab-cmdsrv.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2023-01-22 11:49:52 EST; 35min ago
    Process: 687 ExecStart=/opt/admin/cmdsrv/iiab-cmdsrv3.py --daemon (code=exited, status=0/SUCCESS)
   Main PID: 897 (iiab-cmdsrv3.py)
      Tasks: 9 (limit: 2316)
     Memory: 86.7M
        CPU: 8.761s
     CGroup: /system.slice/iiab-cmdsrv.service
             └─897 /usr/bin/python3 /opt/admin/cmdsrv/iiab-cmdsrv3.py --daemon

Jan 22 11:55:37 box IIAB-CMDSRV[897]: IIAB-CMDSRV : Received CMD Message GET-ADM-CONF.
Jan 22 11:55:37 box IIAB-CMDSRV[897]: IIAB-CMDSRV : Received CMD Message GET-VARS.
Jan 22 11:55:37 box IIAB-CMDSRV[897]: IIAB-CMDSRV : Received CMD Message GET-ANS.
Jan 22 11:55:37 box IIAB-CMDSRV[897]: IIAB-CMDSRV : Received CMD Message GET-IIAB-INI.
Jan 22 11:55:37 box IIAB-CMDSRV[897]: IIAB-CMDSRV : Received CMD Message GET-ZIM-STAT.
Jan 22 11:55:38 box IIAB-CMDSRV[897]: IIAB-CMDSRV : Received CMD Message GET-OER2GO-STAT.
Jan 22 11:55:38 box IIAB-CMDSRV[897]: IIAB-CMDSRV : Received CMD Message GET-OSM-VECT-STAT.
Jan 22 11:55:38 box IIAB-CMDSRV[897]: IIAB-CMDSRV : Received CMD Message GET-SPACE-AVAIL.
Jan 22 11:55:38 box IIAB-CMDSRV[897]: IIAB-CMDSRV : Received CMD Message GET-EXTDEV-INFO.
Jan 22 11:55:48 box IIAB-CMDSRV[897]: IIAB-CMDSRV : Received CMD Message GET-JOB-STAT {"last_rowid":1}.

@tim-moody
Copy link
Contributor

adm cons doesn't think kiwix is installed, so is waiting for it

if you install from a preset kiwix will be installed, but otherwise you have to check it in configure and ico

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

  1. On 10.8.0.38 installing Kiwix and then restarting iiab-cmdsrv got things moving, thanks for solving that!

  2. And then it moved TED-Ed from /libary/working/zims as soon as the 78MB nih_rarediseases_en_all_maxi_2020-12.zim was done downloading i.e. when the download of teded_en_top_2021-01.zim was only 3% complete!

image

Any idea why this is happening?

FYI /library/zims/content/teded_en_top_2021-01.zim continues to expand (presumably as a result of the wget command keeping the file handle, despite the destination having moved!) Is this A-Ok ??

@tim-moody
Copy link
Contributor

what I said was with regard to .46. I see that .38 now has kiwix installed and the zim downloads have started and in one case succeeded.

@tim-moody
Copy link
Contributor

The only thing I can think of is that the restart logic doesn't handle job dependencies properly.

@tim-moody
Copy link
Contributor

tim-moody commented Jan 22, 2023

Also though jobs 3 and 4 say succeeded, the zim was not added, and the job output even says that. On checking I see that rare diseases did get added, but not to menu

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

Also though jobs 3 and 4 say succeeded, the zim was not added, and the job output even says that. On checking I see that rare diseases did get added, but not to menu

I noticed that.

Weird that neither ZIM file appears on the IIAB home page so far.

(Certainly I can manually force these later, i.e. after the 10GB TED-Ed is downloaded, I can run iiab-make-kiwix-lib or click Install Content > Reindex Kiwix Content -- and then run iiab-update-menus)

Hopefully these kinds of things are a very rare occurrence arising from systemctl restart iiab-cmdsrv ?

@tim-moody
Copy link
Contributor

Hopefully these kinds of things are a very rare occurrence arising from systemctl restart iiab-cmdsrv ?

That's what I'm hoping. Just looking at the restart code to see if anything jumps out. Is it possible that wget continued even when you restart cmdsrv? i.e. there was no reboot

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

Is it possible that wget continued even when you restart cmdsrv?

The only thing I did was systemctl restart iiab-cmdsrv to try to force both wget's to begin.

I did not reboot. (Should I have rebooted instead?)

@tim-moody
Copy link
Contributor

tim-moody commented Jan 22, 2023

restart assumes a cold start after a shutdown or crash

@tim-moody
Copy link
Contributor

teded download is marked STARTED in the db, seems like we had a job that was still running at the OS level, but not at the cmdsrv level. don't know what that would do.

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

Weird that neither ZIM file appears on the IIAB home page so far.

Just FYI neither ZIM file appeared on the IIAB main page http://10.8.0.38 in the end.

And just 1 of 2 appears at http://10.8.0.38/kiwix/ in the end.

image

(I can manually force the fixing of all the small issues above, as mentioned earlier, so I'm only mentioning this as an FYI.)

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

teded download is marked STARTED in the db, seems like we had a job that was still running at the OS level, but not at the cmdsrv level. don't know what that would do.

Time to repeat the experiment on 10.8.0.46 == Ubuntu 22.04 just for kicks?

(Do you want to do that...or should I?)

@tim-moody
Copy link
Contributor

reboot .46 first as I have cmdsrv running manually

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

  • I rebooted .46
  • I ran ./runrole kiwix to install kiwix-tools successfully.

I don't know how to check the DB for individual jobs' status (do you want to do this, then restart the VM to see how it goes?)

@tim-moody
Copy link
Contributor

there might be a logical flaw in cmdsrv 3634ff if the job is the highest number, has no dependencies, but is itself dependent.

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

there might be a logical flaw in cmdsrv 3634ff

Is 3634ff a particular commit? Am I looking in the wrong repo if so (am not seeing it!)

@tim-moody
Copy link
Contributor

sorry is a line number and I don't know how you do the fancy L# url

@tim-moody
Copy link
Contributor

@tim-moody
Copy link
Contributor

I ran ./runrole kiwix to install kiwix-tools successfully.

since this was after reboot, cmdsrv may not know about it unless you restart it and refresh admin

@holta
Copy link
Contributor Author

holta commented Jan 22, 2023

I ran ./runrole kiwix to install kiwix-tools successfully.

since this was after reboot, cmdsrv may not know about it unless you restart it and refresh admin

I rebooted .46 (Ubuntu 22.04).

The behavior is identical to .38 (Debian 12).

(In each case, zim_install_move.sh teded_en_top_2021-01.zim is run prematurely, triggered by a different ZIM file download's completing.)

@tim-moody
Copy link
Contributor

This is how it should work if kiwix is installed using Admin Console even after the zim downloads are scheduled.

image

And they were properly added to the home menu.

image

@deldesir
Copy link

FYI, I removed all ZIM files via admin console (Install content > Manage content > Remove selected content) and redownload them using Admin console again. I monitored the commands jobs and all succeed. I can access all the ZIM via Kiwix. No service restart was needed.

@tim-moody
Copy link
Contributor

@deldesir perfect. thanks for confirming.

@holta
Copy link
Contributor Author

holta commented Feb 10, 2023

2. And then it moved TED-Ed from /libary/working/zims as soon as the 78MB nih_rarediseases_en_all_maxi_2020-12.zim was done downloading i.e. when the download of teded_en_top_2021-01.zim was only 3% complete!

Above issue is confirmed fixed! Thanks to @tim-moody's:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants