slides/wednesday_04.html

<!DOCTYPE html>
<html>
  <head>
    <title>Digital Summer School 2024: WED03</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    <link rel="stylesheet" href="../style.css">  </head>
  <body>
    <textarea id="source">

class: center, middle, titlepage
### WED04: *Extracting content metadata.*

---
class: contentpage
### **Content Metadata**

We have looked a bit at extracting technical metadata from media, but now we are going to look at something else - extracting metadata around the content of the media.

This is an area which is being revolutionised by Large Language Models aka Machine Learning aka AI (like ChatGPT).

---
class: contentpage
### **Content Metadata**

Prior to around 2022, there were many projects in this space, using a variety of different techniques, often stretching back decades.

Speech to text.    
OCR.    
Image analysis.    
Experimental/Other.

---
class: contentpage
### **Content Metadata**

Prior to around 2022, there were many projects in this space, using a variety of different techniques, often stretching back decades.

~~Speech to text.~~ -> Whisper    
~~OCR.~~ -> LLM        
~~Image analysis.~~ -> LLM     
~~Experimental/Other.~~ -> LLM   

Advances in Machine Learning are really taking over this space, and constantly evolving, with new models being released often.

---
class: contentpage
### **Whisper**

Let us start by looking at Whisper from OpenAI. 

Whisper is now so ubiquitous it has almost completely replaced all other competitors and all other speech-to-text systems.

It returns speech-to-text "cooked", with punctuation and with all disfluences (this is a new word for me) removed. 

As with all neural network models, it comes in a variety of "sizes" of model dependent on complexity, which generally correlate to quality of results.

---
class: contentpage
### **Whisper**

A minimum viable command for whisper is

```sh
whisper media/film_scan/audio.wav
```

which will use default values for model, language and the output format.

---
class: contentpage
### **Whisper**


The first flag we might want to feed whisper is the language, otherwise it will take a sample of the first 30 seconds and automatically detect.

As an aside, I have noticed that anytime Whisper cannot detect a language if seems to think it is "Welsh".

```sh
whisper --language en media/film_scan/audio.wav
```

---
class: contentpage
### **Whisper**

output_dir lets us specify an output directory 

```sh
whisper --language en --output_dir whisper_test/ media/film_scan/audio.wav
```

---
class: contentpage
### **Whisper**


We can also select and output format.

SRTs and VTTs are subtitle tracks with timestamps, which you should be able to drop straight into VLC or DCPs you are creating.

JSON and TSV are useful as structured data, using formats which are common for storing machine-readable data.

But we are going to use the simplest - a TXT file.

```sh
whisper --language en --output_dir whisper_test/ --output_format txt media/film_scan/audio.wav
```

---
class: contentpage
### **Whisper**

As mentioned before we can also set the size of the model which is used: base, small, medium, large.

It is worth doing some tests to see what effect the bigger the model has, it will use more disk space and take longer to process, so worth checking the results to see if they are significant enough to justify this.

Matt Miller at Library of Congress has done something similar.
https://thisismattmiller.com/post/lomax-whisper/


```sh
whisper --language en --output_dir whisper_test/ --output_format txt --model small media/film_scan/audio.wav
```

Interesting to note that that a number of AV archives are involved in producing their own Whisper models which are trained on certain languages or dialects.

---
class: contentpage
### **LLMs**

Now to shift focus to tools using similar technology to extract information, generally called LLMs or Large Language Models.

The most famous of these is ChatGPT from OpenAI, but there are now equivalent models from Meta (Llama), Microsoft (Phi) and Google (Gemma).

We might begin by looking at different things we can do with these models.

---
class: contentpage
### **LLMs**

We are going to dive right in with a multi-modal model called LLaVA, which allows us to do image analysis. Note that we are now about to discover how powerful (or not) our computers are.

Ollama is a cross-platform command line application which us lets run machine learning models locally.

https://ollama.com/

There is a wide selection of models on offer, and as with whisper they range in size, with the expectation that bigger the model better the results (apparently not always 100% true)!

https://ollama.com/library/gemma2

---
class: contentpage
### **LLMs**

Once we have Ollama installed, we can easily call our required model using the following command.

```sh
ollama run llava:7b
```

Bigger models may have systems requirements beyond what our computer can handle, in which case we will get an error message to inform us of this.

---
class: contentpage
### **LLMs**


Back in time if you wanted to do OCR, object detection, or other visual analysis, this would have required dedicated processes. LLMs are a bit different, they are capable of doing it all, we just need to ask the right questions.

---
class: contentpage
### **LLMs**

General description example

```sh
Can you describe image '/home/summerschool/Desktop/ml_test_files/0087679.jpg'
```

---
class: contentpage
### **LLMs**

*OCR*


```sh
What text is in this image '/home/summerschool/Desktop/ml_test_files/0086556.jpg'
```

Please note that accuracy can depend on model size, so this is the one section of the course where having a powerful computer will give you **better** results, not just faster.
I am currently running the lowest model as my laptop is nothing special, would be keen to compare with people running something better.

For me on the lowest model, it gets it wrong almost every time with fascinating results.

---
class: contentpage
### **LLMs**

*OCR*


For anyone who remembers using Tesseract, mind-bendingly we can ask for specific interpreted text.

```sh
What is the name of the boat in this image '/home/summerschool/Desktop/ml_test_files/0087024.jpg'
```


A problem I noticed during testing this was that LLaVA can sometimes be confused when new images are supplied consecutively in the same session, so quiting in between may not be a terrible idea.


---
class: contentpage
### **LLMs**

*Object Detection*


Frame with flags.

```sh
What objects are visible in '/home/summerschool/Desktop/ml_test_files/0087200.jpg'
```


Frame with boat as Python array.

```sh
Give me a list objects are visible in '/home/summerschool/Desktop/ml_test_files/0087024.jpg'. Return a list of items as Python strings in a list, eg ['hat', 'house', 'car']
```


---
class: contentpage
### **LLMs**


*Image Characteristics*


```sh
Is the following image black and white or colour: '/home/summerschool/Desktop/ml_test_files/0086556.jpg'
```


```sh
What is the aspect ratio of the image area of the following film frame (eg 1.66, 1.85): '/home/summerschool/Desktop/ml_test_files/0086556.jpg'
```

This is especially interesting as the image is letterboxed. There are much easier ways to do "letterbox detection" (e.g. with FFmpeg), but those can be susceptible to black levels.

---
class: contentpage
### **LLMs**


Now the only piece missing is automating this - copying and pasting lines of text into the terminal is not going to scale if we want to inspect hundreds, thousands or millions of images.

Basic Python example.

```python
import ollama

request = ollama.chat(
    model='llava:7b',
    messages=[
        {
        'role': 'user',
        'content': 'What is the aspect ratio of the image area of the following film frame (eg 1.66, 1.85)',
        'images': ['/home/summerschool/Desktop/ml_test_files/0086556.jpg'],
        }
    ]
)

print(request['message']['content'])
```


---
class: center, middle, titlepage
### WED04: *External Connections.*

---
class: contentpage
### **Agenda**


We are going to talk about "external connections". 

Up until now, all our processes have been happening locally, but what if we seek to connect the results up with the outside world?

This could be thought of as knowledge enrichment, benefiting from knowledge available elsewhere, but also as a step to ensuring explicit shared definitions - when I say "DPX", does it mean the same thing as when you say "DPX"?


---
class: contentpage
### **Agenda**


1. DROID        
    1.1 PRONOM     

 
---
class: contentpage
### **1. DROID**

DROID is a very interesting tool from the UK National Archives which profiles files, meaning determining the type of the file (DPX version 2.0), beyond just the extension (.dpx).

It is also very useful if you are given a large collection of diverse digital files, and you want to automatically figure out what you have been given.

https://digital-preservation.github.io/droid/


---
class: contentpage
### **1. DROID**


To run DROID from the command line we are going to want to download the "bin" from the DROID GitHub releases page.

https://github.com/digital-preservation/droid/releases

Once we have unpacked the bin.zip we can run the following to show everything is working.

```sh
java -jar droid-command-line-6.8.0.jar -v
```

And we can now use it on a directory, here using our scans. This is not especially interesting as the film scans are quite homogenous, but try pointing towards this at your Downloads directory and seeing what reports you get.

```sh
java -jar droid-command-line-6.8.0.jar -a /home/paulduchesne/Desktop/summer-school-2024/media -o /home/paulduchesne/Desktop/droid_test.csv
```


---
class: contentpage
### **1. DROID**


Just to unpack what is happening here, DROID is processing each file and checking that the "magic numbers" of the file (which we saw earlier) are correct and match the extension. This is to catch any instances where someone has taken an "image.jpg" and renamed as "image.dpx", a feature we can test ourselves.


What is especially interesting though, and the focus of this next section, is the PUID column, which is printing the PRONOM identifier.

---
class: contentpage
### **1.1 PRONOM**


These PUIDs connect to the PRONOM technical registry, also managed by the UK National Archives.

https://www.nationalarchives.gov.uk/PRONOM/Default.aspx

This becomes useful, because it turns out there are different types of file which use the "dpx" extension.

By moving from referring to a file as a "dpx file" to "fmt/193" we establish exactly what type of file we are talking about to prevent confusion.


---
class: contentpage
### **1.1 PRONOM**

PRONOM also provides great context around the file, where is it from, what are some technical qualities.

https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=669


---
class: contentpage
### **1.1 PRONOM**


It also links to Wikidata, which is the machine-readable sister project to Wikipedia (which we will look at in more depth shortly).

https://www.wikidata.org/wiki/Q1676669

Wikidata links to the "preservation plan" of the NARA (National Archives and Records Administration in the US)

https://www.archives.gov/files/lod/dpframework/id/NF00492.ttl

Which in turn gives us a "risk level", "appropriate tools" and a "preservation plan status".


---
class: contentpage
### **1.1 PRONOM**

For me this is exciting stuff, it is still early days for a lot of these systems, but the vision is that we are moving towards a space where we can ask questions like "how much of my digital collection is at risk" using shared risk analysis.

These are import questions for a digital archive.


We could also write a script which runs DROID, extracts the PUIDs, consults Wikidata, and returns NARA risk levels as a single process. 
Remind me if we end up with spare time to try this!

---
class: contentpage
### **1.1 PRONOM**

Also, if this subject is specifically interesting to you, I would recommend having a read of the DPC guide to "Wikidata for Digital Preservationists"

https://www.dpconline.org/docs/technology-watch-reports/2551-thorntonwikidatadpc-revsionthornton/file


---
class: center, middle, titlepage
### End of Day Three.


      </textarea>
      <script src="https://remarkjs.com/downloads/remark-latest.min.js" type="text/javascript"></script>
      <script type="text/javascript">var slideshow = remark.create({ratio: "16:9"});</script>
    </body>
  </html>