-
Notifications
You must be signed in to change notification settings - Fork 19
/
wednesday_04.html
443 lines (255 loc) · 12.3 KB
/
wednesday_04.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
<!DOCTYPE html>
<html>
<head>
<title>Digital Summer School 2024: WED03</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<link rel="stylesheet" href="../style.css"> </head>
<body>
<textarea id="source">
class: center, middle, titlepage
### WED04: *Extracting content metadata.*
---
class: contentpage
### **Content Metadata**
We have looked a bit at extracting technical metadata from media, but now we are going to look at something else - extracting metadata around the content of the media.
This is an area which is being revolutionised by Large Language Models aka Machine Learning aka AI (like ChatGPT).
---
class: contentpage
### **Content Metadata**
Prior to around 2022, there were many projects in this space, using a variety of different techniques, often stretching back decades.
Speech to text.
OCR.
Image analysis.
Experimental/Other.
---
class: contentpage
### **Content Metadata**
Prior to around 2022, there were many projects in this space, using a variety of different techniques, often stretching back decades.
~~Speech to text.~~ -> Whisper
~~OCR.~~ -> LLM
~~Image analysis.~~ -> LLM
~~Experimental/Other.~~ -> LLM
Advances in Machine Learning are really taking over this space, and constantly evolving, with new models being released often.
---
class: contentpage
### **Whisper**
Let us start by looking at Whisper from OpenAI.
Whisper is now so ubiquitous it has almost completely replaced all other competitors and all other speech-to-text systems.
It returns speech-to-text "cooked", with punctuation and with all disfluences (this is a new word for me) removed.
As with all neural network models, it comes in a variety of "sizes" of model dependent on complexity, which generally correlate to quality of results.
---
class: contentpage
### **Whisper**
A minimum viable command for whisper is
```sh
whisper media/film_scan/audio.wav
```
which will use default values for model, language and the output format.
---
class: contentpage
### **Whisper**
The first flag we might want to feed whisper is the language, otherwise it will take a sample of the first 30 seconds and automatically detect.
As an aside, I have noticed that anytime Whisper cannot detect a language if seems to think it is "Welsh".
```sh
whisper --language en media/film_scan/audio.wav
```
---
class: contentpage
### **Whisper**
output_dir lets us specify an output directory
```sh
whisper --language en --output_dir whisper_test/ media/film_scan/audio.wav
```
---
class: contentpage
### **Whisper**
We can also select and output format.
SRTs and VTTs are subtitle tracks with timestamps, which you should be able to drop straight into VLC or DCPs you are creating.
JSON and TSV are useful as structured data, using formats which are common for storing machine-readable data.
But we are going to use the simplest - a TXT file.
```sh
whisper --language en --output_dir whisper_test/ --output_format txt media/film_scan/audio.wav
```
---
class: contentpage
### **Whisper**
As mentioned before we can also set the size of the model which is used: base, small, medium, large.
It is worth doing some tests to see what effect the bigger the model has, it will use more disk space and take longer to process, so worth checking the results to see if they are significant enough to justify this.
Matt Miller at Library of Congress has done something similar.
https://thisismattmiller.com/post/lomax-whisper/
```sh
whisper --language en --output_dir whisper_test/ --output_format txt --model small media/film_scan/audio.wav
```
Interesting to note that that a number of AV archives are involved in producing their own Whisper models which are trained on certain languages or dialects.
---
class: contentpage
### **LLMs**
Now to shift focus to tools using similar technology to extract information, generally called LLMs or Large Language Models.
The most famous of these is ChatGPT from OpenAI, but there are now equivalent models from Meta (Llama), Microsoft (Phi) and Google (Gemma).
We might begin by looking at different things we can do with these models.
---
class: contentpage
### **LLMs**
We are going to dive right in with a multi-modal model called LLaVA, which allows us to do image analysis. Note that we are now about to discover how powerful (or not) our computers are.
Ollama is a cross-platform command line application which us lets run machine learning models locally.
https://ollama.com/
There is a wide selection of models on offer, and as with whisper they range in size, with the expectation that bigger the model better the results (apparently not always 100% true)!
https://ollama.com/library/gemma2
---
class: contentpage
### **LLMs**
Once we have Ollama installed, we can easily call our required model using the following command.
```sh
ollama run llava:7b
```
Bigger models may have systems requirements beyond what our computer can handle, in which case we will get an error message to inform us of this.
---
class: contentpage
### **LLMs**
Back in time if you wanted to do OCR, object detection, or other visual analysis, this would have required dedicated processes. LLMs are a bit different, they are capable of doing it all, we just need to ask the right questions.
---
class: contentpage
### **LLMs**
General description example
```sh
Can you describe image '/home/summerschool/Desktop/ml_test_files/0087679.jpg'
```
---
class: contentpage
### **LLMs**
*OCR*
```sh
What text is in this image '/home/summerschool/Desktop/ml_test_files/0086556.jpg'
```
Please note that accuracy can depend on model size, so this is the one section of the course where having a powerful computer will give you **better** results, not just faster.
I am currently running the lowest model as my laptop is nothing special, would be keen to compare with people running something better.
For me on the lowest model, it gets it wrong almost every time with fascinating results.
---
class: contentpage
### **LLMs**
*OCR*
For anyone who remembers using Tesseract, mind-bendingly we can ask for specific interpreted text.
```sh
What is the name of the boat in this image '/home/summerschool/Desktop/ml_test_files/0087024.jpg'
```
A problem I noticed during testing this was that LLaVA can sometimes be confused when new images are supplied consecutively in the same session, so quiting in between may not be a terrible idea.
---
class: contentpage
### **LLMs**
*Object Detection*
Frame with flags.
```sh
What objects are visible in '/home/summerschool/Desktop/ml_test_files/0087200.jpg'
```
Frame with boat as Python array.
```sh
Give me a list objects are visible in '/home/summerschool/Desktop/ml_test_files/0087024.jpg'. Return a list of items as Python strings in a list, eg ['hat', 'house', 'car']
```
---
class: contentpage
### **LLMs**
*Image Characteristics*
```sh
Is the following image black and white or colour: '/home/summerschool/Desktop/ml_test_files/0086556.jpg'
```
```sh
What is the aspect ratio of the image area of the following film frame (eg 1.66, 1.85): '/home/summerschool/Desktop/ml_test_files/0086556.jpg'
```
This is especially interesting as the image is letterboxed. There are much easier ways to do "letterbox detection" (e.g. with FFmpeg), but those can be susceptible to black levels.
---
class: contentpage
### **LLMs**
Now the only piece missing is automating this - copying and pasting lines of text into the terminal is not going to scale if we want to inspect hundreds, thousands or millions of images.
Basic Python example.
```python
import ollama
request = ollama.chat(
model='llava:7b',
messages=[
{
'role': 'user',
'content': 'What is the aspect ratio of the image area of the following film frame (eg 1.66, 1.85)',
'images': ['/home/summerschool/Desktop/ml_test_files/0086556.jpg'],
}
]
)
print(request['message']['content'])
```
---
class: center, middle, titlepage
### WED04: *External Connections.*
---
class: contentpage
### **Agenda**
We are going to talk about "external connections".
Up until now, all our processes have been happening locally, but what if we seek to connect the results up with the outside world?
This could be thought of as knowledge enrichment, benefiting from knowledge available elsewhere, but also as a step to ensuring explicit shared definitions - when I say "DPX", does it mean the same thing as when you say "DPX"?
---
class: contentpage
### **Agenda**
1. DROID
1.1 PRONOM
---
class: contentpage
### **1. DROID**
DROID is a very interesting tool from the UK National Archives which profiles files, meaning determining the type of the file (DPX version 2.0), beyond just the extension (.dpx).
It is also very useful if you are given a large collection of diverse digital files, and you want to automatically figure out what you have been given.
https://digital-preservation.github.io/droid/
---
class: contentpage
### **1. DROID**
To run DROID from the command line we are going to want to download the "bin" from the DROID GitHub releases page.
https://github.com/digital-preservation/droid/releases
Once we have unpacked the bin.zip we can run the following to show everything is working.
```sh
java -jar droid-command-line-6.8.0.jar -v
```
And we can now use it on a directory, here using our scans. This is not especially interesting as the film scans are quite homogenous, but try pointing towards this at your Downloads directory and seeing what reports you get.
```sh
java -jar droid-command-line-6.8.0.jar -a /home/paulduchesne/Desktop/summer-school-2024/media -o /home/paulduchesne/Desktop/droid_test.csv
```
---
class: contentpage
### **1. DROID**
Just to unpack what is happening here, DROID is processing each file and checking that the "magic numbers" of the file (which we saw earlier) are correct and match the extension. This is to catch any instances where someone has taken an "image.jpg" and renamed as "image.dpx", a feature we can test ourselves.
What is especially interesting though, and the focus of this next section, is the PUID column, which is printing the PRONOM identifier.
---
class: contentpage
### **1.1 PRONOM**
These PUIDs connect to the PRONOM technical registry, also managed by the UK National Archives.
https://www.nationalarchives.gov.uk/PRONOM/Default.aspx
This becomes useful, because it turns out there are different types of file which use the "dpx" extension.
By moving from referring to a file as a "dpx file" to "fmt/193" we establish exactly what type of file we are talking about to prevent confusion.
---
class: contentpage
### **1.1 PRONOM**
PRONOM also provides great context around the file, where is it from, what are some technical qualities.
https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=669
---
class: contentpage
### **1.1 PRONOM**
It also links to Wikidata, which is the machine-readable sister project to Wikipedia (which we will look at in more depth shortly).
https://www.wikidata.org/wiki/Q1676669
Wikidata links to the "preservation plan" of the NARA (National Archives and Records Administration in the US)
https://www.archives.gov/files/lod/dpframework/id/NF00492.ttl
Which in turn gives us a "risk level", "appropriate tools" and a "preservation plan status".
---
class: contentpage
### **1.1 PRONOM**
For me this is exciting stuff, it is still early days for a lot of these systems, but the vision is that we are moving towards a space where we can ask questions like "how much of my digital collection is at risk" using shared risk analysis.
These are import questions for a digital archive.
We could also write a script which runs DROID, extracts the PUIDs, consults Wikidata, and returns NARA risk levels as a single process.
Remind me if we end up with spare time to try this!
---
class: contentpage
### **1.1 PRONOM**
Also, if this subject is specifically interesting to you, I would recommend having a read of the DPC guide to "Wikidata for Digital Preservationists"
https://www.dpconline.org/docs/technology-watch-reports/2551-thorntonwikidatadpc-revsionthornton/file
---
class: center, middle, titlepage
### End of Day Three.
</textarea>
<script src="https://remarkjs.com/downloads/remark-latest.min.js" type="text/javascript"></script>
<script type="text/javascript">var slideshow = remark.create({ratio: "16:9"});</script>
</body>
</html>