Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow embedding CFF fonts #1282

Closed
wants to merge 2 commits into from

Conversation

camertron
Copy link
Member

@camertron camertron commented Jan 10, 2023

Summary

The original work I did to support OpenType fonts in ttfunk assumed embedding them in PDF files would work the same way as it does for TrueType fonts. Sadly this is not the case. I'm sorry to say I haven't been very active around here since OpenType support was released, largely because I haven't been getting notifications when new issues are filed. A few weeks ago I was alerted to weird OpenType font issues on Twitter and started digging in. This PR is a result of that investigation.

What changed?

The PDF docs require OpenType fonts with a CFF table to have a Subtype of Type1C and a reference entry with a Subtype of Type1. It also requires the font descriptor to include the font program under the FontFile3 key instead of FontFile2 as is necessary for non-CFF fonts.

@pointlessone
Copy link
Member

For reference, this is reported as prawnpdf/ttfunk#98

BaseFont: basename.to_sym,
FontDescriptor: descriptor,
FirstChar: 32,
LastChar: 255,
Widths: @document.ref!(widths),
ToUnicode: cmap
)

if font.cff.exists?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful/more readable to have this extracted into a separate method and have an OTF subclass that overrides it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea 👍

Copy link
Member Author

@camertron camertron Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I think this is correct as-is. It appears Prawn looks at the file extension to determine which Font subclass to use. That's not wrong necessarily, but nothing prevents fonts with a .ttf file extension from having a CFF table. I've also seen this in practice. Checking for the existence of the table is, IMHO, less error-prone.

On that note, I've also (weirdly) come across fonts with both glyf and CFF tables. No idea how that's supposed to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that file extension is used. It's an easy heuristic. Do you think we should change that to make it more sophisticated and closer to "technically correct"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do. OTF and TTF aren't really different formats, they just contain different tables. Prawn would only have to read the directory structure at the beginning of the font to know if it contains a CFF table. Is that something you'd like to see in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also (weirdly) come across fonts with both glyf and CFF tables. No idea how that's supposed to work.

Maybe it's a hybrid font? You can treat it as TTF or OTF and still get the same result? We should look into a few examples to verify if that's the case at least for the majority of cases. If that's so we should probably treat those as OTF fonts.

Is that something you'd like to see in this PR?

Yes, please.

And maybe add a changelog entry?

@camertron
Copy link
Member Author

camertron commented Jan 10, 2023

Wow, thanks for taking a look at this so quickly @pointlessone! I also have a ttfunk PR that I'm working on submitting that prawn will need to depend on to get everything working. Sadly it turns out I made a few mistakes in my implementation of OpenType support 😓 This PR shouldn't be merged until the corresponding ttfunk PR is reviewed, merged, and released. I will link to it here as soon as it's ready.

@pointlessone
Copy link
Member

@camertron Random question. By any chance would it be easier to not support subsetting?

@camertron
Copy link
Member Author

camertron commented Jan 10, 2023

@pointlessone yes, I think it would be much easier, although I haven't tested embedding an entire font file. Most fonts are small and could be embedded without a large increase in file size. However Noto and other fonts that contain very large character sets could add megabytes to file size, so subsetting them is more important.

Another consideration is that, while most font authors allow their fonts to be subsetted and embedded in (for example) PDF files, some font licenses restrict you from distributing the entire font program, which is effectively what we'd be doing if we didn't subset in prawn.

@pointlessone
Copy link
Member

So, to give a bit of context for the question… I as looking into midlle-eastern and indic languages support. It's much more complex than the naive text layout that we currently have. For proper support of those languages we'd have to take into account at the very least GSUB and GPOS tables. They're probably not used by the PDF renderer but TTFunk subsetting is completely not ready for that complexity. Likewise, ligatures, bitmap fonts, variable fonts, the list goes on. So I think instead we're going to deprecate TTFunk subsetting (current iteration) and embed full fonts. That way we'd at least be sure we're not corrupting fonts. Font serialization is not easy. Subsetting is even more complex. Instead let's focus on making TTFunk properly read fonts so that we could properly layout text.

@camertron
Copy link
Member Author

camertron commented Jan 10, 2023

@pointlessone hmm that's really interesting and would definitely prevent a lot of complexity. Are you concerned with the potential copyright/licensing issues I mentioned in my last comment?

For what it's worth, I actually have a branch for subsetting the GPOS/GSUB tables 😅 It would need to be dusted off but a significant amount of the work is already done. The thing is... it's extremely complex.

@pointlessone
Copy link
Member

pointlessone commented Jan 10, 2023

Another consideration is that, while most font authors allow their fonts to be subsetted and embedded in (for example) PDF files, some font licenses restrict you from distributing the entire font program, which is effectively what we'd be doing if we didn't subset in prawn.

BTW, OS/2 table has fstype field which has an explicit flag prohibiting subsetting. We don't honour it either. So… 🤷‍♂️

I mean, ideally we should honour all flags, licenses, and user wishes. But I'd rather we embed full fonts and not have bugs than have bugs and subset fonts. If we have to choose and full embedding is easier then let's do it instead.

@pointlessone
Copy link
Member

Are you concerned with the potential copyright/licensing issues I mentioned in my last comment?

A little. Maybe. I don't have any data on how many fonts actually prohibit full embedding. I also don't know what kind of subsetting they require. Maybe we can throw away a random table that is not used by PDF. Like, kern is probably not used by any PDF renderer and is present in most fonts. Or have a dummy cmap table as PDF seem to insist on using its own CMAP anyway.

@pointlessone
Copy link
Member

For what it's worth, I actually have a branch for subsetting the GPOS/GSUB tables 😅 It would need to be dusted off but a significant amount of the work is already done. The thing is... it's extremely complex.

I feel you. Sunk cost in full swing, right? 😅

But you'd probably breath a sigh of relief know you don't have to support all that complexity. I would. 🙂

@gettalong
Copy link
Member

Just my 2c from what I remember: Subsetting an OTF font should not be that different from subsetting a TrueType font in the context of embedding it in a PDF file. The reason for this is that the PDF viewer shouldn't need to do any layout calculations at all. So GSUB/GPOS et al should generally not be needed.

I think that this could easily be tested by using a GPOS/GSUB aware application like LibreOffice Writer, write a complex text there with ligatures et al and inspect the resulting PDF file for what has been done, e.g. inspect the content stream to see how horizontal/vertical offsets were done and the embedded OTF font.

@camertron
Copy link
Member Author

This is definitely a conversation worth having, since I believe all of us agree that subsetting is extremely complex and easy to get wrong. The question we have to answer is whether the benefits of subsetting outweigh the cost of maintaining the subsetting code. As @pointlessone pointed out, I'm not exactly impartial since I contributed the original OTF parsing/subsetting code 😅 , and I'd be lying if I said it doesn't hurt a little to consider discarding a good deal of it. In the end, I'm interested in doing what's right for the project and the maintainers. To that end, there are a few key points I'd like to make.

Against subsetting

  1. Adobe originally adopted font subsetting to reduce PDF file size. The first version of Acrobat was released in 1993, back when storage, both primary and secondary, was much more expensive and limited. Interestingly, 1993 is the same year the Compact Font Format specification, or CFF, was released. These days, primary and secondary storage are much cheaper, so embedding an entire 16mb font isn't the issue it once was. In other words, both storing the font program on disk and loading the entire thing into memory are much more feasible than they used to be.
  2. These days it's not uncommon to embed entire font programs in PDFs. Most software that is capable of emitting PDF files allows the user to embed a subset or the entire font on a per-document basis. Embedding the entire font can be useful if, for example, someone else needs to edit the document.

For subsetting

  1. I did a bit more research today around the legality of embedding entire fonts in PDFs. Generally speaking, fonts and typefaces can be copyrighted in most western countries. In other words, the font program as well as the glyphs themselves can be copyrighted. Most fonts also come with EULAs that govern how you're allowed to use them. At least in the US, I think a case can be made that subsetting falls under fair use laws which state 10% of a copyrighted work may be reproduced without infringement. To be clear, I don't think it makes sense for ttfunk to limit subsetting to only 10% of glyphs. I'm saying a moral (and maybe even legal) argument can be made for good faith subsetting of copyrighted fonts.
  2. Perhaps the biggest reason I see to keep subsetting functionality in ttfunk is because of use-cases that fall outside the PDF arena. I originally contributed OTF font support because I wanted to automatically subset Noto Sans for the company I worked for at the time. We had identified Noto as one of the major contributors to our low PageSpeed score. While client's browsers were downloading the font the page would appear broken, especially in Japanese. The site used Rails i18n .yml files, so we knew exactly which characters we needed. After adding OTF support to ttfunk I wrote a gem that automatically built up a list of all the Japanese characters we needed and emitted a subset (using the asset pipeline) on every deploy. It worked really well. In fact, I pushed an open-source version of the project here.

@camertron
Copy link
Member Author

@gettalong yes I think you're right about GSUB/GPOS. That's probably why ignoring them hasn't been a problem for prawn. For other use-cases like the one I mentioned in my last comment, it could be.

@pointlessone
Copy link
Member

I'd add the most obvious point to the "against subsetting" section: it's much simpler. We're left with a much simpler problem to solve and that's good.

A few comments on "for subsetting":

    1. EULA is external to fonts from TTFunk's point of view. It's the users responsibility to make sure they can use fonts in their specific case. TTFunk can't know whether EULA/License allows/restricts subsetting. FFTunk can't know whether the font is used for commercial/non-profit/personal/etc cause. So I don't think we should pay much attention to this particular point. The most we can do is be upfront about how fonts are used in Prawn/TTFunk and delegate the actual decision to the users.
    2. IANAL but I'm not convinced subsetting is a proper fair-use. It is a derivation but it's not transformative. We're taking exact copies of glyphs. The amount criterion is hard to meet either. Basic Latin alphabet probably takes 10% in a big Unicode font (by number of glyphs) but what if we're using Chinese instead? What if the font is just Latin glyphs? Also do we count by glyphs, by encoded size, something else? In any case, this still goes back to a decision on the user's part whether fair-use is even in play.
    3. PDF doesn't use all features of a font so subsets might result in crippled fonts without making them useless for PDF. This makes it impossible to extract the font and use it for anything else (other than PDF with the same or smaller glyph set). This is probably the strongest argument for subsetting. It's the closest we can get to copy-protection/DRM that arguably can fulfil a non-distribution clause in a license.
    1. I understand the desire for non-Prawn uses. However, our current implementation is hardly useful. For example, current TTFunk subsetting is limited to 255 characters per subset. It's probably useful for European languages but makes it cumbersome for bigger alphabets or when you want a combination of scripts. On top of that TTFunk ignores a whole lot of tables in its subsetting. What I'm trying to say is that it's very lacking.
    2. There are other tools that can do proper (or at least much better) subsetting. fonttools is kinda good. FontForge is probably not too hard to script to get much better subsets. IIRC, there was a tool from Google Fonts team out there. I'm sure there are others. So this is not a unique feature of TTFunk. Changing workflows sucks but maybe it's for the better.
    3. I'm not for the immediate removal of subsetting from TTFunk. I'm for deprecation of the current implementation. You still can use it if you want. I'm proposing not using it for Prawn because it makes things more complicated and buggy than the alternative.
    4. In the abstract I think subsetting is a useful feature. But it's complex. It's also not a single feature. For instance, for the PDF use case we can throw out whole tables and it'll be fine. But for the web this probably won't fly. Then you can make subset for a specific text and it'd be a bit smaller than a subset for a generic alphabet. Especially for GSUB-heavy languages such as Arabic or Bengali. What I'm trying to say is that even the base case (PDF subsets) is very complex and it can get almost arbitrarily more complex. It probably should be its own project and shouldn't block Prawn.

@camertron
Copy link
Member Author

camertron commented Jan 12, 2023

I'd add the most obvious point to the "against subsetting" section: it's much simpler. We're left with a much simpler problem to solve and that's good.

Definitely not arguing with you on this point. 100% correct. I didn't include it because we'd already discussed it.

IANAL but I'm not convinced subsetting is a proper fair-use.

I'm not suggesting it is or is not fair use, I'm simply suggesting it's more morally defensible to include a subset than to include the entire font program anyone could strip out of the PDF and use on their own. Sort of like how the GNU license lets you distribute compiled versions of their source code, even for commercial purposes, but not the source code itself.

PDF doesn't use all features of a font so subsets might result in crippled fonts without making them useless for PDF. This makes it impossible to extract the font and use it for anything else

That's certainly true for some fonts, but a good deal of the fonts out there can be subset without breaking them in non-PDF contexts.

It's the closest we can get to copy-protection/DRM that arguably can fulfil a non-distribution clause in a license.

I don't think we want to be in the business of enforcing copy protection or DRM. As you said, "EULA is external to fonts from TTFunk's point of view." I agree with that. It's just that embedding the entire font program feels immoral to me for some reason.

I understand the desire for non-Prawn uses. However, our current implementation is hardly useful.

I disagree. It's useful enough that it worked for subsetting Noto at my previous company, and has been used in prawn for a long time as well.

For example, current TTFunk subsetting is limited to 255 characters per subset.

ttfunk actually supports full unicode subsets internally but doesn't offer an API to create them and instead creates MacRoman subsets for some reason. I assume this is for backwards compatibility. MacRoman is one of the baked-in encodings in the PDF spec, probably because in 1993 Unicode was only ~2 years old. I actually looked into changing ttfunk to create Unicode subsets instead. It should be pretty easy to get working (🤞), but will also require a small change to prawn.

CFF also supports the full Unicode codespace. Although CFF charsets and encodings are limited to 255 characters, the font dictionaries feature was intended to support almost an unlimited number characters in a single font. Noto uses this feature, for example.

On top of that TTFunk ignores a whole lot of tables in its subsetting. What I'm trying to say is that it's very lacking.

Yes, that's true, but I don't think it needs to support every font table to be useful. For example, GPOS and GSUB are important for certain scripts and kinds of fonts, but entirely unimportant for others. It seems to work just fine for the fonts I tend to use it for. If at some point I or another developer need it to consider other tables, then we can always add that support later.

There are other tools that can do proper (or at least much better) subsetting. fonttools is kinda good.

Yeah, I've used fonttools quite a bit to validate my work on ttfunk. The code is pretty easy to read. It's not particularly easy to use it from Ruby, however.

I'm not for the immediate removal of subsetting from TTFunk. I'm for deprecation of the current implementation. You still can use it if you want. I'm proposing not using it for Prawn because it makes things more complicated and buggy than the alternative.

Alright, I'm on board with that. As I've said numerous times now, not subsetting does make everything simpler, particularly from prawn's point of view. I still worry about the morality of it, but I'm not the decision maker.

What I'm trying to say is that even the base case (PDF subsets) is very complex and it can get almost arbitrarily more complex. It probably should be its own project and shouldn't block Prawn.

Yeah, that makes a lot of sense.

@camertron camertron marked this pull request as ready for review January 12, 2023 06:23
@pointlessone
Copy link
Member

It's just that embedding the entire font program feels immoral to me for some reason.

That's interesting. I'd love to know more. Do you feel the same way about embedded images, for example? Especially since it's virtually impossible to embed an image in a way that would make it impossible to extract.


Since this is a non-technical part of the discussion feel free to either move it to a private channel (e.g. team discussions or email) or not answer at all.

@gettalong
Copy link
Member

ttfunk actually supports full unicode subsets internally but doesn't offer an API to create them and instead creates MacRoman subsets for some reason.

Using a subset that is not restricted to 255 characters would mean changes to Prawn because simple PDF fonts like Prawn uses are single-byte fonts and can therefore only support 255 characters. If you want to use a single subset font with more than 255 characters in PDF, you would need to use PDF's CID font support (this is actually what I'm doing with HexaPDF).

@camertron
Copy link
Member Author

camertron commented Jan 14, 2023

That's interesting. I'd love to know more. Do you feel the same way about embedded images, for example?

No... maybe it's because fonts are essentially pieces of software, and software is what I do. I suppose ultimately it doesn't outweigh the cost of maintaining subsetting code. Also, since a bunch of PDF tools like Distiller, Illustrator, etc, often embed entire fonts, it's probably ok.

@camertron
Copy link
Member Author

Using a subset that is not restricted to 255 characters would mean changes to Prawn

Oh that's true, I forgot how (needlessly) complicated the PDF spec is for CID fonts.

@pointlessone
Copy link
Member

@camertron I tried it and it doesn't seem to work. I believe Type1C is not the right value for the font type.

From the spec:

Type1C: Type 1–equivalent font program represented in the Compact Font Format (CFF), as described in Adobe Technical Note #5176, The Compact Font Format Specification.

I believe it means a naked CFF font, not an OpenType font with a CFF table. I'm almost certain in this because the same table describes a different font subtype:

OpenType: […] This entry can appear in the font descriptor for the following types of font
dictionaries:
[…]

  • A Type1 font dictionary or CIDFontType0 CIDFont dictionary, if the embedded font program contains a “CFF” table without CIDFont operators.

I used this in #1322 and it seem to work.

Thank you for setting me on the right path. That said, I think it's better to close this PR as #1322 among other things fixes this particular issue, too. If you have some time, could you please take a look at that other PR?

@pointlessone
Copy link
Member

@camertron I took another look at this and the spec and I'm convinced these values are about stand-alone CFF fonts. Since TTFunk doesn't support stand-alone CFF fonts I'll close this PR. Thank you for setting me on the right track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants