THBPdf Download Contact Us Buy Online Developerse-mail me

Ligatures and Search




Message-ID:<8h8tj4d33h8mkepb4jmgm0l2qrmcv4gvfj@4ax.com>
Subject:

Ligatures and Search


Date:Tue, 9 Dec 2008 17:58:28 +0100


I'm getting more and more PDFs that contain ligatures: two letters
combined to form a single glyph. It's a typesetting technique.

Anyway, it royally screws up searches. Suppose I search for "official"
and "ff" is a ligature: not two f's, but a single glyph. Acrobat
search won't find it. 

Is there any way around this? I have no way of knowing ahead of time
that the PDF includes ligatures, and even if I did know, not all
occurrences of the word use the ligatures. Ligatures are not visually
distinct, so I can't tell by looking if a word contains one. So, a
search for "official" might return a bunch of hits, while ignoring
others that I may never know exist.

I don't create the PDFs, so I have no control over whether or not they
contain ligatures.




Message-ID:<ulJ%k.4848$uS1.1487@newsfe19.iad>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 07:29:46 +0100


Richard Evans wrote:


> Anyway, it royally screws up searches. 

Folks need to stop using ligatures for precisely that reason.

Who do they think they're impressing?




Message-ID:<pcomyf46vqt.fsf@math.ntnu.no>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 13:49:14 +0100


+ PDFrank <pdfrank@some.com>:

> Richard Evans wrote:
>
>
>> Anyway, it royally screws up searches. 
>
> Folks need to stop using ligatures for precisely that reason.
>
> Who do they think they're impressing?

Ligatures are not there to impress. They are there to enhance
readability. Anyway, the PDF standard provides ways to use ligatures
while declaring an ActualText property to be used in searches and text
conversions. See, e.g., section 10.8.3 (Replacement Text) in the PDF
Reference version 1.7. One should lobby PDF creators to use this
feature, not to avoid ligatures.

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell




Message-ID:<ghoirk$nrd$1@canard.ulcc.ac.uk>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 15:14:09 +0100


Harald Hanche-Olsen wrote:
> .... Anyway, the PDF standard provides ways to use ligatures
> while declaring an ActualText property to be used in searches and text
> conversions. See, e.g., section 10.8.3 (Replacement Text) in the PDF
> Reference version 1.7.

That's interesting, and new to me (and I suspect to many others - I've
never seen this referenced before when similar issues came up.) Was that
new in PDF 1.7 or present in earlier versions? It does seem to be the
appropriate solution for applications to use if it is as you say.

-- 
Kevin Ashley                               This is not a signature
Head of Digital Archives
ULCC      http://www.ulcc.ac.uk/




Message-ID:<pco7i686jjg.fsf@math.ntnu.no>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 18:12:51 +0100


+ Kevin Ashley <K.Ashley@ulcc.ac.uk>:

> Harald Hanche-Olsen wrote:
>> .... Anyway, the PDF standard provides ways to use ligatures
>> while declaring an ActualText property to be used in searches and text
>> conversions. See, e.g., section 10.8.3 (Replacement Text) in the PDF
>> Reference version 1.7.
>
> That's interesting, and new to me (and I suspect to many others - I've
> never seen this referenced before when similar issues came up.) Was that
> new in PDF 1.7 or present in earlier versions?

It's been around since PDF 1.5, according to the 1.7 manual. I am too
lazy to go back to previous manuals to check. I expect this is what is
used when you use the OCR function in Acrobat to create a searchable
version of a scanned document, but I haven't checked that either.

As an example, here is a code snippet from the manual, showing how the
German word "Drucker" might become split across a line break as "Druk-
ker" but still searchable as "Drucker":

(Dru) Tj
  /Span
  <</Actual Text (c) >>
  BDC
    (k-) Tj
  EMC
(ker) '

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell




Message-ID:<XeSdnXuEDocFCaLUnZ2dnUVZ8vGdnZ2d@posted.plusnet>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 11:01:41 +0100


Richard Evans wrote:
> Ligatures are not visually
> distinct,

Yes they are; if they weren't they wouldn't be needed.

   BugBear




Message-ID:<3hqvj4thi5bqkm7vtpgtrhjlr6fnj1gh3d@4ax.com>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 17:11:42 +0100


bugbear <bugbear@trim_papermule.co.uk_trim> wrote:

>Richard Evans wrote:
>> Ligatures are not visually
>> distinct,
>
>Yes they are; if they weren't they wouldn't be needed.

Not to the naked eye. 






Message-ID:<pco3agw6ji0.fsf@math.ntnu.no>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 18:13:43 +0100


+ Richard Evans <infodex@mindspring.com>:

> bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
>
>>Richard Evans wrote:
>>> Ligatures are not visually
>>> distinct,
>>
>>Yes they are; if they weren't they wouldn't be needed.
>
> Not to the naked eye. 

Methinks you need new glasses.

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell




Message-ID:<p5OdnTUUaNKjZqLUnZ2dnUVZ8sTinZ2d@posted.plusnet>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 18:19:41 +0100


Harald Hanche-Olsen wrote:
> + Richard Evans <infodex@mindspring.com>:
> 
>> bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
>>
>>> Richard Evans wrote:
>>>> Ligatures are not visually
>>>> distinct,
>>> Yes they are; if they weren't they wouldn't be needed.
>> Not to the naked eye. 
> 
> Methinks you need new glasses.

It's possible that finely designed ligatures,
designed for high quality typesetting, only effect a pixel (or less)
when rendered at a small size on a computer display.

   BugBear




Message-ID:<pcoy6yo53k6.fsf@math.ntnu.no>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 18:43:21 +0100


+ bugbear <bugbear@trim_papermule.co.uk_trim>:

> It's possible that finely designed ligatures,
> designed for high quality typesetting, only effect a pixel (or less)
> when rendered at a small size on a computer display.

You have a point there. When blown up or printed, it's quite a different
matter though. Look at http://www.math.ntnu.no/~hanche/tmp/ffilig.png
(only 8K, won't break your bandwidth budget) and compare the two
versions of the word "efficient", first with the ffi ligature and then
without it.

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell




Message-ID:<ghog34$mv9$1@canard.ulcc.ac.uk>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 14:26:57 +0100


Richard Evans wrote:
> I'm getting more and more PDFs that contain ligatures: two letters
> combined to form a single glyph. It's a typesetting technique.
> 
> Anyway, it royally screws up searches. Suppose I search for "official"
> and "ff" is a ligature: not two f's, but a single glyph. Acrobat
> search won't find it. 
> 
> Is there any way around this? I have no way of knowing ahead of time
> that the PDF includes ligatures, and even if I did know, not all
> occurrences of the word use the ligatures. Ligatures are not visually
> distinct, so I can't tell by looking if a word contains one. So, a
> search for "official" might return a bunch of hits, while ignoring
> others that I may never know exist.
> 

Variants of this question have come up before in this group and elsewhere,
and surprisingly it seems that there is little that one can do about it,
but I'll try to describe the little, as it may be useful to you.

If you are searching frequently for a small number of terms which may contain
known ligatures, then you could try two forms of the search, one using
separate characters and one containing the unicode character for the ligature.
(I've not tried this in Acrobat, and it may not be easy to type, but it
ought to work.)

Otherwise, you may find it easier to extract the text into something else
and carry out the search there. (Whether that's useful depends to some extent
on why you are searching the PDF in the first place.) Many tools won't help
you here: the usual question regarding ligatures arises when someone extracts
text from a PDF and then notices that some sets of characters have just
gone missing during the extraction process. But according to a reply
Aandi Inston gave on the Adobe forum in June this year, Acrobat is clever
enough to know if the target for text extraction cannot handle Unicode
ligature characters, and it will then map the ligature into the correct
character pair. Which suggests that extracting to something dumb like Windows
Notepad may do what you want.

But, as Aandi also noted there, Acrobat can only do this with well-known
ligatures that have a Unicode character point. The person he was replying
to had a document containing 'ti' ligatures, which Acrobat could not
recognise because they aren't known to Unicode.

Some other text extraction tools apparently deal with this by falling
back on OCR techniques to recognise characters or ligatures that aren't
mapped correctly, but that seems somewhat extreme.

The root of the problem is that PDF is all about glyphs and visual representation
rather than text and document structure.


As to bugbear's retort to your statement about whether or not ligatures
are visually distinct: yes, clearly they must *be* different in some way,
else why have them? Yet they aren't going to be particularly noticeable
to the eye unless you are looking for them, since one point of
ligatures in typography is to make the text easier on the eye, isn't it?




Message-ID:<oIGdnTC1zJ14V6LUnZ2dnUVZ8jWdnZ2d@posted.plusnet>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 14:53:24 +0100


Kevin Ashley wrote:
> The root of the problem is that PDF is all about glyphs and visual 
> representation
> rather than text and document structure.

Indeed; both PS and PDF can have glyphs that look like ... anything.

Music notation - map symbols, or monograms... or very extreme
triple character ligatures.

It is this versatility that makes (general) ligature
handling hard, although some common special
cases have been managed.

   BugBear




Message-ID:<8jqvj4p990hp71bcpo50uljbsgi1o2k5g3@4ax.com>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 17:17:40 +0100


Kevin Ashley <K.Ashley@ulcc.ac.uk> wrote:

>
>
>As to bugbear's retort to your statement about whether or not ligatures
>are visually distinct: yes, clearly they must *be* different in some way,
>else why have them?

Ask someone who is into typography. The only reasonable explanation
I've found is in the book "Bookmaking": They are used to join two
characters to from a distinct new character. That is not how I'm
seeing them used.

> Yet they aren't going to be particularly noticeable
>to the eye unless you are looking for them, since one point of
>ligatures in typography is to make the text easier on the eye, isn't it?

You just answered your own question: They are not particulary
noticeable. The ones I have encounted have not been noticeable at all.
I could send you a page containing ligatures and I guarantee you will
not pick them out.





Message-ID:<8h8tj4d33h8mkepb4jmgm0l2qrmcv4gvfj@4ax.com>
Subject:

Ligatures and Search


Date:Tue, 9 Dec 2008 17:58:28 +0100


I'm getting more and more PDFs that contain ligatures: two letters
combined to form a single glyph. It's a typesetting technique.

Anyway, it royally screws up searches. Suppose I search for "official"
and "ff" is a ligature: not two f's, but a single glyph. Acrobat
search won't find it. 

Is there any way around this? I have no way of knowing ahead of time
that the PDF includes ligatures, and even if I did know, not all
occurrences of the word use the ligatures. Ligatures are not visually
distinct, so I can't tell by looking if a word contains one. So, a
search for "official" might return a bunch of hits, while ignoring
others that I may never know exist.

I don't create the PDFs, so I have no control over whether or not they
contain ligatures.




Message-ID:<ulJ%k.4848$uS1.1487@newsfe19.iad>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 07:29:46 +0100


Richard Evans wrote:


> Anyway, it royally screws up searches. 

Folks need to stop using ligatures for precisely that reason.

Who do they think they're impressing?




Message-ID:<pcomyf46vqt.fsf@math.ntnu.no>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 13:49:14 +0100


+ PDFrank <pdfrank@some.com>:

> Richard Evans wrote:
>
>
>> Anyway, it royally screws up searches. 
>
> Folks need to stop using ligatures for precisely that reason.
>
> Who do they think they're impressing?

Ligatures are not there to impress. They are there to enhance
readability. Anyway, the PDF standard provides ways to use ligatures
while declaring an ActualText property to be used in searches and text
conversions. See, e.g., section 10.8.3 (Replacement Text) in the PDF
Reference version 1.7. One should lobby PDF creators to use this
feature, not to avoid ligatures.

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell




Message-ID:<ghoirk$nrd$1@canard.ulcc.ac.uk>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 15:14:09 +0100


Harald Hanche-Olsen wrote:
> .... Anyway, the PDF standard provides ways to use ligatures
> while declaring an ActualText property to be used in searches and text
> conversions. See, e.g., section 10.8.3 (Replacement Text) in the PDF
> Reference version 1.7.

That's interesting, and new to me (and I suspect to many others - I've
never seen this referenced before when similar issues came up.) Was that
new in PDF 1.7 or present in earlier versions? It does seem to be the
appropriate solution for applications to use if it is as you say.

-- 
Kevin Ashley                               This is not a signature
Head of Digital Archives
ULCC      http://www.ulcc.ac.uk/




Message-ID:<pco7i686jjg.fsf@math.ntnu.no>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 18:12:51 +0100


+ Kevin Ashley <K.Ashley@ulcc.ac.uk>:

> Harald Hanche-Olsen wrote:
>> .... Anyway, the PDF standard provides ways to use ligatures
>> while declaring an ActualText property to be used in searches and text
>> conversions. See, e.g., section 10.8.3 (Replacement Text) in the PDF
>> Reference version 1.7.
>
> That's interesting, and new to me (and I suspect to many others - I've
> never seen this referenced before when similar issues came up.) Was that
> new in PDF 1.7 or present in earlier versions?

It's been around since PDF 1.5, according to the 1.7 manual. I am too
lazy to go back to previous manuals to check. I expect this is what is
used when you use the OCR function in Acrobat to create a searchable
version of a scanned document, but I haven't checked that either.

As an example, here is a code snippet from the manual, showing how the
German word "Drucker" might become split across a line break as "Druk-
ker" but still searchable as "Drucker":

(Dru) Tj
  /Span
  <</Actual Text (c) >>
  BDC
    (k-) Tj
  EMC
(ker) '

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell




Message-ID:<XeSdnXuEDocFCaLUnZ2dnUVZ8vGdnZ2d@posted.plusnet>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 11:01:41 +0100


Richard Evans wrote:
> Ligatures are not visually
> distinct,

Yes they are; if they weren't they wouldn't be needed.

   BugBear




Message-ID:<3hqvj4thi5bqkm7vtpgtrhjlr6fnj1gh3d@4ax.com>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 17:11:42 +0100


bugbear <bugbear@trim_papermule.co.uk_trim> wrote:

>Richard Evans wrote:
>> Ligatures are not visually
>> distinct,
>
>Yes they are; if they weren't they wouldn't be needed.

Not to the naked eye. 






Message-ID:<pco3agw6ji0.fsf@math.ntnu.no>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 18:13:43 +0100


+ Richard Evans <infodex@mindspring.com>:

> bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
>
>>Richard Evans wrote:
>>> Ligatures are not visually
>>> distinct,
>>
>>Yes they are; if they weren't they wouldn't be needed.
>
> Not to the naked eye. 

Methinks you need new glasses.

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell




Message-ID:<p5OdnTUUaNKjZqLUnZ2dnUVZ8sTinZ2d@posted.plusnet>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 18:19:41 +0100


Harald Hanche-Olsen wrote:
> + Richard Evans <infodex@mindspring.com>:
> 
>> bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
>>
>>> Richard Evans wrote:
>>>> Ligatures are not visually
>>>> distinct,
>>> Yes they are; if they weren't they wouldn't be needed.
>> Not to the naked eye. 
> 
> Methinks you need new glasses.

It's possible that finely designed ligatures,
designed for high quality typesetting, only effect a pixel (or less)
when rendered at a small size on a computer display.

   BugBear




Message-ID:<pcoy6yo53k6.fsf@math.ntnu.no>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 18:43:21 +0100


+ bugbear <bugbear@trim_papermule.co.uk_trim>:

> It's possible that finely designed ligatures,
> designed for high quality typesetting, only effect a pixel (or less)
> when rendered at a small size on a computer display.

You have a point there. When blown up or printed, it's quite a different
matter though. Look at http://www.math.ntnu.no/~hanche/tmp/ffilig.png
(only 8K, won't break your bandwidth budget) and compare the two
versions of the word "efficient", first with the ffi ligature and then
without it.

-- 
* Harald Hanche-Olsen     <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
  when there is no ground whatsoever for supposing it is true.
  -- Bertrand Russell




Message-ID:<ghog34$mv9$1@canard.ulcc.ac.uk>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 14:26:57 +0100


Richard Evans wrote:
> I'm getting more and more PDFs that contain ligatures: two letters
> combined to form a single glyph. It's a typesetting technique.
> 
> Anyway, it royally screws up searches. Suppose I search for "official"
> and "ff" is a ligature: not two f's, but a single glyph. Acrobat
> search won't find it. 
> 
> Is there any way around this? I have no way of knowing ahead of time
> that the PDF includes ligatures, and even if I did know, not all
> occurrences of the word use the ligatures. Ligatures are not visually
> distinct, so I can't tell by looking if a word contains one. So, a
> search for "official" might return a bunch of hits, while ignoring
> others that I may never know exist.
> 

Variants of this question have come up before in this group and elsewhere,
and surprisingly it seems that there is little that one can do about it,
but I'll try to describe the little, as it may be useful to you.

If you are searching frequently for a small number of terms which may contain
known ligatures, then you could try two forms of the search, one using
separate characters and one containing the unicode character for the ligature.
(I've not tried this in Acrobat, and it may not be easy to type, but it
ought to work.)

Otherwise, you may find it easier to extract the text into something else
and carry out the search there. (Whether that's useful depends to some extent
on why you are searching the PDF in the first place.) Many tools won't help
you here: the usual question regarding ligatures arises when someone extracts
text from a PDF and then notices that some sets of characters have just
gone missing during the extraction process. But according to a reply
Aandi Inston gave on the Adobe forum in June this year, Acrobat is clever
enough to know if the target for text extraction cannot handle Unicode
ligature characters, and it will then map the ligature into the correct
character pair. Which suggests that extracting to something dumb like Windows
Notepad may do what you want.

But, as Aandi also noted there, Acrobat can only do this with well-known
ligatures that have a Unicode character point. The person he was replying
to had a document containing 'ti' ligatures, which Acrobat could not
recognise because they aren't known to Unicode.

Some other text extraction tools apparently deal with this by falling
back on OCR techniques to recognise characters or ligatures that aren't
mapped correctly, but that seems somewhat extreme.

The root of the problem is that PDF is all about glyphs and visual representation
rather than text and document structure.


As to bugbear's retort to your statement about whether or not ligatures
are visually distinct: yes, clearly they must *be* different in some way,
else why have them? Yet they aren't going to be particularly noticeable
to the eye unless you are looking for them, since one point of
ligatures in typography is to make the text easier on the eye, isn't it?




Message-ID:<oIGdnTC1zJ14V6LUnZ2dnUVZ8jWdnZ2d@posted.plusnet>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 14:53:24 +0100


Kevin Ashley wrote:
> The root of the problem is that PDF is all about glyphs and visual 
> representation
> rather than text and document structure.

Indeed; both PS and PDF can have glyphs that look like ... anything.

Music notation - map symbols, or monograms... or very extreme
triple character ligatures.

It is this versatility that makes (general) ligature
handling hard, although some common special
cases have been managed.

   BugBear




Message-ID:<8jqvj4p990hp71bcpo50uljbsgi1o2k5g3@4ax.com>
Subject:

Re: Ligatures and Search


Date:Wed, 10 Dec 2008 17:17:40 +0100


Kevin Ashley <K.Ashley@ulcc.ac.uk> wrote:

>
>
>As to bugbear's retort to your statement about whether or not ligatures
>are visually distinct: yes, clearly they must *be* different in some way,
>else why have them?

Ask someone who is into typography. The only reasonable explanation
I've found is in the book "Bookmaking": They are used to join two
characters to from a distinct new character. That is not how I'm
seeing them used.

> Yet they aren't going to be particularly noticeable
>to the eye unless you are looking for them, since one point of
>ligatures in typography is to make the text easier on the eye, isn't it?

You just answered your own question: They are not particulary
noticeable. The ones I have encounted have not been noticeable at all.
I could send you a page containing ligatures and I guarantee you will
not pick them out.





 

|THBPdf| |Download| |Developers|