|
How do i extract data from a pdf document using code
Message-ID:<df5aa14b-8cd8-453f-a2b5-6b498bb50a57@d32g2000yqe.googlegroups.com>
Subject:How do i extract data from a pdf document using code
Date:Thu, 12 Feb 2009 12:23:35 +0100
Hi
I'm using c# in visual studio .net and want to extract a line of text
from a pdf document in order to process it for another application.
The part of the document I require is static (ie, it will always be in
the same 8 characters of the document).
Can anyone tell me how do do this or am I in the wrong group.
Thanks in advance
Craig
Message-ID:<IpWdneqcqpaosAnUnZ2dnUVZ_vednZ2d@posted.palinacquisition>
Subject:Re: How do i extract data from a pdf document using code
Date:Thu, 12 Feb 2009 15:08:00 +0100
You might as well be speaking Swahili to me on this one but Tracker Software
has a pretty advanced PDF reader program in PDF-Viewer Pro. You might give
it a look and see if there is a solution for you there.
http://www.docu-track.com/home/prod_user/PDF-XChange_Tools/PDF-XChange_Viewer_PRO/
Good luck on your search.
--
Don - PDF-XChange ProŽ/PDF-XChange Viewer ProŽ
Vancouver, USA
"Craig" <cpbuck@gmail.com> wrote in message
news:df5aa14b-8cd8-453f-a2b5-6b498bb50a57@d32g2000yqe.googlegroups.com...
> Hi
>
> I'm using c# in visual studio .net and want to extract a line of text
> from a pdf document in order to process it for another application.
> The part of the document I require is static (ie, it will always be in
> the same 8 characters of the document).
>
> Can anyone tell me how do do this or am I in the wrong group.
>
> Thanks in advance
>
> Craig
Message-ID:<6vjimcFk2toiU1@mid.individual.net>
Subject:Re: How do i extract data from a pdf document using code
Date:Thu, 12 Feb 2009 22:29:40 +0100
Craig wrote:
> Hi
>
> I'm using c# in visual studio .net and want to extract a line of text
> from a pdf document in order to process it for another application.
> The part of the document I require is static (ie, it will always be in
> the same 8 characters of the document).
>
> Can anyone tell me how do do this or am I in the wrong group.
I think you buy the API from Adobe.
I'm not sure I understand what "always be in the same 8 characters of
the document" means (unless it's a *very* short piece of text), but if
it really is part of the document text, and it lies in static text, and
you only want the text, not the binary formatting around it, use
pdftotext and an RE processor like grep or sed (both available for
Windows; their functions may be built into C#...I dont know).
For example, I saved your post as PDF
(http://silmaril.ie/software/ctp.pdf) so I could grab your email address
with:
C:\tmp\> pdftotext ctp.pdf - | grep @ | sed "s+^.*<\([^>]*\)>.*$+\1+"
cpbuck@gmail.com
On the other hand, if what you want is some control sequence which is
part of the internal structure of the PDF, rather than visible text, I
suspect you either need Adobe's or someone's API, or a seriously deep
knowledge of what goes on inside PDFs :-)
///Peter
Message-ID:<aec35033-fac4-4988-b80a-8ce23a1f3caa@o36g2000yqh.googlegroups.com>
Subject:Re: How do i extract data from a pdf document using code
Date:Fri, 13 Feb 2009 11:20:00 +0100
Hi Peter
thanks for the full reply. The 8 characters i am after are actually a
contract number displayed in the document. The document is made up of
multiple pages, I want to split it into individual documents (which I
can do) and save each document as the contract number. I'll give your
suggestion a go as it sounds as if it could work.
Craig
On 12 Feb, 21:29, Peter Flynn <peter.n...@m.silmaril.ie> wrote:
> Craig wrote:
> > Hi
>
> > I'm using c# in visual studio .net and want to extract a line of text
> > from a pdf document in order to process it for another application.
> > The part of the document I require is static (ie, it will always be in
> > the same 8 characters of the document).
>
> > Can anyone tell me how do do this or am I in the wrong group.
>
> I think you buy the API from Adobe.
>
> I'm not sure I understand what "always be in the same 8 characters of
> the document" means (unless it's a *very* short piece of text), but if
> it really is part of the document text, and it lies in static text, and
> you only want the text, not the binary formatting around it, use
> pdftotext and an RE processor like grep or sed (both available for
> Windows; their functions may be built into C#...I dont know).
>
> For example, I saved your post as PDF
> (http://silmaril.ie/software/ctp.pdf) so I could grab your email address
> with:
>
> C:\tmp\> pdftotext ctp.pdf - | grep @ | sed "s+^.*<\([^>]*\)>.*$+\1+"
> cpb...@gmail.com
>
> On the other hand, if what you want is some control sequence which is
> part of the internal structure of the PDF, rather than visible text, I
> suspect you either need Adobe's or someone's API, or a seriously deep
> knowledge of what goes on inside PDFs :-)
>
> ///Peter
Message-ID:<ufdll.40195$eY.9202@newsfe15.ams2>
Subject:Re: How do i extract data from a pdf document using code
Date:Fri, 13 Feb 2009 12:58:17 +0100
Craig wrote:
> Hi Peter
>
> thanks for the full reply. The 8 characters i am after are actually a
> contract number displayed in the document. The document is made up of
> multiple pages, I want to split it into individual documents (which I
> can do) and save each document as the contract number. I'll give your
> suggestion a go as it sounds as if it could work.
>
> Craig
>
> On 12 Feb, 21:29, Peter Flynn <peter.n...@m.silmaril.ie> wrote:
>> Craig wrote:
>>> Hi
>>> I'm using c# in visual studio .net and want to extract a line of text
>>> from a pdf document in order to process it for another application.
>>> The part of the document I require is static (ie, it will always be in
>>> the same 8 characters of the document).
>>> Can anyone tell me how do do this or am I in the wrong group.
>> I think you buy the API from Adobe.
>>
>> I'm not sure I understand what "always be in the same 8 characters of
>> the document" means (unless it's a *very* short piece of text), but if
>> it really is part of the document text, and it lies in static text, and
>> you only want the text, not the binary formatting around it, use
>> pdftotext and an RE processor like grep or sed (both available for
>> Windows; their functions may be built into C#...I dont know).
>>
>> For example, I saved your post as PDF
>> (http://silmaril.ie/software/ctp.pdf) so I could grab your email address
>> with:
>>
>> C:\tmp\> pdftotext ctp.pdf - | grep @ | sed "s+^.*<\([^>]*\)>.*$+\1+"
>> cpb...@gmail.com
>>
>> On the other hand, if what you want is some control sequence which is
>> part of the internal structure of the PDF, rather than visible text, I
>> suspect you either need Adobe's or someone's API, or a seriously deep
>> knowledge of what goes on inside PDFs :-)
>>
>> ///Peter
>
"... save each document as the contract number. " Hmm. That sounds
like a bit of a dodgy hack to me, if you mean that the contract number
will become the file-name, or part of the filename.
Could I suggest that you (either 'also' or 'instead') put the contract
number into the document metadata ?
You can do this with quite a number of tools (including Acrobat
Professional and open-source ones) or you can use PDF javascript.
You can also use the PDF javascript to do the low-level text
extraction and regular expression matching that Peter has already
outlined.
|