THBPdf Download Contact Us Buy Online Developerse-mail me

How do i extract data from a pdf document using code




Message-ID:<df5aa14b-8cd8-453f-a2b5-6b498bb50a57@d32g2000yqe.googlegroups.com>
Subject:

How do i extract data from a pdf document using code


Date:Thu, 12 Feb 2009 12:23:35 +0100


Hi

I'm using c# in visual studio .net and want to extract a line of text
from a pdf document in order to process it for another application.
The part of the document I require is static (ie, it will always be in
the same 8 characters of the document).

Can anyone tell me how do do this or am I in the wrong group.

Thanks in advance

Craig




Message-ID:<IpWdneqcqpaosAnUnZ2dnUVZ_vednZ2d@posted.palinacquisition>
Subject:

Re: How do i extract data from a pdf document using code


Date:Thu, 12 Feb 2009 15:08:00 +0100


You might as well be speaking Swahili to me on this one but Tracker Software 
has a pretty advanced PDF reader program in PDF-Viewer Pro.  You might give 
it a look and see if there is a solution for you there.

http://www.docu-track.com/home/prod_user/PDF-XChange_Tools/PDF-XChange_Viewer_PRO/

Good luck on your search.


-- 
Don - PDF-XChange ProŽ/PDF-XChange Viewer ProŽ
Vancouver, USA



"Craig" <cpbuck@gmail.com> wrote in message 
news:df5aa14b-8cd8-453f-a2b5-6b498bb50a57@d32g2000yqe.googlegroups.com...
> Hi
>
> I'm using c# in visual studio .net and want to extract a line of text
> from a pdf document in order to process it for another application.
> The part of the document I require is static (ie, it will always be in
> the same 8 characters of the document).
>
> Can anyone tell me how do do this or am I in the wrong group.
>
> Thanks in advance
>
> Craig 






Message-ID:<6vjimcFk2toiU1@mid.individual.net>
Subject:

Re: How do i extract data from a pdf document using code


Date:Thu, 12 Feb 2009 22:29:40 +0100


Craig wrote:
> Hi
> 
> I'm using c# in visual studio .net and want to extract a line of text
> from a pdf document in order to process it for another application.
> The part of the document I require is static (ie, it will always be in
> the same 8 characters of the document).
> 
> Can anyone tell me how do do this or am I in the wrong group.

I think you buy the API from Adobe.

I'm not sure I understand what "always be in the same 8 characters of 
the document" means (unless it's a *very* short piece of text), but if 
it really is part of the document text, and it lies in static text, and 
you only want the text, not the binary formatting around it, use 
pdftotext and an RE processor like grep or sed (both available for 
Windows; their functions may be built into C#...I dont know).

For example, I saved your post as PDF 
(http://silmaril.ie/software/ctp.pdf) so I could grab your email address 
with:

C:\tmp\> pdftotext ctp.pdf - | grep @ | sed "s+^.*<\([^>]*\)>.*$+\1+"
cpbuck@gmail.com

On the other hand, if what you want is some control sequence which is 
part of the internal structure of the PDF, rather than visible text, I 
suspect you either need Adobe's or someone's API, or a seriously deep 
knowledge of what goes on inside PDFs :-)

///Peter




Message-ID:<aec35033-fac4-4988-b80a-8ce23a1f3caa@o36g2000yqh.googlegroups.com>
Subject:

Re: How do i extract data from a pdf document using code


Date:Fri, 13 Feb 2009 11:20:00 +0100


Hi Peter

thanks for the full reply. The 8 characters i am after are actually a
contract number displayed in the document. The document is made up of
multiple pages, I want to split it into individual documents (which I
can do) and save each document as the contract number.  I'll give your
suggestion a go as it sounds as if it could work.

Craig

On 12 Feb, 21:29, Peter Flynn <peter.n...@m.silmaril.ie> wrote:
> Craig wrote:
> > Hi
>
> > I'm using c# in visual studio .net and want to extract a line of text
> > from a pdf document in order to process it for another application.
> > The part of the document I require is static (ie, it will always be in
> > the same 8 characters of the document).
>
> > Can anyone tell me how do do this or am I in the wrong group.
>
> I think you buy the API from Adobe.
>
> I'm not sure I understand what "always be in the same 8 characters of
> the document" means (unless it's a *very* short piece of text), but if
> it really is part of the document text, and it lies in static text, and
> you only want the text, not the binary formatting around it, use
> pdftotext and an RE processor like grep or sed (both available for
> Windows; their functions may be built into C#...I dont know).
>
> For example, I saved your post as PDF
> (http://silmaril.ie/software/ctp.pdf) so I could grab your email address
> with:
>
> C:\tmp\> pdftotext ctp.pdf - | grep @ | sed "s+^.*<\([^>]*\)>.*$+\1+"
> cpb...@gmail.com
>
> On the other hand, if what you want is some control sequence which is
> part of the internal structure of the PDF, rather than visible text, I
> suspect you either need Adobe's or someone's API, or a seriously deep
> knowledge of what goes on inside PDFs :-)
>
> ///Peter





Message-ID:<ufdll.40195$eY.9202@newsfe15.ams2>
Subject:

Re: How do i extract data from a pdf document using code


Date:Fri, 13 Feb 2009 12:58:17 +0100


Craig wrote:
> Hi Peter
> 
> thanks for the full reply. The 8 characters i am after are actually a
> contract number displayed in the document. The document is made up of
> multiple pages, I want to split it into individual documents (which I
> can do) and save each document as the contract number.  I'll give your
> suggestion a go as it sounds as if it could work.
> 
> Craig
> 
> On 12 Feb, 21:29, Peter Flynn <peter.n...@m.silmaril.ie> wrote:
>> Craig wrote:
>>> Hi
>>> I'm using c# in visual studio .net and want to extract a line of text
>>> from a pdf document in order to process it for another application.
>>> The part of the document I require is static (ie, it will always be in
>>> the same 8 characters of the document).
>>> Can anyone tell me how do do this or am I in the wrong group.
>> I think you buy the API from Adobe.
>>
>> I'm not sure I understand what "always be in the same 8 characters of
>> the document" means (unless it's a *very* short piece of text), but if
>> it really is part of the document text, and it lies in static text, and
>> you only want the text, not the binary formatting around it, use
>> pdftotext and an RE processor like grep or sed (both available for
>> Windows; their functions may be built into C#...I dont know).
>>
>> For example, I saved your post as PDF
>> (http://silmaril.ie/software/ctp.pdf) so I could grab your email address
>> with:
>>
>> C:\tmp\> pdftotext ctp.pdf - | grep @ | sed "s+^.*<\([^>]*\)>.*$+\1+"
>> cpb...@gmail.com
>>
>> On the other hand, if what you want is some control sequence which is
>> part of the internal structure of the PDF, rather than visible text, I
>> suspect you either need Adobe's or someone's API, or a seriously deep
>> knowledge of what goes on inside PDFs :-)
>>
>> ///Peter
> 

"... save each document as the contract number. "  Hmm. That sounds
like a bit of a dodgy hack to me, if you mean that the contract number
will become the file-name, or part of the filename.
Could I suggest that you (either 'also' or 'instead') put the contract
number into the document metadata ?

You can do this with quite a number of tools (including Acrobat 
Professional and open-source ones) or you can use PDF javascript.
You can also use the PDF javascript to do the low-level text
extraction and regular expression matching that Peter has already
outlined.




 

|THBPdf| |Download| |Developers|