Parsing pdf files

Dear Arun,

there is a command-line tool pdftotext, which you can use with encoding specifications and also with a "-layout" option, which will preserve
line breaks.
The list of possible encodings

pdftotext -listenc

does not include iscii-1988, so probably, you'll be out of luck
if the original document is not in Unicode (maybe you can use iconv
on the result of pdftotext).

I found a utf-8 encoded web page in Hindi, printed it to a pdf file, used
pdftotext on it, and opened it in the SciTE editor, specifying the
encoding as UTF-8. Most of the symbols are recognized correctly,
but some are not ...(vowels? combinations of letters?)

I'm sending the screenshot as an attachment to your email address.

Best regards,

Axel

···

--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser

hello Alex,
Thank you. But I would like to point out that its not very accurate in
maintaining the layout. I already tried it out. you can copy a pdf file from
evince to gedit, you will get a better accuracy of layout. What escapes me
is how to do it programatically :slight_smile:

cheers & regards,
Arun

···

On Sun, Aug 23, 2009 at 3:49 PM, Axel Etzold <AEtzold@gmx.de> wrote:

Dear Arun,

there is a command-line tool pdftotext, which you can use with encoding
specifications and also with a "-layout" option, which will preserve
line breaks.
The list of possible encodings

pdftotext -listenc

does not include iscii-1988, so probably, you'll be out of luck
if the original document is not in Unicode (maybe you can use iconv
on the result of pdftotext).

I found a utf-8 encoded web page in Hindi, printed it to a pdf file, used
pdftotext on it, and opened it in the SciTE editor, specifying the
encoding as UTF-8. Most of the symbols are recognized correctly,
but some are not ...(vowels? combinations of letters?)

I'm sending the screenshot as an attachment to your email address.

Best regards,

Axel
--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3 -
sicherer, schneller und einfacher! Aktuelle Nachrichten aus Politik, Wirtschaft & Panorama | GMX

--

श्री जानकीरघुनाथो विजयते ||

-------- Original-Nachricht --------

Datum: Sun, 23 Aug 2009 19:46:23 +0900
Von: Arun Kumar <arun.einstein@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: Re: Parsing pdf files

hello Alex,
Thank you. But I would like to point out that its not very accurate in
maintaining the layout. I already tried it out. you can copy a pdf file
from
evince to gedit, you will get a better accuracy of layout. What escapes me
is how to do it programatically :slight_smile:

cheers & regards,
Arun

Dear Arun,

could you say something more about what layout features you need ?

Best regards,

Axel

···

--
Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
für nur 19,99 Euro/mtl.!* Aktuelle Nachrichten aus Politik, Wirtschaft & Panorama | GMX

Helo Alex,

Suppose the data in the pdf is two-columned ( as is the case in research
papers) or has some tables . The copied version should have the same amount
of spaces between words and columns. I'll attach an example two columned
text in here for your reference. For the program I'm writing, the layout is
most essential.

If you are not able to see a two column output in your text editor (since
its probably more than 80 characters per line) please reduce the font size
of your text editor (or use a large monitor :wink: ).

Observe how theres space between the two columned output. This was done by
copying from evince to gedit or emacs

-------- Original-Nachricht --------
> Datum: Sun, 23 Aug 2009 19:46:23 +0900
> Von: Arun Kumar <arun.einstein@gmail.com>
> An: ruby-talk@ruby-lang.org
> Betreff: Re: Parsing pdf files

> hello Alex,
> Thank you. But I would like to point out that its not very accurate in
> maintaining the layout. I already tried it out. you can copy a pdf file
> from
> evince to gedit, you will get a better accuracy of layout. What escapes

me

ie.txt (8.83 KB)

···

On Sun, Aug 23, 2009 at 4:50 PM, Axel Etzold <AEtzold@gmx.de> wrote:

> is how to do it programatically :slight_smile:
>
> cheers & regards,
> Arun

Dear Arun,

could you say something more about what layout features you need ?

Best regards,

Axel

--
Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
für nur 19,99 Euro/mtl.!* Aktuelle Nachrichten aus Politik, Wirtschaft & Panorama | GMX

--

श्री जानकीरघुनाथो विजयते ||

-------- Original-Nachricht --------

Datum: Sun, 23 Aug 2009 22:05:23 +0900
Von: Arun Kumar <arun.einstein@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: Re: Parsing pdf files

Helo Alex,

Suppose the data in the pdf is two-columned ( as is the case in research
papers) or has some tables . The copied version should have the same
amount
of spaces between words and columns. I'll attach an example two columned
text in here for your reference. For the program I'm writing, the layout
is
most essential.

If you are not able to see a two column output in your text editor (since
its probably more than 80 characters per line) please reduce the font size
of your text editor (or use a large monitor :wink: ).

Observe how theres space between the two columned output. This was done by
copying from evince to gedit or emacs

Dear Arun,

I suppose this is due to the fact that pdftotext (but also gedit) convert
tabulators to whitespaces.
Also, the good impression you get on gedit depends a lot on using a
mono-spaced font :slight_smile:
Admittedly, a very quick hack

text=IO.read("ie.txt")
text.gsub!(/ {8,8}/,"\t")
text.gsub!(/ {2,}/,"")
f=File.new("temp_out.txt","w")
f.puts text
f.close

doesn't give very nice results, so some additional fiddling is necessary.

I once wrote some code to separate two-columned text. You can combine the two columns with tabs.

Best regards,

Axel

···

-------------------------
def column_arrange(txt_file)
  text=IO.readlines(txt_file)
  reg=/ +[^ ]/
  ref=
  text.each{|line|
    # there might be several longer sequences of whitespace in a line
    line.scan(reg).each{|y|
      ref<<line.index(y)+y.length
    }
  }
  cut_most_columns_here=ref.sort[ref.length/2]
  
  col1=
  col2=
  text.each{|line|
    # there might be several longer sequences of whitespace in a line
      whites=line.scan(reg)
      whites_ind=line.scan(reg).collect{|y| (line.index(y)+y.length)-1}
      
      if whites==
        cut_here=cut_most_columns_here
      elsif whites.length==1
        cut_here=whites_ind[0]
      elsif whites.length>1
        min_dist=whites_ind.collect{|x| (x-cut_most_columns_here).abs}.min
        cut_here=whites_ind.delete_if{|x| (x-cut_most_columns_here).abs==min_dist}[0]
      end
      col1<<line[0...cut_here].chomp
      col2<<line[cut_here..-1].chomp
  }
  return col1,col2
end

--
Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
für nur 19,99 Euro/mtl.!* Aktuelle Nachrichten aus Politik, Wirtschaft & Panorama | GMX