-------- Original-Nachricht --------
Datum: Sun, 23 Aug 2009 22:05:23 +0900
Von: Arun Kumar <arun.einstein@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: Re: Parsing pdf files
Helo Alex,
Suppose the data in the pdf is two-columned ( as is the case in research
papers) or has some tables . The copied version should have the same
amount
of spaces between words and columns. I'll attach an example two columned
text in here for your reference. For the program I'm writing, the layout
is
most essential.
If you are not able to see a two column output in your text editor (since
its probably more than 80 characters per line) please reduce the font size
of your text editor (or use a large monitor
).
Observe how theres space between the two columned output. This was done by
copying from evince to gedit or emacs
Dear Arun,
I suppose this is due to the fact that pdftotext (but also gedit) convert
tabulators to whitespaces.
Also, the good impression you get on gedit depends a lot on using a
mono-spaced font 
Admittedly, a very quick hack
text=IO.read("ie.txt")
text.gsub!(/ {8,8}/,"\t")
text.gsub!(/ {2,}/,"")
f=File.new("temp_out.txt","w")
f.puts text
f.close
doesn't give very nice results, so some additional fiddling is necessary.
I once wrote some code to separate two-columned text. You can combine the two columns with tabs.
Best regards,
Axel
···
-------------------------
def column_arrange(txt_file)
text=IO.readlines(txt_file)
reg=/ +[^ ]/
ref=
text.each{|line|
# there might be several longer sequences of whitespace in a line
line.scan(reg).each{|y|
ref<<line.index(y)+y.length
}
}
cut_most_columns_here=ref.sort[ref.length/2]
col1=
col2=
text.each{|line|
# there might be several longer sequences of whitespace in a line
whites=line.scan(reg)
whites_ind=line.scan(reg).collect{|y| (line.index(y)+y.length)-1}
if whites==
cut_here=cut_most_columns_here
elsif whites.length==1
cut_here=whites_ind[0]
elsif whites.length>1
min_dist=whites_ind.collect{|x| (x-cut_most_columns_here).abs}.min
cut_here=whites_ind.delete_if{|x| (x-cut_most_columns_here).abs==min_dist}[0]
end
col1<<line[0...cut_here].chomp
col2<<line[cut_here..-1].chomp
}
return col1,col2
end
--
Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
für nur 19,99 Euro/mtl.!* Aktuelle Nachrichten aus Politik, Wirtschaft & Panorama | GMX