Get rid of extra, blank lines via html parsing?

David_Ainley · 4 August 2010 04:29

So I am trying to get some information from a snippet of html
(http://pastebin.com/iTXyxQ0j), and im using doc.inner_text to get the
important parts, but when I do so I get an odd amount of spacing
(http://pastebin.com/6HWDs5dm). is there a way where I can get rid of
all that extra spacing so I can just print the output and it looks
clean? possibly something like

pino
0.2.11-ubuntu0~lucid
troorl
(2010-07-04)

pino
0.2.10-ubuntu0~karmic
troorl
(2010-05-27)

that? or can i get each piece of text and add it to an array? if i do
that while its got all that odd spacing, is that spacing a piece of the
variable? or is it juts the text?

thanks guys!

···

--
Posted via http://www.ruby-forum.com/.

Jesus_Gabriel_y_Gala · 4 August 2010 06:28

You can remove 2 or more consecutive "\n" like this:

irb(main):001:0> s =<<EOS
irb(main):002:0" test
irb(main):003:0"
irb(main):004:0" test2
irb(main):005:0" sdfsdf
irb(main):006:0" werwer
irb(main):007:0"
irb(main):008:0"
irb(main):009:0"
irb(main):010:0"
irb(main):011:0" sdfsdfsd
irb(main):012:0" sdfer234
irb(main):013:0" EOS
=> "test\n\ntest2\nsdfsdf\nwerwer\n\n\n\n\nsdfsdfsd\nsdfer234\n"
irb(main):019:0> s.gsub /\n\n+/, "\n"
=> "test\ntest2\nsdfsdf\nwerwer\nsdfsdfsd\nsdfer234\n"

or

irb(main):020:0> s.gsub /\n{2,}/, "\n"
=> "test\ntest2\nsdfsdf\nwerwer\nsdfsdfsd\nsdfer234\n"

Hope this helps,

Jesus.

···

On Wed, Aug 4, 2010 at 6:29 AM, David Ainley <wrinkliez@gmail.com> wrote:

So I am trying to get some information from a snippet of html
(http://pastebin.com/iTXyxQ0j\), and im using doc.inner_text to get the
important parts, but when I do so I get an odd amount of spacing
(http://pastebin.com/6HWDs5dm\). is there a way where I can get rid of
all that extra spacing so I can just print the output and it looks
clean? possibly something like

pino
0.2.11-ubuntu0~lucid
troorl
(2010-07-04)

pino
0.2.10-ubuntu0~karmic
troorl
(2010-05-27)

that? or can i get each piece of text and add it to an array? if i do
that while its got all that odd spacing, is that spacing a piece of the
variable? or is it juts the text?

GianFranco_Bozzetti · 4 August 2010 08:35

Use the String methods: s. strip!, s.gsub! and s.squeeze as in
the following snippet:

# no-white.rb - remove empty lines and sequences of blanks
# from a text file
fh = File.open('6HWDs5dm.txt')
while( !fh.eof)
    line = fh.readline.chomp
    # remove leading and trailing blanks
    line.strip!
    # skip empty lines
    next if line == ''
    # convert tab chars to blanks
    line.gsub!(/\t/,' ')
    # substitute a single blank for a sequence of blanks
    line.squeeze!(' ')
    # add code to process line if needed
    puts line
end
fh.close
exit(0)

HTH gfb
"David Ainley" <wrinkliez@gmail.com> wrote in message
news:a8de6e7e2af61a043990f1a86a62f009@ruby-forum.com...

···

So I am trying to get some information from a snippet of html
(http://pastebin.com/iTXyxQ0j\), and im using doc.inner_text to get the
important parts, but when I do so I get an odd amount of spacing
(http://pastebin.com/6HWDs5dm\). is there a way where I can get rid of
all that extra spacing so I can just print the output and it looks
clean? possibly something like

pino
0.2.11-ubuntu0~lucid
troorl
(2010-07-04)

pino
0.2.10-ubuntu0~karmic
troorl
(2010-05-27)

that? or can i get each piece of text and add it to an array? if i do
that while its got all that odd spacing, is that spacing a piece of the
variable? or is it juts the text?

thanks guys!
--
Posted via http://www.ruby-forum.com/\.

David_Ainley · 4 August 2010 14:18

Hey guys, thanks for the responses. Jesus, the gsubs don't do anything
:/, the output still looks the same.

And Gianfranco, everytime I try to use readline, it gives me an error
"private method `readline' called for #<String:0xb71c3fd8>
(NoMethodError)"

···

--
Posted via http://www.ruby-forum.com/.

Jesus_Gabriel_y_Gala · 4 August 2010 19:28

Can you show your code?

Jesus.

···

On Wed, Aug 4, 2010 at 4:18 PM, David Ainley <wrinkliez@gmail.com> wrote:

Hey guys, thanks for the responses. Jesus, the gsubs don't do anything
:/, the output still looks the same.

And Gianfranco, everytime I try to use readline, it gives me an error
"private method `readline' called for #<String:0xb71c3fd8>
(NoMethodError)"

Topic		Replies	Views
How to remove empty space in a string and others ruby-talk	4	114	9 October 2006
Removing Whitespace using regexp ruby-talk	6	93	7 May 2009
How to ignore whitespaces difference while taking htmldiff ruby-talk	0	126	26 May 2009
About unknown charaters shows ruby-talk	0	92	31 July 2010
Help me condence my code? ruby-talk	22	196	8 September 2007

Get rid of extra, blank lines via html parsing?

Related topics