Finding a sentence (more than one word & punctuation (, . ;)) in a string?

given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev

Kev Jackson wrote:

given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message
to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data
is included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev

If you really want sentences, this will work:

s.scan /\w+(?:[\s,]+\w+)*[.;?!]/

=> ["the message\nto echo.", "Yes, unless data is\nincluded in a character
section within this element."]

s.scan /\w+(?:,?\s+\w+)*[.;?!]/

=> ["the message\nto echo.", "Yes, unless data is\nincluded in a character
section within this element."]

Kind regards

    robert

Hi all,

Erik Veenstra wrote:
....

s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}

gegroet,
Erik V. - http://www.erikveen.dds.nl/

As a newbie I thought I'd have a go at this.
What I was trying to do was take Eriks code above, get the text between
tags into an array and then print it out as:
[message, the message to echo, Yes, unless data is included...]

I can do it by the look of things but if there are any suggestions how to improve this I'd appreciate it. Ie is the {} the most efficient way to fill the array? Is there a better way to print it out?

# --------------------------------
foo = " <td valign=\"top\">message</td> <td valign=\"top\">the message to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is included in a character section within this element.</td> </tr> "

# I want to fill an array so I can display in the format
# [message, the message to echo, Yes, unless...]
a = Array.new

# I think I understand this.
# /\s*<[^<>]*>\s*/ = find all tags
# \s* find 0 or more spaces
# <[^<>]*> find anything between and including <>
# \s* as above
# and reject them (.reject)
# whats left (text between tags) use as x in the block |x|

# x seemed to include empty strings so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}

# Trying to find the best way to print this???
# nothing like what I want
# puts "--- print a ---"
print a

# extra space after last item
# puts "\n\n--- print \"[\" a.each{|x| print x + \", \" print \"]\" ---"
print "[ "
   a.each{|x| print x + ", "}
print "]"

# close but must know array size
# puts "\n\n print \"[\" + a[0] + \", \" + a[1] + \", \" + a[2] + \"]\""
print "[" + a[0] + ", " + a[1] + ", " + a[2] + "]\n"

# probably the most 'right' output wise
puts "\n\n--- for i in 0...a.length-1 ---"
print "[ "
for i in 0...a.length-1
   print a[i] + ", "
end
print a[a.length-1]
print "]"
# --------------------------------

thanks,

Mark

There have been several simple approaches proposed in this thread that may work for what you want. Just in case, if you needed something more robust you could have a glance at existing Perl modules that solve this problem like Lingua::EN::Sentence.

-- fxn

···

On Jan 11, 2006, at 8:08, Kev Jackson wrote:

given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a character section within this element."]

Kev Jackson wrote:

given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message
to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is
included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev

if this is an HTML table extraction thing, rubyful soup is the easiest
way to do it
http://www.crummy.com/software/RubyfulSoup/documentation.html

there's also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
http://htmltokenizer.rubyforge.org/doc/
http://htmltokenizer.rubyforge.org/doc/

Mark Woodward wrote:
....

# x seemed to include empty strings so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}

Hmm, here's the first improvement? Seems I can use a << x to append to an array:

# x seemed to include ""??? so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a << x if x != ""}

···

--
Mark

I'm not sure what you're trying to do here, but I think split returns an array already, operated on by reject in this case, which returns the new array. So with the Erik's code:

  a = s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}
  p a
  # => ["message", "the message to echo.", ... etc ... ]

I guess an alternative similar to your approach above might be:

  b = foo.split(/\s*<[^<>]*>\s*/).inject() { |ary,x| if x.empty? then ary else ary << x end }
  p b
  # => ["message", "the message to echo.", ... etc ... ]

Note the 'p' method, which prints out using 'inspect'. Alternatively, you could have done:

  puts b.inspect
  print "{b.inspect}\n"

and so on. Another nitpick about your example, is that in most Ruby I've seen people tend to prefer using unless rather than !negating the condition to if. So where you have:

  if x != ""

I'd tend to use:

  unless x == ""

or (more likely):

  unless x.empty?

Cheers,

···

On Wed, 11 Jan 2006 10:52:08 -0000, Mark Woodward <see@signature.com> wrote:

Mark Woodward wrote:
...

# x seemed to include empty strings so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}

Hmm, here's the first improvement? Seems I can use a << x to append to an array:

# x seemed to include ""??? so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a << x if x != ""}

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Gene Tani wrote:

Kev Jackson wrote:

given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message
to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is
included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev
   
if this is an HTML table extraction thing, rubyful soup is the easiest
way to do it
http://www.crummy.com/software/RubyfulSoup/documentation.html

there's also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
http://htmltokenizer.rubyforge.org/doc/

That is indeed what the problem domain is (did the <td> give it away!).

Basically I have a whole lot of html files and I need to re-write them as xml (sort of docbook-ish, but not quite). I'm using builder (fantastic bit of kit by the way), but my original files sometimes contain things like

    "<td valign=\"top\">append</td>
    <td valign=\"top\">Append to an existing file (or
      <a href=\"http://java.sun.com/j2se/1.4.2/docs/api/java/io/FileWriter.html#FileWriter\(java.lang.String, boolean)\" target=\"_blank\">
      open a new file / overwrite an existing file</a>)?
    </td>
    <td valign=\"top\" align=\"center\">No - default is false.</td>"

And anything I try basically means that I end up with either nothing extracted or the whole table extracted! My thoughts were to try a simple conversion and then fix things manually afterwards (ie get 95% of the conversion done through a script and then apply some elbow grease to finish off the parts that take too much time to work out)

I'm now off to read about this tokenizer ^^^ and see if it does what I want - obviously I'd love to have an automated solution (there are 1000+ html docs I need to convert).

I must admit to beginning to loathe HTMLs lack of structural information - if this was a docbook file I'd have very few problems converting it (I could choose many options), but html is so limited in its ability to express what meaning some section has [sigh]

Thanks to all for the suggested regexps - I never intended it to become a mini Ruby Quiz :slight_smile:
Kev

A quick scan says that you've got legit xml there, why not use REXML?
It's included in the ruby standard libs. Or any of the above html/xml
parsing libraries with xpath to pluck your values out.

REXML Docs:
http://ruby-doc.org/stdlib/

REXML Homepage:
http://www.germane-software.com/software/rexml

  .adam

Hi Ross,

Ross Bamford wrote:

I'm not sure what you're trying to do here,

makes 2 of us :wink:

but I think split returns

an array already, operated on by reject in this case, which returns the new array. So with the Erik's code:

    a = s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}
    p a # => ["message", "the message to echo.", ... etc ... ]

Exactly what I was trying to do. I thought it had to be an array but couldn't figure out how to print it like ["","",""] like the OP wanted.
p a - now thats embarrassing! 2 letters and it works. Compare that to my gibberish :-(. We all have to start somewhere I guess!

I guess an alternative similar to your approach above might be:

    b = foo.split(/\s*<[^<>]*>\s*/).inject() { |ary,x| if x.empty? then ary else ary << x end }
    p b
    # => ["message", "the message to echo.", ... etc ... ]

Note the 'p' method, which prints out using 'inspect'. Alternatively, you could have done:

    puts b.inspect
    print "{b.inspect}\n"

steady on! :wink:

and so on. Another nitpick about your example, is that in most Ruby I've seen people tend to prefer using unless rather than !negating the condition to if. So where you have:

    if x != ""

I'd tend to use:

    unless x == ""

or (more likely):

    unless x.empty?

Nitpick away! I appreciate it. Its been a good little exercise re p, puts, print and chaining methods etc. I've been reading the pickaxe book, but readings not good enough. I need to write some code. If I can make a fool of myself here but learn something at the same time then thats great!

Cheers,

thanks,

···

On Wed, 11 Jan 2006 10:52:08 -0000, Mark Woodward <see@signature.com> > wrote:

--
Mark

Exactly what I was trying to do. I thought it had to be an array but couldn't figure out how to print it like ["","",""] like the OP wanted.
p a - now thats embarrassing! 2 letters and it works. Compare that to my gibberish :-(. We all have to start somewhere I guess!

Absolutely. My early Ruby was probably some of the least Rubyish Ruby around :slight_smile: Check out the 'show_array' nonsense here at http://roscopeco.co.uk/code/noob/basic-syn2.rb - ouch. (I later refactored it a bit to http://roscopeco.co.uk/code/noob/arrays.html\).

Nitpick away! I appreciate it. Its been a good little exercise re p, puts, print and chaining methods etc. I've been reading the pickaxe book, but readings not good enough. I need to write some code. If I can make a fool of myself here but learn something at the same time then thats great!

Heh, I definitely know what you mean there - I have to do stuff to learn too. That said, though, I just got my paper pickaxe (finally, this morning!) and it's much better having something solid to refer to without having to switch to the browser and all that, so I can at least check I'm making sense :slight_smile:

Cheers,

···

On Wed, 11 Jan 2006 11:40:58 -0000, Mark Woodward <see@signature.com> wrote:

--
Ross Bamford - rosco@roscopeco.remove.co.uk

Ross Bamford wrote:

Heh, I definitely know what you mean there - I have to do stuff to learn too. That said, though, I just got my paper pickaxe (finally, this morning!) and it's much better having something solid to refer to without having to switch to the browser and all that, so I can at least check I'm making sense :slight_smile:

Yeah, I've been using the PDF version of Pickaxe(vers 2) but will order the felled trees version I think. Also 'The Ruby Way' version 2 when it is published. What ever it takes :wink:

thanks again,

···

--
Mark