String scan question

Srinsriram · 6 April 2007 16:55

this is probably elementary but I just havent found the right/
reliable way to do this (that works always)

if a string has content in tags such as <TagName> content goes here </

whats the best way to put the content inside an array.. the

content can have whitespace chars (end of lines tabs etc) that should
be preserved in the array element. These tags are simple (no
properties).

I assume that the scan method is relevant but am having trouble
constructing a regex that works reliably.

Peter_Szinek3 · 6 April 2007 16:59

srinsriram@gmail.com wrote:

this is probably elementary but I just havent found the right/
reliable way to do this (that works always)

if a string has content in tags such as <TagName> content goes here </
> whats the best way to put the content inside an array.. the
content can have whitespace chars (end of lines tabs etc) that should
be preserved in the array element. These tags are simple (no
properties).

I assume that the scan method is relevant but am having trouble
constructing a regex that works reliably.

try

html_file.scan(/<TagName>(.+?)</).flatten

This will put the text contents of all <TagName> tags into an array.

Cheers,
Peter

···

--
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

Srinsriram · 6 April 2007 17:10

Here is a simple test case

s = <VariableValue>\nXXX\n</VariableValue>\n\n<VariableValue>
\n<Choice> Administrative Support - Supervisors<Choice>\n</

\n\n"

for the VariableValue tag, there should be 2 array elements \nXXX
\n AND \n<Choice> Administrative Support - Supervisors<Choice>\n

s.scan(/<VariableValue>(.+?)</).flatten returns an empty array

···

On Apr 6, 12:59 pm, Peter Szinek <p...@rubyrailways.com> wrote:

srinsri...@gmail.com wrote:
> this is probably elementary but I just havent found the right/
> reliable way to do this (that works always)

> if a string has content in tags such as <TagName> content goes here </
> > whats the best way to put the content inside an array.. the
> content can have whitespace chars (end of lines tabs etc) that should
> be preserved in the array element. These tags are simple (no
> properties).

> I assume that the scan method is relevant but am having trouble
> constructing a regex that works reliably.

try

html_file.scan(/<TagName>(.+?)</).flatten

This will put the text contents of all <TagName> tags into an array.

Cheers,
Peter

--http://www.rubyrailways.com:: Ruby and Web2.0 bloghttp://scrubyt.org:: Ruby web scraping frameworkhttp://rubykitchensink.ca/:: The indexed archive of all things Ruby

Peter_Szinek3 · 6 April 2007 17:27

srinsriram@gmail.com wrote:

Here is a simple test case

s = <VariableValue>\nXXX\n</VariableValue>\n\n<VariableValue>
\n<Choice> Administrative Support - Supervisors<Choice>\n</
>\n\n"

for the VariableValue tag, there should be 2 array elements \nXXX
\n AND \n<Choice> Administrative Support - Supervisors<Choice>\n

s.scan(/<VariableValue>(.+?)</).flatten returns an empty array

Ah OK, I did not know there can be other tags inside. This works better:

s.scan(/<VariableValue>(.+?)<\/VariableValue>/m).flatten

(note the 'm' flag for multiline)

however, to really match your example, I needed this:

s.scan(/<VariableValue>(.+?)<\/\n?VariableValue>/m).flatten

are you sure there is a line break between / and the tag name?

Cheers,
Peter

···

--
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

Alex_Young · 6 April 2007 17:31

srinsriram@gmail.com wrote:

Here is a simple test case

s = <VariableValue>\nXXX\n</VariableValue>\n\n<VariableValue>
\n<Choice> Administrative Support - Supervisors<Choice>\n</
>\n\n"

If it's actually XML, just use REXML. Anything else is asking for trouble, really.

···

--
Alex

Srinsriram · 6 April 2007 17:35

No there isnt any (due to wrapping here)..
I didnt know about the multiline option (seem to have missed that in
the docs). that worked

thanks very much

···

On Apr 6, 1:27 pm, Peter Szinek <p...@rubyrailways.com> wrote:

srinsri...@gmail.com wrote:
> Here is a simple test case

> s = <VariableValue>\nXXX\n</VariableValue>\n\n<VariableValue>
> \n<Choice> Administrative Support - Supervisors<Choice>\n</
> >\n\n"

> for the VariableValue tag, there should be 2 array elements \nXXX
> \n AND \n<Choice> Administrative Support - Supervisors<Choice>\n

> s.scan(/<VariableValue>(.+?)</).flatten returns an empty array

Ah OK, I did not know there can be other tags inside. This works better:

s.scan(/<VariableValue>(.+?)<\/VariableValue>/m).flatten

(note the 'm' flag for multiline)

however, to really match your example, I needed this:

s.scan(/<VariableValue>(.+?)<\/\n?VariableValue>/m).flatten

are you sure there is a line break between / and the tag name?

Cheers,
Peter

--http://www.rubyrailways.com:: Ruby and Web2.0 bloghttp://scrubyt.org:: Ruby web scraping frameworkhttp://rubykitchensink.ca/:: The indexed archive of all things Ruby

Alex_Young · 6 April 2007 17:43

Alex Young wrote:

···

srinsriram@gmail.com wrote:

Here is a simple test case

s = <VariableValue>\nXXX\n</VariableValue>\n\n<VariableValue>
\n<Choice> Administrative Support - Supervisors<Choice>\n</
>\n\n"

If it's actually XML, just use REXML. Anything else is asking for trouble, really.

Sorry, I didn't notice that your <Choice> tags aren't matched. Is that intentional? If so, ignore my suggestion - REXML clearly won't work.

--
Alex

Srinsriram · 6 April 2007 18:00

the content can be quite nonstandard and have mismatched tags etc
(like real life html).. so this is not xml
I will use rexml when the input is xml.. thanks for your suggestion.
this group is very useful for newbies.

···

On Apr 6, 1:43 pm, Alex Young <a...@blackkettle.org> wrote:

Alex Young wrote:
> srinsri...@gmail.com wrote:
>> Here is a simple test case

>> s = <VariableValue>\nXXX\n</VariableValue>\n\n<VariableValue>
>> \n<Choice> Administrative Support - Supervisors<Choice>\n</
>> >\n\n"
> If it's actually XML, just use REXML. Anything else is asking for
> trouble, really.

Sorry, I didn't notice that your <Choice> tags aren't matched. Is that
intentional? If so, ignore my suggestion - REXML clearly won't work.

--
Alex

Peter_Szinek3 · 6 April 2007 18:33

btw, if the content is funky, you could still try Hpricot - it handles such crap surprisingly nicely, and unless you would like to match more complicated things than a text between an opening and closing tag, it will really make your life easier.

Cheers,
Peter

···

--
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

Topic		Replies	Views
Str.scan ruby-talk	5	71	15 June 2007
Test content string with regex ruby-talk	8	119	21 November 2006
Regular expression ruby-talk	7	100	23 March 2009
Html stringScanner regexp ruby-talk	1	84	3 May 2006
Regex problem ruby-talk	4	87	2 December 2007

String scan question

Related topics