Regex Help

(Ezra Zygmuntowicz) #1

Hey Guys-
     I have a regex problem that I am not sure how to tackle. I am parsing some classified ads in order to format them for display online. I have most of the parsing done but I need help with the final step. So the file has one ad per line and a line looks like this:

<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath, irrigation, horse barn. $122,000. 509-697-6519<endad>

     Now I have already parsed everything to get it to this state but what I need to do next is to count 50 chars after the <begad:11559303> tag and insert </ftditm>
But the tricky part is that I need to place the </ftditm> 50 characters in to the line but if the 50 chars ends in the middle of a word then I need to match the rest of the word as well. So I need a way to match at least 50 chars plus the rest of the current word if the 50'th char lands in the middle of a word.
     So for this particular ad 50 chars makes it to here:
<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath, irri #<= 50 chars ends here# gation, horse barn. $122,000. 509-697-6519<endad>
     So it ends in the middle of the word irrigation and I need it to consume the whole word.

Any help is much appreciated-
-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732
ezra@yakima-herald.com

(David A. Black) #2

Hi --

···

On Tue, 23 Aug 2005, Ezra Zygmuntowicz wrote:

Hey Guys-
   I have a regex problem that I am not sure how to tackle. I am parsing some classified ads in order to format them for display online. I have most of the parsing done but I need help with the final step. So the file has one ad per line and a line looks like this:

<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath, irrigation, horse barn. $122,000. 509-697-6519<endad>

   Now I have already parsed everything to get it to this state but what I need to do next is to count 50 chars after the <begad:11559303> tag and insert </ftditm>
But the tricky part is that I need to place the </ftditm> 50 characters in to the line but if the 50 chars ends in the middle of a word then I need to match the rest of the word as well. So I need a way to match at least 50 chars plus the rest of the current word if the 50'th char lands in the middle of a word.
   So for this particular ad 50 chars makes it to here:
<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath, irri #<= 50 chars ends here# gation, horse barn. $122,000. 509-697-6519<endad>
   So it ends in the middle of the word irrigation and I need it to consume the whole word.

Here's one idea:

   str.sub(/(<begad:[^>]+>.{1,50}.*?\b)/, "\\1<\/ftditm>")

David
--
David A. Black
dblack@wobblini.net

(John Halderman) #3

Seems to me that you're trying to do too much with one regular expression. I
would just grab the content between your tags and then trim that down to 50
characters and reassemble it afterwards.

-j

···

On 8/22/05, David A. Black <dblack@wobblini.net> wrote:

Hi --

On Tue, 23 Aug 2005, Ezra Zygmuntowicz wrote:

> Hey Guys-
> I have a regex problem that I am not sure how to tackle. I am parsing
> some classified ads in order to format them for display online. I have
most
> of the parsing done but I need help with the final step. So the file has
one
> ad per line and a line looks like this:
>
> <ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,
> irrigation, horse barn. $122,000. 509-697-6519<endad>
>
> Now I have already parsed everything to get it to this state but what I
> need to do next is to count 50 chars after the <begad:11559303> tag and
> insert </ftditm>
> But the tricky part is that I need to place the </ftditm> 50 characters
in to
> the line but if the 50 chars ends in the middle of a word then I need to
> match the rest of the word as well. So I need a way to match at least 50
> chars plus the rest of the current word if the 50'th char lands in the
middle
> of a word.
> So for this particular ad 50 chars makes it to here:
> <ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,
irri
> #<= 50 chars ends here# gation, horse barn. $122,000.
509-697-6519<endad>
> So it ends in the middle of the word irrigation and I need it to consume
> the whole word.

Here's one idea:

str.sub(/(<begad:[^>]+>.{1,50}.*?\b)/, "\\1<\/ftditm>")

David
--
David A. Black
dblack@wobblini.net

(David A. Black) #4

Hi --

Seems to me that you're trying to do too much with one regular expression. I
would just grab the content between your tags and then trim that down to 50
characters and reassemble it afterwards.

I'm not sure what you mean by "too much". I think the substitution I
suggested does what Ezra said he needed. Is there an error in it?

David

···

On Tue, 23 Aug 2005, John Halderman wrote:

-j

On 8/22/05, David A. Black <dblack@wobblini.net> wrote:

Hi --

On Tue, 23 Aug 2005, Ezra Zygmuntowicz wrote:

Hey Guys-
I have a regex problem that I am not sure how to tackle. I am parsing
some classified ads in order to format them for display online. I have

most

of the parsing done but I need help with the final step. So the file has

one

ad per line and a line looks like this:

<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,
irrigation, horse barn. $122,000. 509-697-6519<endad>

Now I have already parsed everything to get it to this state but what I
need to do next is to count 50 chars after the <begad:11559303> tag and
insert </ftditm>
But the tricky part is that I need to place the </ftditm> 50 characters

in to

the line but if the 50 chars ends in the middle of a word then I need to
match the rest of the word as well. So I need a way to match at least 50
chars plus the rest of the current word if the 50'th char lands in the

middle

of a word.
So for this particular ad 50 chars makes it to here:
<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,

irri

#<= 50 chars ends here# gation, horse barn. $122,000.

509-697-6519<endad>

So it ends in the middle of the word irrigation and I need it to consume
the whole word.

Here's one idea:

str.sub(/(<begad:[^>]+>.{1,50}.*?\b)/, "\\1<\/ftditm>")

David
--
David A. Black
dblack@wobblini.net

--
David A. Black
dblack@wobblini.net

(Ezra Zygmuntowicz) #5

David-
     Thanks, the regex you posted works great. I had considered just trimming the text inside the tags and then untrimming until a word end, but I figured there would be a regex that would do it all at once.

Thanks Dave-
Ezra

···

On Aug 22, 2005, at 2:53 PM, David A. Black wrote:

Hi --

On Tue, 23 Aug 2005, John Halderman wrote:

Seems to me that you're trying to do too much with one regular expression. I
would just grab the content between your tags and then trim that down to 50
characters and reassemble it afterwards.

I'm not sure what you mean by "too much". I think the substitution I
suggested does what Ezra said he needed. Is there an error in it?

David

-j

On 8/22/05, David A. Black <dblack@wobblini.net> wrote:

Hi --

On Tue, 23 Aug 2005, Ezra Zygmuntowicz wrote:

Hey Guys-
I have a regex problem that I am not sure how to tackle. I am parsing
some classified ads in order to format them for display online. I have

most

of the parsing done but I need help with the final step. So the file has

one

ad per line and a line looks like this:

<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,
irrigation, horse barn. $122,000. 509-697-6519<endad>

Now I have already parsed everything to get it to this state but what I
need to do next is to count 50 chars after the <begad:11559303> tag and
insert </ftditm>
But the tricky part is that I need to place the </ftditm> 50 characters

in to

the line but if the 50 chars ends in the middle of a word then I need to
match the rest of the word as well. So I need a way to match at least 50
chars plus the rest of the current word if the 50'th char lands in the

middle

of a word.
So for this particular ad 50 chars makes it to here:
<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,

irri

#<= 50 chars ends here# gation, horse barn. $122,000.

509-697-6519<endad>

So it ends in the middle of the word irrigation and I need it to consume
the whole word.

Here's one idea:

str.sub(/(<begad:[^>]+>.{1,50}.*?\b)/, "\\1<\/ftditm>")

David
--
David A. Black
dblack@wobblini.net

--
David A. Black
dblack@wobblini.net

-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732
ezra@yakima-herald.com