Hey Guys-
I have a regex problem that I am not sure how to tackle. I am parsing some classified ads in order to format them for display online. I have most of the parsing done but I need help with the final step. So the file has one ad per line and a line looks like this:
<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath, irrigation, horse barn. $122,000. 509-697-6519<endad>
Now I have already parsed everything to get it to this state but what I need to do next is to count 50 chars after the <begad:11559303> tag and insert </ftditm>
But the tricky part is that I need to place the </ftditm> 50 characters in to the line but if the 50 chars ends in the middle of a word then I need to match the rest of the word as well. So I need a way to match at least 50 chars plus the rest of the current word if the 50'th char lands in the middle of a word.
So for this particular ad 50 chars makes it to here:
<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath, irri #<= 50 chars ends here# gation, horse barn. $122,000. 509-697-6519<endad>
So it ends in the middle of the word irrigation and I need it to consume the whole word.
Any help is much appreciated-
-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732
ezra@yakima-herald.com
Seems to me that you're trying to do too much with one regular expression. I
would just grab the content between your tags and then trim that down to 50
characters and reassemble it afterwards.
-j
···
On 8/22/05, David A. Black <dblack@wobblini.net> wrote:
Hi --
On Tue, 23 Aug 2005, Ezra Zygmuntowicz wrote:
> Hey Guys-
> I have a regex problem that I am not sure how to tackle. I am parsing
> some classified ads in order to format them for display online. I have
most
> of the parsing done but I need help with the final step. So the file has
one
> ad per line and a line looks like this:
>
> <ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,
> irrigation, horse barn. $122,000. 509-697-6519<endad>
>
> Now I have already parsed everything to get it to this state but what I
> need to do next is to count 50 chars after the <begad:11559303> tag and
> insert </ftditm>
> But the tricky part is that I need to place the </ftditm> 50 characters
in to
> the line but if the 50 chars ends in the middle of a word then I need to
> match the rest of the word as well. So I need a way to match at least 50
> chars plus the rest of the current word if the 50'th char lands in the
middle
> of a word.
> So for this particular ad 50 chars makes it to here:
> <ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,
irri
> #<= 50 chars ends here# gation, horse barn. $122,000.
509-697-6519<endad>
> So it ends in the middle of the word irrigation and I need it to consume
> the whole word.
Here's one idea:
str.sub(/(<begad:[^>]+>.{1,50}.*?\b)/, "\\1<\/ftditm>")
David
--
David A. Black
dblack@wobblini.net
Hi --
Seems to me that you're trying to do too much with one regular expression. I
would just grab the content between your tags and then trim that down to 50
characters and reassemble it afterwards.
I'm not sure what you mean by "too much". I think the substitution I
suggested does what Ezra said he needed. Is there an error in it?
David
···
On Tue, 23 Aug 2005, John Halderman wrote:
-j
On 8/22/05, David A. Black <dblack@wobblini.net> wrote:
Hi --
On Tue, 23 Aug 2005, Ezra Zygmuntowicz wrote:
Hey Guys-
I have a regex problem that I am not sure how to tackle. I am parsing
some classified ads in order to format them for display online. I have
most
of the parsing done but I need help with the final step. So the file has
one
ad per line and a line looks like this:
<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,
irrigation, horse barn. $122,000. 509-697-6519<endad>
Now I have already parsed everything to get it to this state but what I
need to do next is to count 50 chars after the <begad:11559303> tag and
insert </ftditm>
But the tricky part is that I need to place the </ftditm> 50 characters
in to
the line but if the 50 chars ends in the middle of a word then I need to
match the rest of the word as well. So I need a way to match at least 50
chars plus the rest of the current word if the 50'th char lands in the
middle
of a word.
So for this particular ad 50 chars makes it to here:
<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,
irri
#<= 50 chars ends here# gation, horse barn. $122,000.
509-697-6519<endad>
So it ends in the middle of the word irrigation and I need it to consume
the whole word.
Here's one idea:
str.sub(/(<begad:[^>]+>.{1,50}.*?\b)/, "\\1<\/ftditm>")
David
--
David A. Black
dblack@wobblini.net
--
David A. Black
dblack@wobblini.net
David-
Thanks, the regex you posted works great. I had considered just trimming the text inside the tags and then untrimming until a word end, but I figured there would be a regex that would do it all at once.
Thanks Dave-
Ezra
···
On Aug 22, 2005, at 2:53 PM, David A. Black wrote:
Hi --
On Tue, 23 Aug 2005, John Halderman wrote:
Seems to me that you're trying to do too much with one regular expression. I
would just grab the content between your tags and then trim that down to 50
characters and reassemble it afterwards.
I'm not sure what you mean by "too much". I think the substitution I
suggested does what Ezra said he needed. Is there an error in it?
David
-j
On 8/22/05, David A. Black <dblack@wobblini.net> wrote:
Hi --
On Tue, 23 Aug 2005, Ezra Zygmuntowicz wrote:
Hey Guys-
I have a regex problem that I am not sure how to tackle. I am parsing
some classified ads in order to format them for display online. I have
most
of the parsing done but I need help with the final step. So the file has
one
ad per line and a line looks like this:
<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,
irrigation, horse barn. $122,000. 509-697-6519<endad>
Now I have already parsed everything to get it to this state but what I
need to do next is to count 50 chars after the <begad:11559303> tag and
insert </ftditm>
But the tricky part is that I need to place the </ftditm> 50 characters
in to
the line but if the 50 chars ends in the middle of a word then I need to
match the rest of the word as well. So I need a way to match at least 50
chars plus the rest of the current word if the 50'th char lands in the
middle
of a word.
So for this particular ad 50 chars makes it to here:
<ftditm><begad:11559303>Selah Country Home 1.5 acres. 3 bdrm, 2 bath,
irri
#<= 50 chars ends here# gation, horse barn. $122,000.
509-697-6519<endad>
So it ends in the middle of the word irrigation and I need it to consume
the whole word.
Here's one idea:
str.sub(/(<begad:[^>]+>.{1,50}.*?\b)/, "\\1<\/ftditm>")
David
--
David A. Black
dblack@wobblini.net
--
David A. Black
dblack@wobblini.net
-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732
ezra@yakima-herald.com