Me again. I have chosen one of the crappiest websites on Gods
green Earth to scrape here....
For no good reason I have buttloads of '^M's all over the file
when I check it in vi. What would I feed gsub to strip them out?
Otherwise I'm looking at a very very long regex to yank out the
fields I need....
···
--
They're only trying to make me LOOK paranoid!
Rasputin :: Jack of All Trades - Master of Nuns
"Dick Davies" <rasputnik@hellooperator.net> schrieb im Newsbeitrag
news:20040609094700.GA21180@lb.tenfour...
Me again. I have chosen one of the crappiest websites on Gods
green Earth to scrape here....
For no good reason I have buttloads of '^M's all over the file
when I check it in vi. What would I feed gsub to strip them out?
Otherwise I'm looking at a very very long regex to yank out the
fields I need....
while ( line = gets )
line.chomp!
p line # no more \r\n at end
end
Me again. I have chosen one of the crappiest websites on Gods
green Earth to scrape here....
For no good reason I have buttloads of '^M's all over the file
when I check it in vi. What would I feed gsub to strip them out?
Otherwise I'm looking at a very very long regex to yank out the
fields I need....
I recommend the tool suite called
dos2unix
unix2dos
mac2unix
unix2mac
on unixes. They are even included in the msys toolkit. The ^M's are
superfluous line returns (don't remember which of '\n' or '\r').
Just convert your file to unix line endings using
dos2unix
The CP-conversion tool recode also supports this. And every other
editor. Actually gvim and vim seem to have the setting
set ff=[one of 'dos', 'unix' or 'mac'].
So just doing ':', 'set ff=unix' will initiate conversion.
Dick Davies <rasputnik@hellooperator.net> wrote in message news:<20040609094700.GA21180@lb.tenfour>...
Me again. I have chosen one of the crappiest websites on Gods
green Earth to scrape here....
For no good reason I have buttloads of '^M's all over the file
when I check it in vi. What would I feed gsub to strip them out?
Otherwise I'm looking at a very very long regex to yank out the
fields I need....
"Dick Davies" <rasputnik@hellooperator.net> schrieb im Newsbeitrag
news:20040609094700.GA21180@lb.tenfour...
>
> Me again. I have chosen one of the crappiest websites on Gods
> green Earth to scrape here....
>
> For no good reason I have buttloads of '^M's all over the file
> when I check it in vi. What would I feed gsub to strip them out?
> Otherwise I'm looking at a very very long regex to yank out the
> fields I need....
while ( line = gets )
line.chomp!
p line # no more \r\n at end
end
They're not at end of line though, they're just scattered through the
lines... and vi/dos2unix isn't an option , I'm doing this on a string
on its way from the webserver to REXML...
···
--
It is better never to have been born. But who among us has such luck?
One in a million, perhaps.
Rasputin :: Jack of All Trades - Master of Nuns
Fixed my own problem again. This is getting to be a habit.
shitty_html.gsub!(/\r/, '')
Thanks for suggestions!
* Dick Davies <rasputnik@hellooperator.net> [0658 11:58]:
···
* Robert Klemme <bob.news@gmx.net> [0624 11:24]:
>
> "Dick Davies" <rasputnik@hellooperator.net> schrieb im Newsbeitrag
> news:20040609094700.GA21180@lb.tenfour...
> > For no good reason I have buttloads of '^M's all over the file
> > when I check it in vi. What would I feed gsub to strip them out?
> while ( line = gets )
> line.chomp!
> p line # no more \r\n at end
> end
--
Ideas don't stay in some minds very long because they don't like
solitary confinement.
Rasputin :: Jack of All Trades - Master of Nuns
Dick Davies <rasputnik@hellooperator.net> wrote in message news:<20040609105715.GA24830@lb.tenfour>...
They're not at end of line though, they're just scattered through the
lines... and vi/dos2unix isn't an option , I'm doing this on a string
on its way from the webserver to REXML...
Use tr.
I generally strip DOS idiocy on the command line with:
tr -d '\r' < in > out
You can use the same thing in Ruby:
string.tr!( '\r', '' )
Or, you can do it in vim:
:%s/^M//g
To get the ^M down there, use visual mode to select one of the ^Ms and
yank it; then do :%s/ and type ^R" to paste it. Then finish the
replace with //g, and voila. It might even work if you do:
:%s/^V^M//g
(^V in vim lets you insert control characters). Sorry if I'm telling
you stuff about vim that you already know.
If you're slurping the files with Ruby, the easiest thing is to use
String#tr. If you're slurping the files with wget or curl, 'tr' is
probably the easiest.
My point being that a custom hacked tool for a momentary need is seldom
as clever as the tools that are specially designed for your problem.
Often the need for a small patch to bring things together is a sign for
a lack of design on a larger scale. To just gsub the problem away is the
very best way of having to gsub again tomorrow.
Now of course we all like to write code; it's just that more code is not
always the answer. Madness is a harsh word for that on a short time
scale, but on the long run, I think it actually applies.
I generally strip DOS idiocy on the command line with:
tr -d '\r' < in > out
Why would a thing grown from history be called idiocy in any context ?
It just is that way. If we change that, we would to have agree on one
line ending for everyone. Doesn't Mac have the same issues but the other
way round ?
> I generally strip DOS idiocy on the command line with:
>
> tr -d '\r' < in > out
Why would a thing grown from history be called idiocy in any context ?
It just is that way. If we change that, we would to have agree on one
line ending for everyone. Doesn't Mac have the same issues but the other
way round ?
As far as I'm concerned you can have any line terminator you like in
the privacy of your own filesystem, but when you transfer a file to
me, I'd like it in a useful format.
The server in question is the one that returns no output without a
valid user-agent, remember, and these ^Ms are in the middle of lines.
We're not talking DOS end-of-lines here.
···
--
If you are a fatalist, what can you do about it?
-- Ann Edwards-Duff
Rasputin :: Jack of All Trades - Master of Nuns
> I generally strip DOS idiocy on the command line with:
>
> tr -d '\r' < in > out
Why would a thing grown from history be called idiocy in any context ?
It just is that way. If we change that, we would to have agree on one
line ending for everyone. Doesn't Mac have the same issues but the other
way round ?
In the days of the "classic" Mac OS (9.x and earlier) that was true; \r was the default line ending. But that changed during the migration to OS X, so (with the possible exception of old, badly ported Carbon apps) it's not an issue anymore.
> I generally strip DOS idiocy on the command line with:
>
> tr -d '\r' < in > out
Why would a thing grown from history be called idiocy in any context ?
It just is that way.
If MS-DOS had come out in the 1950's it might be forgivable.
In 1981 elegant solutions to DOS' stupidities had existed
for at least a decade... And yet DOS debuts with:
- lack of shell globbing (STILL doesn't)
- shell can't escape its own metacharacters (STILL can't)
- ^Z character terminating text files even though this
was a historical artifact from an OS that needed them
because its filesize was measured in blocks... POINTLESS
when your OS knows the exact file size in bytes, which MS-DOS
always has
- ignorance of "a file is a bag of bytes, and everything is
a file" metaphor
- TWO end of line characters for text files when one would do
(in 1981 we had CRT screens not teletypes) <grin>
- special illegal filenames "nul" "com1" "lpt1" etc.
- no piping of program output/input streams (they finally
grafted this on haphazardly)
- can't unlink an open file and continue to read from it
(this is why ruby -i doesn't work for in-place edits in DOS)
It's "idiocy" because better solutions already existed for
a decade... Yes I've been using MS-DOS since 1981...
The history of MS-DOS is the very embodiment and manifestation
of the cute turn of phrase,
Those who do not understand Unix are condemned to reinvent it, poorly.
-- Henry Spencer
..Sadness.
If we change that, we would to have agree on one
line ending for everyone. Doesn't Mac have the same issues but the other
way round ?
At least the old macs just used a single character for
newline, even if it was CR instead of LF...
In the days of the "classic" Mac OS (9.x and earlier) that was true; \r
was the default line ending. But that changed during the migration to OS
X, so (with the possible exception of old, badly ported Carbon apps)
it's not an issue anymore.
//samuel
So that would make Dick Davies 'wrong' input html issued from a pre OS X
editor ? See what I meant by saying 'historical context' ?