Adventures in html decoding

From the "If you want it done right, do it yourself... maybe"
department.

Today I was looking at a webpage that used html encoding
(ie, "a" in place of "a") to obfuscate much of it's contents.
This displeased me for several reasons. (Not the least of
which was a standing order from General Principles.)

So I looked around online for a web-based tool that I could
paste the text into and get back a more useful form. But
everything I found either didn't work, or just didn't convert
ordinary letters.

So I said to heck with it, I can write something to do this
myself.

I didn't use CGI for two reasons. 1) I remember the last time I
tried experimenting with CGI, and had to severely hack the
library to get it to let me use html generation methods in a
non-server environment. 2) The description of unescapeHTML
sounded as though it would only unescape the special characters
that have to be escaped.

So, I ended up with this:

···

===
outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
IO.readlines(ARGV[0]).each{ |line|
  begin
  outfile.puts line.gsub(/&#(\d+);/) { |x|
    if $1.to_i < 256
      $1.to_i.chr
    else
      x
    end
  }
  rescue
  outfile.puts line
  puts line
  end
  
}
outfile.close

And it worked.

Then I thought of looking at the source of unescapeHTML, and
found that the description or my interpretation of it was wrong.
Not only would it handle all the escaped ascii characters, it was
a class method, so I didn't need to deal with the enviroment
issues.

Which lead to...

===
require 'cgi'
outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
IO.readlines(ARGV[0]).each{ |line|
  outfile.puts CGI::unescapeHTML(line)
}
outfile.close

Which is much simpler; just some file handling stuff around
the unescapeHTML function. Maybe later I'll try something with
rubywebdialogs that'll let me paste into a web browser window
and get back results the way I'd like to be able to do...

The moral of this story is, html obfuscation sucks.

(What? That's *not* the moral? Oh well...)

-Morgan

--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.10.21/96 - Release Date: 09/10/2005

This may be off topic, but I always wonder why all the flags to File.
Could what you are doing be written as:

  File.open(ARGV[1], "w") { |outfile|
    File.foreach(ARGV[0]) { |line|
      outfile.puts CGI::unescapeHTML(line)
    }
  }

or am I missing something big here?

···

On 9/12/05, Morgan <taria@the-arc.net> wrote:

===
require 'cgi'
outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
IO.readlines(ARGV[0]).each{ |line|
        outfile.puts CGI::unescapeHTML(line)
}
outfile.close

--
Jim Freeze

[...]
Which lead to...

===
require 'cgi'
outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
IO.readlines(ARGV[0]).each{ |line|
    outfile.puts CGI::unescapeHTML(line)
}
outfile.close

Which is much simpler; just some file handling stuff around
the unescapeHTML function. Maybe later I'll try something with
rubywebdialogs that'll let me paste into a web browser window
and get back results the way I'd like to be able to do...

The moral of this story is, html obfuscation sucks.

(What? That's *not* the moral? Oh well...)

-Morgan

the moral is, there is always a simpler way :slight_smile:

require 'cgi'
open(ARGV[1], 'w') do |f|
   f.write(CGI::unescapeHTML(IO.read(ARGV[0])))
end

cheers

Simon

Jim Freeze wrote:

This may be off topic, but I always wonder why all the flags to File.
Could what you are doing be written as:

  File.open(ARGV[1], "w") { |outfile|
    File.foreach(ARGV[0]) { |line|
      outfile.puts CGI::unescapeHTML(line)
    }
  }

or am I missing something big here?

Well, in this case, I don't believe it's possible to
get the effect of File::EXCL (which basically amounts to
"don't overwrite an existing file") with a string as the open
mode. There are some other combinations of parameters
that are also difficult (impossible) to achieve that way.
(I don't remember exactly what it was, but I think it had to
do with a file that was being opened for reading and writing.
All the strings I tried either wouldn't let me access parts of
an existing file, or otherwise failed to perfrom as I required.)

-Morgan

···

--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.10.21/96 - Release Date: 09/10/2005

Simon Kröger wrote:

> [...]
> Which lead to...
>
> ===
> require 'cgi'
> outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
> IO.readlines(ARGV[0]).each{ |line|
> outfile.puts CGI::unescapeHTML(line)
> }
> outfile.close
> ===
>
> Which is much simpler; just some file handling stuff around
> the unescapeHTML function. Maybe later I'll try something with
> rubywebdialogs that'll let me paste into a web browser window
> and get back results the way I'd like to be able to do...
>
> The moral of this story is, html obfuscation sucks.
>
> (What? That's *not* the moral? Oh well...)
>
> -Morgan

the moral is, there is always a simpler way :slight_smile:

require 'cgi'
open(ARGV[1], 'w') do |f|
   f.write(CGI::unescapeHTML(IO.read(ARGV[0])))
end

cheers

Simon

Simpler still:

require 'cgi'
open(ARGV.pop, 'w') { |f|
  f.write(CGI.unescapeHTML(ARGF.read))
}

O_EXCL is broken on nfs:

   O_EXCL When used with O_CREAT, if the file already exists it is an error
   and the open will fail. In this context, a symbolic link exists, regardless of
   where its points to. O_EXCL is broken on NFS file systems, programs which
   rely on it for performing lock- ing tasks will contain a race condition. The
   solution for per- forming atomic file locking using a lockfile is to
   create a unique file on the same fs (e.g., incorporating hostname and pid),
   use link(2) to make a link to the lockfile. If link() returns 0, the lock
   is successful. Otherwise, use stat(2) on the unique file to check if its
   link count has increased to 2, in which case the lock is also successful.

fyi.

-a

···

On Tue, 13 Sep 2005, Morgan wrote:

Jim Freeze wrote:

This may be off topic, but I always wonder why all the flags to File.
Could what you are doing be written as:

  File.open(ARGV[1], "w") { |outfile|
    File.foreach(ARGV[0]) { |line|
      outfile.puts CGI::unescapeHTML(line)
    }
  }

or am I missing something big here?

Well, in this case, I don't believe it's possible to
get the effect of File::EXCL (which basically amounts to
"don't overwrite an existing file") with a string as the open
mode. There are some other combinations of parameters
that are also difficult (impossible) to achieve that way.
(I don't remember exactly what it was, but I think it had to
do with a file that was being opened for reading and writing.
All the strings I tried either wouldn't let me access parts of
an existing file, or otherwise failed to perfrom as I required.)

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

"William James" wrote:

Simpler still:

require 'cgi'
open(ARGV.pop, 'w') { |f|
  f.write(CGI.unescapeHTML(ARGF.read))
}

I think you might have reached the point where
simpler is more complex... I'm not sure I'd know
what that code was supposed to do if it wasn't
something I wrote being reduced.

*never even -seen- ARGF before*

-Morgan

···

--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.10.21/96 - Release Date: 09/10/2005

"Ara.T.Howard" wrote:

O_EXCL is broken on nfs:

  O_EXCL When used with O_CREAT, if the file already exists it is an error
  and the open will fail. In this context, a symbolic link exists, regardless of
  where its points to. O_EXCL is broken on NFS file systems, programs which
  rely on it for performing lock- ing tasks will contain a race condition. The
  solution for per- forming atomic file locking using a lockfile is to
  create a unique file on the same fs (e.g., incorporating hostname and pid),
  use link(2) to make a link to the lockfile. If link() returns 0, the lock
  is successful. Otherwise, use stat(2) on the unique file to check if its
  link count has increased to 2, in which case the lock is also successful.

... And I barely understood a word of that. `.`

Does that mean it won't properly perform the "don't clobber an existing file"
purpose I'm using it for?

-Morgan

···

--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.10.21/96 - Release Date: 09/10/2005

ARGF is a reference to $stdin.

···

On 9/12/05, Morgan <taria@the-arc.net> wrote:

*never even -seen- ARGF before*

--
Jim Freeze

it means that O_EXCL fails silently on some kinds of filesystems, including
nfs. this is not likely to affect you and is beyond the control of ruby (it's
the c library/fs fault) but, if it does affect you, it means that two
instances of the code, when run at the same time, would __both__ be writing to
the file at the same time - neither would have an exclusive lock on the file
as it would not be created atomically. basically you can ignore this if you
are working on local disk - but if you are some sort of shared setup like nfs
or windows equiv be wary.

cheers.

-a

···

On Tue, 13 Sep 2005, Morgan wrote:

"Ara.T.Howard" wrote:

O_EXCL is broken on nfs:

  O_EXCL When used with O_CREAT, if the file already exists it is an
  error and the open will fail. In this context, a symbolic link exists,
  regardless of where its points to. O_EXCL is broken on NFS file
  systems, programs which rely on it for performing lock- ing tasks will
  contain a race condition. The solution for per- forming atomic file
  locking using a lockfile is to create a unique file on the same fs
  (e.g., incorporating hostname and pid), use link(2) to make a link
  to the lockfile. If link() returns 0, the lock is successful. Otherwise,
  use stat(2) on the unique file to check if its link count has
  increased to 2, in which case the lock is also successful.

... And I barely understood a word of that. `.`

Does that mean it won't properly perform the "don't clobber an existing file"
purpose I'm using it for?

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

Jim Freeze wrote:

> *never even -seen- ARGF before*

ARGF is a reference to $stdin.

--
Jim Freeze

An object providing access to virtual concatenation of files
passed as command-line arguments or standard input if there
are no command-line arguments. -- Ruby in a Nutshell

ARGF is no more esoteric than ARGV, and it's quite handy.
Let's say you want to process every line of every file
on the command-line:

ruby -e 'ARGF.each_line{|x| p x}' file1 file2 file3

···

On 9/12/05, Morgan <taria@the-arc.net> wrote:

Damn, that *is* handy! I love this list.

···

On Sep 13, 2005, at 1:16 AM, William James wrote:

ARGF is no more esoteric than ARGV, and it's quite handy.
Let's say you want to process every line of every file
on the command-line:

ruby -e 'ARGF.each_line{|x| p x}' file1 file2 file3

William James wrote:

ARGF is no more esoteric than ARGV, [...]

I disagree. ARGV is familiar to anyone who's ever written C, C++, Objective-C, Java, Perl, AWK, Python, Scheme, ...

ARGF is not. I'd never heard of it until this thread.

Compare the number of references to ARGV and ARGF in the pickaxe book too: ARGF is only mentioned three times in the entire book according to the index. One of those is in a grey "you can skip this" section talking about Perlisms, the second is under a big "ARGC" heading where it's mentioned in passing, and the real discussion isn't until page 336.

mathew

···

--
<URL:http://www.pobox.com/~meta/&gt;
          WE HAVE TACOS

mathew wrote:

William James wrote:
> ARGF is no more esoteric than ARGV, [...]

I disagree. ARGV is familiar to anyone who's ever written C, C++,
Objective-C, Java, Perl, AWK, Python, Scheme, ...
ARGF is not.

These are not familiar to everyone who's ever written in C or Awk:

  class, map, join, __END__, DATA, <<HERE, grep, flatten

But that doesn't prove they are esoteric to those who use Ruby.

I'd never heard of it until this thread.

Major premise:
I know everything about Ruby except that which is esoteric.

Minor premise:
I don't know about ARGF.

Conclusion:
ARGF is esoteric.

Compare the number of references to ARGV and ARGF in the pickaxe book
too: ARGF is only mentioned three times in the entire book according to
the index.

Pickaxe (1st edition), page 16:

  The "Ruby way" to write this would be to use an iterator:

    ARGF.each { |line| print line if line =~ /Ruby/ }

on page 219 under the heading "Standard Objects" these are listed:
  ARGF, ARGV, ENV, false, nil, self, true

page 217 explains ARGF's synonym, $<.

"Teach Yourself Ruby in 21 Days" explains ARGF in Day 8 on
page 173 and uses it in the final two solutions to a problem.
The penultimate one is

  has_a_long_word = /\w{5,}/
  ARGF.each{|line| print line unless has_a_long_word =~ line}

Matz himself in "Ruby in a Nutshell" explains it on page 38
and lists it as one of 14 predefined global constants.

One of those is in a grey "you can skip this" section talking
about Perlisms,

For that the authors should be afflicted with the Spell of
Forlorn Encystment.

···

------------
------------

Usage tip: the name of the file currently being read is available as
$FILENAME or as shown in this example:

ruby -e 'ARGF.each{|x| print ARGF.filename + ", " + x }' file1 file2

Hi --

William James wrote:

ARGF is no more esoteric than ARGV, [...]

I disagree. ARGV is familiar to anyone who's ever written C, C++, Objective-C, Java, Perl, AWK, Python, Scheme, ...

ARGF is not. I'd never heard of it until this thread.

You make it sound like learning something from a ruby-talk thread is
bad :slight_smile:

Compare the number of references to ARGV and ARGF in the pickaxe book too: ARGF is only mentioned three times in the entire book according to the index. One of those is in a grey "you can skip this" section talking about Perlisms, the second is under a big "ARGC" heading where it's mentioned in passing, and the real discussion isn't until page 336.

That doesn't mean it's esoteric. It just means it's discussed on page
336. Something has to be :slight_smile:

David

···

On Wed, 14 Sep 2005, mathew wrote:

--
David A. Black
dblack@wobblini.net

Nicely put :slight_smile:

martin

···

David A. Black <dblack@wobblini.net> wrote:

That doesn't mean it's esoteric. It just means it's discussed on page
336. Something has to be :slight_smile: