Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.
sometimes i type in all or mostly lowercase. a friend of mine says it's hard to
read essays with no capital letters. so the problem is to write a method which
takes a string (which could include many paragraphs), and capitalizes words that
should be capitalized. at minimum it should do the starts of sentences.
solutions could range from simple (a few regexes) to complex (lots of special
cases are possible, like abbreviations that use a period). an addition would be
using a dictionary to find proper nouns and capitalize those. it could also ask
the user about cases the program can't figure out. or log them.
i can provide an example solution (regex based) and a list of reasons it doesn't
work very well, if you want.
sample input:
- this email itself works nicely
- this one is hard. sometimes i might want to write about gsub vs. gsub! without
the "." or "!" causing any capitalization (or the punctuation in quotes).
one problem is maybe dealing with sentences that contain periods is too hard. i
don't know.
sometimes i type in all or mostly lowercase. a friend of mine says it's hard to
read essays with no capital letters. so the problem is to write a method which
takes a string (which could include many paragraphs), and capitalizes words that
should be capitalized. at minimum it should do the starts of sentences.
solutions could range from simple (a few regexes) to complex (lots of special
cases are possible, like abbreviations that use a period). an addition would be
using a dictionary to find proper nouns and capitalize those. it could also ask
the user about cases the program can't figure out. or log them.
i can provide an example solution (regex based) and a list of reasons it doesn't
work very well, if you want.
sample input:
- this email itself works nicely
- this one is hard. sometimes i might want to write about gsub vs. gsub! without
the "." or "!" causing any capitalization (or the punctuation in quotes).
one problem is maybe dealing with sentences that contain periods is too hard. i
don't know.
It would be nice if you could assume two spaces after a end of sentence with puncuation. Generally I think that's correct grammar, although my grammar stinks so I could easily be wrong. If you have to get into parsing incorrect grammar it becomes much more difficult.
perhaps u could also correct rly annoying abbreviations used by ppl
for whom typing a few extra letters is 2 hard! thx!111
(Ugh - did I just type that?!)
Paul.
···
On 04/08/06, Ruby Quiz <james@grayproductions.net> wrote:
sometimes i type in all or mostly lowercase. a friend of mine says it's hard to
read essays with no capital letters. so the problem is to write a method which
takes a string (which could include many paragraphs), and capitalizes words that
should be capitalized. at minimum it should do the starts of sentences.
My solution will at the minimum capitalize the starts of sentences
just judging by periods.
If supplied with an example source using the -s option, it will try to
find words that should always be capitalized (I, Ruby, proper nouns in
general), words that imply that the next word should be capitalized
(Lake, General) and words in which punctuation does not imply an end
of a sentence (abbreviations) although this is only helpful if there
is some capitalization in the text.
The only reason that two spaces were used after a period during the
'typewriter' age was because original typewriters had monospaced
fonts -- the extra space was needed for the eye to pick up on the beginning
of a new sentence. That need is negated w/proportional space type, hence
[it is] the typographic standard.
James Edward Gray II
···
On Aug 4, 2006, at 8:20 AM, Mike Harris wrote:
It would be nice if you could assume two spaces after a end of sentence with puncuation. Generally I think that's correct grammar, although my grammar stinks so I could easily be wrong. If you have to get into parsing incorrect grammar it becomes much more difficult.
It's not correct grammar, just a typographical convention; one which is sort of semi-obsolete and regularly gives rise to great debate in typographical circles over its perceived rightness, wrongness, and pragmatic value.
That isn't to say you shouldn't use it, since it'll be very accurate in the general case, but redefining the problem to say "anything that doesn't use two spaces is wrong" is a bit of a dodge.
sometimes i type in all or mostly lowercase. a friend of mine says it's hard to
read essays with no capital letters. so the problem is to write a method which
takes a string (which could include many paragraphs), and capitalizes words that
should be capitalized. at minimum it should do the starts of sentences.
solutions could range from simple (a few regexes) to complex (lots of special
cases are possible, like abbreviations that use a period). an addition would be
using a dictionary to find proper nouns and capitalize those. it could also ask
the user about cases the program can't figure out. or log them.
i can provide an example solution (regex based) and a list of reasons it doesn't
work very well, if you want.
sample input:
- this email itself works nicely
- this one is hard. sometimes i might want to write about gsub vs. gsub! without
the "." or "!" causing any capitalization (or the punctuation in quotes).
one problem is maybe dealing with sentences that contain periods is too hard. i
don't know.
It would be nice if you could assume two spaces after a end of sentence with puncuation. Generally I think that's correct grammar, although my grammar stinks so I could easily be wrong. If you have to get into parsing incorrect grammar it becomes much more difficult.
----
Matthew Smillie <M.B.Smillie@sms.ed.ac.uk>
Institute for Communicating and Collaborative Systems
University of Edinburgh
sometimes i type in all or mostly lowercase. a friend of mine says it's hard to
read essays with no capital letters. so the problem is to write a method which
takes a string (which could include many paragraphs), and capitalizes words that
should be capitalized. at minimum it should do the starts of sentences.
solutions could range from simple (a few regexes) to complex (lots of special
cases are possible, like abbreviations that use a period). an addition would be
using a dictionary to find proper nouns and capitalize those. it could also ask
the user about cases the program can't figure out. or log them.
i can provide an example solution (regex based) and a list of reasons it doesn't
work very well, if you want.
sample input:
- this email itself works nicely
- this one is hard. sometimes i might want to write about gsub vs. gsub! without
the "." or "!" causing any capitalization (or the punctuation in quotes).
one problem is maybe dealing with sentences that contain periods is too hard. i
don't know.
My day job is developing natural language processing apps, and we've had to implement a similar case-correcting tool. What we found is that a simple regex-based approach is correct about 90% of the time. When we used machine learning to do the same thing, the results went up to about 95%. Compare this to human performance (i.e. have two or more people manually correct a text, then compare how often their corrections were in agreement), which was, IIRC, about 97%.
It would be nice if you could assume two spaces after a end of sentence with puncuation. Generally I think that's correct grammar, although my grammar stinks so I could easily be wrong. If you have to get into parsing incorrect grammar it becomes much more difficult.
The two-spaces-after-period rule is not a grammatical one; it's a typographic convention that grew out of typewriter (i.e. monospaced) fonts.
It would be nice if you could assume two spaces after a end of sentence
with puncuation. Generally I think that's correct grammar, although my
grammar stinks so I could easily be wrong. If you have to get into
parsing incorrect grammar it becomes much more difficult.
There's an old typewriter convention to use two spaces, but I'd be
surprised if you can find a single printed English book that uses two
spaces after a sentence.
On 04/08/06, Ruby Quiz <james@grayproductions.net> wrote:
sometimes i type in all or mostly lowercase. a friend of mine says it's hard to
read essays with no capital letters. so the problem is to write a method which
takes a string (which could include many paragraphs), and capitalizes words that
should be capitalized. at minimum it should do the starts of sentences.
perhaps u could also correct rly annoying abbreviations used by ppl
for whom typing a few extra letters is 2 hard! thx!111
Joking aside, this kind of tool would have been most welcome when I taught freshman-level programming a few years back. We're showing our age here.
I'm still reading through the code but just a minor tip:
lines like:
if EOSPunc.index word[-1].chr then true else false end
can be replaced with
EOSPunc.index word[-1].chr
it will either be true or false, and then the if statement is giving the same thing.
if you really want to have true or false (and not 3 or "hi" or nil, even though those will work fine if you treat the variable as a boolean) one way to do it is !!var. using not twice gets you true or false. there's probably something more readable though.
-- Elliot Temple
···
On Aug 9, 2006, at 7:20 PM, Mitchell Koch wrote:
My solution will at the minimum capitalize the starts of sentences
just judging by periods.
If supplied with an example source using the -s option, it will try to
find words that should always be capitalized (I, Ruby, proper nouns in
general), words that imply that the next word should be capitalized
(Lake, General) and words in which punctuation does not imply an end
of a sentence (abbreviations) although this is only helpful if there
is some capitalization in the text.
This is quite clever Mitchell. Thanks so much for sharing it with us!
Sadly, I wrote the summary earlier today when I had a few free moments. Don't take it personally that it doesn't mention this code.
James Edward Gray II
···
On Aug 9, 2006, at 9:20 PM, Mitchell Koch wrote:
If supplied with an example source using the -s option, it will try to
find words that should always be capitalized (I, Ruby, proper nouns in
general), words that imply that the next word should be capitalized
(Lake, General) and words in which punctuation does not imply an end
of a sentence (abbreviations) although this is only helpful if there
is some capitalization in the text.
It would be nice if you could assume two spaces after a end of sentence with puncuation. Generally I think that's correct grammar, although my grammar stinks so I could easily be wrong. If you have to get into parsing incorrect grammar it becomes much more difficult.
Actually, that's an old typographical convention that we can't seem to shake. Here's an report that talks a little about the issue:
The only reason that two spaces were used after a period during the
'typewriter' age was because original typewriters had monospaced
fonts -- the extra space was needed for the eye to pick up on the beginning
of a new sentence. That need is negated w/proportional space type, hence
[it is] the typographic standard.
It would be nice if you could assume two spaces after a end of sentence with puncuation. Generally I think that's correct grammar, although my grammar stinks so I could easily be wrong. If you have to get into parsing incorrect grammar it becomes much more difficult.
Actually, that's an old typographical convention that we can't seem to shake. Here's an report that talks a little about the issue:
The only reason that two spaces were used after a period during the
'typewriter' age was because original typewriters had monospaced
fonts -- the extra space was needed for the eye to pick up on the beginning
of a new sentence. That need is negated w/proportional space type, hence
[it is] the typographic standard.
Very interesting. It's also very interesting to me that I spend most of my time reading and writing in monospaced fonts and I think two spaces looks worse in monospace, so I only ever use one. When typing in proportional fonts I sometimes still do a double-space, but mostly I've given up caring what others think and just do what I want (one space), similar to the situation with punctuation inside or outside of quotation marks. I blame latex for my nonchalant attitude, however no matter how much I use latex I will never fall for the horrendously wrong `` '' convention.
> It would be nice if you could assume two spaces after a end of
> sentence with puncuation. Generally I think that's correct
> grammar, although my grammar stinks so I could easily be wrong. If
> you have to get into parsing incorrect grammar it becomes much more
> difficult.
Actually, that's an old typographical convention that we can't seem
to shake.
What sort of perversion would make anyone want to shake
an old convention that is useful?
Here's an report that talks a little about the issue.
The only reason that two spaces were used after a period during the
'typewriter' age was because original typewriters had monospaced
fonts -- the extra space was needed for the eye to pick up on the
beginning
of a new sentence. That need is negated w/proportional space type,
hence
[it is] the typographic standard.
Most people view the posts here in a monospaced font.
If they didn't, source code would look too chaotic.
TeX and LaTeX, for example, quite properly put extra space
after the end of a sentence. Since what we type here will
usually be displayed monospaced, a sensible person who is
trying to make his message as readable as possible will put
two spaces between sentences.
>>
>>
> It would be nice if you could assume two spaces after a end of
> sentence with puncuation. Generally I think that's correct
> grammar, although my grammar stinks so I could easily be wrong. If
> you have to get into parsing incorrect grammar it becomes much more
> difficult.
It's not correct grammar, just a typographical convention; one which
is sort of semi-obsolete and regularly gives rise to great debate in
typographical circles over its perceived rightness, wrongness, and
pragmatic value.
That isn't to say you shouldn't use it, since it'll be very accurate
in the general case, but redefining the problem to say "anything that
doesn't use two spaces is wrong" is a bit of a dodge.
I have to add that I never ever read anything about this kind of rule ! And I am
100% sure that this rule does not exist for french typography. I suspect that
every country will have different spacing schemes according to the punctuation,
and if you intend to correct english written by foreigner (and a lot of it is)
or, even better, if you want your program to work with any latin-written
language, you'd better not rely on anything like that ! (I know that I make
loads of english typography errors because I naturally follow the french
rules... unless I make special effort)
···
On Aug 4, 2006, at 14:20, Mike Harris wrote:
----
Matthew Smillie <M.B.Smillie@sms.ed.ac.uk>
Institute for Communicating and Collaborative Systems
University of Edinburgh
------------------------------------------------------
This message was sent using IMP: http://horde.org/imp/
if EOSPunc.index word[-1].chr then true else false end
can be replaced with
EOSPunc.index word[-1].chr
Ah, yeah that's a shorter way to do it. For some reason I had it
stuck in my head that I just wanted to express truth value and tried
to avoid passing on extraneous information (like in this case the
index in the punctuation array of the entry with the punctuation
attached to the last word).
I didn't spend too much time refactoring; at first I was dreaming up
abstracting parts of both the source reading and the proper casing
into a token parser kind of thing, but then it was more like, okay it
works and it's the Wednesday before the quiz summary goes up, so let's
send it off.
* James Edward Gray II <james@grayproductions.net> [060809 22:02]:
This is quite clever Mitchell. Thanks so much for sharing it with us!
Sadly, I wrote the summary earlier today when I had a few free
moments. Don't take it personally that it doesn't mention this
code.
No worries. I shouldn't have put it off to the last minute anyway.
It's interesting to me that so few of us submitted code for this
quiz. It's a problem that has no clear solutions, partly because
capitalization isn't just about grammatical rules, but does
communicate unique things as hinted in Elliot's initial examples.
For example if the word "gray" appears in a lowercase message, it
could mean the color in which it should actually be lowercase, or it
could be a surname, in which it should be capitalized. A computer
reader has no way to know, a human reader should be able to tell, but
really only the original author knows for sure.
It's like image interpolation. If I start out with a small photo,
expand it, and try to infer the extra pixels, a good algorithm will
give you something that looks okay, but it will not be as good as if
you started out by taking it at the larger size in the first place.
Incidentally, that's a good reason to not type in lowercase.
In short, the "rivers" of whitespace, caused by using two spaces,
invariably annoy graphic designers and typographers.
James Edward Gray II
···
On Aug 4, 2006, at 9:45 AM, Hans Fugal wrote:
James Edward Gray II wrote:
On Aug 4, 2006, at 8:20 AM, Mike Harris wrote:
It would be nice if you could assume two spaces after a end of sentence with puncuation. Generally I think that's correct grammar, although my grammar stinks so I could easily be wrong. If you have to get into parsing incorrect grammar it becomes much more difficult.
Actually, that's an old typographical convention that we can't seem to shake. Here's an report that talks a little about the issue: http://webword.com/reports/period.html
Here's an explanation from that report:
The only reason that two spaces were used after a period during the
'typewriter' age was because original typewriters had monospaced
fonts -- the extra space was needed for the eye to pick up on the beginning
of a new sentence. That need is negated w/proportional space type, hence
[it is] the typographic standard.
Very interesting. It's also very interesting to me that I spend most of my time reading and writing in monospaced fonts and I think two spaces looks worse in monospace, so I only ever use one.
TeX and LaTeX, for example, quite properly put extra space
after the end of a sentence. Since what we type here will
usually be displayed monospaced, a sensible person who is
trying to make his message as readable as possible will put
two spaces between sentences.
Two spaces are needed even when the posts are seen in
a proportional font; without them, there is no extra space
between sentences.