Splitting a text file into sentences

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

Doing really, really good sentence boundary detection is an on-going problem in natural language processing. I'm not aware of any Ruby-based NLP packages, but if you want better accuracy than just using [.!?:] there are several free NLP packages around (NLTK in Python, and Stanford's Java NLP package spring to mind) that might help you. A googling of "sentence tokenization" may also yield some help.

If that sounds like overkill, then you can get accuracy "good enough for government work" by making a list of regular expressions to catch exceptions to the punctuation rule. These will necessarily vary a little depending on your source text, but a typical examples are catching titles like "Mr.", "Mrs." "Dr.", and all-caps abbreviations like "U.S.A." or "M.D." (something like this: /([A-Z]\.([A-Z]\.)+/)

good luck,
matthew smillie.

···

On Nov 29, 2005, at 23:49, basi wrote:

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

----
Matthew Smillie <M.B.Smillie@sms.ed.ac.uk>
Institute for Communicating and Collaborative Systems
University of Edinburgh

I dimly recall something on this list about 9 months ago or so.

Nick

···

On 11/29/05, basi <basi_lio@hotmail.com> wrote:

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

--
Nicholas Van Weerdenburg

basi wrote:

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?

It's a common convention to separate sentences by double spaces. I started following this convention because Emacs expected it, and now I use it always.

basi_lio wrote:

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

If you make a regexp: [.!?]\s+[A-Z] you will already capture most. Most
Abbreviations normally aren't followed by a space/capital letter.

One change to this rule that I can think of is Mr. Name, Mrs. Name. But
as you can see these have a <uppercase> followed by only one or two
downcase letters. Most sentences would have at least five non uppercase
in front of the <.> ->
[A-Z]\w\w?\w?\w?\.

···

--
Posted via http://www.ruby-forum.com/\.

Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It's not robust, but if you know that
convention will be followed by the authors, then it can work.

_Kevin

···

-----Original Message-----
From: Matthew Smillie [mailto:M.B.Smillie@sms.ed.ac.uk]
Sent: Tuesday, November 29, 2005 09:06 PM
To: ruby-talk ML
Subject: Re: Splitting a text file into sentences

On Nov 29, 2005, at 23:49, basi wrote:

Looking for ideas on how to split a text file into sentences. I see
the problem of basing the split on [.!?] -- they're also used in ways
other than to end a sentence. If I have to do manual pre-processing of
the text file, what editing might I do? Has anyone had to deal with
this problem and how did you make life easier for you?
Thanks for the help.
basi

Doing really, really good sentence boundary detection is an on-going problem
in natural language processing. I'm not aware of any Ruby- based NLP
packages, but if you want better accuracy than just using [.!?:] there are
several free NLP packages around (NLTK in Python,
and Stanford's Java NLP package spring to mind) that might help you.
A googling of "sentence tokenization" may also yield some help.

If that sounds like overkill, then you can get accuracy "good enough for
government work" by making a list of regular expressions to catch exceptions
to the punctuation rule. These will necessarily vary a little depending on
your source text, but a typical examples are catching titles like "Mr.",
"Mrs." "Dr.", and all-caps abbreviations like "U.S.A." or "M.D." (something
like this: /([A-Z]\.([A-Z]\.)+/)

good luck,
matthew smillie.

----
Matthew Smillie <M.B.Smillie@sms.ed.ac.uk>
Institute for Communicating and Collaborative Systems University of
Edinburgh

http://www.pressure.to/ruby/ is the reference I found in an old email thread
on this list.

Nick

···

On 11/29/05, Nicholas Van Weerdenburg <vanweerd@gmail.com> wrote:

On 11/29/05, basi <basi_lio@hotmail.com> wrote:
>
> Looking for ideas on how to split a text file into sentences. I see the
> problem of basing the split on [.!?] -- they're also used in ways other
> than to end a sentence. If I have to do manual pre-processing of the
> text file, what editing might I do? Has anyone had to deal with this
> problem and how did you make life easier for you?
> Thanks for the help.
> basi
>
>
>
I dimly recall something on this list about 9 months ago or so.

Nick
--
Nicholas Van Weerdenburg

--
Nicholas Van Weerdenburg

Hi,
I have looked at NLTK in Python (and had been hoping a Rubyist would
rewrite it in Ruby). I will go back to NLTK and see if it has a
split-sentence algorithm of sort. And thanks for the tip on Stanfords
Java NLP package. Yes, those abbreviations are pesky, and I may have to
resort to an exceptions list containing the most common ones.
Thanks much,
basi

Hi,
I will google. Thanks!
basi

Yes, I learned this convention when I took a keyboarding (i.e., typing)
lesson in high school. Sometime ago, a style manual for word processing
appeared, and one of the advice is to use only one space to separate
sentences. The reason given is that in a justified format, those two
spaces can become four spaces, or even more. Anyway, a lot of text now
has one or two spaces between sentences, and this wouldn't be a
reliable indicator of sentence boundary.
Cheers!
basi

As I noted above, this is an improper convention outside of the
typewriter realm. If you are using anything other than a fixed-pitch
font for display or print, you should *never* use two spaces.

-austin

···

On 11/29/05, Jeffrey Schwab <jeff@schwabcenter.com> wrote:

basi wrote:
> Looking for ideas on how to split a text file into sentences. I see the
> problem of basing the split on [.!?] -- they're also used in ways other
> than to end a sentence. If I have to do manual pre-processing of the
> text file, what editing might I do? Has anyone had to deal with this
> problem and how did you make life easier for you?
It's a common convention to separate sentences by double spaces. I
started following this convention because Emacs expected it, and now I
use it always.

--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

Hi,
This looks promising. I'm downloading as I write.
Thanks!
basi

I too learned the two space after a period convention years ago and
also recently learned that with modern fonts and word processors it is
not necessary. It was tricky to retrain myself, but I did, and have
been using just one space ever since.

So like you say, that isn't a reliable way to discern sentences.

I would recommend following the advice of first filtering out false
positives (possibly even replacing them with temporary markers, Mr.
becomes $MISTER$ or similar), then splitting on punctuation. If you
then test on various sample texts you should be able to find more
false positives that you might have missed.

Ryan

···

On 11/29/05, basi <basi_lio@hotmail.com> wrote:

Yes, I learned this convention when I took a keyboarding (i.e., typing)
lesson in high school. Sometime ago, a style manual for word processing
appeared, and one of the advice is to use only one space to separate
sentences. The reason given is that in a justified format, those two
spaces can become four spaces, or even more. Anyway, a lot of text now
has one or two spaces between sentences, and this wouldn't be a
reliable indicator of sentence boundary.

That, in fact, is a very *bad* metric to follow, as the proper spacing
after sentence punctuation is a single space. The only reason that two
spaces was used in the past is the space used between sentence endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

-austin

···

On 11/29/05, Kevin Olbrich <kevin.olbrich@duke.edu> wrote:

Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It's not robust, but if you know that
convention will be followed by the authors, then it can work.

--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

Look at Text::Format for some indication on how abbreviations could be handled.

-austin

···

On 11/29/05, basi <basi_lio@hotmail.com> wrote:

I have looked at NLTK in Python (and had been hoping a Rubyist would
rewrite it in Ruby). I will go back to NLTK and see if it has a
split-sentence algorithm of sort. And thanks for the tip on Stanfords
Java NLP package. Yes, those abbreviations are pesky, and I may have to
resort to an exceptions list containing the most common ones.

--
Austin Ziegler * halostatue@gmail.com
               * Alternate: austin@halostatue.ca

Austin Ziegler <halostatue@gmail.com> writes:

basi wrote:
> Looking for ideas on how to split a text file into sentences. I see the
> problem of basing the split on [.!?] -- they're also used in ways other
> than to end a sentence. If I have to do manual pre-processing of the
> text file, what editing might I do? Has anyone had to deal with this
> problem and how did you make life easier for you?
It's a common convention to separate sentences by double spaces. I
started following this convention because Emacs expected it, and now I
use it always.

As I noted above, this is an improper convention outside of the
typewriter realm. If you are using anything other than a fixed-pitch
font for display or print, you should *never* use two spaces.

Alternatively, use text processing systems that do the "right thing";
i.e. transform two spaces into one (e.g. TeX, HTML-based products).
There is no good reason a text processor should show two spaces after
each other in print.

···

On 11/29/05, Jeffrey Schwab <jeff@schwabcenter.com> wrote:

-austin

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneukirchen.org

Hi,
This just might be easier than what I have in mind. I will try this
first.
Thanks!
basi

Ryan Leavengood wrote:

Yes, I learned this convention when I took a keyboarding (i.e.,
typing) lesson in high school. Sometime ago, a style manual for
word processing appeared, and one of the advice is to use only one
space to separate sentences. The reason given is that in a
justified format, those two spaces can become four spaces, or even
more. Anyway, a lot of text now has one or two spaces between
sentences, and this wouldn't be a reliable indicator of sentence
boundary.

I too learned the two space after a period convention years ago and also recently learned that with modern fonts and word processors it
is not necessary. It was tricky to retrain myself, but I did, and
have been using just one space ever since.

So like you say, that isn't a reliable way to discern sentences.

I would recommend following the advice of first filtering out false positives (possibly even replacing them with temporary markers, Mr. becomes $MISTER$ or similar), then splitting on punctuation. If you then test on various sample texts you should be able to find more false positives that you might have missed.

Which will not help you at all with foreign languages. And don't forget putting i.e., e.g. or etc. in the list.
This is an ongoing problem (think about the auto-correction 'feature' of capitalizing the first letter of every sentence in Openoffice or Word - something I always turn off because it is so insistent when it's wrong)
Cheers,
V.-

···

On 11/29/05, basi <basi_lio@hotmail.com> wrote:

--
http://www.braveworld.net/riva

____________________________________________________________________
http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.

Austin Ziegler wrote:

···

On 11/29/05, Kevin Olbrich <kevin.olbrich@duke.edu> wrote:

Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It's not robust, but if you know that
convention will be followed by the authors, then it can work.

That, in fact, is a very *bad* metric to follow, as the proper spacing
after sentence punctuation is a single space. The only reason that two
spaces was used in the past is the space used between sentence endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

Not true at all. I was always taught to use double spaces after sentences in grade-school homework assignments done on plain word processors or typewriters.

All was well with this strategy, until i hit a sentence similar to:

The abbreviation for Mister is Mr.
The head office is in New York, N.Y.

In other words, abbreviations that end a sentence. These sentences
don't end with a double dot, so if we replace Mr. with $MISTER$, the
sentence has no end marker.

Hmmm.
basi