[Bug?] Stack overflow in regexp matcher


(David Heinemeier Hansson) #1

I’ve been using the wonderful new RedCloth release from why the lucky
stiff for a while, but feeding it large inputs causes a meltdown in
Ruby’s regexp’er. I’ve narrowed the problem down to:

irb(main):001:0> IO.readlines(“large_paragraph.txt”).join("\n") =~
/(.+)\n(?![#\s|])/
RegexpError: Stack overflow in regexp matcher: /(.+)\n(?![#
\s|])/
from (irb):1

large_paragraph.txt is a file consiting of a single line with 444 words
followed by a linebreak (in total 3642 characters). Removing a single
character from this file makes the problem go away. Adding it back in
causes it. So that seems to be the tipping point.

I’ve verified this problem on Ruby 1.8.1/OS X 10.3 and Ruby
1.8.1/FreeBSD 4.8.

Anyone has a clue on what’s going on? [ruby-talk:6306] from November
2000 talks about the same problem with Ruby 1.6.1, but no resolution is
specified.

···


David Heinemeier Hansson,
http://www.basecamphq.com/ – Web-based Project Management
http://www.loudthinking.com/ – Broadcasting Brain

large_paragraph.txt (be wary of the single line break at the end):
cambodia audiotape catbird board congenial bengali hawk tacoma suckling
illegal demountable simon constrictor adoption illusionary cecropia
equivocal synod datum adler iceberg britches ebb parabola pillage
earthy horology worm noun moll dietician moss hausdorff strontium
dogtooth english squawbush mormon crate box centrifuge chambers
sacrosanct bromide eucalyptus bellatrix diatomaceous sunk storekeep
christine revile jot brady vice alan pray airfare incalculable lineal
ejaculate bolivar tuskegee felonious nary chloride dominic cough butane
jug weekend sumter polymeric stuffy residential epstein currant scrape
chalkboard niobe inhabitation lesson rampage eardrum actinometer
ancestor cesare sum anamorphic defer dehumidify beirut bucharest
congestion advisory obstruent risk lofty malaise plat etruscan
electress desecrate chinaman hemispheric toffee electro donahue
accession artillery mclaughlin americanism fried clerk bryozoa crewcut
bilge nv banks invest perjury frozen peafowl cockpit xylophone demark
liquidate beresford monsieur rhine spacecraft palette epicure
benevolent sarasota bogging canvasback panama amide sidestep elastomer
cytosine stasis alveolus balance airtight betony crag musician march
scrabble commissary pierce collide salt horology qualm crowberry
rivulet antimony baldpate avow juncture gravy aztec downright parakeet
edna snafu afferent hubbell brunswick adipic scorpio prestidigitate
emile menlo splintery drip septa cloy tv scanty heady demijohn accent
hate ceres izvestia mathewson i’d haughty dodo whomever tuscan coverlet
fiske haulage peculiar codebreak athena block herpes screechy
cheesecake sulfurous clemson motet bujumbura amatory wound cozy tawny
casebook omelet plug elizabeth corrigenda obsidian vito inclusion
maxima dabble clumsy king complicate canst courtney scops finnegan
gneiss nubia skater waspish circumvent frigga snowball trigonometry
enigma flatulent eightfold airplane megalomaniac neil forsythe crackle
boris dread soldier cadet clinton brine shoofly codon esoteric pamphlet
attrition agatha bedim bodybuilder coronet formic buy frock western
buxton quartile compulsory donkey middlemen trimer arclength hondo
bombast buckskin nrc baby blue superior scrumptious poland fag cube
phenyl reilly cloakroom adopt antiperspirant excrete gelatin
infelicitous accelerate bole backlash bacteria tuple caliph shirt
walcott covariant burgher squadron nobelium prim acculturate activation
irrawaddy retardation wrongdo die skyhook circumferential dossier
rockaway dissension penchant structural pure repulsion dauphine
elucidate huddle gorky hexadecimal backscatter nitride fantastic
adolescent expurgate chancel demit clotheshorse thence jonathan gemma
missive volition shouldn’t marksman guggenheim walter schoolgirlish
segmentation ideology griffith friedman asylum boorish knowledge mare
parole lasso token erasable chalkboard orifice lent finn inveterate
crossword american scuttle disparage anyhow spanish brockle cancer
cranelike hippodrome indecent cinerama tycoon formate mcneil drafty
tappet greensboro lift celandine codify obsolete lockwood
parallelepiped cyanamid admissible elbow thuggee bandy assemblage
sylvia procure antelope hattiesburg confiscatory shibboleth ambition
craze impure limb diva custer tartary receptive jerome richmond bestir
flop ordinate variant phagocyte penthouse sylvia procure antelope
hattiesburg confiscatory shibboleth ambition craze impure limb diva
custer tartary receptive jerome richmond bestir flop ordinate variant
phagocyte penthouse destructor schizophrenia ephemerides csas steward
declination camille nebulous uranyl seashore sera poppy


(Simon Strandgaard) #2

I have never seen Ruby’s regexp engine output the regexp-string.
However yesterday I tried Oniguruma, and it outputs the regexp-string.

Have you compiled Oniguruma ?

···

On Thu, 12 Feb 2004 22:16:55 +0900, David Heinemeier Hansson wrote:

I’ve been using the wonderful new RedCloth release from why the lucky
stiff for a while, but feeding it large inputs causes a meltdown in
Ruby’s regexp’er. I’ve narrowed the problem down to:

irb(main):001:0> IO.readlines(“large_paragraph.txt”).join("\n") =~
/(.+)\n(?![#\s|])/
RegexpError: Stack overflow in regexp matcher: /(.+)\n(?![#
\s|])/
from (irb):1


Simon Strandgaard


(David Heinemeier Hansson) #3

I have never seen Ruby’s regexp engine output the regexp-string.
However yesterday I tried Oniguruma, and it outputs the regexp-string.

Have you compiled Oniguruma ?

No. This is a clean 1.8.1 install on both machines.

···


David Heinemeier Hansson,
http://www.basecamphq.com/ – Web-based Project Management
http://www.loudthinking.com/ – Broadcasting Brain


(Simon Strandgaard) #4

I wasn’t able to reproduce the problem.

Perhaps its a CPU architecture problem (endian).
What CPU is in your FreeBSD box?

It could also be weird alignment of C structures.
What compiler do you use?

···

On Thu, 12 Feb 2004 23:11:13 +0900, David Heinemeier Hansson wrote:

I have never seen Ruby’s regexp engine output the regexp-string.
However yesterday I tried Oniguruma, and it outputs the regexp-string.

Have you compiled Oniguruma ?

No. This is a clean 1.8.1 install on both machines.


Simon Strandgaard

BTW: my setup is

uname -a
FreeBSD server.neoneye.home 5.1-RELEASE FreeBSD 5.1-RELEASE #0:
Thu Jun 5 02:55:42 GMT 2003 root@wv1u.btc.adaptec.com:/usr/
obj/usr/src/sys/GENERIC i386
gcc -v
Using built-in specs.
Configured with: FreeBSD/i386 system compiler
Thread model: posix
gcc version 3.2.2 [FreeBSD] 20030205 (release)


(ts) #5

I wasn't able to reproduce the problem.

In the example given in [ruby-talk:92687], remove *all* newlines (i.e. you
must just have *one* line in the file "large_paragraph.txt")

Guy Decoux