Text Parsing Help

Jester_Mania · 2 December 2010 17:27

Greetings,

I am new to Ruby and programming and am trying to parse a text file, but
encountered some difficulties.

Basically, the text file contains lines in the following format (where
\n is not really a newline but the text "\n"):

TextString [SYMBOL]\nDefinition

I need to replace the text \n with a tab, as I am attempting to separate
all the "tokens" by tabs. The issue here is that \n happens to be a
newline character. I tried searching the forums and tried the following
code:

lineItem = line.gsub("\\\\n", "\t")

but it doesn't seem to work and the \n is not being replaced. How can I
convert the \n text into a tab?

Any help is greatly appreciated!

···

--
Posted via http://www.ruby-forum.com/.

Jeremy_Bopp · 2 December 2010 17:34

In Ruby, the literal "\n" is a string consisting of only a newline
character. If you want the string to literally be backslash n (\n),
then you would use "\\n". The backslash is a special character within
string literals, so if you want it to appear literally in your string,
you have to escape it with another backslash. Your example in the gsub
call above is actually creating a search string of backslash backslash n
(\\n) because you have 4 backslashes preceding the n, but that text does
not appear in your input.

-Jeremy

···

On 12/2/2010 11:27 AM, Jester Mania wrote:

Greetings,

I am new to Ruby and programming and am trying to parse a text file, but
encountered some difficulties.

Basically, the text file contains lines in the following format (where
\n is not really a newline but the text "\n"):

TextString [SYMBOL]\nDefinition

I need to replace the text \n with a tab, as I am attempting to separate
all the "tokens" by tabs. The issue here is that \n happens to be a
newline character. I tried searching the forums and tried the following
code:

lineItem = line.gsub("\\\\n", "\t")

but it doesn't seem to work and the \n is not being replaced. How can I
convert the \n text into a tab?

Peter_Vandenabeele1 · 2 December 2010 17:40

Basically, the text file contains lines in the following format (where
\n is not really a newline but the text "\n"):

TextString [SYMBOL]\nDefinition

I need to replace the text \n with a tab, as I am attempting to separate
all the "tokens" by tabs. The issue here is that \n happens to be a
newline character. I tried searching the forums and tried the following
code:

lineItem = line.gsub("\\\\n", "\t")

This may be useful:

$ irb

s = 'car\nplane\ntrain \n boat' # because of ' the \n is not interpreted as newline

=> "car\\nplane\\ntrain \\n boat"

s.gsub(/\\n/, "\t") # here the \\n is really '\n' , but "\t" is really <TAB>

=> "car\tplane\ttrain \t boat"

# the result has <TAB>s now

Peter

···

On Thu, Dec 2, 2010 at 6:27 PM, Jester Mania <jester_b84@hotmail.com> wrote:

Jester_Mania · 2 December 2010 18:01

Thanks for the help! I have a question though regarding Peter's reply:

s = 'car\nplane\ntrain \n boat' # because of ' the \n is not interpreted as

newline

Currently, my code is:

IO.readlines("input.txt").each do |line|
lineItem = line.gsub(/\\n/, "\t")
end

How would I use the ' ' with the line variable?

···

--
Posted via http://www.ruby-forum.com/\.

Jester_Mania · 2 December 2010 19:10

Yes, but I tried the code and it is still not working. I used a puts
statement to output the results to see whether the "\n" text was truly
being replaced by a tab.

#!/usr/bin/ruby -w

IO.readlines("input.txt").each do |line|
lineItem = line.gsub(/\\n/, "\t")
puts lineItem.split("\t")
end

However, the results were that the output still had \n text.

···

--
Posted via http://www.ruby-forum.com/.

Jester_Mania · 4 December 2010 02:29

Peter/Josh,

Thanks once again for the helpful posts. I am learning quite a bit
which is good. However, I just tried to replicate Peter's example and
when I attempted to use the .inspect method, the output was not what I
expected:

INPUT FILE <input.txt>

···

--------------------------
car\nplane\ntrain \n boat

second line, first token \n second token
--------------------------

OUTPUT <windows cmd console>
--------------------------
["\377\376c\000a\000r\000\\\000n\000p\000l\000a\000n\000e\000\\\000n\000t\000r\0
00a\000i\000n\000 \000\\\000n\000 \000b\000o\000a\000t\000\r\000\n"]
["\000\r\000\n"]
["\000s\000e\000c\000o\000n\000d\000 \000l\000i\000n\000e\000,\000
\000f\000i\00
0r\000s\000t\000 \000t\000o\000k\000e\000n\000 \000\\\000n\000
\000s\000e\000c\0
00o\000n\000d\000 \000t\000o\000k\000e\000n\000"]
--------------------------

Do you know why are they so many numbers? like \377 and \000?

--
Posted via http://www.ruby-forum.com/.

Jester_Mania · 4 December 2010 04:28

Ah hah! I figured it out, the txt file had the wrong encoding. I
encoded it with UTF-8 in Notepad++ and everything works as expected. I
thank everyone for writing these meaningful replies.

···

--
Posted via http://www.ruby-forum.com/.

Jesus_Gabriel_y_Gala · 2 December 2010 18:03

You don't need it, because what you read from the file are already the
character '\' and the character 'n'. Peter needed it because he was
typing Ruby string literals.

Jesus.

···

On Thu, Dec 2, 2010 at 7:01 PM, Jester Mania <jester_b84@hotmail.com> wrote:

Thanks for the help! I have a question though regarding Peter's reply:

s = 'car\nplane\ntrain \n boat' # because of ' the \n is not interpreted as

newline

Currently, my code is:

IO.readlines("input.txt").each do |line|
lineItem = line.gsub(/\\n/, "\t")
end

How would I use the ' ' with the line variable?

Peter_Vandenabeele1 · 2 December 2010 22:37

I hope my example below can explain what happens

$ ruby -v
ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]

I used this input.txt file for testing

<start of file>
car\nplane\ntrain \n boat

second line, first token \n second token
<end of file>

irb(main):013:0> IO.readlines("input.txt").each do |line|
irb(main):014:1* lineItem = line.gsub(/\\n/, "\t")
irb(main):015:1> puts lineItem.split("\t").inspect
irb(main):016:1> end
["car", "plane", "train ", " boat\n"] # the first line is parsed
and split correctly into this array
["\n"] # the second line only has a newline
["second line, first token ", " second token\n"] # correct too
=> ["car\\nplane\\ntrain \\n boat\n", "\n", "second line, first token
\\n second token\n"]

# this last line is the result IO.readlines("input.txt") because the
"each" method
eventually returns self after having iterated over all entities

irb(main):017:0> IO.readlines("input.txt").each do |line|
irb(main):018:1* lineItem = line.gsub(/\\n/, "\t")
irb(main):019:1> puts lineItem.split("\t")
irb(main):020:1> end
car
plane
train
boat

second line, first token
second token
=> ["car\\nplane\\ntrain \\n boat\n", "\n", "second line, first token
\\n second token\n"]

So, one trick is to use .inspect and .class in many cases to better
understand what is
the object you are looking at and what the content really is.

Also, you could use chomp to get rid of the newline at the end of the
last entry in your array of tokens.
So, a shorter piece of code that may be useful is:

irb(main):025:0> IO.readlines("input.txt").map do |line|
irb(main):026:1* line.chomp.gsub(/\\n/, "\t")
irb(main):027:1> end
=> ["car\tplane\ttrain \t boat", "", "second line, first token \t second token"]

Now there are the <TAB> delimiters that you wanted between the tokens
in the resulting output.

HTH,

Peter

···

On Thu, Dec 2, 2010 at 8:10 PM, Jester Mania <jester_b84@hotmail.com> wrote:

Yes, but I tried the code and it is still not working. I used a puts
statement to output the results to see whether the "\n" text was truly
being replaced by a tab.

#!/usr/bin/ruby -w

IO.readlines("input.txt").each do |line|
lineItem = line.gsub(/\\n/, "\t")
puts lineItem.split("\t")
end

However, the results were that the output still had \n text.

Josh_Cheek · 2 December 2010 22:52

"\n" is a newline
"\\n" is a backslash, letter n
'\n' is the same as "\\n" but you can ignore that if it is confusing,
because it only counts when you enter it as a literal.

You say you want to see whether "\n" is being replaced by a tab, but you are
replacing /\\n/ (btw, you could use a string here). You say the output has
\n in the text. By that, I assume you mean it has a newline, but are
misinterpreting it as "\\n" which you replaced. If this is accurate, you
should decide whether you wish to replace "\n" or "\\n". As peter said,
using inspect (ie: puts line.inspect) is a good way to see your String data.

Also, if you don't already have tabs that you also wish to split on, then
you don't need the gsub step, you can just split on the "\\n". Here are a
couple of examples to hopefully make it a little easier to see.
"a\nb\\nc".split("\\n") # => ["a\nb", "c"]
"a\nb\\nc".split("\n") # => ["a", "b\\nc"]
"a\nb\\nc\td".gsub("\\n","\t").split("\t") # => ["a\nb", "c", "d"]
"a\nb\\nc\td".gsub("\n","\t").split("\t") # => ["a", "b\\nc", "d"]

···

On Thu, Dec 2, 2010 at 1:10 PM, Jester Mania <jester_b84@hotmail.com> wrote:

Yes, but I tried the code and it is still not working. I used a puts
statement to output the results to see whether the "\n" text was truly
being replaced by a tab.

#!/usr/bin/ruby -w

IO.readlines("input.txt").each do |line|
lineItem = line.gsub(/\\n/, "\t")
puts lineItem.split("\t")
end

However, the results were that the output still had \n text.

Topic		Replies	Views
String frustration ruby-talk	24	136	13 February 2003
Including newlines in a .sub ruby-talk	11	131	24 July 2009
Gsub!("\n","\n") ruby-talk	9	121	26 January 2009
Parsing text ruby-talk	3	111	22 April 2011
Problem replacing newlines in regexp ruby-talk	5	102	30 April 2007

Text Parsing Help

Related topics