Parse Word/HTML Docs for database inserts

Margaret_Smith · 15 July 2009 23:23

I am new to Ruby and have perused the forum but I will ask this question
as I couldn't seem to answer my questions with other posts.

The documents have no structure except for a unique number that appears
first in the document and the rest of the data I am looking for is
preceeded by key words that can help me identify a country code, the
hour something was started or finished and maybe a subject here and
there. The html docs are just snippets from the news pages of the
Internet pictures and all that I need the title, and dates extracted.

What I need to do is also extract the mimetype, file name and
last_update_date of the document. Can I do this with Ruby? I know Ruby
has several gems that can help but which one would be the best for
something like this?

Most of the postings I have read deal with semi-structured data. Data
that is preceeded with a column name perhaps but these files are
completely unstructured.

Also I don't want to be entering filenames one by one. I have about 6000
documents to parse. Is there a way to handle something like that with a
script?

Any direction would be greatly appreciated. Never have written Ruby code
so I am looking for a good tutorial using parsing or an example app that
may handle something like this.

···

--
Posted via http://www.ruby-forum.com/.

Dylan · 15 July 2009 23:35

I'm not able to help with the parsing, but if you want to check all
files in a folder you can use this:

$all_files =
Dir.chdir dir do
$all_files += Dir["*"]
end

where dir is the directory the files are in. That will get you an
array with all the filenames. Then you can just iterate through them:

···

On Jul 15, 4:23 pm, Margaret Smith <msmith...@mymdu.com> wrote:

I am new to Ruby and have perused the forum but I will ask this question
as I couldn't seem to answer my questions with other posts.

The documents have no structure except for a unique number that appears
first in the document and the rest of the data I am looking for is
preceeded by key words that can help me identify a country code, the
hour something was started or finished and maybe a subject here and
there. The html docs are just snippets from the news pages of the
Internet pictures and all that I need the title, and dates extracted.

What I need to do is also extract the mimetype, file name and
last_update_date of the document. Can I do this with Ruby? I know Ruby
has several gems that can help but which one would be the best for
something like this?

Most of the postings I have read deal with semi-structured data. Data
that is preceeded with a column name perhaps but these files are
completely unstructured.

Also I don't want to be entering filenames one by one. I have about 6000
documents to parse. Is there a way to handle something like that with a
script?

Any direction would be greatly appreciated. Never have written Ruby code
so I am looking for a good tutorial using parsing or an example app that
may handle something like this.
--
Posted viahttp://www.ruby-forum.com/.

Raveendran_P · 16 July 2009 06:26

Margaret Smith wrote:

I am new to Ruby and have perused the forum but I will ask this question
as I couldn't seem to answer my questions with other posts.

Hi Smith,

Its very tough to answer your question. Because I like HPRICOT gem very
much. But I didn't said That is best. It depends upon your satisfaction.
And also please try with ,

http://rfeedparser.rubyforge.org/

Thanks,
P.Raveendran

···

The documents have no structure except for a unique number that appears
first in the document and the rest of the data I am looking for is
preceeded by key words that can help me identify a country code, the
hour something was started or finished and maybe a subject here and
there. The html docs are just snippets from the news pages of the
Internet pictures and all that I need the title, and dates extracted.

What I need to do is also extract the mimetype, file name and
last_update_date of the document. Can I do this with Ruby? I know Ruby
has several gems that can help but which one would be the best for
something like this?

Most of the postings I have read deal with semi-structured data. Data
that is preceeded with a column name perhaps but these files are
completely unstructured.

Also I don't want to be entering filenames one by one. I have about 6000
documents to parse. Is there a way to handle something like that with a
script?

Any direction would be greatly appreciated. Never have written Ruby code
so I am looking for a good tutorial using parsing or an example app that
may handle something like this.

--
Posted via http://www.ruby-forum.com/\.

James_Britt3 · 16 July 2009 03:36

Dylan wrote:

I'm not able to help with the parsing, but if you want to check all
files in a folder you can use this:

$all_files =
Dir.chdir dir do
$all_files += Dir["*"]
end

Might not Find be more useful overall?

http://www.ruby-doc.org/stdlib/libdoc/find/rdoc/classes/Find.html

···

--
James Britt

www.jamesbritt.com - Playing with Better Toys
www.ruby-doc.org - Ruby Help & Documentation
www.rubystuff.com - The Ruby Store for Ruby Stuff
www.neurogami.com - Smart application development

Topic		Replies	Views
Parsing challenge ruby-talk	0	60	8 October 2003
Basic xml parsing question ruby-talk	3	88	27 March 2009
Newbie qustion ruby-talk	1	80	2 August 2006
HTML/XML Parsing ruby-talk	2	94	24 February 2004
Ruby noob ruby-talk	2	79	5 March 2007

Parse Word/HTML Docs for database inserts

Related topics