extract repeated text from string


(Punit Jain) #1

Hi,

I am working on an issue where i need to extract repeated text from an
string:

The string is abcdefzfabcdefzfabcdefzf

I tried using forward lookup as /(?=(a.*f))/ but this extracts groups as :

abcdefzfabcdefzfabcdefzf
abcdefzfabcdefzf
abcdefzf

However I am looking for output as :
abcdefzf
abcdefzf
abcdefzf

Any clues ?

Regards
Punit


(Hassan Schroeder) #2

Can you explain what the logic of the pattern is? This "works" for
your exact example:

2.5.1 (main):0 > sample
=> "abcdefzfabcdefzfabcdefzf"
2.5.1 (main):0 > sample.scan /(?=(a.*?f.*?f))/
=> [
  [0] [
    [0] "abcdefzf"
  ],
  [1] [
    [0] "abcdefzf"
  ],
  [2] [
    [0] "abcdefzf"
  ]
]
2.5.1 (main):0 >

but might not be universally applicable...

···

On Wed, Jul 18, 2018 at 7:07 AM, Punit Jain <contactpunitjain@gmail.com> wrote:

I am working on an issue where i need to extract repeated text from an
string:

The string is abcdefzfabcdefzfabcdefzf

I tried using forward lookup as /(?=(a.*f))/ but this extracts groups as :

abcdefzfabcdefzfabcdefzf
abcdefzfabcdefzf
abcdefzf

However I am looking for output as :
abcdefzf
abcdefzf
abcdefzf

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote


(Robert K.) #3

This! The original question sounds a bit like Punit was looking for a
mechanism to identify repeated text in the input. As long as no
pattern for that text is given, regex is not the right tool for the
job.

If you know the first character of the repeated part (or the repeated
string always starts at a specific position) then you can cook
something:

irb(main):017:0> s
=> "abcabcabcdab"
irb(main):018:0> s.scan /((.+)\2)/
=> [["abcabc", "abc"]]
irb(main):019:0> s.scan /((.+)\2+)/
=> [["abcabcabc", "abc"]]

irb(main):020:0> s="abcdeabcdabcd"
=> "abcdeabcdabcd"
irb(main):021:0> s.scan /((.+)\2+)/
=> [["abcdabcd", "abcd"]]

Cheers

robert

···

On Wed, Jul 18, 2018 at 5:18 PM Hassan Schroeder <hassan.schroeder@gmail.com> wrote:

On Wed, Jul 18, 2018 at 7:07 AM, Punit Jain <contactpunitjain@gmail.com> wrote:

> I am working on an issue where i need to extract repeated text from an
> string:
>
> The string is abcdefzfabcdefzfabcdefzf
>
> I tried using forward lookup as /(?=(a.*f))/ but this extracts groups as :
>
> abcdefzfabcdefzfabcdefzf
> abcdefzfabcdefzf
> abcdefzf
>
> However I am looking for output as :
> abcdefzf
> abcdefzf
> abcdefzf

Can you explain what the logic of the pattern is? This "works" for
your exact example:

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/


(Lee Roberts) #4

I don't think a look ahead is necessary if your example matches the real
world scenario
A scan with a regex of the string you are looking for should return all the
matches
here's an example from the pry repl

[5] pry(main)> reggie = /abcdefz/
=> /abcdefz/
[6] pry(main)> stringie.scan(reggie)
=> []
[7] pry(main)> stringie = "abcdefz" * 20
=>
"abcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefzabcdefz"
[8] pry(main)> stringie.scan(reggie)
=> ["abcdefz",
"abcdefz",
"abcdefz",
"abcdefz",
...
"abcdefz"]
[9] pry(main)> stringie.scan(reggie).count
=> 20

I hope this is helpful and I understood the question properly

···

On Wed, Jul 18, 2018 at 11:18 AM Hassan Schroeder < hassan.schroeder@gmail.com> wrote:

On Wed, Jul 18, 2018 at 7:07 AM, Punit Jain <contactpunitjain@gmail.com> > wrote:

> I am working on an issue where i need to extract repeated text from an
> string:
>
> The string is abcdefzfabcdefzfabcdefzf
>
> I tried using forward lookup as /(?=(a.*f))/ but this extracts groups as
:
>
> abcdefzfabcdefzfabcdefzf
> abcdefzfabcdefzf
> abcdefzf
>
> However I am looking for output as :
> abcdefzf
> abcdefzf
> abcdefzf

Can you explain what the logic of the pattern is? This "works" for
your exact example:

2.5.1 (main):0 > sample
=> "abcdefzfabcdefzfabcdefzf"
2.5.1 (main):0 > sample.scan /(?=(a.*?f.*?f))/
=> [
  [0] [
    [0] "abcdefzf"
  ],
  [1] [
    [0] "abcdefzf"
  ],
  [2] [
    [0] "abcdefzf"
  ]
]
2.5.1 (main):0 >

but might not be universally applicable...

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>


(Punit Jain) #5

Hi Hasan,

I have a below string variable containing output of an EMC storage remote
command. The output looks like this:

*Director Identification: AB-1A*

    Director Type : FiberChannel
    Director Status : Online
    Director Slot No : 4

    Director Port: 5
        WWN Port Name :331123G56
        Director Port Status :PendOn
        SCSI Flags
          {
           Sequence(SEQ) :Disabled
           SCSI_Support1(OS2007) :Enabled
           }

      Director Port: 7
          WWN Port Name :3323H66
          Director Port Status :PendOn
          SCSI Flags
            {
             Sequence(SEQ) :Disabled
             SCSI_Support1(OS2007) :Enabled
             }

*Director Identification: AB-1B*

    Director Type : FiberChannel
    Director Status : Online
    Director Slot No : 6

    Director Port: 33
        WWN Port Name :331123G56
        Director Port Status :PendOn
        SCSI Flags
          {
           Sequence(SEQ) :Disabled
           SCSI_Support1(OS2007) :Enabled
           }

If you see here there are 2 loops :

1. One outer *Director Identification*
2. The each outer *Director Identification* has inner Director Port: loop

I need to extract outer and for each outer inner loops to process. Here is
what I am doing:

cmdoutput_nonewline = cmdoutput.gsub("\n",'|')

directorids = cmdoutput_nonewline.scan(/(?=(Director Identification.*?\|))/)

puts "#{directorids.size}"

directorids.each do |directorid|

  puts directorid

end

This doesnot give required o/p, rather prints :

Director Identification: AB-1A|

Director Identification: AB-1B|

Regards,
Punit

···

On Wed, Jul 18, 2018 at 8:48 PM, Hassan Schroeder < hassan.schroeder@gmail.com> wrote:

On Wed, Jul 18, 2018 at 7:07 AM, Punit Jain <contactpunitjain@gmail.com> > wrote:

> I am working on an issue where i need to extract repeated text from an
> string:
>
> The string is abcdefzfabcdefzfabcdefzf
>
> I tried using forward lookup as /(?=(a.*f))/ but this extracts groups as
:
>
> abcdefzfabcdefzfabcdefzf
> abcdefzfabcdefzf
> abcdefzf
>
> However I am looking for output as :
> abcdefzf
> abcdefzf
> abcdefzf

Can you explain what the logic of the pattern is? This "works" for
your exact example:

2.5.1 (main):0 > sample
=> "abcdefzfabcdefzfabcdefzf"
2.5.1 (main):0 > sample.scan /(?=(a.*?f.*?f))/
=> [
  [0] [
    [0] "abcdefzf"
  ],
  [1] [
    [0] "abcdefzf"
  ],
  [2] [
    [0] "abcdefzf"
  ]
]
2.5.1 (main):0 >

but might not be universally applicable...

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>


(Punit Jain) #6

Here is the actual usecase with input data

*Director Identification: AB-1A*

    Director Type : FiberChannel
    Director Status : Online
    Director Slot No : 4

    Director Port: 5
        WWN Port Name :331123G56
        Director Port Status :PendOn
        SCSI Flags
          {
           Sequence(SEQ) :Disabled
           SCSI_Support1(OS2007) :Enabled
           }

      Director Port: 7
          WWN Port Name :3323H66
          Director Port Status :PendOn
          SCSI Flags
            {
             Sequence(SEQ) :Disabled
             SCSI_Support1(OS2007) :Enabled
             }

*Director Identification: AB-1B*

    Director Type : FiberChannel
    Director Status : Online
    Director Slot No : 6

    Director Port: 33
        WWN Port Name :331123G56
        Director Port Status :PendOn
        SCSI Flags
          {
           Sequence(SEQ) :Disabled
           SCSI_Support1(OS2007) :Enabled
           }

Need to extract Director Identification with respective Director Port which
can be 1 or many per identification.

Regards
Punit

···

On Wed, Jul 18, 2018 at 8:58 PM, Robert Klemme <shortcutter@googlemail.com> wrote:

On Wed, Jul 18, 2018 at 5:18 PM Hassan Schroeder > <hassan.schroeder@gmail.com> wrote:
>
> On Wed, Jul 18, 2018 at 7:07 AM, Punit Jain <contactpunitjain@gmail.com> > wrote:
>
> > I am working on an issue where i need to extract repeated text from an
> > string:
> >
> > The string is abcdefzfabcdefzfabcdefzf
> >
> > I tried using forward lookup as /(?=(a.*f))/ but this extracts groups
as :
> >
> > abcdefzfabcdefzfabcdefzf
> > abcdefzfabcdefzf
> > abcdefzf
> >
> > However I am looking for output as :
> > abcdefzf
> > abcdefzf
> > abcdefzf
>
> Can you explain what the logic of the pattern is? This "works" for
> your exact example:

This! The original question sounds a bit like Punit was looking for a
mechanism to identify repeated text in the input. As long as no
pattern for that text is given, regex is not the right tool for the
job.

If you know the first character of the repeated part (or the repeated
string always starts at a specific position) then you can cook
something:

irb(main):017:0> s
=> "abcabcabcdab"
irb(main):018:0> s.scan /((.+)\2)/
=> [["abcabc", "abc"]]
irb(main):019:0> s.scan /((.+)\2+)/
=> [["abcabcabc", "abc"]]

irb(main):020:0> s="abcdeabcdabcd"
=> "abcdeabcdabcd"
irb(main):021:0> s.scan /((.+)\2+)/
=> [["abcdabcd", "abcd"]]

Cheers

robert

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>


(Hassan Schroeder) #7

Here is the actual usecase with input data

LOL, this doesn't look much like your original question, but...

Need to extract Director Identification with respective Director Port which
can be 1 or many per identification.

What *exactly* does the output look like?

Just e.g. "Director Identification: AB-1B Director Port: 33" or more?

···

On Wed, Jul 18, 2018 at 9:13 AM, Punit Jain <contactpunitjain@gmail.com> wrote:

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote


(Hassan Schroeder) #8

I would also ask if that indentation is consistent, so something like
this would work:

2.5.1 (main):0 > output.scan /\s{,2}(\w+ \w+):\s*([^\s]+)/
=> [
  [0] [
    [0] "Director Identification",
    [1] "AB-1A"
  ],
  [1] [
    [0] "Director Port",
    [1] "5"
  ],
  [2] [
    [0] "Director Port",
    [1] "7"
  ],
  [3] [
    [0] "Director Identification",
    [1] "AB-1B"
  ],
  [4] [
    [0] "Director Port",
    [1] "33"
  ]
]
2.5.1 (main):0 >

···

On Wed, Jul 18, 2018 at 9:18 AM, Hassan Schroeder <hassan.schroeder@gmail.com> wrote:

What *exactly* does the output look like?

Just e.g. "Director Identification: AB-1B Director Port: 33" or more?

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote


(Punit Jain) #9

expected o/p -

"Director Identification":"AB-1A","Director Type":"FiberChannel","Director
Status":"Online","Director Port":{ "Port":5,"WWN Port
Name":"331123G56","SCSI Flags":{"Sequence(SEQ)":"Disabled"},"Director
Port":{ "Port":7,"WWN Port Name":"3323H66","SCSI Flags":{"Sequence(SEQ)":"
Disabled"}

"Director Identification":"AB-1B","Director Type":"FiberChannel","Director
Status":"Online","Director Port":{ "Port":33,"WWN Port Name":"331123G56","SCSI
Flags":{"Sequence(SEQ)":"Disabled"}

Regards,

Punit

···

On Wed, Jul 18, 2018 at 9:48 PM, Hassan Schroeder < hassan.schroeder@gmail.com> wrote:

On Wed, Jul 18, 2018 at 9:13 AM, Punit Jain <contactpunitjain@gmail.com> > wrote:
> Here is the actual usecase with input data

LOL, this doesn't look much like your original question, but...

> Need to extract Director Identification with respective Director Port
which
> can be 1 or many per identification.

What *exactly* does the output look like?

Just e.g. "Director Identification: AB-1B Director Port: 33" or more?

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>


(Saverio Miroddi) #10

Hello Punit,

This is very likely a JSON object. If so, you should be able to simply:

```ruby

require "json"

parsed_output = JSON.parse(output_string)

# [...]

```

With this you can easily manage `parsed_output`, which is a Hash.

You didn't copy/paste the text correctly, regardless of it being JSON or not. There are 4 opening braces and 2 closing ones.

Z
···

On 18.07.2018 18:55, Punit Jain wrote:

expected o/p -

          "Director Identification":"AB-1A","Direc                  tor Type":"FiberChannel","                    Director Status":"Online","                      Director Port":{ "Port":5,"WWN Port Name":"331123G56","SCSI Flags":{"Sequence(SEQ)":"Disabled"},"                        Director Port":{ "Port":7,"WWN Port Name":"3323H66",                          "SCSI Flags":{"Sequence(SEQ)":"Disabled"}
                      "Director Identification":"AB-1B","Direc                            tor Type":"FiberChannel","                              Director Status":"Online","                                Director Port":{ "Port":33,"WWN Port Name":"331123G56                                ","SCSI Flags":{"Sequence(SEQ)":"Disabled"}

Regards,

Punit

        On Wed, Jul 18, 2018 at 9:48 PM, Hassan Schroeder <hassan.schroeder@gmail.com>
        wrote:
          On Wed, Jul 18, 2018 at 9:13 AM, Punit Jain <contactpunitjain@gmail.com              > wrote:

          > Here is the actual usecase with input data



          LOL, this doesn't look much like your original question, but...



          > Need to extract Director Identification with respective Director Port which

          > can be 1 or many per identification.



          What *exactly* does the output look like?



          Just e.g. "Director Identification: AB-1B Director Port:  33" or more?



              --

              Hassan Schroeder ------------------------ hassan.schroeder@gmail.com

              twitter: @hassan

              Consulting Availability : Silicon Valley or remote



              Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>

              <[http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk](http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk)>
Unsubscribe:

mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribehttp://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk


(Punit Jain) #11

You are right Saverio, this is to be converted to JSON. I initially did the
same, however got parse error:

System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/json/common.rb:155:in
`parse': 757: unexpected token at 'Director Identification: AB-1A
(JSON::ParserError)

I think input will require lot of sanitization and converting to required
format. Thats why I planned to go down the route of using regex with scan
method, however facing problem in parsing with right regex.

Regards,
Punit

···

On Wed, Jul 18, 2018 at 10:37 PM, Saverio M. <saverio.pub2@gmail.com> wrote:

Hello Punit,

This is very likely a JSON object. If so, you should be able to simply:

require "json"

parsed_output = JSON.parse(output_string)

# [...]

With this you can easily manage `parsed_output`, which is a Hash.

You didn't copy/paste the text correctly, regardless of it being JSON or
not. There are 4 opening braces and 2 closing ones.

Z

On 18.07.2018 18:55, Punit Jain wrote:

expected o/p -

"Director Identification":"AB-1A","Director Type":"FiberChannel","Director
Status":"Online","Director Port":{ "Port":5,"WWN Port
Name":"331123G56","SCSI Flags":{"Sequence(SEQ)":"Disabled"},"Director
Port":{ "Port":7,"WWN Port Name":"3323H66","SCSI
Flags":{"Sequence(SEQ)":"Disabled"}

"Director Identification":"AB-1B","Director Type":"FiberChannel","Director
Status":"Online","Director Port":{ "Port":33,"WWN Port Name":"331123G56","SCSI
Flags":{"Sequence(SEQ)":"Disabled"}

Regards,

Punit

On Wed, Jul 18, 2018 at 9:48 PM, Hassan Schroeder < > hassan.schroeder@gmail.com> wrote:

On Wed, Jul 18, 2018 at 9:13 AM, Punit Jain <contactpunitjain@gmail.com> >> wrote:
> Here is the actual usecase with input data

LOL, this doesn't look much like your original question, but...

> Need to extract Director Identification with respective Director Port
which
> can be 1 or many per identification.

What *exactly* does the output look like?

Just e.g. "Director Identification: AB-1B Director Port: 33" or more?

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe> <ruby-talk-request@ruby-lang.org?subject=unsubscribe><http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk> <http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>


(Hassan Schroeder) #12

System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/json/common.rb:155:in
`parse': 757: unexpected token at 'Director Identification: AB-1A
(JSON::ParserError)

The input you showed is not remotely valid JSON.

I think input will require lot of sanitization and converting to required
format. Thats why I planned to go down the route of using regex with scan
method, however facing problem in parsing with right regex.

Trying to create a single regex to parse this seems like a horrible
idea to me; time-consuming and bound to be brittle.

I would parse out the individual lines into `key: value` pairs and build
an object from that. Then write your formatter to take that object as
input and output the JSON you want.

···

On Wed, Jul 18, 2018 at 10:21 AM, Punit Jain <contactpunitjain@gmail.com> wrote:

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan
Consulting Availability : Silicon Valley or remote