Ruby Drops

This is the idea I am thinking of proposing for my Google Summer of Code project. This is from the Google Summer of Code thread:

I was struck with an idea yesterday that could theoretically be really nice for Ruby developers. I'm sure most of us are aware of the idea of keeping code DRY (Don't Repeat Yourself), and I think most people that know about it don't have too much of a problem with following it. Another idea that programmers follow, not as well, is to not reinvent the wheel, which for the sake of the rest of this e-mail I will refer to as DROP (Don't Repeat Other People).

My idea is to create an open source code repository, web site, and set of tools designed to help people to automate the process of factoring code out of their projects which they can all share. First, it helps them to find instances of code that need to be DRYed or DROPed by comparing lines of code across the entire code base in the repository and pointing out lines that are similar to things that have already been done before. If the programmer finds things within his program which he repeated, then it should be a simple a matter of factoring out to another function or class within his code to DRY it. If he finds that somebody else has similar code, he can factor it out into a separate "project" in the repository to DROP it. People with similar code in the repository are notified so that they can update their individual projects accordingly if they desire to do so.

Using code that has been factored out into these external projects should be both easy to integrate and easy to keep up to date in each project. Though I'm not quite sure of the mechanics of how that would be done yet, I'm envisioning a script programmers can run that will bring all functions and classes they are using from external projects up to date in their own program. As it does this, it runs all the programmer's tests to make sure that it doesn't break something and pulls back to a previous revision if necessary. (As such, it would practically be a requirement that all code that takes advantage of this be unit tested.) This would also provide the benefit that factored out projects can be edited by anyone, like a wiki, without screwing everything up; any time something gets messed up or is incompatible with some projects, somebody will see when they try to update and can fix it themselves.

The web site would show the projects in the repository, provide a method of discussion around the various bits of code, and give downloads and instructions for using the resource for yourself.

My hope is that this would be a tool that could speed up development, simplify and stabilize Ruby programs, and bring a collaborative atmosphere even to individual projects.

I'm making a thread for it because I'm looking for input (ideas, suggestions, etc.).

- Jake McArthur

Come on, it can't be that bad of an idea! Is this really going to go over that badly if I submit this as a project proposal?

- Jake McArthur

My idea is to create an open source code repository, web site, and set of tools designed to help people to automate the process of factoring code out of their projects which they can all share. First, it helps them to find instances of code that need to be DRYed or DROPed by comparing lines of code across the entire code base in the repository and pointing out lines that are similar to things that have already been done before.

Your idea sounds good to me. It'd have the additional benefit of helping people notice that they are writing bad code. Many people who understand the DRY principle in abstract haven't made the connections to realise all the situations it applies to. And repeating other people is easy, and it's hard to know that you are.

I'm new here, but I'll try to give feedback.

How will it find similar code? One simple issue is that people will name their variables and methods differently, so you'll want to somehow see the structure of a section of code and ignore a lot of details. But you can't ignore the details too much. Maybe (trivial example) someone wrote a "max" function and someone else in-lined it, and otherwise their code blocks are the same.

If the programmer finds things within his program which he repeated, then it should be a simple a matter of factoring out to another function or class within his code to DRY it.

I don't think all code is simple to refactor like that. But maybe enough is for this to be useful. Maybe most is? I don't know.

If he finds that somebody else has similar code, he can factor it out into a separate "project" in the repository to DROP it. People with similar code in the repository are notified so that they can update their individual projects accordingly if they desire to do so.

Using code that has been factored out into these external projects should be both easy to integrate and easy to keep up to date in each project. Though I'm not quite sure of the mechanics of how that would be done yet, I'm envisioning a script programmers can run that will bring all functions and classes they are using from external projects up to date in their own program. As it does this, it runs all the programmer's tests to make sure that it doesn't break something and pulls back to a previous revision if necessary. (As such, it would practically be a requirement that all code that takes advantage of this be unit tested.)

I don't have much experience with unit tests. How well can they usually withstand arbitrary changes to code with subtle bugs?

This would also provide the benefit that factored out projects can be edited by anyone, like a wiki,

It's a bit off-topic, but I'm not sure how good an idea wikis are. Wikipedia gets a lot of vandalism. But worse: what happens when people have a legitimate disagreement about how some code should be written? "anyone can post anything" doesn't provide a way to resolve disagreement.

There could also be a risk of a malicious code that people auto-update.

without screwing everything up; any time something gets messed up or is incompatible with some projects, somebody will see when they try to update and can fix it themselves.

The web site would show the projects in the repository, provide a method of discussion around the various bits of code, and give downloads and instructions for using the resource for yourself.

My hope is that this would be a tool that could speed up development, simplify and stabilize Ruby programs, and bring a collaborative atmosphere even to individual projects.

I wonder how well the code-similarity algorithm would work for non-Ruby code. Just curious how Ruby-specific the tests would be vs how general.

I'm making a thread for it because I'm looking for input (ideas, suggestions, etc.).

Hope that helped :slight_smile:

-- Elliot Temple

···

On Apr 27, 2006, at 12:46 PM, Jake McArthur wrote:

[skip]

> First, it helps them to find instances of code that need to be
> DRYed or DROPed by comparing lines of code across the entire code
> base in the repository and pointing out lines that are similar to
> things that have already been done before. If the programmer finds
> things within his program which he repeated, then it should be a
> simple a matter of factoring out to another function or class
> within his code to DRY it. If he finds that somebody else has
> similar code, he can factor it out into a separate "project" in the
> repository to DROP it. People with similar code in the repository
> are notified so that they can update their individual projects
> accordingly if they desire to do so.

I am afraid the problem here is: what I would do even if I find that some of
my code repeats code in repository? I already *had written* the code, so,
where is my benefits? To understand "Oops, I'm and idiot" ? I already knew
:))

Much more useful repository must help me to find the code I only *intend* to
write - and here code comparison isn't necessary, because I have nothing to
compare yet.

What do you think about this?

- Jake McArthur

Victor.

I think that this idea would work better as a social network which
tied together Ruby projects via tagging and rss feeds and all that
other yummy stuff.

Basically, rather than trying to accomplish the very hard task of
trying to compare ruby code, where there are often many ways to do
things (and let's not even consider meta-programming!) we tie
projects together through a bunch of community maintained meta-data
which helps support DROP.

This might include open commenting systems on the source code, so that
people can review the code live, things which automatically sense
similar projects or similar dependencies and things like that, and
just the general cool stuff that well set up social network could
provide.

I think it would be technically impossible to implement a good
automated solution to find duplicated code. This is a more human
oriented option which could be helpful and a lot of fun.

Plus... you still have a project here... it could be implemented
nicely in Rails or Nitro or something :slight_smile:

···

On 4/28/06, Jake McArthur <jake.mcarthur@gmail.com> wrote:

Come on, it can't be that bad of an idea! Is this really going to go
over that badly if I submit this as a project proposal?

How will it find similar code? One simple issue is that people will name their variables and methods differently, so you'll want to somehow see the structure of a section of code and ignore a lot of details. But you can't ignore the details too much. Maybe (trivial example) someone wrote a "max" function and someone else in-lined it, and otherwise their code blocks are the same.

I've already been working on this. Right now, I'm making a simple algorithm that works on arbitrary text and returns a number reflecting how similar two strings are. Even this alone has been giving fairly good results on code, even code that was written rather differently, but my plan is to use this algorithm to compare symbols and literals. A similar algorithm, working on a slightly larger scale, would compare entire lines of code for similar syntax, augmented by data from the first algorithm.

I'm still thinking about this. Suggestions, anybody?

I don't think all code is simple to refactor like that. But maybe enough is for this to be useful. Maybe most is? I don't know.

By far it is not, but all I meant is that there is no need to mess with the system to do it.

I don't have much experience with unit tests. How well can they usually withstand arbitrary changes to code with subtle bugs?

Well-tested code will not break unless a test was missed, and if a bug is found, writing a test to cover it will practically squish that particular bug permanently.

It's a bit off-topic, but I'm not sure how good an idea wikis are. Wikipedia gets a lot of vandalism. But worse: what happens when people have a legitimate disagreement about how some code should be written? "anyone can post anything" doesn't provide a way to resolve disagreement.

There could also be a risk of a malicious code that people auto-update.

Disagreements could be resolved by simply forking off another project. Everybody is happy. And anyway, if everybody agrees on tests, and those tests pass, everybody should be happy anyway.

Well-tested projects will not be affected by malicious code because the system would see that tests fail and revert back to the last working version.

I wonder how well the code-similarity algorithm would work for non-Ruby code. Just curious how Ruby-specific the tests would be vs how general.

The algorithm I'm currently on is language agnostic, but it doesn't benefit from syntax parsing and such like plans reflect.

- Jake McArthur

Jake,

I am working on something uncannily similar to what you describe. I
imagined it to also be wiki-integrated and to present a cleaned-up
version of test code that is human readable.

have a look at Liquid Development: The Social Evolution of Software

and tell me what you think

Cheers

···

--
Chiaroscuro
---
Liquid Development: http://liquiddevelopment.blogspot.com/

On 4/28/06, Jake McArthur <jake.mcarthur@gmail.com> wrote:

Come on, it can't be that bad of an idea! Is this really going to go
over that badly if I submit this as a project proposal?

- Jake McArthur

There are many benefits:

a) You help others to not repeat themselves (obvious).
b) You open parts of your code up so that others have reason to find and fix your bugs.
c) It creates a much more useful repository of code than ordinarily because this is code that people actually are using and maintaining, not just things people figured might be useful later.

The point isn't to see that you repeated somebody when you shouldn't have. The point is to see the repeat and save everybody the trouble and factor it into one central location. It's not too difficult; like you said, you already wrote the code. It's just a matter of branching it off and making it shiny. Code that has been factored out would, of course, be tagged and searchable so that people (or you) _can_ look for the code they "intend" to write.

- Jake McArthur

···

I am afraid the problem here is: what I would do even if I find that some of
my code repeats code in repository? I already *had written* the code, so,
where is my benefits? To understand "Oops, I'm and idiot" ? I already knew
:))

Much more useful repository must help me to find the code I only *intend* to
write - and here code comparison isn't necessary, because I have nothing to
compare yet.

What do you think about this?

It is a very good point (I just tried to write something like this).
Life shows, that "dumb social" systems works faster and better, than "smart
intellectual analysys".
At least, it sounds like something we can at minimum *try* to do (while
first idea sounded like "It would be nice if somebody already done this, but
personally I would never even try").

Victor.

···

On 4/28/06, Jake McArthur <jake.mcarthur@gmail.com> wrote:
> Come on, it can't be that bad of an idea! Is this really going to go
> over that badly if I submit this as a project proposal?

I think that this idea would work better as a social network which
tied together Ruby projects via tagging and rss feeds and all that
other yummy stuff.

Basically, rather than trying to accomplish the very hard task of
trying to compare ruby code, where there are often many ways to do
things (and let's not even consider meta-programming!) we tie
projects together through a bunch of community maintained meta-data
which helps support DROP.

This might include open commenting systems on the source code, so that
people can review the code live, things which automatically sense
similar projects or similar dependencies and things like that, and
just the general cool stuff that well set up social network could
provide.

I think it would be technically impossible to implement a good
automated solution to find duplicated code. This is a more human
oriented option which could be helpful and a lot of fun.

Plus... you still have a project here... it could be implemented
nicely in Rails or Nitro or something :slight_smile:

There's always the copy/paste detector CPD for that:

http://pmd.sf.net/cpd.html

It's got a (very basic) Ruby parser.

Yours,

tom

···

On Fri, 2006-04-28 at 14:31 +0900, Gregory Brown wrote:

I think it would be technically impossible to implement a good
automated solution to find duplicated code. This is a more human
oriented option which could be helpful and a lot of fun.

CPD uses the Burrows-Wheeler transform to find exact matches. It has
some options to ignore identifiers and literals, although that results
in false positives sometimes...

Yours,

Tom

···

On Fri, 2006-04-28 at 23:30 +0900, Jake McArthur wrote:

> How will it find similar code? One simple issue is that people will
> name their variables and methods differently, so you'll want to
> somehow see the structure of a section of code and ignore a lot of
> details. But you can't ignore the details too much. Maybe (trivial
> example) someone wrote a "max" function and someone else in-lined
> it, and otherwise their code blocks are the same.

I've already been working on this. Right now, I'm making a simple
algorithm that works on arbitrary text and returns a number
reflecting how similar two strings are. Even this alone has been
giving fairly good results on code, even code that was written rather
differently, but my plan is to use this algorithm to compare symbols
and literals. A similar algorithm, working on a slightly larger
scale, would compare entire lines of code for similar syntax,
augmented by data from the first algorithm.

[... social network ...]

I do like the idea, but it's a different spirit from what I'm going for. There are tons of developer communities, and tons of source code repositories. While I don't think I've seen a social network for developers before, I don't really see it kicking off so well unless it happens to be language agnostic, which is fine except that it would only become that much more difficult to find anything that applies to your particular project.

I'm trying to do this as a learning experience, and something that hasn't been done before. While I like some social networks, there are only so many people can be so active in. A web site like this would have to compete with other social networks (with overlapping functionality) in order to have anything worth using, but I know I can't do that. I have to try something new because of that.

This might include open commenting systems on the source code, so that
people can review the code live, things which automatically sense
similar projects or similar dependencies and things like that, and
just the general cool stuff that well set up social network could
provide.

I thought of this, but I don't like the idea of having to go through so much extra trouble just to get rigged up for the system. And as for automatically sensing similar projects, that's just a larger scale version of what I'm trying to do already.

- Jake McArthur

it seems down at the moment, but this is close/perfect for your needs

   http://complearn.org/

google cache (until site up)

   http://72.14.207.104/search?q=cache:bmlzYI4W39sJ:www.complearn.org/+complearn&hl=en&gl=us&ct=clnk&cd=1

more links

   http://www.newscientist.com/article.ns?id=dn3602
   http://homepages.cwi.nl/~cilibrar/musicart/trnmag.com/Stories/2003/042303/Software_sorts_tunes_042303.html

i've played with it and, since there are command line tools and a ruby api, i
would think you could categorize text quite easily.

we are actually playing with this to identify spatial/temporal trends in
nighttime lights satellite imagery.

cheers.

-a

···

On Fri, 28 Apr 2006, Jake McArthur wrote:

I've already been working on this. Right now, I'm making a simple algorithm
that works on arbitrary text and returns a number reflecting how similar two
strings are. Even this alone has been giving fairly good results on code,
even code that was written rather differently, but my plan is to use this
algorithm to compare symbols and literals. A similar algorithm, working on a
slightly larger scale, would compare entire lines of code for similar
syntax, augmented by data from the first algorithm.

I'm still thinking about this. Suggestions, anybody?

--
be kind whenever possible... it is always possible.
- h.h. the 14th dali lama

Jake McArthur wrote:

> How will it find similar code? One simple issue is that people will
> name their variables and methods differently, so you'll want to
> somehow see the structure of a section of code and ignore a lot of
> details. But you can't ignore the details too much. Maybe (trivial
> example) someone wrote a "max" function and someone else in-lined
> it, and otherwise their code blocks are the same.

I've already been working on this. Right now, I'm making a simple
algorithm that works on arbitrary text and returns a number
reflecting how similar two strings are. Even this alone has been
giving fairly good results on code, even code that was written rather
differently, but my plan is to use this algorithm to compare symbols
and literals. A similar algorithm, working on a slightly larger
scale, would compare entire lines of code for similar syntax,
augmented by data from the first algorithm.

I'm still thinking about this. Suggestions, anybody?

(my 1st ruby-talk post not from Google groups )
cyclomatic complexity may have some value as another input
http://saikuro.rubyforge.org/

also look at gonzui and doing some kind of vector space-LSI modelling
based on ruby keywords, core and std lib methodnames, etc.

The cleaned-up code (which I call 'prozed') looks like the coloured
bits of code at:
http://liquiddevelopment.blogspot.com/2006/01/software-roadmap.html

proze and intent (my rSpec-like framework) produce stuff like:

···

###############################################
people = [
    "john",
    "mike",
    "Sam"
]

story "Turning people names to uppercase"
    people to upcase should be ["JOHN","MIKE","SAM"]
###############################################

and

###############################################
story "Using Cashflows"
        Let's now deal with cashflows

        with Conventional discount = continuously compounded yield

        we expect about 271.67748, as the
            present value of
              Cashflow [
                 100 at 1 year ,
                 100 at 2 years ,
                 100 at 3 years
              ],
              with interest rate at 5 percent
###############################################

this code is easily readable and the intentional code could be
automatically imported within the editor (FreeRIDE has a nice system
to write plugins to connect to the repository) while the appropriate
gems get downloaded in background.

What do you say? Shall we join forces?

On 4/28/06, chiaro scuro <kiaroskuro@gmail.com> wrote:

Jake,

I am working on something uncannily similar to what you describe. I
imagined it to also be wiki-integrated and to present a cleaned-up
version of test code that is human readable.

have a look at Liquid Development: The Social Evolution of Software

and tell me what you think

Cheers

--
Chiaroscuro
---
Liquid Development: http://liquiddevelopment.blogspot.com/

On 4/28/06, Jake McArthur <jake.mcarthur@gmail.com> wrote:
> Come on, it can't be that bad of an idea! Is this really going to go
> over that badly if I submit this as a project proposal?
>
> - Jake McArthur
>

--
Chiaroscuro
---
Liquid Development: http://liquiddevelopment.blogspot.com/

What if the code functioned exactly the same plus some nasty side effects like a root kit? Could that get through tests?

-- Elliot Temple

···

On Apr 28, 2006, at 7:30 AM, Jake McArthur wrote:

Well-tested projects will not be affected by malicious code because the system would see that tests fail and revert back to the last working version.

You're right. It is very similar. We do, however, have slightly different ideas. Yours seems to be a repository based around nuggets of code which is searched by comparing the nuggets' "intent" with your own. I really like it too, but it's not the same.

Mine is to compare code directly, even code that normally wouldn't be classified as a stand-alone "nugget," like inline code inside large projects, code that is a bit interspersed with other code, etc. In this way, similarities within individual projects can be located and factored out. This approach seems to focus less on explicitly sharing _everything_ (and trying to make code work for _everybody_) and more on getting your own project done, with the improvement of the collective code base for everybody coming almost as a side-effect.

- Jake McArthur

···

On Apr 28, 2006, at 10:16 AM, chiaro scuro wrote:

Jake,

I am working on something uncannily similar to what you describe. I
imagined it to also be wiki-integrated and to present a cleaned-up
version of test code that is human readable.

have a look at Liquid Development: The Social Evolution of Software

and tell me what you think

Cheers

--
Chiaroscuro
---
Liquid Development: http://liquiddevelopment.blogspot.com/

On 4/28/06, Jake McArthur <jake.mcarthur@gmail.com> wrote:

Come on, it can't be that bad of an idea! Is this really going to go
over that badly if I submit this as a project proposal?

- Jake McArthur

Jake McArthur wrote:

There are many benefits:

a) You help others to not repeat themselves (obvious).
b) You open parts of your code up so that others have reason to find and fix your bugs.
c) It creates a much more useful repository of code than ordinarily because this is code that people actually are using and maintaining, not just things people figured might be useful later.

d) You find not only the bit you've already written, but also the bit that goes with it that you were about to write.

···

--
Alex

That exactly what I'm going for... something that nobody else really wants to try. If nobody else will make something like this, then I want to make it, or at least try; we would never reap the benefits otherwise.

What better chance is there for this than Summer of Code? It is the kind of project everybody secretly really wants to have, but would never realistically be able to find the time for it unless they could do it as a kind of job, but who would pay for something this risky? This is really the only way this could happen is if I propose it for Summer of Code.

- Jake McArthur

···

On Apr 28, 2006, at 12:40 AM, Victor Shepelev wrote:

"It would be nice if somebody already done this, but
personally I would never even try"

Victor.

Excellent find! Gives me some good algorithms to look up at the least, but maybe even some code to use? (I can't see right now if it is open source. No time right now. Gotta study for exams.)

- Jake McArthur

···

On Apr 28, 2006, at 9:37 AM, Tom Copeland wrote:

CPD uses the Burrows-Wheeler transform to find exact matches. It has
some options to ignore identifiers and literals, although that results
in false positives sometimes...

Yours,

Tom