You have a point. This would not be caught. I see only a few potential solutions, but they are kind of a hassle:
a) No world editing. The creator of a branch can give access to other people at will, but only explicitly.
b) No automatic running of tests. Updating to recent revisions of dependencies prompts the developer to review the changes to the code before running tests, and then gives an option to run the tests and/or revert to an older version that works.
c) Both of the above.
d) A sort of karma system in which a dependency update will update and test code automatically that is edited by "trusted" developers, but uses method B for any code that has been edited ("tainted") by untrusted developers. A developer gains karma in the network whenever somebody decides to keep their revisions in their own projects. A developer loses karma whenever somebody updates a dependency, examines their code, and decides not to use it. You can define the karma threshold at which an update is automatically tested and integrated. A dependency itself may also be marked as editable only to those with a high enough karma. This is my favorite idea here, but also the most difficult to implement.
e) Some programmatic way of checking for or eliminating the possibility of malicious code. (???)
- Jake McArthur
···
On Apr 28, 2006, at 11:12 AM, Elliot Temple wrote:
What if the code functioned exactly the same plus some nasty side effects like a root kit? Could that get through tests?
Just had a thought... Something similar _has_ been done before, but targeted at a different application. Take a look at services like MOSS <http://www.cs.berkeley.edu/~aiken/moss.html>, which was designed for plagiarism detection -- e.g. finding similar code across projects. There is a nice paper linked that talks about how they go about doing it.
Matt
···
On 28 Apr , 2006, at 10:42 AM, Jake McArthur wrote:
That exactly what I'm going for... something that nobody else really wants to try. If nobody else will make something like this, then I want to make it, or at least try; we would never reap the benefits otherwise.
What better chance is there for this than Summer of Code? It is the kind of project everybody secretly really wants to have, but would never realistically be able to find the time for it unless they could do it as a kind of job, but who would pay for something this risky? This is really the only way this could happen is if I propose it for Summer of Code.
- Jake McArthur
On Apr 28, 2006, at 12:40 AM, Victor Shepelev wrote:
"It would be nice if somebody already done this, but
personally I would never even try"
Victor.
--
Matt Long mlong@acm.org / mtlong@csee.usf.edu
University of South Florida, CRASAR
GnuPG public key: http://www.csee.usf.edu/~mtlong/public_key.html
"In mathematics you don't understand things, you just get used to them."
- John von Neumann
Excellent find! Gives me some good algorithms to look up at the
least, but maybe even some code to use? (I can't see right now if it
is open source.
Yup, it's BSD-licensed; code is here:
http://pmd.sourceforge.net/xref/net/sourceforge/pmd/cpd/package-summary\.
html
No time right now. Gotta study for exams.)
Best of luck!
Yours,
Tom
I'm working on something similar also, although it's hardly even
started, and it isn't optimized for code sharing.
Also, somebody mentioned they were working on techniques to identify
similar strings. I think what you're looking for may be Levenstein
distance.
···
--
Giles Bowkett
http://www.gilesgoatboy.org
On 4/28/06, Jake McArthur <jake.mcarthur@gmail.com> wrote:
You have a point. This would not be caught. I see only a few
potential solutions, but they are kind of a hassle:
a) No world editing. The creator of a branch can give access to other
people at will, but only explicitly.
b) No automatic running of tests. Updating to recent revisions of
dependencies prompts the developer to review the changes to the code
before running tests, and then gives an option to run the tests and/
or revert to an older version that works.
c) Both of the above.
d) A sort of karma system in which a dependency update will update
and test code automatically that is edited by "trusted" developers,
but uses method B for any code that has been edited ("tainted") by
untrusted developers. A developer gains karma in the network whenever
somebody decides to keep their revisions in their own projects. A
developer loses karma whenever somebody updates a dependency,
examines their code, and decides not to use it. You can define the
karma threshold at which an update is automatically tested and
integrated. A dependency itself may also be marked as editable only
to those with a high enough karma. This is my favorite idea here, but
also the most difficult to implement.
e) Some programmatic way of checking for or eliminating the
possibility of malicious code. (???)
- Jake McArthur
On Apr 28, 2006, at 11:12 AM, Elliot Temple wrote:
> What if the code functioned exactly the same plus some nasty side
> effects like a root kit? Could that get through tests?
Just had a thought... Something similar _has_ been done before, but
targeted at a different application. Take a look at services like
MOSS <http://www.cs.berkeley.edu/~aiken/moss.html>, which was
designed for plagiarism detection -- e.g. finding similar
code across
projects. There is a nice paper linked that talks about how they go
about doing it.
That is an interesting paper, thanks for the link! He lists a couple of
requirements:
1) whitespace independence - CPD has this for several languages (C, C++,
Java) since for those languages uses JavaCC-generated parsers that
discard whitespace. For other languages (Ruby, PHP) it also discards
whitespace but does it a bit more clunkily, at a higher level in the
framework.
2) noise suppression - yup, this is important since you don't want to
catch little matches like "x.each do |y|". CPD allows you to set the
minimum match size; I usually start at about 100 and work my way lower
from there.
3) position independence - since you don't want moving things around to
affect the analysis. CPD mostly has this, I think 
Of course, the hard problem is fixing the duplicates once you find them;
that can be a delicate job.
[SHAMELESS PLUG] If you're interested in reading more about CPD, I've
got a chapter on it in my book:
http://pmdapplied.com/
Yours,
Tom
Quote: "To date, the main application of Moss has been in detecting
plagiarism in programming classes."
Really... I've been known to do other things.
Must keep you busy, checking all those submissions... 
Matt
···
On 28 Apr , 2006, at 12:09 PM, Matthew Moss wrote:
Quote: "To date, the main application of Moss has been in detecting
plagiarism in programming classes."
Really... I've been known to do other things.
--
Matt Long mlong@acm.org / mtlong@csee.usf.edu
University of South Florida, CRASAR
GnuPG public key: http://www.csee.usf.edu/~mtlong/public_key.html
The wars of the future will not be fought on the battlefield or at sea. They will be fought in space, or possibly on top of a very tall mountain. In either case, most of the actual fighting will be done by small robots. And as you go forth today remember always your duty is clear: To build and maintain those robots. Thank you.
-The Simpsons