Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: What parts of your code are not original
104 points by nunobrito on Feb 6, 2016 | hide | past | favorite | 36 comments
Hello,

Are you curious to discover which snippets of your code were copied from Stackoverflow?

Where else on the Internet are those icons that you copied a long time ago?

Or simply to discover which licenses apply to the open source in your code?

There's an app for that: http://triplecheck.net/quantum/

Development of this tooling took over two years, we archived over 630Tb of open source data around the web. Some sources of data have gone offline in the meanwhile but we kept a copy for posterity.

Things to consider: - Stackoverflow snippet detection is limited to Java at this moment - However, snippet detection works for mainstream languages in other repositories (sourceforge, github, googlecode, etc) - app is command line based (our UX skills suck), you need java installed - please let me know if pricing is too high or too low. We are bootstrapped, since there is no VC then pricing == survival - bugs will happen. Early edition, my apologies in advance for any bugs that surface - privacy NOT guaranteed. I don't store your code, only fingerprints are sent to the server and these are NOT stored after scan is concluded. However, your data will be captured by network providers. Please don't scan critical code, there's a secure offline app. Details at http://triplecheck.net/what-we-do.html - more than 300 open source licenses are detected

If the tool helped you: please retweet, upvote or just share your feedback and tips on how to make this grow from here. From one engineer to another: My personal thanks, I mean it.

-- Nuno



Cool!

And thanks for being up front about "privacy NOT guaranteed". Such a statement builds trust for me as an engineer.

The low pricing puts this in range for me if I would do independent software development and outsource part of the work. I would want to make sure I don't pay for development and somebody just copy-pastes some GPL code. That could also spell trouble the day I might want to be acquired by some big player and the problem surfaces in their audit.

I'm passionate about software freedom. I can really see the usefulness of a tool such as this being part of continuous integration to keep developers honest, especially if some part of development is outsourced to big software factories.

But I'm having a hard time convincing project managers etc. about the importance of license compliance. If you can help me with this I might be able to sell in a tool like this.


Indeed. License compliance is something very secretive, the old-styled management does not really want to hear they are using GPL. When the company gets acquired is a shock discovering otherwise. Would be better to understand what GPL is all about, rather than hiding. Last year we had a company in Germany that the acquisition failed because they were not respecting the copyleft licenses.

Would appreciate your help. How can we get in contact? There is a contact form on our website if you wish. Thank you! :-)


What makes me really sad is organizations where management sort of understands, but argues things like:

  * we have never cared about this since we started way back and it has worked fine!
  * it will cost us time and money, let's ignore it.
  * how could someone ever find out?
That said, having a tool that can clearly point to problems could be a big help when management changes and someone sympathetic to uncovering and fixing license issues comes in.

(Contact details are in my profile.)


If you are worried that a subcontractor would copy-paste some existing code, I would be more worried that they would reuse code between clients. Just imagine the lawsuit where a for-profit competitor found out that your flagship product has code that they previously bought through the some developer.


Subcontractors should always spell out that some parts of the delivered project will use MIT licenced code, or code licenced to the subcontracting company itself.

Personally, when I do business development and two companies end up co-authoring a library (like say a new plugin), we always specify what parts of the new library are for the company we helped write the plugin, and what parts are "ours".

Sometimes we'll write a bit and say "this is yours, we'll never use it again", and sometimes we'll say "we wrote this, you can use it, but we're free to give it to anyone else too".


This gives me severe flashbacks to clueless clients/PMs. We had projects we had to run through Black Duck -- "No open source code."

The reference implementation of the Mersenne Twister was once GPL, although it wasn't anymore at that time. Still, there are only so many ways you can implement a Mersenne Twister. So my implementation got flagged.


Worse thing is sometimes getting 30 different projects with the same snippet of code but the code was written by neither, it was simply copied from Stackoverflow and then applied by each developer on their code.

Have you had a chance of trying out the report?

If you are scanning open source, there is a trick to ignore matches from the repository where it comes from.

Just create a file called "ignore.txt" inside the samples folder and on that file include the keywords that blacklist positive matches. For example, if scanning the "Adblock Plus" code then add as keyword "adblockplus" on the ignore.txt file and no matches from repositories containing "adblockplus" on their URL will be listed.

Works good for discovering which parts of an already open source project are not really original.


Why did you need to write an implementation of Mersenne Twister?


"No open source code."


Yeah, we got that. But why did you need a Mersenne Twister?


My understanding is that Mersenne Twister is kind of a gold standard for non-cryptographic PRNGs. Linear congruential generators are known to be rather poor, and only ideal where speed is critical and the quality doesn't matter.


Could you make an example report for some public well known project taken from github. Your current example report just includes a screenshot of some super generic graphs that don't mean anything.

Look at what viva64 is doing to promote their static analysis tool, they analyze open source repos all the time and write about the results: http://www.viva64.com/en/b/0366/ That gives me confidence that the product can actually find issues that might have relevance for me.


Thank you Too, very helpful feedback. Will get into it. I've added an example on the page, direct download link is http://triplecheck.net/download/example.zip


Ok. I've downloaded it. Double clicking doesn't seem to run it, but if I open the jar from the command line, it will run. (OS X Mavericks)

Then I go to get an API key, so I sign up for mashape.com? Is this service somehow related to triplecheck.net? I need an API key, but nowhere do I see one.

I think your startup process needs to be simplified or better documented.

Edit: Ok, found the key. Now I need to put my code in the application bundle? Why can't I just select a folder?

Edit2: Apparently that wasn't the key or the app couldn't access the internet. Maybe no NSAppTransportSecurity in the .plist


Thanks you for the feedback, very useful to see how a first time user runs the app.

> so I sign up for mashape.com? | Answer: Yes. Sorry about that. I think it makes sense as next step to run our own API management.

> put my code in the application bundle? Why can't I just select a folder? | Answer: Look for "settings.xml", there you can change to another folder

> Apparently that wasn't the key or the app couldn't access the internet. | Answer: Can you try running the jar from command line? The mac version needs to be fixed, the jar edition should work: http://triplecheck.net/download/quantum.zip

To see the UI: java -jar quantum.jar

To run from command line: java -jar quantum.jar scan

The API key should be inside settings.xml too. If you typed the wrong key, you can replace it there or just delete settings.xml to reset the app. Good luck, please let me know it this worked. Thanks.


"Everything is a remix" is an interesting take on this sort of thing[1]

[1] http://www.npr.org/2014/06/27/322910178/is-everything-a-remi...

While it's specific to music, the concept certainly applies to anything creative.


Actually, there are three more parts to it, dicussing movies, computers, etc.


I've always wondered what the chances of two people writing the same chunk of code are.


I got to mark an assignment at uni that I'd also done. There seemed to be just three different versions that were "obviously" copied... then I saw one identical with mine! With just wminor differences. I hadn't copied, and I couldn't see how they could have copied mine.

On reflection, there is much convention in code (this was c IIRC), loops, indices and cordinates variables are standard. And, each level of a problem only has a few obvious ways to approach it. The same underlying algorithm - especially in a uni course about algorithms - also tends to suggest particular designs.

The eye of an octopus is striking similar to the human eye. Does that mean they are related, other than by objective?


This could be an investigation work to get hard data. From personal experience would say that exist many cases where a given code can only be written in a given manner. On other cases, people simply copied the code many years ago and never again know where it can be found.

For me, more relevant is to compare the variable names. If both code snippets have very similar variable names, then one of them is likely a copy.


I've seen cases where names where changed, but comments retained with the same typos as in the GPL package...


I've found a couple of Chinese companies offering compiled versions of my open source code before...They would change the class names and stuff but keep all the interface methods the same so it was pretty easy to figure out what they had done.


Yes. Found this once on the code from a Chinese developer. Really made a big effort to rewrite the code but then the comments were done in a too perfect English that made it obvious as copied from elsewhere.


Rumor has it that two of the original UNIX guys (Thompson and Ritchie, I think) once wrote the exact same ~20 line program in PDP-11 assembly.


sorry, I accidentally mobile-downvoted. It's a good comment.


We could extrapolate this to the music field also.


"what parts of your code are not original?"

Let's start with the syntax, constructs, compilers, linkers, and use of assembler. Very little that's original any more. That's fine as most science and tech that's worth a shit is an increment on some prior development. Evolutionary, not revolutionary.


But what happens if somebody uploads parts of your code to stackoverflow?


Stackoverflow now owns it and you will have to comment your code with a link to the stackoverflow post. You will also have to send stackoverflow 1% of your salary.


I hope this is a joke



I'm just going to pretend I didn't read that.


The API key dispensary mechanism you are using does reject throwaway, anonymous email addresses from anonbox.net at registration time. Is that intentional?


Not intentional at all, that comes by default on Mashape.com

Sorry about the hassle


Do you detect only exact similarity? What if variable names, formatting is changed? What if code had been refactored quite a bit? Can you give more details on what exact algorithm you use?


Different algorithms are used.

1) binary comparison. Without knowing what type of file we are matching, we compare to other files and evaluate if the binary contents are similar (or preferably 100% equal)

2) snippet matching. For mainstream languages (C, Java, Javascript, Python, etc) we transform the code into anonymized blocks that don't care about variable names, formatting or comment blocks. Then the code is compared for similarity. Up to 80% similarity is still qualifying as a match.

To provide context, we have the concept of code diversity. Meaning that a given match needs to present a relatively high number of different logical instructions in order to qualify as match. Example, multiple IF statements will not qualify, unless they contain other code within. If you change the order, add/remove code we are still robust enough to detect changes.

For special cases where exists known malicious intention of hiding the code I will be cross-matching different algorithms and specifically look on variable names and comments inside the code. In such cases, a manual inspection gets done by an expert and becomes truly difficult for a developer to escape the detection of non-original code.

In fact, if the guy is indeed able to hide code from triplecheck then it reached a level of sophistication that no normal third-party developer will be capable of (easily) detecting plagiarism. In our experience have occurred rare cases where only with new techniques we notice that a given company managed to hide non-original code from our tooling.

In either case, we live and learn from such examples and gets more difficult on new iterations of the tooling to evade (non) originality detection.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: