GitHub has launched Copilot, “your AI pair programmer”:
Trained on billions of lines of public code, GitHub Copilot puts the knowledge you need at your fingertips, saving you time and helping you stay focused.
GitHub Copilot landing page
There is discussion if Copilot is or isn’t infringing copyright (HackerNews). Although this is a discussion that has to happen sometime, I’m no copyright expert. The legal system does not always work as expected. Laws take years or decades to catch up. To me, this is the wrong discussion.
The discussion should be around ethics. GitHub themselves somewhat acknowledge something like this tangentially in the “Responsible AI” section of their FAQ (I would link to it, but there doesn’t seem any way to).
Apparently, the “frequently asked” questions around “responsible AI” are:
- “Can GitHub Copilot introduce insecure code in its suggestions?”
- “Does GitHub Copilot produce offensive outputs?”
- “How will advanced code generation tools like GitHub Copilot affect developer jobs?”
Ignoring the bile-inducing corporate/PR doublespeak:
- Unsurprisingly, the answer is yes, Copilot can introduce bad code which may be insecure. In any case, it’s a black box, you’ll never know for sure.
- Unsurprisingly, the answer is yes, Copilot produces offensive outputs, at least if swearing is offensive to you. In any case, it’s a black box, you’ll never know for sure.
- Who knows?
In any case, it’s a black box, you’ll never know for sure.In any case, it isn’t our problem as long as we make money!
Ok, let’s look at the “frequently asked” questions around the “training set”:
- “Why was GitHub Copilot trained on data from publicly available sources?”
Training machine learning models on publicly available data is considered fair use across the machine learning community.
GitHub Copilot landing page
Forgive me if I don’t hold the “machine learning community” in a high regard, but this doesn’t sound promising. Even Wikipedia has a section on machine learning ethics, and, umm, it isn’t great.
So we note nowhere does GitHub discuss “Should we do this?”, “Is it ethical to rip off other people’s code without credit or attribution, and avoid giving back?”, and “What are the effects of our attempt to make money, no matter what the cost?”. (I’m biased, whatever.)
So, is it ethical to rip off other people’s code without credit or attribution, and avoid giving back?
Most “public code” - whatever that means - has licenses. Ignoring the legal implications, and a lot of people do not mind people copying code, but would at least like to be credited or attributed. Some people have other opinions or ideologies. choosealicense.com, a page provided by GitHub, acknowledges this is a reason people choose certain licenses:
I care about sharing improvements.
choosealicense.com, one of the options for the prompt “Which of the following best describes your situation?”
As a side note, the page footer says “Curated with ❤️ by GitHub, Inc. and You!” - or not, because we’ll trample on your wishes if we can make money.
Hey, and look at this: even the MIT License requires “the above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.” Now you can of course argue about “substantial portions”, the point is one of the most lenient licenses has the concept of “we’d like it if you say where this comes from” (at least, that’s how I read it).
It’s good to be nice. Once I was porting some code from C to Python (I think) and reached out to the original developer asking if they’d be okay with this. Not because the license required me to, simply out of courtesy. Maybe the author didn’t want a “fork” in a more popular language? Maybe they wanted some input as to where the repo is hosted? The point is software development is collaborative - even a one-person project is once it’s out in the world.
Effects on open source and machine learning
Of course, not everybody will see Copilot being an affront to open source. Or it might take a while.
In the meantime, migrating from GitHub isn’t hard. I think that’s a natural first step. Gitea exists, and apparently it isn’t difficult to mirror repos from GitHub to Gitea. GitLab exists, and it’s trivial to import projects from GitHub to GitLab. Turns out their moat isn’t so big, at least for me. But it isn’t even clear that switching to another provider will make code “safe” from being plagiarized by Copilot.
GitHub is playing a dangerous game, with chilling effects in the entire software industry. They clearly think they’re going to “win”, but what does that look like? Is Copilot worth the risk? It doesn’t matter, you can’t close Pandora’s box.
Here’s the billion dollar question: Are people going to think long and hard before open-sourcing something now? I know I will. Maybe people will even take down code. After all, it might be too late for the Copilot issue, but not for future “AI” crap.
Sadly, this short-term and selfish thinking exactly what you expect from a US corporation. And it’s incompatible with open source, or at least my view of it.
At the same time, they also risk having a precedent set against the flimsy “fair use” argument. And personally, I can’t wait to see a license that forbids using code with machine learning/AI. If I’m able to, I’ll re-license in a heartbeat.
If you’re an open source developer, I hope you’ll think a bit about what Copilot means for the open source community. Maybe consider migrating away from GitHub. And I’d love to see a broader discussion on ethics in our field.