How to make sure AI can't steal your work (or your data)

❝

“The technology and vision beyond generative AI is amazing, but stealing the work of the world’s creators to build it is not.” - Ed Newton-Rex

Companies source data for training their AI models in many different ways.

Some of it’s legitimately licensed, which means that the company pays the data source to access and use its data through an API or other method, like the recent deal between Wikipedia and Amazon, Microsoft, Meta, Perplexity, and France’s Mistral AI that allows those AI companies enterprise access to Wikipedia’s premium API.

The other portion of data used is content scraped from the internet without any consent, except for content behind paywalls (which LLMs can’t access). You think they’re going to pay for any of that access like the fancy licensing agreements they entertain with larger companies? Not a chance in hell.

But karma always comes to collect. In this instance, we are that karma.

Don’t let them have it their way.

In this edition of ✨Writefully Yours✨, you’ll learn:

How to protect your work and intellectual property from AI
Copyright 101
How to make sure AI models don’t use your data for further training
The type of data you should NEVER give AI

Licensing is the only way forward.

With the rise of generative AI, creators and websites are shutting off free access one by one to protect their copyright and make sure they’re being paid fairly. Licensing would help preserve an open internet, and it’s also something the future of AI will need to rely on.

Licensing agreements are how companies can legally use copyright-protected work. But AI companies have to fork over the cash for those, and they should! They don’t get a pass just because they’re operating under the guise of “technological evolution” and run in billionaire boys clubs that think they’re above the law.

To best understand this topic, I absolutely LOVE this TED Talk from AI expert Ed Newton-Rex and founder of Fairly Trained, a non-profit that certifies AI companies for fairer training data sourcing:

Side note for those wondering about some of the data sources for these LLMs: In the video, he mentions CommonCrawl as a source for 64% of large language models get “free” copyrighted data from.

Thanks for reading Writefully Yours! This post is public so feel free to share it.

How to protect your work and data

Anytime you create something, you automatically have copyright protection. You don’t have to register it, unless you want to sue anyone for copyright infringement. (source: U.S. Copyright Office, and remember that I’m not a lawyer and this is not legal advice).

Beyond that, how can you protect your work and data from generative AI tools like ChatGPT using it without permission? Since they seem to completely ignore the Copyright Act. 🙄

Well, there are a few ways.

Protecting your work:

Register your copyright via the U.S. Copyright Office as an option if you want to take legal action and have that added protection going forward.
- Copyright registration for digital work
- Copyright registration for literary work
Put the content you want to protect behind a paywall of some sort, like Substack, Patreon, BeeHiiv, setting up paid subscriptions on IG, etc. LLMs can’t access content that’s protected behind paywalls.
Learn about licensing. Createdbyhumans.ai is a company that helps authors, book agents, and publishers license out their work, if that’s something they want to do.
If you must use generative AI, please use standalone AI tools that you can load with protected data, projects, etc, to help you with future work. This way, you aren’t participating in using data from other creators, are protecting your own data, and have more control over what the tool generates for you.
- AnythingLLM and LocalAI is are a few of these tools, but I have yet to play around them and really get in the weeds. I’ll update you once I do to see if they’re worth your time. 😉
- This still have environmental impact and other ethical concerns. These aren’t perfect.

Protecting your data:

Change the settings on any AI tools you use to opt-out of using your data to train their models. For this, I want to start with Gmail, since it’s probably where most of your sensitive data is.

Think about it, your Gmail has emails from your doctor’s office, bank, credit card company, mortgage company, tax servicers, social media accounts, and your photos, videos, login location history, you name it. Let’s nip that one in the bud first.

Change Gmail settings

Go to the Gmail app on your phone (or PC) and select the 3-line icon on the left-hand side of the search bar:

Then select “Settings” and scroll down to select “Data privacy”. (These screenshots are for iPhones, but other models should be similar.)

Toggle “Smart features” to the “Off” position (should be gray). Then click the arrow next to “Google Workspace smart features” to toggle off “Smart features in Google Workspace” and “Smart features in other Google products”.

Change ChatGPT settings

Go to chatgpt.com, click on your name at the bottom-right corner of the screen, and click “Settings”.
Swipe through the tabs and click “Data controls”. Where it says “Improve the model for everyone”, click where it says “On” to toggle off (my screenshot says Off, since I’ve already updated my settings).

Toggle Off where it says “Improve the model for everyone”. Then toggle off the voice settings under that.

Change Claude settings

Go to claude.ai, click where it says your name at the bottom-right corner of the screen, and click “Settings”.
Click the “Privacy” tab on the Settings screen. Then toggle off “Location metadata” and “Help improve Claude”.

If you’re comfortable, check out the other settings in all of these apps to further tailor your experience and data usage. Steps to get to the settings and data privacy screens on other AI apps should be similar.

If you’re having trouble with updating or finding these settings on any other app, comment below, and I’ll write another guide covering those tools as well!

The type of data you should never give AI

I won’t judge you for using AI, but I WILL judge it for what you tell it.

Under no circumstances should you ever give generative AI any of the following data:

Personal financial data (account numbers, routing numbers, bank login info, retirement and investment logins, etc), unless it’s within the app for the actual institution your funds are at.
Healthcare records and doctor office portal logins (like MyChart), as in, don’t ask a tool like Claude Cowork to login to your MyChart for you to ask your doctor a question or schedule an appointment. You might want to do things like that yourself.
Your social security number (be careful trying to get any generative AI or any tool that isn’t your accounting software to do our taxes, you’re handing over not just your social security number when you upload tax documents, but also additional data that the AI overlords really have no business knowing).
Access to your personal devices. Never give an AI agent free range on any of your devices.

What about the other tools we use every day that are connected to this data, like our bank apps, that also have AI built in?

You’d be surprised how much AI you use every day without realizing it. Every app uses it.

Your bank uses it to update budget and spending visuals, your MyChart now has an AI assistant, your Netflix uses it to show you the shows and movies it thinks you’ll like, real estate apps like Redfin have an AI assistant to help look for homes, and the location data on your phone is used by surveillance companies that use AI for tracking.

You’ve actually been using AI ever since you first saw the autocomplete and search suggestions on Google (implemented in December 2004). Google uses something called Natural Language Processing for those features.

None of these things are necessarily bad. For instance, your banking and retirement accounts really do protect your data, and your money is either FDIC insured (for banks) or NCUA insured (for credit unions, which are my favorite). Plus, you don’t really have the option to opt out of any AI functionality in those tools.

I wrote a little bit about AI and financial planning in a past article of mine for CNET Money: These AI Tools Are Helping Me Plan for Retirement. Here’s How It’s Going So Far (it discusses how to use professional guidance and human oversight with retirement and banking AI tools)

Anyway, use your discretion and best judgement when using any tool and always check the settings to make sure you’re protecting your data.

I hope these processes and tools are helpful! If you want more content/guides like this (or maybe not like this), don’t ever be afraid to drop me a line.

Until next week,

Daniella 💜