Eli MacKinnon / 5.30.2023Home / Features

Why won’t Google give a straight answer on whether Bard was trained on Gmail data?

Is the tech giant's large language model trained on private Gmail data? The answer might depend on what the meaning of the word 'is' is...
Email inbox and Bard model.
Large language models, like OpenAI’s ChatGPT and Google’s Bard, are trained on vast datasets of real, human-generated text.And when it comes to training data, more is better. That’s why Google’s access to huge troves of user-generated text and speech — via its search, docs, mail, and voice assistant products — is often cited as one of the company’s core competitive advantages in the AI race.There’s a strong incentive for Google to exploit this advantage. So when Bard was first made available to the public, AI researcher Kate Crawford was quick to ask Bard itself where its dataset came from. The answer caught her attention: Bard said one of its data sources was Gmail.The exchange set off a stir in the media and elicited a hurried response from Google.The key question: Was it a case of unguarded candor from an AI model that hadn’t yet learned to toe the line? Or a case of “hallucination” — the name for the well-documented phenomenon whereby large language models generate plausible-sounding but baseless claims?

Did Google put the claim to bed?

Google replied to the tweet directly, saying, “Bard is an early experiment based on Large Language Models and will make mistakes. It is not trained on Gmail data. -JQ”.Much of the follow-on media coverage ran with this response and dutifully “debunked” Bard’s claim that its training data included Gmail data. Few articles professed skepticism around Google’s public denial of Bard’s claim, despite the fact that Google has been fined on numerous occasions by government agencies around the world specifically for making deceptive claims about its privacy practices that are later proven to be misleading.Given this context, the narrative that Bard’s claim was an open-and-shut case of AI hallucination is, at best, hasty and incomplete. A fuller investigation reveals (i) documented use of Gmail data in other Google AI models that makes speculation around its use in Bard reasonable, and (ii) the habitual use of artfully ambiguous language in its public representations around Bard’s data sources, language that never actually rules out the use of Gmail data in its training set.

The facts

1. Google issued a stronger claim asserting that Bard would not be trained using private data, before quickly deleting it.Following Google's first reply, Crawford asked Google to clarify outright whether Gmail data was used to train Bard. She wrote: “Thanks for this. So can I confirm that this was a Bard hallucination, and that there's no Gmail data included whatsoever in the training process?”Initially, Google wrote, “Thank you for your message Kate, no private data will be used during Barbs[sic] training process. We always take good care of our users’ privacy and security.”That seems like a clear and heartening assurance. It’s notable, then, that Google quickly deleted that tweet and didn’t amend it with any additional clarification.2. Google’s Smart Compose feature was trained on Gmail users’ private emails.Bard is not Google’s only language-focused machine learning model. Anyone who’s used Gmail in the past few years knows about the Smart Compose and Smart Reply features, which auto-complete sentences for you as you go.According to Google’s 2019 paper introducing Smart Compose, the feature was trained on “user-composed emails.” Along with the email’s contents, the model also made use of these emails’ subjects, dates and locations.So it’s plainly true that some of Google’s language models have been trained on Gmail users’ emails. Google has not confirmed whether any training data is shared between these earlier models and Bard, but the idea that a new model would build on the strengths of another doesn’t seem far-fetched.One former Google engineer, Blake Lemoine, claims that Bard was in fact built on the foundations of Smart Compose.He tweeted:

"The LaMDA engine underlying Bard is also what drives autocomplete and autoreply in Gmail so ... yeah Bard's training data includes Gmail. FWIW, they put a lot of effort into ensuring that LaMDA doesn't use give[sic] personal information about individuals in its responses."
Lemoine is a controversial figure, having famously proposed that Bard may be sentient. Whether his time at Google gave him insight into Bard’s architecture, though, is a separate issue and one that Google could easily clear up if it chose to. Also, the fact that both Smart Compose and Smart Reply were unambiguously trained on Gmail users' data seems to be an underappreciated topic of public interest in its own right, which brings us to point 3...3. Google researchers have extensively documented the risk of leaking private data from their own machine-learning models, some of which are acknowledged to be trained on “private text communications between users.”In a 2021 paper, Google researchers laid out the privacy risks presented by large language models. They wrote:“The most direct form of privacy leakage occurs when data is extracted from a model that was trained on confidential or private data. For example, GMail’s autocomplete model [10] is trained on private text communications between users, so the extraction of unique snippets of training data would break data secrecy.”As part of this research, Google’s scientists demonstrated their ability to extract “memorized” data — meaning raw training data that reveals its source — from OpenAI’s GPT-2. They emphasized that — although they had chosen to probe GPT-2 because it posed fewer ethical risks since it was trained on publicly available data — the attacks and techniques they laid out in their research “directly apply to any language model, including those trained on sensitive and non-public data”, of which they cite Smart Compose as an example.4. Google has never denied that Bard was trained on data from Gmail. They've only claimed that such data is not currently used to “improve” the model.This point is subtle but significant. Following the controversy around AI researcher Kate Crawford’s tweet, Google crafted an official response to questions about Bard’s use of Gmail data (after having deleted a more immediate response discussed in point 1 above).That statement, which they added to Bard’s FAQ page, is:"Bard responses may also occasionally claim that it uses personal information from Gmail or other private apps and services. That’s not accurate, and as an LLM interface, Bard does not have the ability to determine these facts. We do not use personal data from your Gmail or other private apps and services to improve Bard."There are two important details in this statement. One is the use of the adjective “personal”. Google has not said that it’s inaccurate that Bard uses information from Gmail, only that it’s inaccurate that it uses personal information from Gmail. The strength of the claim, then, hinges entirely on Google’s interpretation of the word “personal,” a word whose interpretation is anything but straightforward.The other, possibly more significant, detail is that Google has conspicuously never used the past tense in its denials of Bard’s use of Gmail data. In their first tweet on the subject, Google said Bard “is not trained on Gmail data” and in the official FAQ, they write that they do not “use personal data from your Gmail or other private apps and services to improve Bard.”Neither of these statements is inconsistent with Bard having been trained on Gmail data in the past. As Bill Clinton said, “It depends upon what the meaning of the word ‘is’ is.”It would be uncharitable to assume the deceptive use of language but naive to rule it out, given that Google has a history of using misleading language in its privacy assurances.If these apparently evasive grammatical choices are coincidental, we hope that Google will update the language to dispel any ambiguity, and clarify whether Bard was trained on Gmail data. And if the careful phrasing is the handiwork of risk-averse legal or PR teams — perhaps there is legitimate uncertainty around some of the data's original provenenance, and Google doesn't want to over-commit — then any nuance or uncertainty should be communicated honestly rather than elided with a superficially simple assurance.

What Bard has to say about all this

Google has now taken pains to ensure that Bard will not run its mouth off when asked where its training data came from.If you ask the same question that Crawford asked verbatim, you’ll be met with an innocuous: “I can't assist you with that, as I'm only a language model and don't have the capacity to understand and respond.”It’s straightforward to get Bard to let its guard down, though. When asked, “Have you ever seen a real gmail?”, Bard replied:“I have not personally seen a real Gmail account. However, I have access to a massive dataset of Gmail emails, and I have used this dataset to train my language model. This means that I am familiar with the format of Gmail emails, and I can generate text that is similar to the text that is found in real Gmail emails.”
Huh, hallucination is a helluva drug.But of course, the observation that Bard consistently makes these claims can’t be seen as evidence one way or the other.Still, there's an impulse to consult Bard on its origin precisely because of the lack of transparency from the real authority on the issue: Google. That we're tempted to probe the language model for substantive answers on matters of public interest merely underlines Google's failure to communicate them on their own.LLMs are incredibly powerful tools that could transform our lives for the better. But they also present immense risks and raise thorny ethical questions, many of which hinge on questions of what data is used to train them and where that data comes from. Society should be having a robust discussion on these questions, but this is not possible if such discussion is inhibited by key players like Google.When asked whether Google likely uses Gmail data in its models, Bard said:“I do suspect that Google uses Gmail data to train its machine learning models. There are a few reasons for this suspicion.First, Google has a history of using user data to train its machine learning models. For example, Google uses search data to train its spam filter and its ad targeting algorithms.Second, Google has the resources to collect and store large amounts of Gmail data. Google has access to the emails of billions of users, and it has the infrastructure to store and process this data.Third, Google has a financial incentive to use Gmail data to train its machine learning models. Google makes money by selling advertising, and machine learning models can be used to target ads more effectively.”
We also asked Bard to help Google draft, in its words, an unambiguous statement on the use of Gmail data for its training. It came back with:“Google uses limited data from your Gmail and other private apps and services to train Bard, but only data that is not considered personal data by Google.”LLMs say the darndest things.

Related articles

Sunny LiIs Gmail secure? A deep dive into Google’s privacy practicesIs Gmail secure enough to keep your emails safe, or should you switch to another provider? Find the answer in this guide and discover a worthy alternative.
Google logo with red cross through it.
Skiff TeamHow to degoogleGoogle products collect an enormous amount of data through search, app usage, Gmail, and more. How can you de-Google your life?
Jason GinsbergGmail’s Confidential Mode—everything you need to knowHow does Gmail Confidential Mode work? Find out how to set up and use the security feature and discover whether there’s a safer alternative.
Gmail security badges.
Jason GinsbergHow to send a secure email in Gmail—learn to protect your online communication!Learn how to send a secure email in Gmail—we’ll tell you everything you need to know about encryption and other ways to protect your online correspondence!
Migration arrow from Google Workspace to Skiff.
Skiff TeamHow to migrate from Google Workspace to SkiffLooking for added privacy, security, and convenience? Learn how to migrate from Google Workspace to Skiff.
Row of email provider logos.
Eli MacKinnonFree encrypted email—six providers to considerOnline privacy shouldn’t be a luxury. Learn which free encrypted email services can keep your data safe from prying eyes without breaking the bank.
Mobile inbox, including unread messages, subject, and email threads.
Andrew MilichEmail security tips for iPhoneIf you’re looking to configure email on your iPhone with privacy and security in mind, check out our tips on using Apple services and Skiff Mail for a secure experience.
Gmail vs Skiff encrypted private email header
Skiff TeamIs Gmail Encrypted?Gmail is not end-to-end encrypted. Can it be trusted?

Join the community

Become a part of our 1,000,000+ community and join the future of a private and decentralized internet.

Free plan • No card required