Rohan Sathe is the co-founder and CTO of Nightfall AI, the first DLP platform to leverage generative AI to discover, classify and protect sensitive data across the modern enterprise. Prior to Nightfall, Rohan was a founding engineer at UberEats, where he saw firsthand how complex it is to manage sensitive data when it’s sprawled across hundreds of SaaS and infrastructure applications. Rohan’s technical leadership and relentless desire to solve this problem earned the attention of top venture capital firms, CEOs, professional athletes and influencers who have invested more than $60M in Nightfall AI. Rohan is a Forbes 30 Under 30 recipient and nationally ranked chess player.
Nightfall AI announced the release of Firewall for AI to safeguard organizations' GenAI-based applications and data pipelines that leverage GPT-4o and other large language models (LLMs).
Nightfall AI introduces 'Firewall for AI' to enhance security in generative AI applications - SiliconANGLE
There are several challenges in deploying LLMs.
Challenge 1: Persistent sensitive data sprawl. Like the cloud explosion, companies have adopted GenAI without first identifying and implementing security best practices. As a result, you have LLMs trained on sensitive data. GenAI has widened the attack surface. One way to address this issue is through the use of synthetic data. While synthetic data may take longer than actual values, this inconvenience pales compared to a sensitive data leak. Simply put, there’s really no logical reason to use sensitive data in training models.
Challenge 2: An alarming rate of false positives. Many of our customers have found that systems they use to flag improper sharing of sensitive data in LLMs have an alarming rate of false positives. If you’re relying solely on heuristics, you may be dealing with a 20% accuracy rate, which can take excessive time to comb through using manual processes. As simplistic as it may sound, securing AI requires an AI-based solution. Security tools trained on millions of lines of code can more accurately identify when sensitive data is being shared in a GenAI environment and when it isn’t.
Challenge 3: Securing developer projects. We’re seeing a great deal of emerging guidance for secure GenAI implementation for typical corporate users. However, there’s less clarity for those using AI to develop products. If sensitive data is exposed to an AI model, several downstream risks could arise, including, most notably, data reconstruction attacks or training data extraction attacks. An AI firewall in the form of a third-party service can monitor inputs for sensitive data and scrub that sensitive data before it can be transmitted to a third-party LLM. By doing this, they’re more protected from downstream data leaks that could result from LLM overfitting, lack of filtering, or other errors involved in model training.
Synthetic data can help with issues like bias, consent, copyright, and privacy conflicts, but it's no panacea. The biggest challenge with synthetic data is that in order to be useful, it must realistically represent the patterns and concepts that the AI model should learn. Since synthetic data is (by definition) produced by a computer program, that program must be capable of generating data representing those patterns and concepts, along with appropriate actions for the AI model to take for each example. For easy tasks, like trying to train an AI model to perform mathematical operations like "10 + 5 = 15", the synthetic data generation program can be quite simple. However, for more complex tasks like training an AI model to draft emails or call a restaurant, the synthetic data generation program would need to be so complex that you could likely use it to solve the task in the first place.
One feasible approach in these situations is to use an existing AI model to generate synthetic data for training a new AI model. However, this doesn't really avoid the bias, consent, copyright and privacy concerns of an AI model, since all of the synthetic data generated by an existing AI model could carry the same bias, reproduce copyrighted content or include private information inadvertently included in the original training dataset. Much like how the latest generation of the HeLa cell line still carries the DNA of Henrietta Lacks and many of the ethical issues of its original unconsenting extraction, AI models derived from other AI models can carry many of the same ethical concerns as the original models themselves.
From Wordle to Club Penguin, some of the internet's favorite games are under fire due to a few leaked credentials.
ICYMI, both The New York Times and Disney experienced breaches caused by exposed secrets. In NYT's case, an access token in GitHub helped threat actors to uncover source code, internal documentation, and infrastructure tools. In Disney's case, an exposed credential led to unauthorized access to developer tools, advertising strategies, internal API endpoints, and even additional credentials for S3 buckets.
The bottom line? Behemoth enterprises like NYT and Disney are particularly susceptible to data sprawl—but regardless of company size, stolen credentials can often lead to devastating breaches. With these most recent examples in mind, it's more important than ever to handle secrets and credentials safely. Here are a few best practices for safeguarding your passwords, credentials, and more across the cloud:
1. Encourage employees to adopt better secret sharing practices, such as automatically encrypting all secrets, as well as rotating credentials on a regular basis.
2. Scan for secrets across SaaS apps like GitHub, Jira, and Confluence using a real-time Data Leak Prevention (DLP) platform.
3. Conduct regular audits of data at rest across SaaS apps and data silos. Even if you have a DLP solution in place, it’s still worthwhile to perform historical scans on a weekly or monthly basis to ensure that no secrets or credentials slip through the cracks.