Summary of Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series

This is an AI generated summary. There may be inaccuracies.
Summarize another video · Purchase summarize.tech Premium

00:00:00 - 01:00:00

Andrew Trask explores the concept of privacy-preserving AI and its potential to transform the way data scientists and researchers work. He discusses the challenges of accessing private data and proposes various techniques, like differential privacy and encrypted computation, to make private data more accessible. Trask introduces tools such as PI Sift and PI Grid that enable privacy-preserving machine learning and remote execution of computations. He emphasizes the importance of balancing privacy and utility and highlights the need for shared governance and personal privacy budgeting infrastructure. Trask also discusses the potential of privacy-preserving AI to answer questions using data that cannot be seen, and the importance of adopting and engineering these techniques to create a secure and robust privacy infrastructure. The goal is to unlock untapped data while preserving privacy, enabling collaboration and advancements across various domains.

00:00:00 In this section, Andrew Trask discusses the concept of privacy-preserving AI and how it can change the way data scientists and researchers work. He starts by addressing the question of whether it is possible to answer questions using data that cannot be seen, using the example of analyzing tumor images. He highlights the challenges of accessing private data and the need to find financing and business partners to obtain such data. Trask then contrasts this with the ease of accessing other types of data, such as digitized handwritten digits. He proposes using various techniques to make private data more accessible and lower the entry barrier for researchers, with the ultimate goal of being able to easily access datasets like these through a simple installation process.
00:05:00 In this section, the speaker discusses the tools and concepts related to privacy-preserving AI and machine learning. The first tool mentioned is called PI Sift, which extends major deep learning frameworks and allows for privacy-preserving machine learning. The speaker then introduces the concept of remote execution, where computation can occur on a remote machine without direct access. This allows for coordination of remote computations without physically seeing the data. The second tool discussed is PI Grid, which provides a platform for accessing and analyzing large datasets in a privacy-preserving manner. It allows for remote searches and provides detailed information about the data without exposing it directly. These tools aim to enable data science and answering questions using data that is not directly accessible.
00:10:00 In this section, the speaker talks about privacy-preserving AI in the context of his previous work experience delivering AI services to corporations. He highlights the challenge of working with data that the home team can't see and introduces differential privacy as a solution. He explains that differential privacy allows for statistical analysis of a database without compromising the privacy of the data set. The speaker provides an example of querying a database while guaranteeing privacy and discusses the concept of sensitivity, which refers to the maximal amount that the output of a function can change when an individual is removed or replaced in the database. He also mentions the use of differential privacy in sensitive surveys, where people are likely to lie, to obtain more accurate results without violating privacy.
00:15:00 In this section, the speaker introduces the concept of privacy-preserving AI techniques, focusing on differential privacy. He explains how randomized response, a technique involving coin flips, can be used to add noise to data and provide plausible deniability for individuals. The speaker also discusses local and global differential privacy, highlighting the trade-offs between privacy and accuracy. He mentions that differential privacy can be seen as a formal version of data anonymization and emphasizes the importance of not relying solely on data anonymization for privacy protection.
00:20:00 In this section, Andrew Trask discusses the issue of privacy when dealing with anonymized datasets. He uses the example of the Netflix prize, where researchers were able to de-anonymize the dataset by comparing it to other publicly available datasets. Trask explains that just because a dataset appears anonymous, it doesn't mean that it can't be linked to other information. He introduces the concept of maximum epsilon, which is an upper bound on the statistical uniqueness of a dataset. By applying noise to the data, he suggests that a balance can be achieved between privacy and utility. Trask also mentions the use of single query approaches and synthetic datasets for preserving privacy. He emphasizes the importance of generalization in machine learning and how privacy-preserving techniques can still allow for meaningful insights without compromising individual identities.
00:25:00 In this section, the speaker discusses the privacy budgeting mechanism in privacy-preserving AI. They mention that the privacy budget should not be set by the data scientist or the data owner, but rather by the individuals whose information is being used. However, they acknowledge that achieving this level of personal privacy budgeting infrastructure is aspirational. The speaker also addresses two weaknesses of remote execution: the model being put at risk and the challenge of performing joint computations across multiple data owners. They then introduce secure multi-party computation as a solution to these challenges and explain how it allows multiple people to combine their private inputs to compute a function without revealing their inputs to each other. They give an example using encrypted shares of a number to illustrate this concept.
00:30:00 In this section, the speaker explains how privacy-preserving AI works using encryption and shared governance. Each value is encrypted and cannot be decrypted unless all shareholders agree. However, despite being encrypted, computations can still be performed on these numbers. For example, multiplying the shares by a certain value. There are different protocols that allow for different functions needed in machine learning. Large models and datasets can also be individually encrypted and shared, allowing for shared ownership and governance. The speaker also addresses the computational complexity and potential slowdown that comes with encrypted computation. Nonetheless, the theoretical framework exists for answering questions using data that cannot be seen, which opens up new possibilities for privacy-preserving AI. The long-term goal is to create secure and robust infrastructure that is accessible to all.
00:35:00 In this section, the speaker discusses the concept of encrypting both the model and the dataset during AI training. This approach differs from previous methods that only encrypted the dataset. By encrypting both the model and dataset, the computations can still be performed inside the encryption without any degradation in accuracy. The speaker also mentions the concept of federated learning, praising Google's implementation and highlighting the two forms of federated learning: one with a fixed dataset and model for product development, and another more exploratory style where data is hosted in different private clouds. The speaker emphasizes the importance of combining federated learning with techniques like differential privacy to protect against information leakage.
00:40:00 In this section, the speaker discusses the potential of privacy preserving AI to answer questions using data that cannot be seen. This ability is important because people spend a significant amount of their lives answering questions that often involve personal data. The speaker highlights four different areas where this technology can be applied, including healthcare and finance. The speaker also uses an analogy of how data is handled in society to walking through sludge and sewage in the past, emphasizing the need for better data privacy infrastructure.
00:45:00 In this section, the speaker emphasizes the importance of privacy preserving AI in answering important questions without compromising data security. He acknowledges that the idea may seem far-fetched, but believes that the basic building blocks are already in place and what is needed is adoption and engineering. The speaker also highlights the potential of open data for scientific advancements, citing historical examples where large datasets have led to significant progress in AI. He further explains that there is a vast amount of untapped data sitting in data warehouses due to legal risks and commercial concerns, and proposes a solution that allows different entities to collaborate by securely accessing and utilizing each other's data. This approach has the potential to unlock a plethora of data and solve various challenges across different domains.
00:50:00 In this section, the speaker discusses the concept of privacy-preserving AI through three use cases. The first use case involves creating a gateway that protects and leverages users' data, increasing the accuracy of models. The second use case focuses on single-use accountability, where data is only accessed to answer specific questions, similar to how sniffing dogs at airports reveal only one bit of information without searching every bag. This approach ensures privacy and limits potential misuse. The third use case explores encrypted services, such as messaging apps, where messages are encrypted on users' phones and sent directly to the intended recipient, providing secure communication. These use cases demonstrate the potential of privacy-preserving AI in various scenarios.
00:55:00 In this section, the speaker discusses the concept of privacy-preserving AI, where machine learning, encrypted computation, and differential privacy are combined to create services that protect user data. The example given is a doctor's visit, where the doctor and patient's data sets are brought together to compute the appropriate treatment, without revealing the individual's personal medical information. The speaker introduces the idea of structured transparency, which involves input privacy, logic, and output privacy, allowing for end-to-end encrypted services. The implications of this approach are demonstrated through a skin cancer prediction model, where the machine learning model can classify whether a person has cancer without exposing their medical records. The speaker suggests that if such services can be scaled and trained on large data sets, it could empower individuals to have control over their personal information while receiving the same services.

01:00:00 - 01:10:00

In this section, Andrew Trask explains the concept of privacy-preserving AI and the importance of protecting sensitive information. He discusses the use of encrypted computations to safeguard data and mentions the need for differential privacy to address biases in models. Trask highlights the adoption of privacy-preserving AI techniques by organizations like OpenMind and the US Census. He also discusses future developments in privacy technology, challenges in encrypted computation, and the potential for individuals to assign a privacy budget. Trask emphasizes the need for improved computing and networking infrastructure to achieve greater individual control over personal privacy. Finally, he explores the communication between enterprises and the need for an accounting mechanism to prevent data misuse.

01:00:00 In this section, the speaker discusses the concept of privacy in AI and how encrypted computations can help protect sensitive information. They explain that the diagnosis itself is not private, as it is revealed to the service provider, but the encrypted result can be decrypted by different key holders, depending on their purpose. They also mention the need for differential privacy to address potential biases in models without compromising privacy. The speaker mentions that the US Census is already using differential privacy for data protection and highlights the growing adoption of privacy-preserving AI techniques in organizations like OpenMind.
01:05:00 In this section, Andrew Trask discusses the importance of privacy in AI and the commercial incentives for protecting data. He mentions that in addition to privacy concerns, there are commercial reasons for protecting data sets and the unique statistical signal they possess. Trask also mentions the sponsorship of open-source grants and the hope for more grants in the future. He anticipates that the coming year will see the first pilots of privacy-preserving AI rolling out. He then addresses a technical question about nonlinear functions and their impact on performance in encryption. Trask mentions two trends in deep learning research for dealing with nonlinear functions - polynomial approximations and discrete comparison functions. He also mentions the challenges and constraints of encrypted computation. In response to a question about allowing individuals to assign a privacy budget, Trask mentions that adoption will likely happen in waves, with enterprises initially adopting privacy-preserving technologies driven by commercial reasons. He believes that this enterprise adoption will mature privacy technology quickly. Trask also discusses the challenges of encrypted services and the need for improvements in computing and networking infrastructure to achieve individual control over personal privacy budgets.
01:10:00 In this section, the speaker discusses the concept of privacy-preserving AI and the communication between different enterprises that would be necessary for its implementation. They suggest the need for an accounting mechanism to ensure that data is not being double-spent or misused. There are different possibilities for this mechanism, such as the creation of an app or institution, or the use of data banks that handle both financial and personal data. However, the speaker acknowledges that these ideas would require significant adoption and it is still unclear what the final solution would look like. They also mention that current recommendation systems can be improved by accessing private data without actually seeing it, which could lead to more beneficial and holistic recommendations.