September 26, 2022

Toward Trustworthy Machine Learning: An Example in Defending against Adversarial Patch Attacks (2)

By Chong Xiang and Prateek Mittal

In our previous post, we discussed adversarial patch attacks and presented our first defense algorithm PatchGuard. The PatchGuard framework (small receptive field + secure aggregation) has become the most popular defense strategy over the past year, subsuming a long list of defense instances (Clipped BagNet, De-randomized Smoothing, BagCert, Randomized Cropping, PatchGuard++, ScaleCert, Smoothed ViT, ECViT). In this post, we will present a different way of building robust image classification models: PatchCleanser. Instead of using small receptive fields to suppress the adversarial effect, PatchCleanser directly masks out adversarial pixels in the input image. This design makes PatchCleanser compatible with any high-performance image classifiers and achieve state-of-the-art defense performance.

PatchCleanser: Removing the Dependency on Small Receptive Fields

The limitation of small receptive fields. We have seen the small receptive field plays an important role in PatchGuard: it limits the number of corrupted features and lays a foundation for robustness. However, the small receptive field also limits the information received by each feature; as a result, it hurts the clean model performance (when there is no attack). For example, the PatchGuard models (BagNet+robust masking) can only have a 55%-60% clean accuracy on the ImageNet dataset while state-of-the-art undefended models, which all have large receptive fields, can achieve an accuracy of 80%-90%.

This huge drop in clean accuracy discourages the real-world deployment of PatchGuard-style defenses. A natural question to ask is:

Can we achieve strong robustness without the use of small receptive fields? 

YES, we can. We propose PatchCleanser with an image-space pixel masking strategy to make the defense compatible with any state-of-the-art image classification model (with larger receptive fields).

A pixel-masking defense strategy. The high-level idea of PatchCleanser is to apply pixel masks to the input image and evaluate model predictions on masked images. If a mask removes the entire patch, the attacker has no influence over the classification, and thus any image classifier can make an accurate prediction on the masked image. However, the challenge is: how can we mask out the patch, especially when the patch location is unknown? 

Pixel masking: the first attempt. A naive approach is to choose a mask and apply it to all possible image locations. If the mask is large enough to cover the entire patch, then at least one mask location can remove all adversarial pixels. 

We provide a simplified visualization below. When we apply masks to an adversarial image (top of the figure), the model prediction is correct as “dog” when the mask removes the patch at the upper left corner. Meanwhile, the predictions on other masked images are incorrect since they are influenced by the adversarial patch — we see a prediction disagreement among different masked images. On the other hand, when we consider a clean image (bottom of the figure), the model predictions usually agree on the correct label since both we and the classifier can easily recognize the partially occluded dog. 

visual examples of one-mask predictions

Based on these observations, we can use the disagreement in one-mask predictions to detect a patch attack; a similar strategy is used by the Minority Reports defense, which takes inconsistency in the prediction voting grid as an attack indicator. However, can we recover the correct prediction label instead of merely detecting an attack? Or equivalently, how can we know which mask removes the entire patch? Which class label should an image classifier trust — dog, cat, or fox?  

Pixel masking: the second attempt. The solution turns out to be super simple: we can perform a second round of masking on the one-masked images (see visual examples below). If the first-round mask already removes the patch (top of the figure), then our second-round masking is applied to a “clean” image, and thus all two-mask predictions will have a unanimous agreement. On the other hand, if the patch is not removed by the first-round mask (bottom of the figure), the image is still “adversarial”. We will then see a disagreement in two-mask predictions; we shall not trust the prediction labels.

visual examples of two-mask predictions

In our PatchCleanser paper, we further discuss how to generate a mask set such that at least one mask can remove the entire patch regardless of the patch location. We further prove that if the model predictions on all possible two-masked images are correct, PatchCleanser can always make correct predictions. 

PatchCleanser performance. In the figure below, we plot the defense performance on the 1000-class ImageNet dataset (against a 2%-pixel square patch anywhere on the image). We can see that PatchCleanser significantly outperforms prior works (which are all PatchGuard-style defenses with small receptive fields). Notably, (1) the certified robust accuracy of PatchCleanser (62.1%) is even higher than the clean accuracy of all prior defenses, and (2) the clean accuracy of PatchCleanser (83.9%) is similar to vanilla undefended models! These results further demonstrate the strength of defenses that are compatible with any state-of-the-art classification models (with large receptive fields). 

Clean accuracy and certified robust accuracy on the ImageNet dataset; certified robust accuracy evaluated against a 2%-pixel square patch anywhere on the image

(certified robust accuracy is a provable lower bound on model robust accuracy; see the PatchCleanser paper for more details)

Takeaways. In PatchCleanser, we demonstrate that small receptive fields are not necessary for strong robustness. We design an image-space pixel masking strategy that is compatible with any image classifier. The compatibility allows us to use state-of-the-art image classifiers and achieves significant improvements over prior works.

Conclusion: Using Logical Reasoning for Building Trustworthy ML systems

In the era of big data, we have been amazed at the power of statistical reasoning/learning: an AI model can automatically extract useful information from a large amount of data and significantly outperforms manually designed models (e.g., hand-crafted features). However, these learned models can have unexpected behaviors when encountered with “adversarial examples” and thus lacks reliability for security-critical applications. In our posts, we demonstrate that we can additionally apply logical reasoning to statistically learned models to achieve strong robustness. We believe the combination of logical and statistical reasoning is a promising and important direction for building trustworthy ML systems.

Additional Reading

New Study Analyzing Political Advertising on Facebook, Google, and TikTok

By Orestis Papakyriakopoulos, Christelle Tessono, Arvind Narayanan, Mihir Kshirsagar

With the 2022 midterm elections in the United States fast approaching, political campaigns are poised to spend heavily to influence prospective voters through digital advertising. Online platforms such as Facebook, Google, and TikTok will play an important role in distributing that content. But our new study – How Algorithms Shape the Distribution of Political Advertising: Case Studies of Facebook, Google, and TikTok — that will appear in the Artificial Intelligence, Ethics, and Society conference in August, shows that the platforms’ tools for voluntary disclosures about political ads do not provide the necessary transparency the public needs. More details can also be found on our website: campaigndisclosures.princeton.edu.

Our paper conducts the first large-scale analysis of public data from the 2020 presidential election cycle to critically evaluate how online platforms affect the distribution of political advertisements. We analyzed a dataset containing over 800,000 ads about the 2020 U.S. presidential election that ran in the 2 months prior to the election, which we obtained from the ad libraries of Facebook and Google that were created by the companies to offer more transparency about political ads. We also collected and analyzed 2.5 million TikTok videos from the same time period. These ad libraries were created by the platforms in an attempt to stave off potential regulation such as the Honest Ads Act, which sought to impose greater transparency requirements for platforms carrying political ads. But our study shows that these ad libraries fall woefully short of their own objectives to be more transparent about who pays for the ads and who sees the ads, as well the objectives of bringing greater transparency about the role of online platforms in shaping the distribution of political advertising. 

We developed a three-part evaluative framework to assess the platform disclosures: 

1. Do the disclosures meet the platforms’ self-described objective of making political advertisers accountable?

2. How do the platforms’ disclosures compare against what the law requires for radio and television broadcasters?

3. Do the platforms disclose all that they know about the ad targeting criteria, the audience for the ads, and how their algorithms distribute or moderate content?

Our analysis shows that the ad libraries do not meet any of the objectives. First, the ad libraries only have partial disclosures of audience characteristics and targeting parameters of placed political ads. But these disclosures do not allow us to understand how political advertisers reached prospective voters. For example, we compared ads in the ad libraries that were shown to different audiences with dummy ads that we created on the platforms (Figure 1). In many cases, we measured a significant difference between the calculated cost-per-impression between the two types of ads, which we could not explain with the available data.

  • Figure 1. We plot the generated cost per impression of ads in the ad-libraries that were (1) targeted to all genders & ages on Google, (2) to Females, between 25-34 on YouTube, (3) were seen by all genders & ages in the US on Facebook, and (4) only by females of all ages located in California on Facebook.  For Facebook, lower & upper bounds are provided for the impressions. For Google, lower & upper bounds are provided for cost & impressions, given the extensive “bucketing” of the parameters performed by the ad libraries when reporting them, which are denoted in the figures with boxes. Points represent the median value of the boxes. We compare the generated cost-per impression of ads with the cost-per impression of a set of dummy ads we placed on the platforms with the exact same targeting parameters & audience characteristics. Black lines represent the upper and lower boundaries of an ad’s cost-per-impression as we extracted them from the dummy ads. We label an ad placement as “plausible targeting”, when the ad cost-per-impression overlaps with the one we calculated, denoting that we can assume that the ad library provides all relevant targeting parameters/audience characteristics about an ad.  Similarly, an placement labeled as `”unexplainable targeting’”  represents an ad whose cost-per-impression is outside the upper and lower reach values that we calculated, meaning that potentially platforms do not disclose full information about the distribution of the ad.

Second, broadcasters are required to offer advertising space at the same price to political advertisers as they do to commercial advertisers. But we find that the platforms charged campaigns different prices for distributing ads. For example, on average, the Trump campaign on Facebook paid more per impression (~18 impressions/dollar) compared to the Biden campaign (~27 impressions/dollar). On Google, the Biden campaign paid more per impression compared to the Trump campaign. Unfortunately, while we attempted to control for factors that might account for different prices for different audiences, the data does not allow us to probe the precise reason for the differential pricing. 

Third, the platforms do not disclose the detailed information about the audience characteristics that they make available to advertisers. They also do not explain how the algorithms distribute or moderate the ads. For example, we see that campaigns placed ads on Facebook that were not ostensibly targeted by age, but the ad was not distributed uniformly.  We also find that platforms applied their ad moderation policies inconsistently, with some instances of moderated ads being removed and some others not, and without any explanation for the decision to remove an ad. (Figure 2) 

  • Figure 2. Comparison of different instances of moderated ads across platforms. The light blue bars show how many instances of a single ad were moderated, and maroon bars show how many instances of the same ad were not. Results suggests an inconsistent moderation of content across platforms, with some instances of the same ad being removed and some others not.

Finally, we observed new forms of political advertising that are not captured in the ad libraries. Specifically, campaigns appear to have used influencers to promote their messages without adequate disclosure. For example, on TikTok, we document how political influencers, who were often linked with PACs, generated billions of impressions from their political content. This new type of campaigning still remains unregulated and little is known about the practices and relations between influencers and political campaigns.  

In short, the online platform self-regulatory disclosures are inadequate and we need more comprehensive disclosures from platforms to understand their role in the political process. Our key recommendations include:

– Requiring that each political entity registered with the FEC use a single, universal identifier for campaign spending across platforms to allow the public to track their activity.

– Developing a cross-platform data repository, hosted and maintained by a government or independent entity, that collects political ads, their targeting criteria, and the audience characteristics that received them. 

– Requiring platforms to disclose information that will allow the public to understand how the algorithms distribute content and how platforms price the distribution of political ads. 

– Developing a comprehensive definition of political advertising that includes influencers and other forms of paid promotional activity.

Toward Trustworthy Machine Learning: An Example in Defending against Adversarial Patch Attacks

By Chong Xiang and Prateek Mittal

Thanks to the stunning advancement of Machine Learning (ML) technologies, ML models are increasingly being used in critical societal contexts — such as in the courtroom, where judges look to ML models to determine whether a defendant is a flight risk, and in autonomous driving,  where driverless vehicles are operating in city downtowns. However, despite the advantages, ML models are also vulnerable to adversarial attacks, which can be harmful to society. For example, an adversary against image classifiers can augment an image with an adversarial pixel patch to induce model misclassification. Such attacks raise questions about the reliability of critical ML systems and have motivated the design of trustworthy ML models. 

In this 2-part post on trustworthy machine learning design, we will focus on ML models for image classification and discuss how to protect them against adversarial patch attacks. We will first introduce the concept of adversarial patches and then present two of our defense algorithms: PatchGuard in Part 1 and PatchCleanser in Part 2.

Adversarial Patch Attacks: A Threat in the Physical World

The adversarial patch attack, first proposed by Brown et al., targets image recognition models (e.g., image classifiers). The attacker aims to overlay an image with a carefully generated adversarial pixel patch to induce models’ incorrect predictions (e.g., misclassification). Below is a visual example of the adversarial patch attack against traffic sign recognition models: after attaching an adversarial patch, the model prediction changes from “stop sign” to “speed limit 80 sign” incorrectly.

a visual example of adversarial patch attacks taken from Yakura et al.

Notably, this attack can be realized in the physical world. An attacker can print and attach an adversarial patch to a physical object or scene. Any image taken from this scene then becomes an adversarial image. Just imagine that a malicious sticker attached to a stop sign confuses the perception system of an autonomous vehicle and eventually leads to a serious accident! This threat to the physical world motivates us to study mitigation techniques against adversarial patch attacks.

Unfortunately, security is never easy. An attacker only needs to find one strategy to break the entire system while a defender has to defeat as many attack strategies as possible. 

In the remainder of this post, we discuss how to make an image classification model as secure as possible: able to make correct and robust predictions against attackers who know everything about the defense and who might attempt to use an adversarial patch at any image location and with any malicious content.

This as-robust-as-possible notion is referred to as provable, or certifiable, robustness in the literature. We refer interested readers to the PatchGuard and PatchCleanser papers for its formal definitions and security guarantees

PatchGuard: A Defense Framework Using Small Receptive Field + Secure Aggregation

PatchGuard is a defense framework for certifiably robust image classification against adversarial patch attacks. Its design is motivated by the following question:

How can we ensure that the model prediction is not hijacked by a small localized patch? 

We propose a two-step defense strategy: (1) small receptive fields and (2) secure aggregation. The use of small receptive fields limits the number of corrupted features, and secure aggregation on a partially corrupted feature map allows us to make robust final predictions.  

Step 1: Small Receptive Fields. The receptive field of an image classifier (e.g., CNN) is the region of the input image that a particular feature looks at (or is influenced by). The model prediction is based on the aggregation of features extracted from different regions of an image. By using a small receptive field, we can ensure that only a limited number of features “see” the adversarial patch. 

The example below illustrates that the adversarial patch can only corrupt one feature — the red vector on the right – when we use a model with small receptive fields, marked with red and green boxes over the images.

.

We provide another example below for large receptive fields. The adversarial pixels appear in the receptive fields – the area inside the red boxes – of all four features and lead to a completely corrupted feature map, making it nearly impossible to recover the correct prediction. 

Step 2: Secure Aggregation. The use of small receptive fields limits the number of corrupted features and translates the defense into a secure aggregation problem: how can we make a robust prediction based on a partially corrupted feature map? Here, we can use any off-the-shelf robust statistics techniques (e.g., clipping and median) for feature aggregation. 

In our paper, we further propose a more powerful secure aggregation technique named robust masking. Its design intuition is to identify and remove abnormally large features. This mechanism introduces a dilemma for the attacker. If the attacker wants to launch a successful attack, it needs to either introduce large malicious feature values that will be removed by our defense, or use small feature values that can evade the masking operation, but are not malicious enough to cause misclassification. 

This dilemma further allows us to analyze the defense robustness. For example, if we consider a square patch that occupies 1% of image pixels, we can calculate the largest number of corrupted features and quantitatively reason about the worst-case feature corruption (details in the paper). Our evaluation shows that, for 89.0% of images in the test set of the ImageNette dataset (a 10-class subset of the ImageNet dataset), our defense can always make correct predictions, even when the attacker has full access to our defense setup and can place a 1%-pixel square patch at any image location and with any malicious content. We note that the result of 89.0% is rigorously proved and certified in our paper, giving the defense theoretical and formal security guarantees. 

Furthermore, PatchGuard is also scalable to more challenging datasets like ImageNet. We can achieve certified robustness for 32.2% of ImageNet test images, against a 1%-pixel square patch. Note that the ImageNet dataset contains images from 1000 different categories, which means that an image classifier that makes random predictions can only correctly classify roughly 1/1000=0.1% of images.

Takeaways

The high-level contribution of PatchGuard is a two-step defense framework: small receptive field and secure aggregation. This simple approach turns out to be a very powerful strategy: the PatchGuard framework subsumes most of the concurrent (Clipped BagNet, De-randomized Smoothing) and follow-up works (BagCert, Randomized Cropping, PatchGuard++, ScaleCert, Smoothed ViT, ECViT). We refer interested readers to our robustness leaderboard to learn more about state-of-the-art defense performance.

Conclusion

In this post, we discussed the threat of adversarial patch attacks and presented our PatchGuard defense algorithm (small receptive field and secure aggregation). This defense example demonstrates one of our efforts toward building trustworthy ML models for critical societal applications such as autonomous driving.

In the second part of this two-part post, we will present PatchCleanser — our second example for designing robust ML algorithms.