The authors would like to thank the reviewers for their valuable and constructive feedback for improving paper.
Review #154A:
·       Dataset size- Since the required preprocessing time for each app is high (e.g., making call graph, abstracting it, generating Markov chain, etc. for MaMaDroid), we have decided to work with 15,000 samples in our experiments similar to other studies […].

·       The baseline attacks- PK and Random attacks had been chosen as baselines because of highlighting the proper *evasion rate* of EvadeDroid. PK attack is the strongest attack to evade DREBIN and Sec-SVM. Moreover, since EvadeDroid performs a random search, the Random attack was used to remark that randomly mutating features of apps cannot fool malware detectors.

·       Black-box baseline attacks- We couldn’t compare EvadeDroid with only presented *problem-space* black-box *Android* evasion attacks [29] because the source code is not available. For reimplementing, the methodology is too vague(e.g., lacks details about their preprocessing and attacks). On the other hand, comparing EvadeDroid with an impractical evasion attack did not make sense since this attack may significantly alter the semantics of the programs.

·       Evaluating against AV vendors- In the revision, we will add more details about this evaluation. Note in our evaluation, we only generated adversarial examples (AEs) for the samples that had already been detected as malware by AVs. Moreover, although the details of AVs in VirusTotal (VT) are not publicly available, they not only work for scanning specific apps (e.g., APKs, PEs, etc.) because, in the real world, VT is widely used for labeling APKs.

@Veelasha: One of the reviewer comments for AVs is “It only makes sense to include vendors that (also) specialize in Android malware”.

·       Related work section- We will reconsider this section by comparing the similarities and differences relevant studies to highlight the superiorities of our study.

·       Ethical responsibility about AVs- We agree that disclosing our results about AV vendors may have ethical implications; so, we will anonymise the name of vendors in the revision.

·       ZK technique- Regarding the most threat models defined in the literature, in the ZK settings, target classifiers are sort of black boxes that are only being queried by sending inputs and returning their outputs. ZK seems more realistic compared to PK and LK since, in real life, most people can only scan a file with an AV product to see its label, without knowing the details of the product.
Review #154B:
·       Dataset- We worked with the dataset provided in [20] (a.k.a., DREBIN20) as it is a public new Android dataset that lets other researchers compare their techniques. Note the most important thing for us was working with Android apps the are new as possible.

@Veelasha: The reviewer comment is “The used dataset [26] contains samples collected in 2017 and 2018. Was there any evaluation conducted on a dataset containing samples from a broader period of time?”

·       AEs for ESET-NODE32- As shown in Table 4, EvadeDroid usually needed more than queries for generating AEs against ESET-NOD32 as compared to other AVs; so, checking the executability of these manipulated apps was fairer as the probability of crashing apps after applying more than one transformation is higher.

·       The number of queries- In the revision, we will discuss more the number of queries. Note we couldn’t compare the number of EvadeDroid’s queries with similar attacks because of the lack of problem-space black-box attacks for Android that work with queries. However, the number of queries used in EvadeDroid is much lower than a *semi-black-box* Android attack presented in [23].

·       Real-world constraints- We believe our threat model (e.g., ZK settings, Hard-label attacking) meets more constraints of real life, *not all constraints*. In the revision, we will further clarify this claim.

·       Transferability and dynamic analysis - The goal of our threat model is to evade *specific* *static* ML-based malware detector that is a black box for attackers; however, we will perform further experiments to evaluate EvadeDroid in cases that the target model uses dynamic analysis and unclear (transferability).

·       Configuration of AV vendors- We will further discuss VT in the revision. Note VT uses commercial AV products that are up to date. We don’t exactly know the details of AVs; but, it seems that in the response of usual queries, VT returns the result of a static analysis because of the short time of analyzing. Moreover, there are various papers that use VT for static analysis (e.g., […]).

·       The size of changes in AVs experiment- The size of apps did not matter for us in that experiment as our focus was only on the evasion rate.

·       Black-box baseline methods- I’m not sure it is possible or not.

@Veelasha: The comments of the reviewer are “No baseline comparison with other black-box evasion attacks (e.g., GAN-based detection). It appears interesting (if not necessary) to compare the threat models and the evasion rates” and “compare with black-box baseline methods that use surrogate models.” I think using EvadeDroid to generate AEs with surrogate models is enough for the reviewer.
Review #154C
·       Novelty- The novelty of the paper goes back to our threat model that not only is different from other proposed proposals in the Android domain but also is more realizable in many real-world scenarios. Moreover, in contrast to other relevant studies, we directly manipulate objects in problem space without using feature-space information. Our study offers a new way method to address finding desired manipulations for generating successful adversarial examples in black-box settings and applying them. Obviously, for modifying objects in the problem space, we need a transformation technique. This research has selected code transplantation because according to what was discussed in [26] and [29], it is the best candidate for manipulating real objects.

·       Evaluation settings- The focus of the paper is working on practical Android evasion attacks to generate realizable adversarial examples, not improving the robustness of ML-based malware detectors to evasion attacks.

·       Gadgets- We will add more information about gadgets in the revision. In simple words, a gadget is a code snippet of a program (e.g., an APK) with all its dependencies.
Review #154D
·       EvadeDroid vs [26] – In the revision, we will evaluate EvadeDroid with [26]. In sum, the first key difference between EvadeDroid and [26] is our threat model that assumes attackers work in black-box settings; however, adversaries work in white-box settings in [26]. [26] finds the most influential features that matter for target classifiers, and then some transformations that can affect these features will be applied for manipulating objects; however, our approach is different and novel for finding the desired manipulations since our attacker has not this information. The second difference is our attack directly generates problem-space perturbations while the problem-space perturbations in [26] are based on the feature-space perturbations. The third difference is it does not appear that [26] is a general attack as the transformations should correspond to the features of the target classifiers; however, they are independent in EvadeDroid. Note in the paper, we mentioned that in contrast to [28] or [29], the tool presented in [26] for manipulating Android apps can guarantee the executability and functionality of apps, without breaking the apps. Indeed, we have selected this tool [26] as a reliable transformation technique and extended it to extract our desired gadgets.

·       Table 1 and literature review - The goal of this table is to summarize the possible feature addition transformation techniques presented in the Android domain. In the revision, we will add a table in the Related Work section to summarize the key contributions of relevant papers as well as our study.

·       VT’s labeling- In the revision, we will discuss analyzing VT for accurate labeling.

·       KNN in MaMaDroid- In the original paper [34], the proper performance of using KNN and the low performance of SVM were shown. Moreover, we have empirically concluded that KNN works better as compared to SVM.

Review #154E
·       NN-Based classifier- We will discuss DNN-based malware detectors in the revision. We will try to evaluate EvadeDroid against one of the state-of-the-art DNN malware detectors.

·       The optimization goal- We aim to evade a target malware detector by minimizing the number of transformations that should be applied for manipulating an Android malware app. The number of transformations immediately influences the number of queries or added features because applying fewer transformations means we need fewer queries to check the quality of transformations. Moreover, it leads to fewer added features.
·       Samples collection time- We will add the details of the collection time of the dataset from AndroZoo in the revision. Note we collected the Android apps from AndroZoo in recent months.

·       Baseline black-box attack- Please refer to our answer for the respectful Reviewer A about *why we didn’t use an Android black-box attack in our baseline*. Note the main issue to use the problem-space black-box attacks presented in domains (e.g., [36], [41]) is related to their transformations for manipulating malware programs that cannot be used in the Android domain.

·       Attack consumption time- We will add further details about the consumption time of EvadeDroid to Table 3.