-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks Discussion #2
Comments
Would two onnx network inputs and two onnx network outputs be accepted for the compitition? Motivation Currently, most networks evaluated in the literature are focusing on image classfication. However, in most practical scenarios, eg. autonomous driving, people pay more attention to object detection or semantic segmentation. Considering the complexity of the object detection, we propose a new simple Unet(four Conv2d layers followed with BN and ReLu). We advocate that tools should handle more practical architectures, and the simplified Unet is the first step towards this goal. Why we need two onnx network inputs and two onnx network outputs? In the image classification task, we can easily say that the image is classified correctly or wrongly. However in the semantic segmentation task, we need to calculate that how many are classified correctly and use a threshold to judge whether the task is succeed or not. Thus we have to use two onnx network inputs, where the one input is RGB values of images as usual and the other one is ground truth which is used for the statistical purrpose, i.e, how many pixels are classified correctly. For the same reason, we have to design two onnx network outputs, where the one is the logits as ususal and the other one is for statistical purpose. Question It seems like it's impossible to seperate two aspects since we need train the semantic segmentation model and we also need to compare the output of semantic segmentation model with the ground truth. How should we modify our benchmark, so more tools can be applied to our benchmark? |
Hi Jinyan, Thank you very much for the questions and comments. Verification for semantic segmentation networks remains challenging for most of the verification techniques now due to the huge output space (number of classified pixels). We have worked on this problem and the implementation in NNV. I am aware of another method from ETH Zurich. The metrics we have proposed to evaluate the robustness of a network under adversarial attacks such as brightening or darkening attacks are:
Important note!!! robustness is different from accuracy. We don't use ground truth segmented images with 100% accuracy for statistical evaluation of the robustness. The reason is that your trained network cannot achieve 100% accuracy. Therefore, using user-provided ground truth segmented images to evaluate the robustness is inappropriate. The solution is, with one image, we execute the network to get a ground truth output segmented image. Under bounded attack to the input image, we compute all possible classes of all pixels in the segmented output. Then, we compare it with the ground truth image to compute robustness value, robustness sensitivity, and robustness IoU. For the details, you can find our analysis here. (https://drive.google.com/file/d/1eiflLYjRtg4G0s4TglyCq21Z30OEaKFj/view) Another note is that, for upsampling, please use dilated or transposed convolution, don't use un-max-pooling. Tran |
Thank you, Tran. We have one more question. Do we still need two onnx network inputs if we need to compare the output with the ground truth in the onnx network? More specifically, we take the original image as input and excute the network to get a ground truth output. Then we replace the user-provided ground truth to the new ground truth. How can we merge the new ground truth and the disturbed images as one input? |
Hi Jinyan, When you have a single image and attack that image by some bounded unknown attack, for example, brightening some pixels of the image, then what you get is NOT A SINGLE DISTURBED IMAGE but an infinite (but bounded) set containing all possible disturbed images. With that, a pixel may be classified into many classes. The robustness verification is to verify if the class of that pixel is changed under the attack. Therefore, what we need to provide to a verification algorithm is the network, an input image, and a model of the attack, not a concrete disturbed image. If you provide a concrete disturbed image and a ground truth, we just need to execute the network on the disturbed image and analyze the result in comparison with the ground truth as a normal evaluation of the accuracy. This execution-based of robustness evaluation is fundamentally different from our verification purpose. Tran |
Hi all, We have a benchmark which we think would be interesting for the competition; however, we are still waiting to hear back from some stakeholders about whether it is okay for us to publish it. We are aware that the deadline for proposing benchmarks is today, but would it be possible to get a few days of extension? I am also not sure how many benchmarks each participant is allowed to publish this year? If it is more than one we are also happy to provide the same fully connected MNIST benchmarks as last year. Best, |
Hi all, I would like to propose the following two benchmarks both based on CIFAR10 Image Classification using ResNets: I think we should also try to have the benchmarks ready a bit closer to the agreed-upon deadline than last year so that all participants have sufficient (and the same) time to make the required adaptions in their tools/parsing code. Regarding the suggestions from @pomodoromjy, I think it would be important, that the benchmark conforms to the vnnlib standard and has just one input and output. Probably, we would also want to make sure that it is not completely out of reach for all tools to avoid having all tools run the full 6h on that benchmark. What do the other participants think? Cheers, |
Our team aims to propose a benchmark with collaborations from researchers in non-ML community. We will also greatly appreciate it if we can have a few days extension. The rule document says the benchmarks need to be prepared by May 31. Is that the final hard deadline for benchmark submission (or maybe a few days before that?) And the rule document also says that each team can provide up to 2 benchmarks (one benchmark may be selected from non-participant's proposal and should not duplicate the other benchmark). I hope there will be more benchmarks from non-participants in the next a few days for selection. And yes, I agree with @mnmueller that benchmarks need to conform to vnnlib format. The format is indeed a bit restricted and maybe difficult to apply to some application scenarios (we had to workaround this as well), but I guess for this year's competition we should keep this format. |
Hi all, We want to submit a Neural Network for Systems (NN4Sys) benchmark (team up with Huan/Kaidi's team). We're almost there---we're tuning our specification difficulties for a more balanced benchmark. This benchmark will include four NN models: two from the learned database index (see this paper) and two from cardinality estimation (see this paper). We will append network architectures shortly. We design the specifications to reflect the semantic correctness of these system components. In particular, for the learned index, the spec requires the predicted data positions to be close to the actual data positions. And for cardinality estimation, spec indicates that the predicted cardinality should be accurate, namely, within an error bound of the number of rows regarding the SQL statement. |
Hi all. Yes the benchmark deadline on the website was updated to May 31, 2022, to match the rules document. Please try to stick with this date as participants need time to ensure their tools will run on all the benchmarks. |
Hi all, Regarding @mnmueller and @huanzhang12 's suggestion, we will try some new ways to our scenario. A simple specification file would check to ensure that a single pixel of the output did not change classes. Can a more complicated specification file check to ensure that how many percentages of the pixels of the output did not change classes? Best |
Please note that the website for the benchmark tests is currently experiencing some issues. I will investigate this tomorrow and post an update once it's fixed. Sorry for the inconvenience! |
The bug has been fixed, submitting benchmarks should work fine now. If you notice that your code is stuck (no change in the log files after several minutes) you now have the option to abort the execution and submitting it again. |
Hi, We want to propose the following benchmarks: mnist_fc: https://vnncomp.christopher-brix.de/benchmark/details/53 The first benchmark is the same as proposed last year, three mnist fully connected networks with 2, 4 and 6 layers. The second benchmark verifies a CIFAR fully convolutional network against bias field perturbations. Bias field verification is explained in detail here. Quickly summarised: a bias field models a smooth noise and is, in this case, represented by a spatially varying 3rd order polynomial. The bias field transformation is encoded by prepending a fully connected layer to the cifar network; the FC layer takes the 16 coefficients of the bias field polynomial as input. The verification problem considers perturbations of these coefficients. We do the encoding in generate_properties.py, thus any toolkit supporting the standard l-infinity specifications should also support this problem "out-of-the-box". We note that the encoding depends on the input image, so the random draws in this benchmarks affects the onnx models generated by generate_properties.py, not the vnnlib files. |
Can you please clarify the following:
Thanks! |
@regkirov 1. yes total sum over all instances should not be more 6 hours, 2. yes please have an instances.csv file. Note this can be created with some randomness based on a passed in seed to the geneartion script that you provide (all benchmarks need this during this year). |
Hello! I have the following error in the logs during benchmark submission:
Some inputs to our benchmark NN are Boolean. Does VNNLib only support |
@regkirov I think it's more of a restriction of the tools. I don't think any tools from last year would support such specs, as maybe you can encode SAT problems with booleans. I'll note this down as some extension we can discuss. For now, would it make sense to encode them as real's between 0 and 1? |
Thanks @stanleybak Yes, for now it would not be a problem to encode all inputs as real's - I will do that and resubmit. In general, this could be a good extension, for example, to avoid having CEX that are not realistic. Currently, for NNs with boolean/binary inputs, one would need to check every CEX against integrality constraints on such inputs. In other words, if a tool returns a CEX with the value, say, 0.4 for some input that can only be 0 or 1 - this would be valid for the property spec where every input is Real, but not meaningful for the model, because values between 0-1 can never occur. |
@ChristopherBrix @stanleybak Our benchmark includes several NNs with different input size. Due to this difference, the properties defined for one NN cannot be run on another NN. As discussed above, we provide an instances.csv file that maps vnnlib specs to corresponding NNs. However, in the logs I see a mismatch. Does the automated procedure consider the instances.csv file? Is there some naming convention to be followed? If the csv is not considered, any suggestion on how to pass the check avoiding the mismatch of properties and NNs? Thanks! |
You're right, instances.csv is currently not used, all properties are checked against every NN. I'll work on that over the weekend and post an update here once that's fixed. |
Hi everyone, I would like to propose a TLL NN benchmark based on the networks we used in a couple of our recent papers (e.g. FastBATLLNN). I have a repository with this benchmark, but the automatic validation currently fails with error For some additional, specific context: this error occurs while trying to unzip an ONNX network using Python's Assuming there isn't a bug in my code: is the disk space on the instance a hard constraint? If so -- and assuming there's sufficient interest in this benchmark -- then there are a few ways I can adapt our benchmark to require less disk space. For example, the benchmark is currently configured to generate one property per network, with the number of networks determined solely by the timeout -- I could easily generate multiple properties per network instead, and simply decompress fewer networks. Considering only the smaller network sizes is another possibility: the repository contains many more networks of each size than are matched to properties in the current benchmark. Of course I'm open to other ideas or modifications to help include this benchmark if there's interest. Thanks in advance, |
@regkirov The I've also fixed a bug with protobuf, a new version caused everything to fail. If you tried to submit in the last days and got an error message complaining about the protobuf version, please try again. |
@stanleybak Our benchmark (Collins Aerospace) passed the automated check, following the instruction I am posting the link here: https://vnncomp.christopher-brix.de/benchmark/details/72 Do you plan to consolidate all benchmarks in a single repo? Is there a hard deadline, after which the benchmark materials can no longer be updated? (e.g., timeouts, description, models, etc.) |
Hi all, The benchmark submission deadline was extended one week to June 7th. Please try to make this deadline as no further extensions will be possible. The tool submission will also be moved back one week so that participants have the expected amount of time to run their tools on the benchmarks. |
Hi everyone, I'm about to propose a few benchmarks (involving collision avoidance, some reinforcement learning benchmark problems) -- what does instances.csv need to contain? |
@vkrish1 the |
FYI: the AWS validation instance is suffering from the recent incompatibility bug between tensorflow and protobuf. Basically, protobuf >=3.20 introduces changes that break TF compatibility, so attempting to import TF will fail. See e.g. tensorflow/tensorflow#56077 and the release notes from TF 2.9.1 https://github.com/tensorflow/tensorflow/releases. This appears to prevent The solution is to use TF==2.9.1/2.8.2/2.6.5 etc., which should install protobuf < 3.20 as a dependency, or manually install protobuf==3.19.3 first, since it won't be upgraded by pip. I'm also still curious about the disk space available on the instance. Have there been any deliberations about that? |
Hi all, we'd like to propose few benchmarks: a collision-avoidance task and two from OpenAI's Gym
Dubins Rejoin environment details:
|
@ChristopherBrix Any idea for this? Hopefully it's easy to get a bigger hard drive allocated. Also: Could you help add a pip package for the benchmark generation? I'm try to to finalize a benchmark based on vggnet16, but I need to |
Hi all, here's the current list of benchmarks I see from the website:
If there are any others, please post a comment here describing your benchmark and try to run it on the submission site. Given the issues with the benchmark submission site (see the previous 3-5 comments), I think we can extend submissions until Friday. If things aren't fixed by then, I'll manually go through the benchmarks to add them. |
Hi @huanzhang12 Would it be possible to remove the part of the model related to reshaping at the start of lindex.onnx and lindex_deep.onnx? It does not seem to serve any purpose when the input is already in the correct shape? Thank you @Cli212. @ChristopherBrix, could you please update the models in the main repo? Thanks! |
Benchmark The error:
|
@yodarocks1 Yes, sorry I thought I fixed this but must have not pushed the change. The updated repo is now 600 secs per instance. @ChristopherBrix could you update this one? https://github.com/stanleybak/reach_probability_benchmark2022 |
@pat676 @ChristopherBrix We updated the NN4sys benchmarks in response to #2 (comment) and #2 (comment). We've updated both onnx models and vnnlib files in our repository: https://github.com/Cli212/VNNComp22_NN4Sys |
@ChristopherBrix @stanleybak Another thing I am not completely sure: will all the benchmarks in https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks be used for competition? According to competition rule, each team may propose up to two benchmarks and one of them can be selected from an outside group. It is unclear to me which benchmarks are from participants and which are from outside groups? I haven’t seen anyone nominating a benchmark yet. |
@huanzhang12 We'll do a matching step to ensure that each benchmark has a nominating participant. I don't think we'll need to exclude any for this reason, as no one proposed an excessive number of benchmarks. |
I'll update NN4sys tomorrow, I'll need to find an alternative to LFS as that doesn't scale due to GitHub quota limitations. But I guess I can just push them to my universities file storage, I just need to implement that. |
|
Hi @ChristopherBrix @stanleybak , it looks like vggnet16_2022 still has nothing under |
Another issue is that in vnncomp2022_benchmarks, all the onnx models are stored in |
You need to run the |
Hi @ChristopherBrix just double checking. See the message above - a fix has been added to the collins_rul_cnn benchmark, but the version in vnncomp22_benchmarks has not been updated yet. Can you check? Thanks! |
Hi all, For the competition, do we have a pre-processing phase that is not counted as part of the timeout for loading the vnnlib specifications and converting the model to the correct format for our toolkit? The reason I'm asking is that the nn4sys benchmarks have some massive vnnlib files that take several minutes to parse with our implementation and it takes us ~40 seconds to convert the vggnet16 from onnx to PyTorch. Also, the rl_benchmarks have some very short timeouts (1 second), in which case any pre-processing time may be essential. |
I think we don't want use just a few seconds as the timeout due to the time variance and unknown startup overhead of each tool. Maybe we should set a minimum timeout threshold like 20 or 30 seconds? @vkrish1 you mentioned you have updated timeouts in a previous comment #2 (comment), but it seems there are still many instances in rl_benchmark with timeouts of a few seconds? Could you double check? Thanks. Last year we have a "prepare_instance.sh" timeout which was 30 seconds if I recall correctly. We probably need to increase the preparation timeout this year? maybe 100 seconds (or more)? |
Hi @ChristopherBrix, The setup.sh file does not seem to fetch the mscn_2048d_dual.onnx model of the nn4sys benchmark yet. |
Yes, there is a As for small timeouts... I think 1 second may be a little bit small as well... hopefully we can update these be at least 30 seconds. |
Hi @Cli212 There seems to be an issue with the ends parameter of Slice_9 in the mscn_128d_dual.onnx model. It is very large and leads to overflow errors in my onnxparser. Could you have a look at this? Thanks! |
My rl-benchmarks have been updated with larger timeouts (at least 30sec) - passed the tests and the repo was updated. |
Hi @naizhengtan, @Cli212, @ChristopherBrix. I am still having trouble with the nn4sys benchmark as mentioned in my comments above.
Is anyone else having trouble with this? If so, I believe we should consider removing these two models from the benchmark at this point. |
Sorry, I'll look into the first point and fix that right away |
|
The setup script should now work correctly, too. |
Thank you, got the model now. For my second problem, the bug was on my side. If anyone else is facing a similar issue, it seems like torch.LongTensor throws an overflow error in certain situations when dealing with INT_MAX, but reducing it to |
This was fixed some time ago, just forgot to post it here.
I'll do that tonight. |
@stanleybak Actually, you already updated the prepare instance timeout to 5 minutes last year (https://github.com/stanleybak/vnncomp2021/blob/main/run_single_instance.sh#L49). I just copied your code, so that should be fine. |
For unet_upsample_small.onnx in carvana_unet_2022 benchmark, did anyone face any issues with Pad_113 layer: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 3 for tensor number 1 in the list? Based on the torch.nn.functional.pad(input, pad, mode='constant', value=None) → Tensor signature, I found:
Am I missing something? Any help would be appreciated! Thanks! |
Hi Piyush Have you tried to run evaluate_network.py? You can set onnxpath as unet_upsample_small.onnx and suceeds_mask as upsample_net_pre_mask. It may demonstrate how the network works. Yonggang |
Thank you, Yonggang! |
Both tool participants and outsiders such as industry partners can propose benchmarks. All benchmarks must be in .onnx format and use .vnnlib specifications, as was done last year. Each benchmark must also include a script to randomly generate benchmark instances based on a random seed. For image classification, this is used to select the images considered. For other benchmarks, it could, for example, perturb the size of the input set or specification.
The purpose of this thread is present your benchmarks and provide preliminary files to get feedback. Participants can them provide comments, for example, suggesting you to simplify the structure of the network or remove unsupported layers.
To propose a new benchmark, please create a public git repository with all the necessary code.
The repository must be structured as follows:
Once this succeeds, please post a link to the completed benchmark site, for example this one.
Here is a link to last year's github issue where we discussed benchmarks.
Here is a link to the folder containing final benchmark files last year.
The text was updated successfully, but these errors were encountered: