Benchmarks Discussion #2

stanleybak · 2022-04-15T17:43:44Z

Both tool participants and outsiders such as industry partners can propose benchmarks. All benchmarks must be in .onnx format and use .vnnlib specifications, as was done last year. Each benchmark must also include a script to randomly generate benchmark instances based on a random seed. For image classification, this is used to select the images considered. For other benchmarks, it could, for example, perturb the size of the input set or specification.

The purpose of this thread is present your benchmarks and provide preliminary files to get feedback. Participants can them provide comments, for example, suggesting you to simplify the structure of the network or remove unsupported layers.

To propose a new benchmark, please create a public git repository with all the necessary code.
The repository must be structured as follows:

It must contain a generate_properties.py file which accepts the seed as the only command line parameter.
There must be a folder with all .vnnlib files, which may be identical to the folder containing the generate_properties.py file
There must be a folder with all .onnx files, which may be identical to the folder containing the generate_properties.py file
The generate_properties.py file will be run using Python 3.8 on a t2.large AWS instance. (see https://vnncomp.christopher-brix.de/)

Once this succeeds, please post a link to the completed benchmark site, for example this one.

Here is a link to last year's github issue where we discussed benchmarks.
Here is a link to the folder containing final benchmark files last year.

pomodoromjy · 2022-04-24T15:44:06Z

Would two onnx network inputs and two onnx network outputs be accepted for the compitition?

Motivation

Currently, most networks evaluated in the literature are focusing on image classfication. However, in most practical scenarios, eg. autonomous driving, people pay more attention to object detection or semantic segmentation. Considering the complexity of the object detection, we propose a new simple Unet(four Conv2d layers followed with BN and ReLu). We advocate that tools should handle more practical architectures, and the simplified Unet is the first step towards this goal.

Why we need two onnx network inputs and two onnx network outputs?

In the image classification task, we can easily say that the image is classified correctly or wrongly. However in the semantic segmentation task, we need to calculate that how many are classified correctly and use a threshold to judge whether the task is succeed or not. Thus we have to use two onnx network inputs, where the one input is RGB values of images as usual and the other one is ground truth which is used for the statistical purrpose, i.e, how many pixels are classified correctly. For the same reason, we have to design two onnx network outputs, where the one is the logits as ususal and the other one is for statistical purpose.

Question

It seems like it's impossible to seperate two aspects since we need train the semantic segmentation model and we also need to compare the output of semantic segmentation model with the ground truth. How should we modify our benchmark, so more tools can be applied to our benchmark?

trhoangdung · 2022-04-26T13:54:30Z

Hi Jinyan,

Thank you very much for the questions and comments. Verification for semantic segmentation networks remains challenging for most of the verification techniques now due to the huge output space (number of classified pixels). We have worked on this problem and the implementation in NNV. I am aware of another method from ETH Zurich. The metrics we have proposed to evaluate the robustness of a network under adversarial attacks such as brightening or darkening attacks are:

Robustness value: the percentage of robust pixels under un attack
Robustness sensitivity: the average number of pixels affected (misclassified) if one pixel is attached
Robustness IoU: robust IoU is similar to IoU used in training SSN

Important note!!! robustness is different from accuracy. We don't use ground truth segmented images with 100% accuracy for statistical evaluation of the robustness. The reason is that your trained network cannot achieve 100% accuracy. Therefore, using user-provided ground truth segmented images to evaluate the robustness is inappropriate. The solution is, with one image, we execute the network to get a ground truth output segmented image. Under bounded attack to the input image, we compute all possible classes of all pixels in the segmented output. Then, we compare it with the ground truth image to compute robustness value, robustness sensitivity, and robustness IoU.

For the details, you can find our analysis here. (https://drive.google.com/file/d/1eiflLYjRtg4G0s4TglyCq21Z30OEaKFj/view)

Another note is that, for upsampling, please use dilated or transposed convolution, don't use un-max-pooling.

Tran

pomodoromjy · 2022-04-28T15:49:23Z

Thank you, Tran.

We have one more question. Do we still need two onnx network inputs if we need to compare the output with the ground truth in the onnx network? More specifically, we take the original image as input and excute the network to get a ground truth output. Then we replace the user-provided ground truth to the new ground truth. How can we merge the new ground truth and the disturbed images as one input?

trhoangdung · 2022-04-28T18:01:48Z

Hi Jinyan,

When you have a single image and attack that image by some bounded unknown attack, for example, brightening some pixels of the image, then what you get is NOT A SINGLE DISTURBED IMAGE but an infinite (but bounded) set containing all possible disturbed images. With that, a pixel may be classified into many classes. The robustness verification is to verify if the class of that pixel is changed under the attack.

Therefore, what we need to provide to a verification algorithm is the network, an input image, and a model of the attack, not a concrete disturbed image. If you provide a concrete disturbed image and a ground truth, we just need to execute the network on the disturbed image and analyze the result in comparison with the ground truth as a normal evaluation of the accuracy. This execution-based of robustness evaluation is fundamentally different from our verification purpose.

Tran

pat676 · 2022-05-02T09:18:48Z

Hi all,

We have a benchmark which we think would be interesting for the competition; however, we are still waiting to hear back from some stakeholders about whether it is okay for us to publish it. We are aware that the deadline for proposing benchmarks is today, but would it be possible to get a few days of extension?

I am also not sure how many benchmarks each participant is allowed to publish this year? If it is more than one we are also happy to provide the same fully connected MNIST benchmarks as last year.

Best,
Patrick

mnmueller · 2022-05-02T21:09:48Z

Hi all,

I would like to propose the following two benchmarks both based on CIFAR10 Image Classification using ResNets:
~~https://vnncomp.christopher-brix.de/benchmark/details/28~~
~~https://vnncomp.christopher-brix.de/benchmark/details/27~~
https://vnncomp.christopher-brix.de/benchmark/details/74
https://vnncomp.christopher-brix.de/benchmark/details/75
Edit: There was a typo in the epsilon for these leading to way too hard properties. I updated them accordingly.

I think we should also try to have the benchmarks ready a bit closer to the agreed-upon deadline than last year so that all participants have sufficient (and the same) time to make the required adaptions in their tools/parsing code.

Regarding the suggestions from @pomodoromjy, I think it would be important, that the benchmark conforms to the vnnlib standard and has just one input and output. Probably, we would also want to make sure that it is not completely out of reach for all tools to avoid having all tools run the full 6h on that benchmark.

What do the other participants think?

Cheers,
Mark

huanzhang12 · 2022-05-02T22:01:46Z

Our team aims to propose a benchmark with collaborations from researchers in non-ML community. We will also greatly appreciate it if we can have a few days extension. The rule document says the benchmarks need to be prepared by May 31. Is that the final hard deadline for benchmark submission (or maybe a few days before that?) And the rule document also says that each team can provide up to 2 benchmarks (one benchmark may be selected from non-participant's proposal and should not duplicate the other benchmark). I hope there will be more benchmarks from non-participants in the next a few days for selection.

And yes, I agree with @mnmueller that benchmarks need to conform to vnnlib format. The format is indeed a bit restricted and maybe difficult to apply to some application scenarios (we had to workaround this as well), but I guess for this year's competition we should keep this format.

naizhengtan · 2022-05-03T02:59:57Z

Hi all,

We want to submit a Neural Network for Systems (NN4Sys) benchmark (team up with Huan/Kaidi's team). We're almost there---we're tuning our specification difficulties for a more balanced benchmark.

This benchmark will include four NN models: two from the learned database index (see this paper) and two from cardinality estimation (see this paper). We will append network architectures shortly.

We design the specifications to reflect the semantic correctness of these system components. In particular, for the learned index, the spec requires the predicted data positions to be close to the actual data positions. And for cardinality estimation, spec indicates that the predicted cardinality should be accurate, namely, within an error bound of the number of rows regarding the SQL statement.

stanleybak · 2022-05-03T17:06:13Z

Hi all. Yes the benchmark deadline on the website was updated to May 31, 2022, to match the rules document. Please try to stick with this date as participants need time to ensure their tools will run on all the benchmarks.

pomodoromjy · 2022-05-04T06:55:23Z

Hi all,

Regarding @mnmueller and @huanzhang12 's suggestion, we will try some new ways to our scenario.

A simple specification file would check to ensure that a single pixel of the output did not change classes. Can a more complicated specification file check to ensure that how many percentages of the pixels of the output did not change classes?
@trhoangdung, do you know the answer? Is it possible to compute robustness value, robustness sensitivity, and robustness IoU using vnnlib? How would this look?

Best
Jinyan

ChristopherBrix · 2022-05-05T15:02:14Z

Please note that the website for the benchmark tests is currently experiencing some issues. I will investigate this tomorrow and post an update once it's fixed. Sorry for the inconvenience!

ChristopherBrix · 2022-05-07T15:26:33Z

The bug has been fixed, submitting benchmarks should work fine now. If you notice that your code is stuck (no change in the log files after several minutes) you now have the option to abort the execution and submitting it again.

pat676 · 2022-05-12T14:23:57Z

Hi,

We want to propose the following benchmarks:

mnist_fc: https://vnncomp.christopher-brix.de/benchmark/details/53
cifar_biasfield: https://vnncomp.christopher-brix.de/benchmark/details/55

The first benchmark is the same as proposed last year, three mnist fully connected networks with 2, 4 and 6 layers.

The second benchmark verifies a CIFAR fully convolutional network against bias field perturbations. Bias field verification is explained in detail here. Quickly summarised: a bias field models a smooth noise and is, in this case, represented by a spatially varying 3rd order polynomial.

The bias field transformation is encoded by prepending a fully connected layer to the cifar network; the FC layer takes the 16 coefficients of the bias field polynomial as input. The verification problem considers perturbations of these coefficients. We do the encoding in generate_properties.py, thus any toolkit supporting the standard l-infinity specifications should also support this problem "out-of-the-box".

We note that the encoding depends on the input image, so the random draws in this benchmarks affects the onnx models generated by generate_properties.py, not the vnnlib files.

regkirov · 2022-05-19T09:09:43Z

Can you please clarify the following:

Timeouts for properties of a single benchmark should sum up to 6 hours? Or less?
Is the *_instances.csv file also required where one specifies the mapping between properties and NNs, as well as the timeout for each entry?

Thanks!

stanleybak · 2022-05-19T15:06:00Z

@regkirov 1. yes total sum over all instances should not be more 6 hours, 2. yes please have an instances.csv file. Note this can be created with some randomness based on a passed in seed to the geneartion script that you provide (all benchmarks need this during this year).

regkirov · 2022-05-23T08:23:24Z

Hello! I have the following error in the logs during benchmark submission:

AssertionError: failed parsing line: (declare-const X_9 Bool)

Some inputs to our benchmark NN are Boolean. Does VNNLib only support Real type?

stanleybak · 2022-05-24T16:37:01Z

@regkirov I think it's more of a restriction of the tools. I don't think any tools from last year would support such specs, as maybe you can encode SAT problems with booleans. I'll note this down as some extension we can discuss. For now, would it make sense to encode them as real's between 0 and 1?

regkirov · 2022-05-25T12:48:57Z

Thanks @stanleybak Yes, for now it would not be a problem to encode all inputs as real's - I will do that and resubmit.

In general, this could be a good extension, for example, to avoid having CEX that are not realistic. Currently, for NNs with boolean/binary inputs, one would need to check every CEX against integrality constraints on such inputs. In other words, if a tool returns a CEX with the value, say, 0.4 for some input that can only be 0 or 1 - this would be valid for the property spec where every input is Real, but not meaningful for the model, because values between 0-1 can never occur.

regkirov · 2022-05-25T14:55:10Z

@ChristopherBrix @stanleybak Our benchmark includes several NNs with different input size. Due to this difference, the properties defined for one NN cannot be run on another NN. As discussed above, we provide an instances.csv file that maps vnnlib specs to corresponding NNs. However, in the logs I see a mismatch. Does the automated procedure consider the instances.csv file? Is there some naming convention to be followed? If the csv is not considered, any suggestion on how to pass the check avoiding the mismatch of properties and NNs? Thanks!

ChristopherBrix · 2022-05-25T14:57:01Z

You're right, instances.csv is currently not used, all properties are checked against every NN. I'll work on that over the weekend and post an update here once that's fixed.

jferlez · 2022-05-25T18:17:48Z

Hi everyone,

I would like to propose a TLL NN benchmark based on the networks we used in a couple of our recent papers (e.g. FastBATLLNN).

I have a repository with this benchmark, but the automatic validation currently fails with error OSError: [Errno 28] No space left on device. I suspect that the problem is the size of the ONNX networks in the benchmark, since in uncompressed form, they occupy ~2.5 GB.

For some additional, specific context: this error occurs while trying to unzip an ONNX network using Python's zipfile package. In an offline conversation, @stanleybak suggested that I use generate_properties.py to unzip each ONNX file as needed during property generation. Since they compress very well, this allows them to fit easily within the standard github file-size limits. (The largest ONNX file is ~300MB uncompressed vs. ~1.5MB in zip form.)

Assuming there isn't a bug in my code: is the disk space on the instance a hard constraint?

If so -- and assuming there's sufficient interest in this benchmark -- then there are a few ways I can adapt our benchmark to require less disk space. For example, the benchmark is currently configured to generate one property per network, with the number of networks determined solely by the timeout -- I could easily generate multiple properties per network instead, and simply decompress fewer networks. Considering only the smaller network sizes is another possibility: the repository contains many more networks of each size than are matched to properties in the current benchmark.

Of course I'm open to other ideas or modifications to help include this benchmark if there's interest.

Thanks in advance,
James

ChristopherBrix · 2022-05-29T22:35:34Z

@regkirov The instances.csv file is now used to determine which combinations of models and properties to check. At the moment this file is expected to exist in the scripts directory, I'm working on making this configurable.
Once that's done, I'll check all benchmarks that have been submitted so far, to make sure that their instances.csv file is correct.

I've also fixed a bug with protobuf, a new version caused everything to fail. If you tried to submit in the last days and got an error message complaining about the protobuf version, please try again.

regkirov · 2022-05-31T16:24:26Z

@stanleybak Our benchmark (Collins Aerospace) passed the automated check, following the instruction I am posting the link here: https://vnncomp.christopher-brix.de/benchmark/details/72

Do you plan to consolidate all benchmarks in a single repo? Is there a hard deadline, after which the benchmark materials can no longer be updated? (e.g., timeouts, description, models, etc.)

stanleybak · 2022-05-31T17:56:03Z

Hi all,

The benchmark submission deadline was extended one week to June 7th. Please try to make this deadline as no further extensions will be possible. The tool submission will also be moved back one week so that participants have the expected amount of time to run their tools on the benchmarks.

vkrish1 · 2022-06-01T01:16:25Z

Hi everyone, I'm about to propose a few benchmarks (involving collision avoidance, some reinforcement learning benchmark problems) -- what does instances.csv need to contain?
I saw some benchmarks posted here that had a third column (in addition to the vnnlib and onnx filenames) -- is that needed?
Thanks!

stanleybak · 2022-06-01T13:50:03Z

@vkrish1 the instances.csv file contains lists of benchmark instances: onnx network, vnnlib spec, timeout in seconds. The total timeout of all instances should be less than six hours. The generate python script can create the instances.csv file based on a random seed given as a commend line argument.

jferlez · 2022-06-01T17:11:18Z

FYI: the AWS validation instance is suffering from the recent incompatibility bug between tensorflow and protobuf. Basically, protobuf >=3.20 introduces changes that break TF compatibility, so attempting to import TF will fail. See e.g. tensorflow/tensorflow#56077 and the release notes from TF 2.9.1 https://github.com/tensorflow/tensorflow/releases.

This appears to prevent import onnx from succeeding as well.

The solution is to use TF==2.9.1/2.8.2/2.6.5 etc., which should install protobuf < 3.20 as a dependency, or manually install protobuf==3.19.3 first, since it won't be upgraded by pip.

I'm also still curious about the disk space available on the instance. Have there been any deliberations about that?

vkrish1 · 2022-06-04T01:11:44Z

Hi all, we'd like to propose few benchmarks: a collision-avoidance task and two from OpenAI's Gym
(Edited to include 2 other benchmarks)

Dubins Rejoin - based on the SafeRL Rejoin Task benchmark (paper ref below). The system represents a flight formation problem: a network is trained via reinforcement learning to guide a wingman aircraft to a certain radius around a lead. The input and output spaces are both 8-dimensional. The 8 network outputs are translated into tuples of actuators for controlling the wingman (rudder, throttle); each discrete-valued actuator has 4 options.
The control tuple is determined from the argmax of the first 4 network outputs and the argmax of the second 4, yielding 16 possible combinations. The benchmarks are designed to validate a network output with respect to one or more of the 16 combinations, which is a conjunction of 6 terms (i.e. comparing the first actuator's logit to the remaining 3 in the first group and comparing the second actuator's logit to the remaining 3 in the second group). Each benchmark is a disjunction of one or more of these 6-term conjunctions.

Dubins Rejoin environment details:
U. Ravaioli, J. Cunningham, J. McCarroll, V. Gangal, K. Dunlap, and K. Hobbs, “Safe reinforcement learning benchmark environments for aerospace control systems,” in 2022 IEEE Aerospace Conference, IEEE, 2022.

CartPole - based on the OpenAI CartPole-v1 benchmark.The control network was trained using the StableBaselines3 library - the input space is 4-dimensional and the output space is binary.
LunarLander - based on the OpenAI LunarLander-v4 benchmark. The control network was trained using the StableBaselines3 library - the input space is 8-dimensional and the output space is 4-dimensional.

stanleybak · 2022-06-05T21:36:21Z

I'm also still curious about the disk space available on the instance. Have there been any deliberations about that?

@ChristopherBrix Any idea for this? Hopefully it's easy to get a bigger hard drive allocated.

Also: Could you help add a pip package for the benchmark generation? I'm try to to finalize a benchmark based on vggnet16, but I need to pip3 install mxnet to generate the benchmarks. The benchmark repo I have is here. Another potential issue is that vggnet is about 500MB so I'm not including it in the repo, but instead using wget as part of generate_benchmarks.py.

stanleybak · 2022-06-07T14:05:12Z

Hi all, here's the current list of benchmarks I see from the website:

mnist_fc
Carvana-unet
cifar_biasfield
collins-rul-cnn
SRI_ResNetA / SRI_ResNetB
TLLVerifyBench
dubins-rejoin / cartpole / lunarlander
VGGNET16
NN4SYs

If there are any others, please post a comment here describing your benchmark and try to run it on the submission site. Given the issues with the benchmark submission site (see the previous 3-5 comments), I think we can extend submissions until Friday. If things aren't fixed by then, I'll manually go through the benchmarks to add them.

pat676 · 2022-06-21T13:24:19Z

Hi @huanzhang12 Would it be possible to remove the part of the model related to reshaping at the start of lindex.onnx and lindex_deep.onnx? It does not seem to serve any purpose when the input is already in the correct shape?

Hi @pat676, sorry for the confusion caused. We have already removed this operation and updated the benchmark in our repository. Please find our updated onnx files here Cli212/VNNComp22_NN4Sys@5d9d801.

Thank you @Cli212. @ChristopherBrix, could you please update the models in the main repo? Thanks!

yodarocks1 · 2022-06-21T20:38:04Z

Benchmark reach_prob_density errors out when testing due to too-short timeout durations. According to its instances.csv file, each instance is given 10 seconds before timeout. I assume this is meant to be 10 minutes, or 600 seconds each? @stanleybak - based on the conversation above, I believe this is your benchmark?

The error:

Running reach_prob_density category from ./benchmarks/reach_prob_density/instances.csv
Category 'reach_prob_density' timeout sum: 360 seconds
reach_prob_density sum timeout (360) not in valid range [3*60*60, 6*60*60]

stanleybak · 2022-06-21T20:51:28Z

@yodarocks1 Yes, sorry I thought I fixed this but must have not pushed the change. The updated repo is now 600 secs per instance. @ChristopherBrix could you update this one? https://github.com/stanleybak/reach_probability_benchmark2022

huanzhang12 · 2022-06-21T20:59:30Z

@pat676 @ChristopherBrix We updated the NN4sys benchmarks in response to #2 (comment) and #2 (comment). We've updated both onnx models and vnnlib files in our repository: https://github.com/Cli212/VNNComp22_NN4Sys
New test: https://vnncomp.christopher-brix.de/benchmark/details/229
But the "Exporting to GitHub" step failed due to the large models. @ChristopherBrix Could you please kindly update both the vnnlib and onnx files manuually in your repo?

huanzhang12 · 2022-06-21T21:02:23Z

@ChristopherBrix @stanleybak Another thing I am not completely sure: will all the benchmarks in https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks be used for competition? According to competition rule, each team may propose up to two benchmarks and one of them can be selected from an outside group. It is unclear to me which benchmarks are from participants and which are from outside groups? I haven’t seen anyone nominating a benchmark yet.

stanleybak · 2022-06-21T21:05:22Z

@huanzhang12 We'll do a matching step to ensure that each benchmark has a nominating participant. I don't think we'll need to exclude any for this reason, as no one proposed an excessive number of benchmarks.

ChristopherBrix · 2022-06-21T21:08:07Z

reach_prob_density has been updated.

I'll update NN4sys tomorrow, I'll need to find an alternative to LFS as that doesn't scale due to GitHub quota limitations. But I guess I can just push them to my universities file storage, I just need to implement that.

ChristopherBrix · 2022-06-23T23:12:20Z

NN4sys has been updated and the handling of large files in general should be fixed now and no longer relies on GitHub LFS

shizhouxing · 2022-06-24T00:59:00Z

Hi @ChristopherBrix @stanleybak , it looks like vggnet16_2022 still has nothing under onnx/?

shizhouxing · 2022-06-24T01:03:04Z

Another issue is that in vnncomp2022_benchmarks, all the onnx models are stored in .onnx.gz format, but instances.csv files use .onnx instead of .onnx.gz. I suggest there to be at least some script or documentation specifying what processing steps are needed after cloning the benchmark respository.

ChristopherBrix · 2022-06-24T09:37:15Z

You need to run the setup.sh script to fetch those external models. That wasn't documented anywhere, I'll fix that. This will also unzip all files.

regkirov · 2022-06-28T08:39:27Z

Hi @regkirov, it seems that some of the colins_aerospace benchmarks do not declare the right output variables. E.g., robustness_16perturbations_delta40_epsilon10_w40.vnnlib declares only "Y_1", but uses "Y_0". Could you look into this?

Hi @mnmueller - thanks for pointing this out! I've changed the output declaration from "Y_1" to "Y_0" and pushed the update to https://github.com/loonwerks/vnncomp2022. Cheers.

Hi @ChristopherBrix just double checking. See the message above - a fix has been added to the collins_rul_cnn benchmark, but the version in vnncomp22_benchmarks has not been updated yet. Can you check? Thanks!

pat676 · 2022-06-28T11:34:40Z

Hi all,

For the competition, do we have a pre-processing phase that is not counted as part of the timeout for loading the vnnlib specifications and converting the model to the correct format for our toolkit?

The reason I'm asking is that the nn4sys benchmarks have some massive vnnlib files that take several minutes to parse with our implementation and it takes us ~40 seconds to convert the vggnet16 from onnx to PyTorch. Also, the rl_benchmarks have some very short timeouts (1 second), in which case any pre-processing time may be essential.

huanzhang12 · 2022-06-28T12:53:31Z

I think we don't want use just a few seconds as the timeout due to the time variance and unknown startup overhead of each tool. Maybe we should set a minimum timeout threshold like 20 or 30 seconds? @vkrish1 you mentioned you have updated timeouts in a previous comment #2 (comment), but it seems there are still many instances in rl_benchmark with timeouts of a few seconds? Could you double check? Thanks.

Last year we have a "prepare_instance.sh" timeout which was 30 seconds if I recall correctly. We probably need to increase the preparation timeout this year? maybe 100 seconds (or more)?

pat676 · 2022-06-28T13:08:30Z

You need to run the setup.sh script to fetch those external models. That wasn't documented anywhere, I'll fix that. This will also unzip all files.

Hi @ChristopherBrix,

The setup.sh file does not seem to fetch the mscn_2048d_dual.onnx model of the nn4sys benchmark yet.

stanleybak · 2022-06-28T13:38:11Z

Yes, there is a prepare_instance.sh script you can use for conversion that does not get counted towards runtime (same as last year). This has a timeout itself but that shouldn't be a concern... Last year I think it was 60 seconds but 100 seconds is reasonable and an easy change (please update this in the evaluation scripts @ChristopherBrix).

As for small timeouts... I think 1 second may be a little bit small as well... hopefully we can update these be at least 30 seconds.

pat676 · 2022-06-29T15:49:41Z

Hi @Cli212

There seems to be an issue with the ends parameter of Slice_9 in the mscn_128d_dual.onnx model. It is very large and leads to overflow errors in my onnxparser. Could you have a look at this? Thanks!

vkrish1 · 2022-06-29T19:14:42Z

My rl-benchmarks have been updated with larger timeouts (at least 30sec) - passed the tests and the repo was updated.

pat676 · 2022-07-05T10:37:38Z

Hi @naizhengtan, @Cli212, @ChristopherBrix.

I am still having trouble with the nn4sys benchmark as mentioned in my comments above.

I'm not able to download the mscn_2048d_dual.onnx model (it is not downloaded py setup.py).
The end parameter of the Slice_9 in mscn_128d_dual.onnx is extremely large and leads to overflow issues.

Is anyone else having trouble with this? If so, I believe we should consider removing these two models from the benchmark at this point.

ChristopherBrix · 2022-07-05T10:38:29Z

Sorry, I'll look into the first point and fix that right away

huanzhang12 · 2022-07-05T11:10:08Z

@pat676

The model can be downloaded here: https://github.com/Cli212/VNNComp22_NN4Sys/blob/master/model/mscn_2048d_dual.onnx
The end parameter of Slice_9 is within onnx spec: https://github.com/onnx/onnx/blob/main/docs/Operators.md#slice

For slicing to the end of a dimension with unknown size, it is recommended to pass in INT_MAX when slicing forward and 'INT_MIN' when slicing backward.

ChristopherBrix · 2022-07-05T11:11:42Z

The setup script should now work correctly, too.

pat676 · 2022-07-05T13:23:26Z

Thank you, got the model now. For my second problem, the bug was on my side. If anyone else is facing a similar issue, it seems like torch.LongTensor throws an overflow error in certain situations when dealing with INT_MAX, but reducing it to
INT_MAX - 1 seems to work.

ChristopherBrix · 2022-07-05T16:59:11Z

Hi @ChristopherBrix just double checking. See the message above - a fix has been added to the collins_rul_cnn benchmark, but the version in vnncomp22_benchmarks has not been updated yet. Can you check? Thanks!

This was fixed some time ago, just forgot to post it here.

Yes, there is a prepare_instance.sh script you can use for conversion that does not get counted towards runtime (same as last year). This has a timeout itself but that shouldn't be a concern... Last year I think it was 60 seconds but 100 seconds is reasonable and an easy change (please update this in the evaluation scripts @ChristopherBrix).

I'll do that tonight.

ChristopherBrix · 2022-07-05T20:30:34Z

@stanleybak Actually, you already updated the prepare instance timeout to 5 minutes last year (https://github.com/stanleybak/vnncomp2021/blob/main/run_single_instance.sh#L49). I just copied your code, so that should be fine.

piyush-J · 2022-07-10T17:25:18Z

For unet_upsample_small.onnx in carvana_unet_2022 benchmark, did anyone face any issues with Pad_113 layer: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 3 for tensor number 1 in the list?

Based on the torch.nn.functional.pad(input, pad, mode='constant', value=None) → Tensor signature, I found:

The input is of size: torch.Size([1, 64, 30, 46])
The pad is of size 8: tensor([0, 0, 0, 0, 0, 0, 1, 1])
The value is 0

Am I missing something? Any help would be appreciated!

Thanks!

pomodoromjy · 2022-07-11T12:39:15Z

For unet_upsample_small.onnx in carvana_unet_2022 benchmark, did anyone face any issues with Pad_113 layer: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 3 for tensor number 1 in the list?

Based on the torch.nn.functional.pad(input, pad, mode='constant', value=None) → Tensor signature, I found:

The input is of size: torch.Size([1, 64, 30, 46])

The pad is of size 8: tensor([0, 0, 0, 0, 0, 0, 1, 1])

The value is 0

Am I missing something? Any help would be appreciated!

Thanks!

Hi Piyush

Have you tried to run evaluate_network.py? You can set onnxpath as unet_upsample_small.onnx and suceeds_mask as upsample_net_pre_mask. It may demonstrate how the network works.

Yonggang

piyush-J · 2022-07-11T20:48:33Z

Thank you, Yonggang!

stanleybak mentioned this issue Feb 2, 2023

VNNCOMP2022 Report and Discussion for Next Year #4

Open

KaidiXu mentioned this issue May 15, 2024

Benchmark Proposals and Discussion verivital/vnncomp2024#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks Discussion #2

Benchmarks Discussion #2

Benchmarks Discussion #2

Benchmarks Discussion #2

Comments