[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks Discussion #2

Open
stanleybak opened this issue Apr 15, 2022 · 87 comments
Open

Benchmarks Discussion #2

stanleybak opened this issue Apr 15, 2022 · 87 comments

Comments

@stanleybak
Copy link
Owner
stanleybak commented Apr 15, 2022

Both tool participants and outsiders such as industry partners can propose benchmarks. All benchmarks must be in .onnx format and use .vnnlib specifications, as was done last year. Each benchmark must also include a script to randomly generate benchmark instances based on a random seed. For image classification, this is used to select the images considered. For other benchmarks, it could, for example, perturb the size of the input set or specification.

The purpose of this thread is present your benchmarks and provide preliminary files to get feedback. Participants can them provide comments, for example, suggesting you to simplify the structure of the network or remove unsupported layers.

To propose a new benchmark, please create a public git repository with all the necessary code.
The repository must be structured as follows:

  • It must contain a generate_properties.py file which accepts the seed as the only command line parameter.
  • There must be a folder with all .vnnlib files, which may be identical to the folder containing the generate_properties.py file
  • There must be a folder with all .onnx files, which may be identical to the folder containing the generate_properties.py file
  • The generate_properties.py file will be run using Python 3.8 on a t2.large AWS instance. (see https://vnncomp.christopher-brix.de/)

Once this succeeds, please post a link to the completed benchmark site, for example this one.

Here is a link to last year's github issue where we discussed benchmarks.
Here is a link to the folder containing final benchmark files last year.

@pomodoromjy
Copy link

Would two onnx network inputs and two onnx network outputs be accepted for the compitition?

Motivation

Currently, most networks evaluated in the literature are focusing on image classfication. However, in most practical scenarios, eg. autonomous driving, people pay more attention to object detection or semantic segmentation. Considering the complexity of the object detection, we propose a new simple Unet(four Conv2d layers followed with BN and ReLu). We advocate that tools should handle more practical architectures, and the simplified Unet is the first step towards this goal.

Why we need two onnx network inputs and two onnx network outputs?

In the image classification task, we can easily say that the image is classified correctly or wrongly. However in the semantic segmentation task, we need to calculate that how many are classified correctly and use a threshold to judge whether the task is succeed or not. Thus we have to use two onnx network inputs, where the one input is RGB values of images as usual and the other one is ground truth which is used for the statistical purrpose, i.e, how many pixels are classified correctly. For the same reason, we have to design two onnx network outputs, where the one is the logits as ususal and the other one is for statistical purpose.

微信图片_20220424231557

Question

It seems like it's impossible to seperate two aspects since we need train the semantic segmentation model and we also need to compare the output of semantic segmentation model with the ground truth. How should we modify our benchmark, so more tools can be applied to our benchmark?

@trhoangdung
Copy link

Hi Jinyan,

Thank you very much for the questions and comments. Verification for semantic segmentation networks remains challenging for most of the verification techniques now due to the huge output space (number of classified pixels). We have worked on this problem and the implementation in NNV. I am aware of another method from ETH Zurich. The metrics we have proposed to evaluate the robustness of a network under adversarial attacks such as brightening or darkening attacks are:

  1. Robustness value: the percentage of robust pixels under un attack
  2. Robustness sensitivity: the average number of pixels affected (misclassified) if one pixel is attached
  3. Robustness IoU: robust IoU is similar to IoU used in training SSN

Important note!!! robustness is different from accuracy. We don't use ground truth segmented images with 100% accuracy for statistical evaluation of the robustness. The reason is that your trained network cannot achieve 100% accuracy. Therefore, using user-provided ground truth segmented images to evaluate the robustness is inappropriate. The solution is, with one image, we execute the network to get a ground truth output segmented image. Under bounded attack to the input image, we compute all possible classes of all pixels in the segmented output. Then, we compare it with the ground truth image to compute robustness value, robustness sensitivity, and robustness IoU.

For the details, you can find our analysis here. (https://drive.google.com/file/d/1eiflLYjRtg4G0s4TglyCq21Z30OEaKFj/view)

Another note is that, for upsampling, please use dilated or transposed convolution, don't use un-max-pooling.

Tran

@pomodoromjy
Copy link

Thank you, Tran.

We have one more question. Do we still need two onnx network inputs if we need to compare the output with the ground truth in the onnx network? More specifically, we take the original image as input and excute the network to get a ground truth output. Then we replace the user-provided ground truth to the new ground truth. How can we merge the new ground truth and the disturbed images as one input?

@trhoangdung
Copy link

Hi Jinyan,

When you have a single image and attack that image by some bounded unknown attack, for example, brightening some pixels of the image, then what you get is NOT A SINGLE DISTURBED IMAGE but an infinite (but bounded) set containing all possible disturbed images. With that, a pixel may be classified into many classes. The robustness verification is to verify if the class of that pixel is changed under the attack.

Therefore, what we need to provide to a verification algorithm is the network, an input image, and a model of the attack, not a concrete disturbed image. If you provide a concrete disturbed image and a ground truth, we just need to execute the network on the disturbed image and analyze the result in comparison with the ground truth as a normal evaluation of the accuracy. This execution-based of robustness evaluation is fundamentally different from our verification purpose.

Tran

@pat676
Copy link
pat676 commented May 2, 2022

Hi all,

We have a benchmark which we think would be interesting for the competition; however, we are still waiting to hear back from some stakeholders about whether it is okay for us to publish it. We are aware that the deadline for proposing benchmarks is today, but would it be possible to get a few days of extension?

I am also not sure how many benchmarks each participant is allowed to publish this year? If it is more than one we are also happy to provide the same fully connected MNIST benchmarks as last year.

Best,
Patrick

@mnmueller
Copy link
mnmueller commented May 2, 2022

Hi all,

I would like to propose the following two benchmarks both based on CIFAR10 Image Classification using ResNets:
https://vnncomp.christopher-brix.de/benchmark/details/28
https://vnncomp.christopher-brix.de/benchmark/details/27
https://vnncomp.christopher-brix.de/benchmark/details/74
https://vnncomp.christopher-brix.de/benchmark/details/75
Edit: There was a typo in the epsilon for these leading to way too hard properties. I updated them accordingly.

I think we should also try to have the benchmarks ready a bit closer to the agreed-upon deadline than last year so that all participants have sufficient (and the same) time to make the required adaptions in their tools/parsing code.

Regarding the suggestions from @pomodoromjy, I think it would be important, that the benchmark conforms to the vnnlib standard and has just one input and output. Probably, we would also want to make sure that it is not completely out of reach for all tools to avoid having all tools run the full 6h on that benchmark.

What do the other participants think?

Cheers,
Mark

@huanzhang12
Copy link

Our team aims to propose a benchmark with collaborations from researchers in non-ML community. We will also greatly appreciate it if we can have a few days extension. The rule document says the benchmarks need to be prepared by May 31. Is that the final hard deadline for benchmark submission (or maybe a few days before that?) And the rule document also says that each team can provide up to 2 benchmarks (one benchmark may be selected from non-participant's proposal and should not duplicate the other benchmark). I hope there will be more benchmarks from non-participants in the next a few days for selection.

And yes, I agree with @mnmueller that benchmarks need to conform to vnnlib format. The format is indeed a bit restricted and maybe difficult to apply to some application scenarios (we had to workaround this as well), but I guess for this year's competition we should keep this format.

@naizhengtan
Copy link

Hi all,

We want to submit a Neural Network for Systems (NN4Sys) benchmark (team up with Huan/Kaidi's team). We're almost there---we're tuning our specification difficulties for a more balanced benchmark.

This benchmark will include four NN models: two from the learned database index (see this paper) and two from cardinality estimation (see this paper). We will append network architectures shortly.

We design the specifications to reflect the semantic correctness of these system components. In particular, for the learned index, the spec requires the predicted data positions to be close to the actual data positions. And for cardinality estimation, spec indicates that the predicted cardinality should be accurate, namely, within an error bound of the number of rows regarding the SQL statement.

@stanleybak
Copy link
Owner Author

Hi all. Yes the benchmark deadline on the website was updated to May 31, 2022, to match the rules document. Please try to stick with this date as participants need time to ensure their tools will run on all the benchmarks.

@pomodoromjy
Copy link

Hi all,

Regarding @mnmueller and @huanzhang12 's suggestion, we will try some new ways to our scenario.

A simple specification file would check to ensure that a single pixel of the output did not change classes. Can a more complicated specification file check to ensure that how many percentages of the pixels of the output did not change classes?
@trhoangdung, do you know the answer? Is it possible to compute robustness value, robustness sensitivity, and robustness IoU using vnnlib? How would this look?

Best
Jinyan

@ChristopherBrix
Copy link

Please note that the website for the benchmark tests is currently experiencing some issues. I will investigate this tomorrow and post an update once it's fixed. Sorry for the inconvenience!

@ChristopherBrix
Copy link

The bug has been fixed, submitting benchmarks should work fine now. If you notice that your code is stuck (no change in the log files after several minutes) you now have the option to abort the execution and submitting it again.

@pat676
Copy link
pat676 commented May 12, 2022

Hi,

We want to propose the following benchmarks:

mnist_fc: https://vnncomp.christopher-brix.de/benchmark/details/53
cifar_biasfield: https://vnncomp.christopher-brix.de/benchmark/details/55

The first benchmark is the same as proposed last year, three mnist fully connected networks with 2, 4 and 6 layers.

The second benchmark verifies a CIFAR fully convolutional network against bias field perturbations. Bias field verification is explained in detail here. Quickly summarised: a bias field models a smooth noise and is, in this case, represented by a spatially varying 3rd order polynomial.

The bias field transformation is encoded by prepending a fully connected layer to the cifar network; the FC layer takes the 16 coefficients of the bias field polynomial as input. The verification problem considers perturbations of these coefficients. We do the encoding in generate_properties.py, thus any toolkit supporting the standard l-infinity specifications should also support this problem "out-of-the-box".

We note that the encoding depends on the input image, so the random draws in this benchmarks affects the onnx models generated by generate_properties.py, not the vnnlib files.

@regkirov
Copy link
regkirov commented May 19, 2022

Can you please clarify the following:

  1. Timeouts for properties of a single benchmark should sum up to 6 hours? Or less?
  2. Is the *_instances.csv file also required where one specifies the mapping between properties and NNs, as well as the timeout for each entry?

Thanks!

@stanleybak
Copy link
Owner Author

@regkirov 1. yes total sum over all instances should not be more 6 hours, 2. yes please have an instances.csv file. Note this can be created with some randomness based on a passed in seed to the geneartion script that you provide (all benchmarks need this during this year).

@regkirov
Copy link

Hello! I have the following error in the logs during benchmark submission:

AssertionError: failed parsing line: (declare-const X_9 Bool)

Some inputs to our benchmark NN are Boolean. Does VNNLib only support Real type?

@stanleybak
Copy link
Owner Author

@regkirov I think it's more of a restriction of the tools. I don't think any tools from last year would support such specs, as maybe you can encode SAT problems with booleans. I'll note this down as some extension we can discuss. For now, would it make sense to encode them as real's between 0 and 1?

@regkirov
Copy link

Thanks @stanleybak Yes, for now it would not be a problem to encode all inputs as real's - I will do that and resubmit.

In general, this could be a good extension, for example, to avoid having CEX that are not realistic. Currently, for NNs with boolean/binary inputs, one would need to check every CEX against integrality constraints on such inputs. In other words, if a tool returns a CEX with the value, say, 0.4 for some input that can only be 0 or 1 - this would be valid for the property spec where every input is Real, but not meaningful for the model, because values between 0-1 can never occur.

@regkirov
Copy link

@ChristopherBrix @stanleybak Our benchmark includes several NNs with different input size. Due to this difference, the properties defined for one NN cannot be run on another NN. As discussed above, we provide an instances.csv file that maps vnnlib specs to corresponding NNs. However, in the logs I see a mismatch. Does the automated procedure consider the instances.csv file? Is there some naming convention to be followed? If the csv is not considered, any suggestion on how to pass the check avoiding the mismatch of properties and NNs? Thanks!

@ChristopherBrix
Copy link

You're right, instances.csv is currently not used, all properties are checked against every NN. I'll work on that over the weekend and post an update here once that's fixed.

@jferlez
Copy link
jferlez commented May 25, 2022

Hi everyone,

I would like to propose a TLL NN benchmark based on the networks we used in a couple of our recent papers (e.g. FastBATLLNN).

I have a repository with this benchmark, but the automatic validation currently fails with error OSError: [Errno 28] No space left on device. I suspect that the problem is the size of the ONNX networks in the benchmark, since in uncompressed form, they occupy ~2.5 GB.

For some additional, specific context: this error occurs while trying to unzip an ONNX network using Python's zipfile package. In an offline conversation, @stanleybak suggested that I use generate_properties.py to unzip each ONNX file as needed during property generation. Since they compress very well, this allows them to fit easily within the standard github file-size limits. (The largest ONNX file is ~300MB uncompressed vs. ~1.5MB in zip form.)

Assuming there isn't a bug in my code: is the disk space on the instance a hard constraint?

If so -- and assuming there's sufficient interest in this benchmark -- then there are a few ways I can adapt our benchmark to require less disk space. For example, the benchmark is currently configured to generate one property per network, with the number of networks determined solely by the timeout -- I could easily generate multiple properties per network instead, and simply decompress fewer networks. Considering only the smaller network sizes is another possibility: the repository contains many more networks of each size than are matched to properties in the current benchmark.

Of course I'm open to other ideas or modifications to help include this benchmark if there's interest.

Thanks in advance,
James

@ChristopherBrix
Copy link

@regkirov The instances.csv file is now used to determine which combinations of models and properties to check. At the moment this file is expected to exist in the scripts directory, I'm working on making this configurable.
Once that's done, I'll check all benchmarks that have been submitted so far, to make sure that their instances.csv file is correct.

I've also fixed a bug with protobuf, a new version caused everything to fail. If you tried to submit in the last days and got an error message complaining about the protobuf version, please try again.

@regkirov
Copy link
regkirov commented May 31, 2022

@stanleybak Our benchmark (Collins Aerospace) passed the automated check, following the instruction I am posting the link here: https://vnncomp.christopher-brix.de/benchmark/details/72

Do you plan to consolidate all benchmarks in a single repo? Is there a hard deadline, after which the benchmark materials can no longer be updated? (e.g., timeouts, description, models, etc.)

@stanleybak
Copy link
Owner Author

Hi all,

The benchmark submission deadline was extended one week to June 7th. Please try to make this deadline as no further extensions will be possible. The tool submission will also be moved back one week so that participants have the expected amount of time to run their tools on the benchmarks.

@vkrish1
Copy link
vkrish1 commented Jun 1, 2022

Hi everyone, I'm about to propose a few benchmarks (involving collision avoidance, some reinforcement learning benchmark problems) -- what does instances.csv need to contain?
I saw some benchmarks posted here that had a third column (in addition to the vnnlib and onnx filenames) -- is that needed?
Thanks!

@stanleybak
Copy link
Owner Author

@vkrish1 the instances.csv file contains lists of benchmark instances: onnx network, vnnlib spec, timeout in seconds. The total timeout of all instances should be less than six hours. The generate python script can create the instances.csv file based on a random seed given as a commend line argument.

@jferlez
Copy link
jferlez commented Jun 1, 2022

FYI: the AWS validation instance is suffering from the recent incompatibility bug between tensorflow and protobuf. Basically, protobuf >=3.20 introduces changes that break TF compatibility, so attempting to import TF will fail. See e.g. tensorflow/tensorflow#56077 and the release notes from TF 2.9.1 https://github.com/tensorflow/tensorflow/releases.

This appears to prevent import onnx from succeeding as well.

The solution is to use TF==2.9.1/2.8.2/2.6.5 etc., which should install protobuf < 3.20 as a dependency, or manually install protobuf==3.19.3 first, since it won't be upgraded by pip.

I'm also still curious about the disk space available on the instance. Have there been any deliberations about that?

@vkrish1
Copy link
vkrish1 commented Jun 4, 2022

Hi all, we'd like to propose few benchmarks: a collision-avoidance task and two from OpenAI's Gym
(Edited to include 2 other benchmarks)

  1. Dubins Rejoin - based on the SafeRL Rejoin Task benchmark (paper ref below). The system represents a flight formation problem: a network is trained via reinforcement learning to guide a wingman aircraft to a certain radius around a lead. The input and output spaces are both 8-dimensional. The 8 network outputs are translated into tuples of actuators for controlling the wingman (rudder, throttle); each discrete-valued actuator has 4 options.
    The control tuple is determined from the argmax of the first 4 network outputs and the argmax of the second 4, yielding 16 possible combinations. The benchmarks are designed to validate a network output with respect to one or more of the 16 combinations, which is a conjunction of 6 terms (i.e. comparing the first actuator's logit to the remaining 3 in the first group and comparing the second actuator's logit to the remaining 3 in the second group). Each benchmark is a disjunction of one or more of these 6-term conjunctions.

Dubins Rejoin environment details:
U. Ravaioli, J. Cunningham, J. McCarroll, V. Gangal, K. Dunlap, and K. Hobbs, “Safe reinforcement learning benchmark environments for aerospace control systems,” in 2022 IEEE Aerospace Conference, IEEE, 2022.

  1. CartPole - based on the OpenAI CartPole-v1 benchmark.The control network was trained using the StableBaselines3 library - the input space is 4-dimensional and the output space is binary.

  2. LunarLander - based on the OpenAI LunarLander-v4 benchmark. The control network was trained using the StableBaselines3 library - the input space is 8-dimensional and the output space is 4-dimensional.

@stanleybak
Copy link
Owner Author
stanleybak commented Jun 5, 2022

I'm also still curious about the disk space available on the instance. Have there been any deliberations about that?

@ChristopherBrix Any idea for this? Hopefully it's easy to get a bigger hard drive allocated.

Also: Could you help add a pip package for the benchmark generation? I'm try to to finalize a benchmark based on vggnet16, but I need to pip3 install mxnet to generate the benchmarks. The benchmark repo I have is here. Another potential issue is that vggnet is about 500MB so I'm not including it in the repo, but instead using wget as part of generate_benchmarks.py.

@stanleybak
Copy link
Owner Author

Hi all, here's the current list of benchmarks I see from the website:

  1. mnist_fc
  2. Carvana-unet
  3. cifar_biasfield
  4. collins-rul-cnn
  5. SRI_ResNetA / SRI_ResNetB
  6. TLLVerifyBench
  7. dubins-rejoin / cartpole / lunarlander
  8. VGGNET16
  9. NN4SYs

If there are any others, please post a comment here describing your benchmark and try to run it on the submission site. Given the issues with the benchmark submission site (see the previous 3-5 comments), I think we can extend submissions until Friday. If things aren't fixed by then, I'll manually go through the benchmarks to add them.

@pat676
Copy link
pat676 commented Jun 21, 2022
Screenshot 2022-06-15 at 14 49 19 Hi @huanzhang12 Would it be possible to remove the part of the model related to reshaping at the start of lindex.onnx and lindex_deep.onnx? It does not seem to serve any purpose when the input is already in the correct shape?

Hi @pat676, sorry for the confusion caused. We have already removed this operation and updated the benchmark in our repository. Please find our updated onnx files here Cli212/VNNComp22_NN4Sys@5d9d801.

Thank you @Cli212. @ChristopherBrix, could you please update the models in the main repo? Thanks!

@yodarocks1
Copy link

Benchmark reach_prob_density errors out when testing due to too-short timeout durations. According to its instances.csv file, each instance is given 10 seconds before timeout. I assume this is meant to be 10 minutes, or 600 seconds each? @stanleybak - based on the conversation above, I believe this is your benchmark?

The error:

Running reach_prob_density category from ./benchmarks/reach_prob_density/instances.csv
Category 'reach_prob_density' timeout sum: 360 seconds
reach_prob_density sum timeout (360) not in valid range [3*60*60, 6*60*60]

@stanleybak
Copy link
Owner Author

@yodarocks1 Yes, sorry I thought I fixed this but must have not pushed the change. The updated repo is now 600 secs per instance. @ChristopherBrix could you update this one? https://github.com/stanleybak/reach_probability_benchmark2022

@huanzhang12
Copy link

@pat676 @ChristopherBrix We updated the NN4sys benchmarks in response to #2 (comment) and #2 (comment). We've updated both onnx models and vnnlib files in our repository: https://github.com/Cli212/VNNComp22_NN4Sys
New test: https://vnncomp.christopher-brix.de/benchmark/details/229
But the "Exporting to GitHub" step failed due to the large models. @ChristopherBrix Could you please kindly update both the vnnlib and onnx files manuually in your repo?

@huanzhang12
Copy link

@ChristopherBrix @stanleybak Another thing I am not completely sure: will all the benchmarks in https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks be used for competition? According to competition rule, each team may propose up to two benchmarks and one of them can be selected from an outside group. It is unclear to me which benchmarks are from participants and which are from outside groups? I haven’t seen anyone nominating a benchmark yet.

@stanleybak
Copy link
Owner Author

@huanzhang12 We'll do a matching step to ensure that each benchmark has a nominating participant. I don't think we'll need to exclude any for this reason, as no one proposed an excessive number of benchmarks.

@ChristopherBrix
Copy link

reach_prob_density has been updated.

I'll update NN4sys tomorrow, I'll need to find an alternative to LFS as that doesn't scale due to GitHub quota limitations. But I guess I can just push them to my universities file storage, I just need to implement that.

@ChristopherBrix
Copy link

NN4sys has been updated and the handling of large files in general should be fixed now and no longer relies on GitHub LFS

@shizhouxing
Copy link
shizhouxing commented Jun 24, 2022

Hi @ChristopherBrix @stanleybak , it looks like vggnet16_2022 still has nothing under onnx/?

@shizhouxing
Copy link
shizhouxing commented Jun 24, 2022

Another issue is that in vnncomp2022_benchmarks, all the onnx models are stored in .onnx.gz format, but instances.csv files use .onnx instead of .onnx.gz. I suggest there to be at least some script or documentation specifying what processing steps are needed after cloning the benchmark respository.

@ChristopherBrix
Copy link
ChristopherBrix commented Jun 24, 2022

You need to run the setup.sh script to fetch those external models. That wasn't documented anywhere, I'll fix that. This will also unzip all files.

@regkirov
Copy link

Hi @regkirov, it seems that some of the colins_aerospace benchmarks do not declare the right output variables. E.g., robustness_16perturbations_delta40_epsilon10_w40.vnnlib declares only "Y_1", but uses "Y_0". Could you look into this?

Hi @mnmueller - thanks for pointing this out! I've changed the output declaration from "Y_1" to "Y_0" and pushed the update to https://github.com/loonwerks/vnncomp2022. Cheers.

Hi @ChristopherBrix just double checking. See the message above - a fix has been added to the collins_rul_cnn benchmark, but the version in vnncomp22_benchmarks has not been updated yet. Can you check? Thanks!

@pat676
Copy link
pat676 commented Jun 28, 2022

Hi all,

For the competition, do we have a pre-processing phase that is not counted as part of the timeout for loading the vnnlib specifications and converting the model to the correct format for our toolkit?

The reason I'm asking is that the nn4sys benchmarks have some massive vnnlib files that take several minutes to parse with our implementation and it takes us ~40 seconds to convert the vggnet16 from onnx to PyTorch. Also, the rl_benchmarks have some very short timeouts (1 second), in which case any pre-processing time may be essential.

@huanzhang12
Copy link

I think we don't want use just a few seconds as the timeout due to the time variance and unknown startup overhead of each tool. Maybe we should set a minimum timeout threshold like 20 or 30 seconds? @vkrish1 you mentioned you have updated timeouts in a previous comment #2 (comment), but it seems there are still many instances in rl_benchmark with timeouts of a few seconds? Could you double check? Thanks.

Last year we have a "prepare_instance.sh" timeout which was 30 seconds if I recall correctly. We probably need to increase the preparation timeout this year? maybe 100 seconds (or more)?

@pat676
Copy link
pat676 commented Jun 28, 2022

You need to run the setup.sh script to fetch those external models. That wasn't documented anywhere, I'll fix that. This will also unzip all files.

Hi @ChristopherBrix,

The setup.sh file does not seem to fetch the mscn_2048d_dual.onnx model of the nn4sys benchmark yet.

@stanleybak
Copy link
Owner Author

Yes, there is a prepare_instance.sh script you can use for conversion that does not get counted towards runtime (same as last year). This has a timeout itself but that shouldn't be a concern... Last year I think it was 60 seconds but 100 seconds is reasonable and an easy change (please update this in the evaluation scripts @ChristopherBrix).

As for small timeouts... I think 1 second may be a little bit small as well... hopefully we can update these be at least 30 seconds.

@pat676
Copy link
pat676 commented Jun 29, 2022

Hi @Cli212

There seems to be an issue with the ends parameter of Slice_9 in the mscn_128d_dual.onnx model. It is very large and leads to overflow errors in my onnxparser. Could you have a look at this? Thanks!

Screenshot 2022-06-29 at 16 47 00

@vkrish1
Copy link
vkrish1 commented Jun 29, 2022

My rl-benchmarks have been updated with larger timeouts (at least 30sec) - passed the tests and the repo was updated.

@pat676
Copy link
pat676 commented Jul 5, 2022

Hi @naizhengtan, @Cli212, @ChristopherBrix.

I am still having trouble with the nn4sys benchmark as mentioned in my comments above.

  1. I'm not able to download the mscn_2048d_dual.onnx model (it is not downloaded py setup.py).
  2. The end parameter of the Slice_9 in mscn_128d_dual.onnx is extremely large and leads to overflow issues.

Is anyone else having trouble with this? If so, I believe we should consider removing these two models from the benchmark at this point.

@ChristopherBrix
Copy link

Sorry, I'll look into the first point and fix that right away

@huanzhang12
Copy link

@pat676

  1. The model can be downloaded here: https://github.com/Cli212/VNNComp22_NN4Sys/blob/master/model/mscn_2048d_dual.onnx
  2. The end parameter of Slice_9 is within onnx spec: https://github.com/onnx/onnx/blob/main/docs/Operators.md#slice
For slicing to the end of a dimension with unknown size, it is recommended to pass in INT_MAX when slicing forward and 'INT_MIN' when slicing backward.

@ChristopherBrix
Copy link

The setup script should now work correctly, too.

@pat676
Copy link
pat676 commented Jul 5, 2022

Thank you, got the model now. For my second problem, the bug was on my side. If anyone else is facing a similar issue, it seems like torch.LongTensor throws an overflow error in certain situations when dealing with INT_MAX, but reducing it to
INT_MAX - 1 seems to work.

@ChristopherBrix
Copy link

Hi @ChristopherBrix just double checking. See the message above - a fix has been added to the collins_rul_cnn benchmark, but the version in vnncomp22_benchmarks has not been updated yet. Can you check? Thanks!

This was fixed some time ago, just forgot to post it here.

Yes, there is a prepare_instance.sh script you can use for conversion that does not get counted towards runtime (same as last year). This has a timeout itself but that shouldn't be a concern... Last year I think it was 60 seconds but 100 seconds is reasonable and an easy change (please update this in the evaluation scripts @ChristopherBrix).

I'll do that tonight.

@ChristopherBrix
Copy link

@stanleybak Actually, you already updated the prepare instance timeout to 5 minutes last year (https://github.com/stanleybak/vnncomp2021/blob/main/run_single_instance.sh#L49). I just copied your code, so that should be fine.

@piyush-J
Copy link

For unet_upsample_small.onnx in carvana_unet_2022 benchmark, did anyone face any issues with Pad_113 layer: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 3 for tensor number 1 in the list?

Based on the torch.nn.functional.pad(input, pad, mode='constant', value=None) → Tensor signature, I found:

  • The input is of size: torch.Size([1, 64, 30, 46])
  • The pad is of size 8: tensor([0, 0, 0, 0, 0, 0, 1, 1])
  • The value is 0

Am I missing something? Any help would be appreciated!

Thanks!

@pomodoromjy
Copy link

For unet_upsample_small.onnx in carvana_unet_2022 benchmark, did anyone face any issues with Pad_113 layer: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 3 for tensor number 1 in the list?

Based on the torch.nn.functional.pad(input, pad, mode='constant', value=None) → Tensor signature, I found:

  • The input is of size: torch.Size([1, 64, 30, 46])
  • The pad is of size 8: tensor([0, 0, 0, 0, 0, 0, 1, 1])
  • The value is 0

Am I missing something? Any help would be appreciated!

Thanks!

Hi Piyush

Have you tried to run evaluate_network.py? You can set onnxpath as unet_upsample_small.onnx and suceeds_mask as upsample_net_pre_mask. It may demonstrate how the network works.

Yonggang

@piyush-J
Copy link

Thank you, Yonggang!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests