[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downloading got Google Drive virus scan warning page rather than data files #3935

Open
forrestbao opened this issue May 15, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@forrestbao
Copy link

/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET

Short description
I am trying to load/download some datasets, mainly summarization ones. But instead of the *.tgz files, I got an HTML warning page telling me that the file is too large for Google Drive to scan virus and asking me whether to download anyway.

Environment information

  • Operating System: Ubuntu Linux 20.04

  • Python version: 3.10.4

  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets 4.5.2

  • tensorflow/tf-nightly version: tensorflow 2.8.0

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?

N/A

Reproduction instructions

CoLab link here https://colab.research.google.com/drive/1F5jHy8o0_va6aIvuaB6H-EqfiWUrC9Ld#scrollTo=k3k-fYTuxw54

Python code below

import tensorflow_datasets as tfds
tfds.load('cnn_dailymail', split='test')

If you share a colab, make sure to update the permissions to share it.

Link to logs

I got the error message:

NonMatchingChecksumError: Artifact https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ, downloaded to /root/tensorflow_datasets/downloads/ucexport_download_id_0BwmD_VLjROrfTHk4NFg2SndKG8BdJPpt2iRo6Dpzz23CByJuAePEilB-pxbcBCHaWDs.tmp.cfa00e128d6c4efab209b7a281915239/uc, has wrong checksum. This might indicate:

  • The website may be down (e.g. returned a 503 status code). Please check the url.
  • For Google Drive URLs, try again later as Drive sometimes rejects downloads when too many people access the same URL. See Better Drive files download failure #1482
  • The original datasets files may have been updated. In this case the TFDS dataset builder should be updated to use the new files and checksums. Sorry about that. Please open an issue or send us a PR with a fix.
  • If you're adding a new dataset, don't forget to register the checksums as explained in: https://www.tensorflow.org/datasets/add_dataset#2_run_download_and_prepare_locally

Below is the downloaded file under my ~/tensorflow_datasets/download/<a random hash>:

<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="sj0n0DT6hrEpyOgybFb8Iw">/* Copyright 2022 Google Inc. All Rights Reserved. */
.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=0BwmD_VLjROrfTHk4NFg2SndKcjQ">cnn_stories.tgz</a> (151M)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="downloadForm" action="https://drive.google.com/uc?export=download&amp;id=0BwmD_VLjROrfTHk4NFg2SndKcjQ&amp;confirm=t" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>

Expected behavior
I was expecting to see progress bars of downloading for a while and then a return like this

 <PrefetchDataset element_spec={'abstract': TensorSpec(shape=(), dtype=tf.string, name=None), 'description': TensorSpec(shape=(), dtype=tf.string, name=None)}>

Additional context

This error repeats for many large datasets but not small ones. For example, I had no problem with cifar10. But I had the same issue with big_patent.

@forrestbao forrestbao added the bug Something isn't working label May 15, 2022
@parshinsh
Copy link
parshinsh commented Aug 23, 2022

@forrestbao I'm facing the same problem when directly using wget to download large datasets from Google Drive on Linux. Did you find any solution for this error?

@Rushikesh-Malave-175
Copy link
Rushikesh-Malave-175 commented Jun 7, 2023

use this. https://github.com/Rushikesh-Malave-175/GD-Resume
ik im like super late lol. downloads only one file and only works on widowstho

@vinismarques
Copy link

You can work around the "Google Drive can't scan this file for viruses." message if you know the download URL.

Just add the parameter confirm=t to the URL and it should work.

Example using OP's file:

wget 'https://drive.usercontent.google.com/download?id=0BwmD_VLjROrfTHk4NFg2SndKcjQ&export=download&authuser=1&confirm=t' -O cnn_stories.tgz

@rome-legacy
Copy link

@vinismarques doesn't seem to work. When i inspect the url behind the download anyway button, then i see some uuid parameter and "at" parameter which i cannot decrypt

@vinismarques
Copy link

@rome-legacy the at parameter you see when inspecting the form HTML might not be necessary.

What worked for me was to add the confirm=t to the URL directly. Get the URL of the page you see the "Download anyway" button and add the confirm param. It will look something like this:

.../download?id=ID_HERE&export=download&confirm=t

It might also have an authuser param in the URL, you can keep it.

@GerrySant
Copy link

I was encountering the same problem and I made this bash script which allows to download without passing the antivirus:

#!/bin/bash

if [ $# -ne 2 ]; then
    echo "Usage: $0 <file_url> <destination_path>"
    exit 1
fi

file_url=$1
destination_path=$2

confirmation_page=$(curl -s -L "$file_url")

file_id=$(echo "$confirmation_page" | grep -oE "name=\"id\" value=\"[^\"]+" | sed 's/name="id" value="//')
file_confirm=$(echo "$confirmation_page" | grep -oE "name=\"confirm\" value=\"[^\"]+" | sed 's/name="confirm" value="//')
file_uuid=$(echo "$confirmation_page" | grep -oE "name=\"uuid\" value=\"[^\"]+" | sed 's/name="uuid" value="//')

download_url="https://drive.usercontent.google.com/download?id=$file_id&export=download&confirm=$file_confirm&uuid=$file_uuid"

curl -L -o "$destination_path" "$download_url"

if [ $? -eq 0 ]; then
    echo "Download completed successfully."
else
    echo "Download failed."
fi

To use it, simply perform the following steps:

  1. Create a '.sh' file (for example download_script.sh) and copy the above code.
  2. Give it execute permissions: chmod +x download_script.sh
  3. Run it as ./download_script.sh <file_url> <destination_path> (for example: ./download_script.sh 'https://drive.usercontent.google.com/download?id=**ID_HERE**&export=download' '/path/to/save/**FILE_NAME**.zip')

@mvoelk
Copy link
mvoelk commented May 17, 2024

I faced the issue with reddit_tifu dataset. As a workaround I downloaded the file manually and changed the URL in
.../lib/python3.10/site-packages/tensorflow_datasets/datasets/reddit_tifu/reddit_tifu_dataset_builder.py
from
_URL = "https://drive.google.com/uc?export=download&id=1ffWfITKFMJeqjT8loC8aiCLRNJpc_XnF"
to the local file path
_URL = "/home/user/Downloads/tifu_all_tokenized_and_filtered.json"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants