You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While using the sra-download workflow (short description and code) to download some publicly available metagenomes, I realized that the current workflow does not verify MD5 checksums to check for file corruption.
To ensure data integrity, I manually checked the MD5 checksums of the files against the ENA metadata checksums (see Additional Info [1]). The checksums did not match, raising concerns about potential file corruption. While investigating, I reached out to @mschecht, who developed the workflow.
The issue is that the sra-download workflow downloads unzipped files (specifically, a .sra file containing the fastq files) and gzips them post-download. However, the MD5 checksums from ENA are for the .fastq.gz files. Unzipping and re-zipping changes the checksum (see Additional Info [2]), making the checksums provided by ENA useless here.
Further investigation led me to this issue on obtaining MD5 checksums from NCBI: ncbi/sra-tools#531. These checksums are for the .sra files and should suffice to ensure the files were not corrupted during download.
After discussing with @mschecht, we plan to include an MD5 checksum check in the workflow for the .sra files stored in the temporary 01_NCBI_SRA directory before extracting the .fastq files.
The solution
Integrate a new rule with an MD5 checksum verification step in the sra-download workflow to check the .sra files in the temp directory before fasterq-dump extracts the FASTQ files.
For this, we could do something similar to:
Use curl to retrieve the MD5 checksums and accession numbers from the NCBI API:
#!/bin/bash
# Read each line from the accession_numbers.txt file
while IFS= read -r accession; do
# Construct the curl command using the accession number
curl "https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve?filetype=run&location-type=forced&location=s3.us-east-1&accept-charges=aws&acc=${accession}" -o "${accession}.json"
done < accession_numbers.txt
Extract the MD5 checksums and accession numbers from the .json files:
#!/bin/bash
# Output file for checksums
output_file="checksums.txt"
# Clear the output file if it exists
> $output_file
# Loop through each JSON file
for json_file in *.json; do
# Extract accession number and md5 checksum from the JSON file
accession=$(jq -r '.result[0].bundle' "$json_file")
md5=$(jq -r '.result[0].files[0].md5' "$json_file")
# Append the extracted information to the output file
echo "$accession $md5" >> $output_file
done
Compare the MD5 checksums with the checksums of the downloaded .sra files:
#!/bin/bash
# Read the checksums file and compare md5 checksums of .sra files
while IFS=' ' read -r accession md5_expected; do
# Compute the md5 checksum of the .sra file
md5_actual=$(md5sum "$accession.sra" | awk '{ print $1 }')
# Compare the checksums
if [ "$md5_expected" == "$md5_actual" ]; then
echo "$accession: Checksums match"
else
echo "$accession: Checksums do not match"
fi
done < checksums.txt
Beneficiaries
Anyone using this workflow to download nucleotide sequence data from NCBI SRA who wants to ensure that their files have not been corrupted during downloading.
Additional info
[1] How to get ENA metadata with MD5 checksums
Use the following script to download the metadata (including MD5 checksums) for projects from ENA:
importrequestsimportos# Define the list of project accessionsproject_accessions= ['PRJEB1787', 'PRJEB8682', 'PRJNA656268', 'PRJNA385854', 'PRJEB52452', 'PRJNA385855', 'PRJNA385855', 'PRJNA16339', 'PRJNA352737', 'PRJEB2064']
# Define the API endpoint for the project metadataurl="https://www.ebi.ac.uk/ena/portal/api/search"# Define the common parameters for the API requestfields='accession,altitude,assembly_quality,assembly_software,base_count,binning_software,bio_material,broker_name,cell_line,cell_type,center_name,checklist,collected_by,collection_date,completeness_score,contamination_score,country,cultivar,culture_collection,depth,description,dev_stage,ecotype,elevation,environment_biome,environment_feature,environment_material,environmental_sample,experiment_accession,experiment_alias,experiment_title,experimental_factor,fastq_aspera,fastq_bytes,fastq_ftp,fastq_galaxy,fastq_md5,first_created,first_public,germline,host,host_body_site,host_genotype,host_gravidity,host_growth_conditions,host_phenotype,host_sex,host_status,host_tax_id,identified_by,instrument_model,instrument_platform,investigation_type,isolate,isolation_source,last_updated,lat,library_layout,library_name,library_selection,library_source,library_strategy,location,lon,mating_type,nominal_length,nominal_sdev,ph,project_name,protocol_label,read_count,run_accession,run_alias,salinity,sample_accession,sample_alias,sample_collection,sample_description,sample_material,sample_title,sampling_campaign,sampling_platform,sampling_site,scientific_name,secondary_sample_accession,secondary_study_accession,sequencing_method,serotype,serovar,sex,specimen_voucher,sra_aspera,sra_bytes,sra_ftp,sra_galaxy,sra_md5,strain,study_accession,study_alias,study_title,sub_species,sub_strain,submission_accession,submitted_aspera,submitted_bytes,submitted_format,submitted_ftp,submitted_galaxy,submitted_host_sex,submitted_md5,target_gene,tax_id,taxonomic_classification,taxonomic_identity_marker,temperature,tissue_lib,tissue_type,variety'common_params= {
'result': 'read_run',
'fields': fields,
'format': 'tsv',
'limit': '0'
}
# Loop through each project accessionforproject_accessioninproject_accessions:
# Update the query parameter for each projectparams=common_params.copy()
params['query'] =f'study_accession={project_accession}'# Send the API requestresponse=requests.get(url, params=params)
# Check if the request was successfulifresponse.status_code==200:
# Define the path to the output fileoutput_file_path=os.path.join('..', 'data', f'{project_accession}_metadata.csv')
# Save the metadata to the specified filewithopen(output_file_path, 'w') asfile:
file.write(response.text)
print(f"Metadata for {project_accession} has been successfully downloaded and saved to '{output_file_path}'.")
else:
print(f"Failed to retrieve metadata for {project_accession}: {response.status_code}")
print(response.text)
[2] Investigation on unzipping and re-zipping changing checksums
The sra-download workflow downloads the unzipped files and then zips them. The MD5 checksums in the metadata file are for the zipped files. The question is whether unzipping and re-zipping changes the checksum. Let's test it:
MD5 checksums change due to unzipping and re-zipping, hence the provided checksums are not applicable. Using the MD5 checksums of the .sra files ensures integrity.
Alternative to using SRA
As a sanity check (and to exclude the possibility that the md5 checksums I got from ENA are incorrect), I tried a different method for downloading some of the data with wget and the FTP links to each run. For these downloads, the MD5 checksums matched.
This is how one would use the ENA API and wget to avoid needing to rely on the SRA toolkit.
One needs the ftp links (which you get in [1]) that one would add to a file such as ftp-links-raw-files.txt, which could look like:
The need
While using the
sra-download
workflow (short description and code) to download some publicly available metagenomes, I realized that the current workflow does not verify MD5 checksums to check for file corruption.To ensure data integrity, I manually checked the MD5 checksums of the files against the ENA metadata checksums (see Additional Info [1]). The checksums did not match, raising concerns about potential file corruption. While investigating, I reached out to @mschecht, who developed the workflow.
The issue is that the
sra-download
workflow downloads unzipped files (specifically, a .sra file containing the fastq files) and gzips them post-download. However, the MD5 checksums from ENA are for the .fastq.gz files. Unzipping and re-zipping changes the checksum (see Additional Info [2]), making the checksums provided by ENA useless here.Further investigation led me to this issue on obtaining MD5 checksums from NCBI: ncbi/sra-tools#531. These checksums are for the .sra files and should suffice to ensure the files were not corrupted during download.
After discussing with @mschecht, we plan to include an MD5 checksum check in the workflow for the .sra files stored in the temporary
01_NCBI_SRA
directory before extracting the .fastq files.The solution
Integrate a new rule with an MD5 checksum verification step in the
sra-download
workflow to check the .sra files in the temp directory beforefasterq-dump
extracts the FASTQ files.For this, we could do something similar to:
Use
curl
to retrieve the MD5 checksums and accession numbers from the NCBI API:the .json files look like this:
Extract the MD5 checksums and accession numbers from the .json files:
Compare the MD5 checksums with the checksums of the downloaded .sra files:
Beneficiaries
Anyone using this workflow to download nucleotide sequence data from NCBI SRA who wants to ensure that their files have not been corrupted during downloading.
Additional info
[1] How to get ENA metadata with MD5 checksums
Use the following script to download the metadata (including MD5 checksums) for projects from ENA:
[2] Investigation on unzipping and re-zipping changing checksums
The
sra-download
workflow downloads the unzipped files and then zips them. The MD5 checksums in the metadata file are for the zipped files. The question is whether unzipping and re-zipping changes the checksum. Let's test it:MD5 checksums change due to unzipping and re-zipping, hence the provided checksums are not applicable. Using the MD5 checksums of the .sra files ensures integrity.
Alternative to using SRA
As a sanity check (and to exclude the possibility that the md5 checksums I got from ENA are incorrect), I tried a different method for downloading some of the data with
wget
and the FTP links to each run. For these downloads, the MD5 checksums matched.This is how one would use the ENA API and
wget
to avoid needing to rely on the SRA toolkit.One needs the ftp links (which you get in [1]) that one would add to a file such as
ftp-links-raw-files.txt
, which could look like:and then you could simply iterate through it:
Then check the md5 checksums of all downloaded .fastq files
md5sum *.fastq
and compare those against the md5 checksums given in the ENA metadata.And while this is a very viable option for which we could also write a workflow, it would be nice to also have the SRA option.
The text was updated successfully, but these errors were encountered: