[go: nahoru, domu]

Skip to content

Code used for the analyses of the data paper associated with data_release_001

Notifications You must be signed in to change notification settings

arms-mbon/code_release_001

Repository files navigation

code_release_001

Here you will find all code that was used for the analysis presented in ARMS-MBON's first data paper. Note that this is not the code used to create files for EurOBIS submissions, but the code used for the exploration of the sequencing data of data_release_001.

Curating taxonomy and adding NCBI IDs

Processing and analysis in R

The following R scripts merge the data (read count, taxonomy, and fasta files) from individual PEMA runs we provide in the analysis_release_001 GitHub repository for each marker gene:

These scripts perform the following tasks:

  • Merging of all data from individual PEMA runs.
  • As no confidence threshold was applied within PEMA for taxonomic assignments of COI ASVs, all rank assignments with a confidence value of below 0.8 are discarded for this marker gene.
  • Given that reference databases use different taxonomic levels and contain assignments that may not represent an actual classification (e.g., "Class.xy_X" etc.), rank assignments were curated (i.e., assignments containing Xs,"_sp", etc. set as NA), actual species assignments generated and a final rank order established.
  • Final ASV/OTU count tables, taxonomy tables and fasta files were generated for each gene's data set. These files can be found in final_count_taxonomy_fasta_files. Note that the ASVs/OTUs there do NOT equal the ASV/OTU IDs in the PEMA output files. New IDs were re-assigned after merging all data of all PEMA runs for each gene.

The R script gene_analysis.R uses the files found in final_count_taxonomy_fasta_files, the sample_data.txt and the palette.txt file and performs all subsequent exploration of the sequencing data as presented in the manuscript, including:

  • removal of certain samples/replicates and erroneous/contaminant sequences
  • manual correction of phylum level assignments
  • generating all data/results presented in the mansucript
  • generating species occurrences (i.e., species occurrences with at least two sequence reads of COI and 18S data; data of each gene were then pooled) to screen for sensitive, non-indigenous and red-listed taxa (see below for respective code of the actual screening process)
  • statistics
  • generating plots

Further information on each step are given as comments within the R script.

Screening for species listed in AMBI, IUCN/HELCOM Red Lists and WRiMS

The gene_analysis.R script generates a list of species (among many other files) occuring in the COI and 18S data set. we made use of LifeWatch Belgium's e-Lab services (https://www.lifewatch.be/data-services/) using the "Taxon match services" -> "Taxon match World Register of Marine Species (WoRMS)" to obtain correct, accepted species names and AphiaIDs as present in WoRMS. The results were read back into R and used in the gene_analysis.R script to generate files with species occurrences per observatory to screen against the three databases mentioned below. See the respective part in gene_analysis.R for further details.

The three databases used to screen against are:

  • AZTI’s Marine Biotic Index (AMBI; Borja et al., 2000, 2019) for species very sensitive to disturbance
  • the World Register of Introduced Marine Species (WRiMS; Costello et al., 2021, 2024) for species with alien status at the place of observation
  • the Red Lists of the International Union for Conservation of Nature (IUCN) and Baltic Marine Environment Protection Commission (Helsinki Commission, HELCOM) for species registered as Near Threatened, Vulnerable, Endangered or Critically Endangered

For AMBI and IUCN/HELCOM scan, we used the web services provided by the World Register of Marine Species (WoRMS, Ahyong et al. 2023), using the WoRMS REST services; more specifically the call AphiaAttributesByAphiaID) via the following script:

  • WormsAttributes4ARMSdata.py takes as input a CSV file with at least one column of AphiaIDs for the species' of interest, and it returns the information about a set of attributes ("Species importance to society", "IUCN RedList Category","IUCN Criteria","IUCN Year Accessed","HELCOM RedList Category","AMBI ecological group","Environmental position") for those species as found in WoRMS, using its REST APIs. Useage of the code is documented within the code. You run this on the command line, with the input file name and column number with the AphiaIDs written into the top of the code. The specific version of this code used for the manuscript of data_release_001 is WormsAttributes4ARMSdata_4release001.py. This specific version outputs some of the input columns in addition to what is computed by the code.
  • The resulting files are SpeciesListAttributesCOI.csv and SpeciesListAttributes18S.csv.

For WRiMS scan, we used the Jupyter notebook on IJI invasive checker GH, with the specific input files in ARMSrun2 folder there. The ARMS_SpeciesPerObservatory_18S.xlsx input file found there is created with the gene_analysis.R script (see above). Note that the file created with this script looks slightly different than the initially used file in ARMSrun2 folder. The gene_analysis.R script creates a file containing only unique species presences per ARMS units, while the file actually used for the WRiMS scan contains also absences and duplicate species occurrences (initially, all occurrences of all samples of an ARMS units were retrieved without removing duplicates). This does not change the final results, though. The input file name ends with "_18S" because the code used for WRiMS scan was initially written requiring this exact file name. However, the file created here contains both COI AND 18S species occurrences. The result of the WRiMS scan is the ARMS_SpeciesPerObservatory_wrims.xlsx file.

The files resulting from the database scans are read into R within the gene_analysis.R script and processed further for analysis and visualisation.

About

Code used for the analyses of the data paper associated with data_release_001

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •