code_release_001

Here you will find all code that was used for the analysis presented in ARMS-MBON's first data paper. Note that this is not the code used to create files for EurOBIS submissions, but the code used for the exploration of the sequencing data of data_release_001.

Curating taxonomy and adding NCBI IDs

FixPEMAtaxassigments_18S_taxonomist.py to curate the 18S taxonomic outputs (see the repo processing_batch1/updated_taxonomic_assignments for more detail) and get the NBCI IDs for the scientific names.
FixPEMAtaxassigments_COI_taxonomist.py to add the species level from the output PEMA taxonomic assignment files to the OTU tables, for COI, and finding the NCBI ID for those (see the repo processing_batch1/updated_taxonomic_assignments for more detail).

Processing and analysis in R

The following R scripts merge the data (read count, taxonomy, and fasta files) from individual PEMA runs we provide in the analysis_release_001 GitHub repository for each marker gene:

These scripts perform the following tasks:

Merging of all data from individual PEMA runs.
As no confidence threshold was applied within PEMA for taxonomic assignments of COI ASVs, all rank assignments with a confidence value of below 0.8 are discarded for this marker gene.
Given that reference databases use different taxonomic levels and contain assignments that may not represent an actual classification (e.g., "Class.xy_X" etc.), rank assignments were curated (i.e., assignments containing Xs,"_sp", etc. set as NA), actual species assignments generated and a final rank order established.
Final ASV/OTU count tables, taxonomy tables and fasta files were generated for each gene's data set. These files can be found in final_count_taxonomy_fasta_files. Note that the ASVs/OTUs there do NOT equal the ASV/OTU IDs in the PEMA output files. New IDs were re-assigned after merging all data of all PEMA runs for each gene.

The R script gene_analysis.R uses the files found in final_count_taxonomy_fasta_files, the sample_data.txt and the palette.txt file and performs all subsequent exploration of the sequencing data as presented in the manuscript, including:

removal of certain samples/replicates and erroneous/contaminant sequences
manual correction of phylum level assignments
generating all data/results presented in the mansucript
generating species occurrences (i.e., species occurrences with at least two sequence reads of COI and 18S data; data of each gene were then pooled) to screen for sensitive, non-indigenous and red-listed taxa (see below for respective code of the actual screening process)
statistics
generating plots

Further information on each step are given as comments within the R script.

Screening for species listed in AMBI, IUCN/HELCOM Red Lists and WRiMS

The gene_analysis.R script generates a list of species (among many other files) occuring in the COI and 18S data set. we made use of LifeWatch Belgium's e-Lab services (https://www.lifewatch.be/data-services/) using the "Taxon match services" -> "Taxon match World Register of Marine Species (WoRMS)" to obtain correct, accepted species names and AphiaIDs as present in WoRMS. The results were read back into R and used in the gene_analysis.R script to generate files with species occurrences per observatory to screen against the three databases mentioned below. See the respective part in gene_analysis.R for further details.

The three databases used to screen against are:

AZTI’s Marine Biotic Index (AMBI; Borja et al., 2000, 2019) for species very sensitive to disturbance
the World Register of Introduced Marine Species (WRiMS; Costello et al., 2021, 2024) for species with alien status at the place of observation
the Red Lists of the International Union for Conservation of Nature (IUCN) and Baltic Marine Environment Protection Commission (Helsinki Commission, HELCOM) for species registered as Near Threatened, Vulnerable, Endangered or Critically Endangered

For AMBI and IUCN/HELCOM scan, we used the web services provided by the World Register of Marine Species (WoRMS, Ahyong et al. 2023), using the WoRMS REST services; more specifically the call AphiaAttributesByAphiaID) via the following script:

WormsAttributes4ARMSdata.py takes as input a CSV file with at least one column of AphiaIDs for the species' of interest, and it returns the information about a set of attributes ("Species importance to society", "IUCN RedList Category","IUCN Criteria","IUCN Year Accessed","HELCOM RedList Category","AMBI ecological group","Environmental position") for those species as found in WoRMS, using its REST APIs. Useage of the code is documented within the code. You run this on the command line, with the input file name and column number with the AphiaIDs written into the top of the code. The specific version of this code used for the manuscript of data_release_001 is WormsAttributes4ARMSdata_4release001.py. This specific version outputs some of the input columns in addition to what is computed by the code.
The resulting files are SpeciesListAttributesCOI.csv and SpeciesListAttributes18S.csv.

For WRiMS scan, we used the Jupyter notebook on IJI invasive checker GH, with the specific input files in ARMSrun2 folder there. The ARMS_SpeciesPerObservatory_18S.xlsx input file found there is created with the gene_analysis.R script (see above). Note that the file created with this script looks slightly different than the initially used file in ARMSrun2 folder. The gene_analysis.R script creates a file containing only unique species presences per ARMS units, while the file actually used for the WRiMS scan contains also absences and duplicate species occurrences (initially, all occurrences of all samples of an ARMS units were retrieved without removing duplicates). This does not change the final results, though. The input file name ends with "_18S" because the code used for WRiMS scan was initially written requiring this exact file name. However, the file created here contains both COI AND 18S species occurrences. The result of the WRiMS scan is the ARMS_SpeciesPerObservatory_wrims.xlsx file.

The files resulting from the database scans are read into R within the gene_analysis.R script and processed further for analysis and visualisation.

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github/workflows		.github/workflows
final_count_taxonomy_fasta_files		final_count_taxonomy_fasta_files
18S_merge_tables_first_step_for_data_paper.R		18S_merge_tables_first_step_for_data_paper.R
COI_merge_tables_first_step_for_data_paper.R		COI_merge_tables_first_step_for_data_paper.R
FixPEMAtaxassigments_18S_taxonomist.py		FixPEMAtaxassigments_18S_taxonomist.py
FixPEMAtaxassigments_COI_taxonomist.py		FixPEMAtaxassigments_COI_taxonomist.py
ITS_merge_tables_first_step_for_data_paper.R		ITS_merge_tables_first_step_for_data_paper.R
README.md		README.md
SpeciesListAttributes18S.csv		SpeciesListAttributes18S.csv
SpeciesListAttributesCOI.csv		SpeciesListAttributesCOI.csv
WormsAttributes4ARMSdata.py		WormsAttributes4ARMSdata.py
WormsAttributes4ARMSdata_4release001.py		WormsAttributes4ARMSdata_4release001.py
codemeta.json		codemeta.json
extra_metadata.json		extra_metadata.json
gene_analysis.R		gene_analysis.R
palette.txt		palette.txt
ro-crate-metadata.json		ro-crate-metadata.json
sample_data.txt		sample_data.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

code_release_001

Curating taxonomy and adding NCBI IDs

Processing and analysis in R

Screening for species listed in AMBI, IUCN/HELCOM Red Lists and WRiMS

About

Releases

Packages

Contributors 4

Languages

arms-mbon/code_release_001

Folders and files

Latest commit

History

Repository files navigation

code_release_001

Curating taxonomy and adding NCBI IDs

Processing and analysis in R

Screening for species listed in AMBI, IUCN/HELCOM Red Lists and WRiMS

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages