[go: nahoru, domu]

Skip to content

Codes i wrote for the paper "genomic resources for Mediterranean fishes"

Notifications You must be signed in to change notification settings

Grelot/reserveBenefit--snpsdata_analysis

Repository files navigation

Codes for the paper : "Genomic resources for Mediterranean fishes"

https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg


Pierre-Edouard Guerin, Stephanie Manel

Montpellier, 2017-2019

Submited to Molecular Ecology Ressources, 2019


Prerequisites

Softwares

Singularity container

See https://www.sylabs.io/docs/ for instructions to install Singularity.

Download the container

singularity pull --name snpsdata_analysis.simg shub://Grelot/reserveBenefit--snpsdata_analysis:snpsdata_analysis

Run the container

singularity run snpsdata_analysis.simg

Data files

We work on three species : mullus surmuletus, diplodus sargus and serranus cabrilla. Let's define the wildcard species as any of these three species.

  • genome assembly .fasta

  • SNPs data from radseq .vcf

Filtering SNPs

Only one randomly selected SNP was retained per locus, and a locus was retained only if present in at least 85% of individuals. Individuals with an excess coverage depth (>1,000,000x) or >30% missing data were filtered out. We kept loci with maximum observed heterozygosity=0.6.

Filtering steps (IBD paper)

  1. Remove loci with inbreeding coefficient Fis > 0.5 or < -0.5
  2. Keep all pairs of loci that are closer than 5000 bp
  3. Keep pairs of loci with linkage desequilibrum > 0.8
  4. Keep SNPs with a minimum minor allele frequency (MAF) of 1%
  5. Remove loci that deviated significantly (p-value <0.01) from expected Hardy-Weinberg genotyping frequencies under random mating

Filtering steps (genome paper)

  1. Keep all pairs of loci that are closer than 5000 bp
  2. Keep pairs of loci with linkage desequilibrum > 0.8
  3. Keep SNPs with a minimum minor allele frequency (MAF) of 1%

INPUTS:

OUTPUTS:

  • species.lmiss: number of missing individuals by locus table
  • species.imiss: number of missing loci by individual table
  • species.idepth: mean locus depth coverage by individual table
  • species.geno.ld: linkage desequilibrum _r² table
  • species.snps.fisloc_rm.vcf
  • species.fisloc_rm.ld_5000.log
  • species.fisloc_rm.ld_5000.recode.vcf
  • species.fisloc_rm.ld_5000.r2.recode.vcf
  • species.fisloc_rm.ld_5000.r2.maf001.recode.vcf
  • species.fisloc_rm.ld_5000.r2.maf001.hwe.recode.vcf: final filtered snps
  • speciesfiltering_count_snps_report.tsv: number of SNPs at each filtering step
cd filter_vcf
bash filter_vcf.sh

Description of SNPs onto genome

Generate tables

  1. Split the genome into genome-windows of 400 Kbp.
  2. Count number of SNPs located on each genome-windows.
  3. Count number of reads for each SNP for each individuals.

INPUTS:

  • species.fasta: genome fasta file of species
  • species.vcf: SNPs from radseq data of species
  • species.gff3: coordinates and related information of coding region annotation genome of species

OUTPUTS:

  • speciescoverage.bed: a table with row as genome-windows of 400000bp of the genome of species with genome-coordinates (scaffold, start position, end position) and coverage (number of SNPs)
  • speciesmeandepth.bed: a table with row as SNPs with genome-windows, coordinates (scaffold, start position, end position) and depth coverage (number of reads) for each SNP for each individuals
  • speciescoords.snps.bed: coordinates (scaffold, position) of SNPs onto genomes
  • speciescoding.snps.bed: snps located on coding region
bash snpsontothegenome/command.sh

Build the figure

Rscript snpsontothegenome/figure_cover_genome.R

Average distance between SNPs loci

INPUTS:

  • speciescoords.snps.bed : coordinates (scaffold, position) of SNPs onto genomes
Rscript snpsontothegenome/average_distance_loci.R

SNPs located/not in coding regions

Simply count number of lines of the file speciescoding.snps.bed (each line is a snp located on a coding region)

SNPs located/not in mitochondrial regions

............

Results

  • distance_loci.csv : mean, median and sd distance between consecutive loci

       |  mean            | median|   sd             | maw     | min
    

---------|------------------|-------|------------------|---------|---- diplodus | 35388.9078430345 | 23751 | 34996.9143024498 | 459616 |5000 mullus | 30716.8684498214 | 20930 | 29189.8335674228 | 384550 |5002 serran | 28239.7585528699 | 19084 | 27013.2843728281 | 403508 |733

  • summary_snps.csv: number of SNPs, average distance between consecutive loci (in bp) and number of SNPs located on a coding region for each species
species number_snps average_distance_bp number_coding_snps
diplodus 20074 35389 11978
mullus 15710 30717 10304
serranus 21101 28240 13107