WIP: stdin support for FASTA/FASTQ files #2029

RaeedA · 2024-05-24T22:13:34Z

Issue #2017
Adding support for FASTA/FASTQ files to make them accept stdin streaming, as well as adding compatibility for other file formats in the future.
Modified the FASTQ file parser to match the FASTA structure, specifically moving the parser functionality out of the reader and into a dedicated helper function.
Making most sniffer/parsers pass out consumed lines.
Coverage still in progress.

Please complete the following checklist:

I have read the contribution guidelines.
I have documented all public-facing changes in the changelog.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the licenses directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied.
- It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.
This pull request does not include code, documentation, or other content derived from external source(s).

Note: This document may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.

Stdin fixes

codecov · 2024-05-24T22:28:03Z

Codecov Report

Attention: Patch coverage is 91.30435% with 14 lines in your changes are missing coverage. Please review.

Project coverage is 98.44%. Comparing base (e3a27f5) to head (1b3629b).
Report is 2 commits behind head on main.

Files	Patch %	Lines
skbio/io/_iosources.py	54.54%	5 Missing ⚠️
skbio/io/util.py	70.00%	3 Missing ⚠️
skbio/io/format/_base.py	77.77%	2 Missing ⚠️
skbio/io/registry.py	92.00%	2 Missing ⚠️
skbio/io/format/fasta.py	94.73%	1 Missing ⚠️
skbio/io/format/fastq.py	97.77%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2029      +/-   ##
==========================================
- Coverage   98.59%   98.44%   -0.16%     
==========================================
  Files         182      182              
  Lines       30885    31156     +271     
  Branches     7512     7583      +71     
==========================================
+ Hits        30452    30671     +219     
- Misses        418      469      +51     
- Partials       15       16       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

qiyunzhu

@RaeedA Great job! It looks to me quite meaningful. I did some tests using the attached real-world FASTQ file as input, and got mixed result.

First, the original use case suggested by @wasade in #2017 , appears to be working. My test script test.py has the following content:

import sys
import skbio

seq_gen = skbio.read(sys.stdin, format='fastq')

print('1st batch')
i = 0
while i < 5:
    print(next(seq_gen).metadata['id'])
    i += 1

print('2nd batch')
i = 0
while i < 5:
    print(next(seq_gen).metadata['id'])
    i += 1

The following command works and produces the correct screen output:

cat input.fastq | python test1.py

Then I tried a different use case. The following script test1.py is the currently accepted skbio usage:

import sys
from skbio import DNA
with open(sys.argv[1], 'r') as f:
    seq = DNA.read(f, seq_num=5)
print(seq)

One can run it like:

python test1.py input.fastq

I tried to edit it to support standard input (test2.py):

import sys
from skbio import DNA
seq = DNA.read(sys.stdin, seq_num=5)
print(seq)

And run:

cat input.fastq | python test2.py

The error message is:

io.UnsupportedOperation: underlying stream is not seekable

I specified the format (i.e., skipping sniffing?) (test3.py):

import sys
from skbio import DNA
seq = DNA.read(sys.stdin, seq_num=5, format='fastq')
print(seq)

The error message is:

skbio.io._exception.FASTQFormatError: Expected sequence (@) header line at start of file: 'TATGCAATGAATATTAAGTTTACAACCGCAAATACTGACCCAAATGAATTTGCTGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACCGCAACAAGGCTCGGCGA'

input.fastq.zip

Therefore, I think that the current PR is very close to the right solution, but some subtle places still need some work.

skbio/io/_iosources.py

qiyunzhu · 2024-05-27T17:09:33Z

skbio/io/format/_base.py

    for line in _line_generator(fh, skip_blanks=False):
+        consumed += [line]


This can be replaced with:

consumed.append(line)

Is this more efficient than just doing +=? I'm using += in other places as sometimes I need to add multiple lines, and wanted to keep it consistent.

append is just a "pythonic" way :) I don't know if it's more efficient or not. PS: if you want to add multiple elements, you can do consumed.extend([line1, line2, line3...]).

skbio/io/format/fasta.py

wasade · 2024-05-30T20:21:18Z

skbio/io/registry.py

+                try:
+                    # Attempt to reset and sniff file
+                    backup = file.tell()
+                    res = sniffer(file, **io_kwargs)
+                    is_format = res[0]
+                    skwargs = res[1]
+                    file.seek(backup)
+                except io.UnsupportedOperation:
+                    # File is not seekable, so get consumed lines to reset it
+                    is_format, skwargs, consumed = sniffer(file, **io_kwargs)


peek for stdin (see sys.stdin.buffer.peek) and many other IO types (not io.StringIO AFAIK) would work without advancing the pointer too. But, it's read request is not assured to be what you want in terms of bytes

Yeah, I've tested a couple other methods of reading nonseekables including this but it just seems that reconstructing it as a chain is the best.

wasade · 2024-05-30T20:21:49Z

skbio/io/registry.py

@@ -914,7 +939,8 @@ def wrapped_sniffer(
                        # Some formats may have headers which indicate their
                        # format sniffers should be able to rely on the
                        # filehandle to point at the beginning of the file.
-                        fh.seek(0)
+                        if self.header:
+                            fh.seek(0)


Is this still safe for stdin which is not seekable?

I think that @RaeedA added a seekable flag to different input types to control this behavior.

That was the original plan, I moved some things around in the current changes to make it work more intuitively. It's now in a try/catch block.

wasade · 2024-05-30T20:26:27Z

skbio/io/tests/test_util.py

+class TestReadStandardInput(unittest.TestCase):
+    num_text = ["ONE", "TWO", "THREE", "FOUR", "FIVE", "SIX", "SEVEN"]
+
+    def test_stdin_read_fasta(self):


These tests should stay, but I recommend additionally setting up an independent integration test run through the Makefile. Something like the below. The inline python though could also be placed in a minor script.

#!/bin/bash cat skbio/io/tests/test_data/some_fasta.fna | \ python -c "import skbio, sys; print([r.metadata['id'] for r in skbio.read(sys.stdin, format='fasta')])" > /tmp/obs_stdin.ids grep "^>" skbio/io/tests/test_data/some_fasta.fna | tr -d ">" > /tmp/exp_stdin.ids obs_md5=$(md5sum /tmp/obs_stdin.ids) exp_md5=$(md5sum /tmp/exp_stdin.ids) if [[ ${obs_md5} != ${exp_md5} ]]; then exit 1 fi

I'm not super familiar with these, but I've been looking for a way to do just this! I'll take a look at implementing this for testing.

wasade · 2024-06-03T22:36:35Z

Does this work?How to check if file object is random accessstackoverflow.comOn Jun 3, 2024, at 15:05, Raeed Azom ***@***.***> wrote: @RaeedA commented on this pull request. In skbio/io/registry.py:

@@ -914,7 +939,8 @@ def wrapped_sniffer(

# Some formats may have headers which indicate their # format sniffers should be able to rely on the # filehandle to point at the beginning of the file. - fh.seek(0) + if self.header: + fh.seek(0) That was the original plan, I moved some things around in the current changes to make it work more intuitively. It's now in a try/catch block. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

qiyunzhu · 2024-06-03T22:39:12Z

@wasade According to the Python documentation, seekable is a valid check. It says: "Return True if the stream supports random access".

wasade · 2024-06-03T23:03:23Z

Right, and it would be preferable to perform the explicit check rather than a try/except

wasade · 2024-06-03T23:04:15Z

Oh I see, the SO post didn't actually come through in email -- this one

qiyunzhu · 2024-06-03T23:46:26Z

Right, and it would be preferable to perform the explicit check rather than a try/except

I agree with that. In the current PR there are both. Redundancy may not be bad. But @RaeedA you may want to check to make sure we don't solely rely on try/except. (I am okay with solely relying on seekable, if that works.)

RaeedA · 2024-06-04T00:44:25Z

Hey! I'm actually not using the seekable call here, as I've found that it behaves irregularly with some data formats. I've instead been using tell() wrapped inside of a try/catch, as tell will consistently produce an UnsupportedOperation error that I can use to detect seekable versus unseekable.

qiyunzhu · 2024-06-04T03:07:40Z

@RaeedA I tend to think that using try/except is fine, given that you found its result consistent, whereas that of seekable isn't.

RaeedA and others added 12 commits May 7, 2024 12:04

adding seekable functions to the IOSource class

0ac11d8

get seekable instead of reader when not seekable

04e965e

modifying changelog

66d81af

Adding comment and letting commit hooks fix formatting

bde320d

making format compatible with ruff

263f891

Removing code that could lead to memory issues

a804f3a

working implimentation of stdin reading

d201771

fixes for errors caused by streaming

5432dca

adding example fasta and fastq files

10f4312

Merge pull request #2 from RaeedA/stdin_fixes

df8713e

Stdin fixes

modifying test to match updated fasta file

a83028a

adding comments and cleaning up code

1b3629b

qiyunzhu requested changes May 27, 2024

View reviewed changes

wasade reviewed May 30, 2024

View reviewed changes

RaeedA closed this Jul 2, 2024

RaeedA deleted the stdin_support branch July 2, 2024 00:50

RaeedA restored the stdin_support branch July 2, 2024 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: stdin support for FASTA/FASTQ files #2029

WIP: stdin support for FASTA/FASTQ files #2029

		for line in _line_generator(fh, skip_blanks=False):
		consumed += [line]

WIP: stdin support for FASTA/FASTQ files #2029

WIP: stdin support for FASTA/FASTQ files #2029

Conversation

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment