[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: stdin support for FASTA/FASTQ files #2029

Closed
wants to merge 12 commits into from

Conversation

RaeedA
Copy link
Contributor
@RaeedA RaeedA commented May 24, 2024

Issue #2017
Adding support for FASTA/FASTQ files to make them accept stdin streaming, as well as adding compatibility for other file formats in the future.
Modified the FASTQ file parser to match the FASTA structure, specifically moving the parser functionality out of the reader and into a dedicated helper function.
Making most sniffer/parsers pass out consumed lines.
Coverage still in progress.

Please complete the following checklist:

  • I have read the contribution guidelines.

  • I have documented all public-facing changes in the changelog.

  • This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the licenses directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied.

    • It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.
  • This pull request does not include code, documentation, or other content derived from external source(s).

Note: This document may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.

Copy link
codecov bot commented May 24, 2024

Codecov Report

Attention: Patch coverage is 91.30435% with 14 lines in your changes are missing coverage. Please review.

Project coverage is 98.44%. Comparing base (e3a27f5) to head (1b3629b).
Report is 2 commits behind head on main.

Files Patch % Lines
skbio/io/_iosources.py 54.54% 5 Missing ⚠️
skbio/io/util.py 70.00% 3 Missing ⚠️
skbio/io/format/_base.py 77.77% 2 Missing ⚠️
skbio/io/registry.py 92.00% 2 Missing ⚠️
skbio/io/format/fasta.py 94.73% 1 Missing ⚠️
skbio/io/format/fastq.py 97.77% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2029      +/-   ##
==========================================
- Coverage   98.59%   98.44%   -0.16%     
==========================================
  Files         182      182              
  Lines       30885    31156     +271     
  Branches     7512     7583      +71     
==========================================
+ Hits        30452    30671     +219     
- Misses        418      469      +51     
- Partials       15       16       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor
@qiyunzhu qiyunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaeedA Great job! It looks to me quite meaningful. I did some tests using the attached real-world FASTQ file as input, and got mixed result.

First, the original use case suggested by @wasade in #2017 , appears to be working. My test script test.py has the following content:

import sys
import skbio

seq_gen = skbio.read(sys.stdin, format='fastq')

print('1st batch')
i = 0
while i < 5:
    print(next(seq_gen).metadata['id'])
    i += 1

print('2nd batch')
i = 0
while i < 5:
    print(next(seq_gen).metadata['id'])
    i += 1

The following command works and produces the correct screen output:

cat input.fastq | python test1.py

Then I tried a different use case. The following script test1.py is the currently accepted skbio usage:

import sys
from skbio import DNA
with open(sys.argv[1], 'r') as f:
    seq = DNA.read(f, seq_num=5)
print(seq)

One can run it like:

python test1.py input.fastq

I tried to edit it to support standard input (test2.py):

import sys
from skbio import DNA
seq = DNA.read(sys.stdin, seq_num=5)
print(seq)

And run:

cat input.fastq | python test2.py

The error message is:

io.UnsupportedOperation: underlying stream is not seekable

I specified the format (i.e., skipping sniffing?) (test3.py):

import sys
from skbio import DNA
seq = DNA.read(sys.stdin, seq_num=5, format='fastq')
print(seq)

The error message is:

skbio.io._exception.FASTQFormatError: Expected sequence (@) header line at start of file: 'TATGCAATGAATATTAAGTTTACAACCGCAAATACTGACCCAAATGAATTTGCTGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACCGCAACAAGGCTCGGCGA'

input.fastq.zip

Therefore, I think that the current PR is very close to the right solution, but some subtle places still need some work.

skbio/io/_iosources.py Show resolved Hide resolved
for line in _line_generator(fh, skip_blanks=False):
consumed += [line]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced with:

consumed.append(line)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this more efficient than just doing +=? I'm using += in other places as sometimes I need to add multiple lines, and wanted to keep it consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append is just a "pythonic" way :) I don't know if it's more efficient or not. PS: if you want to add multiple elements, you can do consumed.extend([line1, line2, line3...]).

skbio/io/format/fasta.py Show resolved Hide resolved
Comment on lines +542 to +551
try:
# Attempt to reset and sniff file
backup = file.tell()
res = sniffer(file, **io_kwargs)
is_format = res[0]
skwargs = res[1]
file.seek(backup)
except io.UnsupportedOperation:
# File is not seekable, so get consumed lines to reset it
is_format, skwargs, consumed = sniffer(file, **io_kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peek for stdin (see sys.stdin.buffer.peek) and many other IO types (not io.StringIO AFAIK) would work without advancing the pointer too. But, it's read request is not assured to be what you want in terms of bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've tested a couple other methods of reading nonseekables including this but it just seems that reconstructing it as a chain is the best.

@@ -914,7 +939,8 @@ def wrapped_sniffer(
# Some formats may have headers which indicate their
# format sniffers should be able to rely on the
# filehandle to point at the beginning of the file.
fh.seek(0)
if self.header:
fh.seek(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still safe for stdin which is not seekable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @RaeedA added a seekable flag to different input types to control this behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original plan, I moved some things around in the current changes to make it work more intuitively. It's now in a try/catch block.

class TestReadStandardInput(unittest.TestCase):
num_text = ["ONE", "TWO", "THREE", "FOUR", "FIVE", "SIX", "SEVEN"]

def test_stdin_read_fasta(self):
Copy link
Collaborator
@wasade wasade May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests should stay, but I recommend additionally setting up an independent integration test run through the Makefile. Something like the below. The inline python though could also be placed in a minor script.

#!/bin/bash

cat skbio/io/tests/test_data/some_fasta.fna | \
    python -c "import skbio, sys; print([r.metadata['id'] for r in skbio.read(sys.stdin, format='fasta')])" > /tmp/obs_stdin.ids

grep "^>" skbio/io/tests/test_data/some_fasta.fna | tr -d ">" > /tmp/exp_stdin.ids

obs_md5=$(md5sum /tmp/obs_stdin.ids)
exp_md5=$(md5sum /tmp/exp_stdin.ids)

if [[ ${obs_md5} != ${exp_md5} ]]; then
    exit 1
fi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with these, but I've been looking for a way to do just this! I'll take a look at implementing this for testing.

@wasade
Copy link
Collaborator
wasade commented Jun 3, 2024 via email

@qiyunzhu
Copy link
Contributor
qiyunzhu commented Jun 3, 2024

@wasade According to the Python documentation, seekable is a valid check. It says: "Return True if the stream supports random access".

@wasade
Copy link
Collaborator
wasade commented Jun 3, 2024

Right, and it would be preferable to perform the explicit check rather than a try/except

@wasade
Copy link
Collaborator
wasade commented Jun 3, 2024

Oh I see, the SO post didn't actually come through in email -- this one

@qiyunzhu
Copy link
Contributor
qiyunzhu commented Jun 3, 2024

Right, and it would be preferable to perform the explicit check rather than a try/except

I agree with that. In the current PR there are both. Redundancy may not be bad. But @RaeedA you may want to check to make sure we don't solely rely on try/except. (I am okay with solely relying on seekable, if that works.)

@RaeedA
Copy link
Contributor Author
RaeedA commented Jun 4, 2024

Hey! I'm actually not using the seekable call here, as I've found that it behaves irregularly with some data formats. I've instead been using tell() wrapped inside of a try/catch, as tell will consistently produce an UnsupportedOperation error that I can use to detect seekable versus unseekable.

@qiyunzhu
Copy link
Contributor
qiyunzhu commented Jun 4, 2024

@RaeedA I tend to think that using try/except is fine, given that you found its result consistent, whereas that of seekable isn't.

@RaeedA RaeedA closed this Jul 2, 2024
@RaeedA RaeedA deleted the stdin_support branch July 2, 2024 00:50
@RaeedA RaeedA restored the stdin_support branch July 2, 2024 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants