-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: stdin support for FASTA/FASTQ files #2029
Conversation
Stdin fixes
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2029 +/- ##
==========================================
- Coverage 98.59% 98.44% -0.16%
==========================================
Files 182 182
Lines 30885 31156 +271
Branches 7512 7583 +71
==========================================
+ Hits 30452 30671 +219
- Misses 418 469 +51
- Partials 15 16 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RaeedA Great job! It looks to me quite meaningful. I did some tests using the attached real-world FASTQ file as input, and got mixed result.
First, the original use case suggested by @wasade in #2017 , appears to be working. My test script test.py
has the following content:
import sys
import skbio
seq_gen = skbio.read(sys.stdin, format='fastq')
print('1st batch')
i = 0
while i < 5:
print(next(seq_gen).metadata['id'])
i += 1
print('2nd batch')
i = 0
while i < 5:
print(next(seq_gen).metadata['id'])
i += 1
The following command works and produces the correct screen output:
cat input.fastq | python test1.py
Then I tried a different use case. The following script test1.py
is the currently accepted skbio usage:
import sys
from skbio import DNA
with open(sys.argv[1], 'r') as f:
seq = DNA.read(f, seq_num=5)
print(seq)
One can run it like:
python test1.py input.fastq
I tried to edit it to support standard input (test2.py
):
import sys
from skbio import DNA
seq = DNA.read(sys.stdin, seq_num=5)
print(seq)
And run:
cat input.fastq | python test2.py
The error message is:
io.UnsupportedOperation: underlying stream is not seekable
I specified the format (i.e., skipping sniffing?) (test3.py
):
import sys
from skbio import DNA
seq = DNA.read(sys.stdin, seq_num=5, format='fastq')
print(seq)
The error message is:
skbio.io._exception.FASTQFormatError: Expected sequence (@) header line at start of file: 'TATGCAATGAATATTAAGTTTACAACCGCAAATACTGACCCAAATGAATTTGCTGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTCGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACCGCAACAAGGCTCGGCGA'
Therefore, I think that the current PR is very close to the right solution, but some subtle places still need some work.
for line in _line_generator(fh, skip_blanks=False): | ||
consumed += [line] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be replaced with:
consumed.append(line)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this more efficient than just doing +=? I'm using += in other places as sometimes I need to add multiple lines, and wanted to keep it consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
append
is just a "pythonic" way :) I don't know if it's more efficient or not. PS: if you want to add multiple elements, you can do consumed.extend([line1, line2, line3...])
.
try: | ||
# Attempt to reset and sniff file | ||
backup = file.tell() | ||
res = sniffer(file, **io_kwargs) | ||
is_format = res[0] | ||
skwargs = res[1] | ||
file.seek(backup) | ||
except io.UnsupportedOperation: | ||
# File is not seekable, so get consumed lines to reset it | ||
is_format, skwargs, consumed = sniffer(file, **io_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
peek
for stdin
(see sys.stdin.buffer.peek
) and many other IO types (not io.StringIO
AFAIK) would work without advancing the pointer too. But, it's read request is not assured to be what you want in terms of bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've tested a couple other methods of reading nonseekables including this but it just seems that reconstructing it as a chain is the best.
@@ -914,7 +939,8 @@ def wrapped_sniffer( | |||
# Some formats may have headers which indicate their | |||
# format sniffers should be able to rely on the | |||
# filehandle to point at the beginning of the file. | |||
fh.seek(0) | |||
if self.header: | |||
fh.seek(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still safe for stdin
which is not seekable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that @RaeedA added a seekable
flag to different input types to control this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the original plan, I moved some things around in the current changes to make it work more intuitively. It's now in a try/catch block.
class TestReadStandardInput(unittest.TestCase): | ||
num_text = ["ONE", "TWO", "THREE", "FOUR", "FIVE", "SIX", "SEVEN"] | ||
|
||
def test_stdin_read_fasta(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests should stay, but I recommend additionally setting up an independent integration test run through the Makefile
. Something like the below. The inline python though could also be placed in a minor script.
#!/bin/bash
cat skbio/io/tests/test_data/some_fasta.fna | \
python -c "import skbio, sys; print([r.metadata['id'] for r in skbio.read(sys.stdin, format='fasta')])" > /tmp/obs_stdin.ids
grep "^>" skbio/io/tests/test_data/some_fasta.fna | tr -d ">" > /tmp/exp_stdin.ids
obs_md5=$(md5sum /tmp/obs_stdin.ids)
exp_md5=$(md5sum /tmp/exp_stdin.ids)
if [[ ${obs_md5} != ${exp_md5} ]]; then
exit 1
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with these, but I've been looking for a way to do just this! I'll take a look at implementing this for testing.
Does this work?How to check if file object is random accessstackoverflow.comOn Jun 3, 2024, at 15:05, Raeed Azom ***@***.***> wrote:
@RaeedA commented on this pull request.
In skbio/io/registry.py:
@@ -914,7 +939,8 @@ def wrapped_sniffer(
# Some formats may have headers which indicate their
# format sniffers should be able to rely on the
# filehandle to point at the beginning of the file.
- fh.seek(0)
+ if self.header:
+ fh.seek(0)
That was the original plan, I moved some things around in the current changes to make it work more intuitively. It's now in a try/catch block.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@wasade According to the Python documentation, |
Right, and it would be preferable to perform the explicit check rather than a try/except |
Oh I see, the SO post didn't actually come through in email -- this one |
I agree with that. In the current PR there are both. Redundancy may not be bad. But @RaeedA you may want to check to make sure we don't solely rely on try/except. (I am okay with solely relying on seekable, if that works.) |
Hey! I'm actually not using the seekable call here, as I've found that it behaves irregularly with some data formats. I've instead been using tell() wrapped inside of a try/catch, as tell will consistently produce an UnsupportedOperation error that I can use to detect seekable versus unseekable. |
@RaeedA I tend to think that using try/except is fine, given that you found its result consistent, whereas that of seekable isn't. |
Issue #2017
Adding support for FASTA/FASTQ files to make them accept stdin streaming, as well as adding compatibility for other file formats in the future.
Modified the FASTQ file parser to match the FASTA structure, specifically moving the parser functionality out of the reader and into a dedicated helper function.
Making most sniffer/parsers pass out consumed lines.
Coverage still in progress.
Please complete the following checklist:
I have read the contribution guidelines.
I have documented all public-facing changes in the changelog.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the
licenses
directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied.This pull request does not include code, documentation, or other content derived from external source(s).
Note: This document may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.