[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download REBASE files on install or use #972

Open
vincentdavis opened this issue Oct 13, 2016 · 8 comments
Open

Download REBASE files on install or use #972

vincentdavis opened this issue Oct 13, 2016 · 8 comments

Comments

@vincentdavis
Copy link
Contributor

From @peterjc "Avoid the legal grey area about the REBASE files by downloading them either at install time or usage." see New restriction analysis library, v2 #268

  • Is the updating the Restriction_Dictionary part of the building a new release? If it is I am not seeing where this is documented or automated.

Proposal:

  1. Review or rewrite rebase_update.py
  2. Make rebase_update.py part of the release build process, maybe something like.
$python setup.py build --rebase_update
  1. Depreciate the inclusion of the Restriction_Dictionary in the git repo while continuing to include in the release. Then in the future consider not including in the release but only downloading on install.
@peterjc
Copy link
Member
peterjc commented Oct 13, 2016

Currently the REBASE update has been done periodically on an ad hoc basis, largely because there is no active maintainer of the restriction code. While it generally has worked fine, I recall it needing some manual intervention where a new enzyme has something strange (e.g. a previously unused special character appears in the name etc).

It looks like it has been updated twice since the original contribution:

Doing an update now would be a good test of how well documented I have left this, and a useful contribution to Biopython in itself.

Long term, I wondered about fetching the REBASE files semi-automatically (as part of a module re-write) a bit like the NCBI DTD file downloads for the Entrez parser.

@MarkusPiotrowski
Copy link
Contributor

I have updated it recently and the update went into 1.67. Also, I made some changes to rebase_update.py so that the year of the update is included in the header:

# Used REBASE emboss files version 605 (2016).

In the Restriction cookbook it is written that the file will be updated with every new Biopython version, but this is obviously not the case. Actually, I wonder how often this is really neccessary: While the number of known restriction enzymes has largely increased (I think I recall that there are more than 100 enzymes new in Restriction_Dictionary.py), only a handfull of them are commercially available.

@peterjc
Copy link
Member
peterjc commented Oct 13, 2016

As far as I know, updating REBASE was never written down as part of the Biopython release process - but that isn't a bad idea: http://biopython.org/wiki/Building_a_release

Apologies @MarkusPiotrowski - I missed 8852490 while looking over the log.

@MarkusPiotrowski
Copy link
Contributor
MarkusPiotrowski commented Oct 14, 2016

I guess it was the intention of the original author of the Restriction module to update Restriction_Dictionary regularly. And then he may have lost interest.

Actually, I don't understand what is exactly the problem with the REBASE files. As I understood, the guys from REBASE are quite happy with other packages distributing their files, as long as this is mentioned and their latest paper cited. Also, they claim that they would provide files in other formats if they were asked (so we could ask them to provide Biopython specific files, if we want to).
For a new Restriction module I would suggest to use the original REBASE files (whatever format, but EMBOSS seems fines), instead of converting the data into another file, and then try to implement a sort of lazy-loading for the enzymes. And then distribute the up-to date and original REBASE files with the Biopython package.

@peterjc
Copy link
Member
peterjc commented Oct 14, 2016

The EMBOSS format data files we use start:

#  
# REBASE version 610                                              emboss_e.610
#  
#     =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
#     REBASE, The Restriction Enzyme Database   http://rebase.neb.com
#     Copyright (c)  Dr. Richard J. Roberts, 2016.   All rights reserved.
#     =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
#  
# Rich Roberts                                                    Sep 29 2016

However http://rebase.neb.com/rebase/rebhelp.html currently says:

Those seeking to distribute REBASE files with their software packages are welcome to do so, providing it is clear to your users that they are not being charged for the REBASE data. It should be transparent that REBASE is a free and independent resource, with the following bibliographical reference:

LATEST REVIEW: PDF file...
Roberts, R.J., Vincze, T., Posfai, J., Macelis, D.
REBASE-a database for DNA restriction and modification: enzymes, genes and genomes.
Nucleic Acids Res. 43: D298-D299 (2015).

OFFICIAL REBASE WEB SITE: http://rebase.neb.com

I wonder if the wording changed? Sadly the site's robots.txt has blocked checking with archive.org

@vincentdavis
Copy link
Contributor Author

@peterjc @MarkusPiotrowski Thanks for all the input!
Ok, distributing the files seems fine. The size does not look large but maybe gzip them.
The FTP code seems overly complex. I'll continue to simplify and make part of the release process.

Regarding the usage of the files and the conversion to a different format (another, future, issue). I was thinking that loading the data into a simple SQLite db would be nice. This would make it easy to query on many different features/properties when looking for an enzyme. Then when choosing an enzyme load data for that enzyme into a dict.

@peterjc
Copy link
Member
peterjc commented Oct 14, 2016

SQLite might be overkill - fresh eyes on #268 would be good, as I have failed to set aside the time to look at it.

@MarkusPiotrowski
Copy link
Contributor

At the moment the REBASE files are not distributed (and they are also not in the repository). rebase_update.py fetches the EMBOSS formatted files from the REBASE ftp server. ranacompiler.py reads the relevant data from this files and writes a new Restriction_Dictionary.py. This is then copied (either by ranacompiler.py, if it can, or manually) into the Bio\Restriction folder.
Just try running rebase_update.py and you see what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants