Sovisu+ publications harvester as microservice
SoVisu+ Harvester is distributed under the terms of the CeCILL v2.1 license (GPL compatible).
The SoVisu+ Harvester project is intended to provide a unified interface to the various scholarly publication databases and repositories that are used by an institutional research information system in general and the SoVisu+ project in particular.
SoVisu+ Harvester is implemented as a microservice that can be deployed in a containerized environment.
It is intended to institutions that have already created and actively maintain a repository of matched identifiers for authors, structures and research projects.
The SoVisu+ Harvester is designed to receive requests containing a list of identifiers for a so-called "research entity" (wich can be an author, research structure, research institution or research project) and to return a list of references to publications that are associated with the research entity.
The list of accepted identifiers is not exhaustive as the service is extensible by design.
- Idref
- Idhal
- ORCID
- ResearcherID
- ...
The references are obtained through a set of modular harvesters that are designed to query various scholarly publication databases and repositories :
- Data.idref.fr
- Hal
- Scanr
- Web of Science
- OpenAlex
- ... The harvesters are launched in parallel and their results are returned on the fly to the client, allowing for a fast response time and an asynchronous display of the results.
Requests and results are registered in the database for monitoring purposes. The harvesting history may be consulted through the web interface.
The research entities may be submitted to the harvester in 4 ways :
- Manually, through the web interface
- By querying the REST API asynchronoulsy
- By querying the REST API synchronoulsy
- Through AMQP messages delivered through a message broker (such as RabbitMQ)
The structure of the JSON output complies with the SciencePlus publications model.
Please note that references deduplication is out of scope of SoVisu+ Harvester and should be handled by another SoVisu+ component.
Server side :
- Python 3.10 with asyncio
- FastAPI with Pydantic
- PostgreSQL with asyncpg and alembic
- RabbitMQ (through aio-pika)
- Poetry
- Pytest
- Black
Client side (admin interface) :
- Npm
- Webpack
- Bootstrap
- Jest
Install Postgresql, RabbitMQ and the web server you want to use as a front-end.
Note that poetry is not required as requirements are exported to requirements.txt.
Clone the projet, copy .env.example to .env and .test.env and update them. All the values defined in the app/settings classes (AppSettings, TestSettings, DevSettings...) can be overriden either through .env files or through environment variables (the latter takes precedence over the former).
As postgres user :
CREATE
DATABASE your_db_name;
CREATE
USER your_user_name WITH PASSWORD 'your_secret';
GRANT ALL PRIVILEGES ON DATABASE
your_db_name to your_user_name;
Repeat for test database.
Update .env and .test.env with credentials
DB_USER="your_user_name"
DB_NAME="your_db_name"
DB_PASSWORD="your_secret"
The project uses alembic for schema versioning and migration.
At development time, migrations are generated automatically from the models (see app/models and alembic/versions).
APP_ENV=DEV alembic revision --autogenerate -m "Explain what you did to the model"
At deployment time, migrations are applied to the database.
APP_ENV=DEV alembic upgrade head
The project uses poetry for dependency management.
poetry install
The project uses pytest for testing.
From project root :
APP_ENV=TEST pytest
or with coverage
APP_ENV=TEST coverage run --source=app -m pytest
coverage report --show-missing
The project uses webpack for assets compilation.
Copy app/templates/src/js/env.js.example to app/templates/src/js/env.js and update it with your values before compiling the assets.
From app/templates :
- Dependencies installation
npm install
- Assets compilation
npm run build
or
npm run build
From project root :
APP_ENV=DEV uvicorn app.main:app --reload
or
APP_ENV=DEV python3 app/main.py
To update the translation files, run the following command from the project root :
pybabel extract --mapping babel.cfg --output-file=locales/admin.pot .
To init a po file for a new language :
pybabel init --domain=admin --input-file=locales/admin.pot --output-dir=locales --locale=NEW_one
To update the .po files with the new strings :
pybabel update --domain=admin --input-file=locales/admin.pot --output-dir=locales
To compile the .po files to .mo files :
pybabel compile --domain=admin --directory=locales --use-fuzzy
See Babel commad line documentation for more information.
The documentation is written in reStructuredText and compiled with Sphinx.
To export the documentation to HTML, run the following command from the docs
directory :
python -m sphinx -b html source build/html
Then, copy the content of the docs/build/html
directory to the server of your choice.
The documentation is automatically published on ReadTheDocs at each push on the dev-main
branch.
The configuration settings are defined in the docs/source/conf.py
file, which is specified in the .readthedocs.yml
file as the entry point.
To export the documentation to Confluence through the Confluence Publisher plugin,
- Copy
docs/source/confluence/conf.py.example
todocs/source/confluence/conf.py
and update it with your values - Run the following command from the
docs
directory :
python -m sphinx -b confluence -c source/confluence source build/confluence -E -a