[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using metadata to boost the performance of ExtractiveReader #5640

Open
sjrl opened this issue Aug 28, 2023 · 6 comments
Open

Using metadata to boost the performance of ExtractiveReader #5640

sjrl opened this issue Aug 28, 2023 · 6 comments
Assignees
Labels
2.x Related to Haystack v2.0 P3 Low priority, leave it in the backlog type:feature New feature or request

Comments

@sjrl
Copy link
Contributor
sjrl commented Aug 28, 2023

Is your feature request related to a problem? Please describe.
I would like to be able to use meta information to provide context to the TransformerReader or the FARMReader to boost the performance of answering questions in a similar way to how we can use embed_meta_fields to boost the performance of EmbeddingRetrievers. Sometimes meta information is needed to distinguish between similar documents.

We have had multiple clients face this exact problem because they are retrieving info from lots of legal PDF files which have a lot of boilerplate text and often define things like company name once at the beginning of a 60-page PDF.

Describe the solution you'd like
As motivation I'd like to walk through an example where being able to add meta information from a document to the Reader at query time would be beneficial. Pretend I have two docs that have a similar structure and contain similar information, but about two different companies:

Document 1 (comes from pear_llc_contract.pdf)

# meta info
meta = {"additional_context": "This passage is about the company Pear, from the year 2020."}
# content of Document
Company ID: 312521124141
Deal amount: 100k
Two leading organizations have joined forces in a groundbreaking partnership that promises to revolutionize their respective industries. The agreement, which was finalized after months of negotiations, will see the companies collaborate on a range of exciting initiatives that will benefit both parties and their customers.

Document 2 (comes from rainforest_contract.pdf)

# meta info
meta = {"additional_context": "This passage is about the company Rainforest, from the year 2019."}
# content of Document
Company ID: 847584923
Deal amount: 60k
The deal is expected to generate significant benefits for both companies, including increased revenue, improved operational efficiency, and enhanced customer experience. It is also expected to create new jobs and stimulate economic growth in the regions where the companies operate.

I would like to ask the question "What is the company ID of Pear LLC?" However, nowhere in the content of the document does it specify the name of the companies involved in the deal. So if provide these two documents to a FARMReader I should get about a 50/50 chance of getting the correct answer.

However, if I could specify a new variable (e.g. embed_meta_fields like we can for EmbeddingRetrievers

reader = ExtractiveReader(model="deepset/deberta-v3-large-squad2", embed_meta_fields=["additional_context"])

then the FARMReader will have the necessary context to answer the question.

Additional context

  • This is a similar idea to how we can use PromptTemplates to provide context to the PromptNode. And already in PromptTemplates we can add meta information from the Document into the prompt using special variables. I think extending this to an extractive reader would still be very beneficial because Sol has still seen quite some interest in extractive models.
  • However, one difference is that we should consider if we prevent the ExtractiveReader from returning the additional_context as an answer, since the additional_context will not be present in the returned Document to the user.
@ZanSara
Copy link
Contributor
ZanSara commented Aug 28, 2023

Hey @sjrl, could this be a feature of ExtractiveReader, rather than FARMReader? We're trying to bring feature parity between them, so new features should be added to ExtractiveReader directly.

If so, let's change the title and mark this as a Haystack 2.x feature request. If not, let's figure out why 🙂

@sjrl
Copy link
Contributor Author
sjrl commented Aug 28, 2023

Yes definitely. This could be a feature for ExtractiveReader.

@sjrl sjrl changed the title Using metadata to boost the performance of FARMReader Using metadata to boost the performance of ExtractiveReader Aug 28, 2023
@sjrl sjrl added type:feature New feature or request 2.x Related to Haystack v2.0 labels Aug 28, 2023
@Timoeller
Copy link
Contributor

I dont understand why it should be a meta field. Can't this info be added to documents during preprocessing? In any case, if it is urgent for any of the clients, feel free to open a lightweight PR. I would prefer though to handle it outside of the Reader.

@Timoeller Timoeller added the P3 Low priority, leave it in the backlog label Sep 29, 2023
@sjrl
Copy link
Contributor Author
sjrl commented Sep 29, 2023

I dont understand why it should be a meta field.

I think often we will not want this additional information to be allowed to be returned as an answer by the reader. So this point from my original description:

  • However, one difference is that we should consider if we prevent the ExtractiveReader from returning the additional_context as an answer, since the additional_context will not be present in the returned Document to the user.

That's why just directly adding it to the preprocessed document would not work.

I would prefer though to handle it outside of the Reader.

Given that I think preferably we would not allow this additional text to be returned as an answer I think it would be better to integrate it within the ExtractiveReader.

What do you think?

@Timoeller
Copy link
Contributor

Mh, still not sure about this. In the prompt, users can check what was passed to the model. With Extractive QA we want to ensure even more that the user can check the predictions properly. Without the adiitional_context this might not be possible.
I think having additional_context inside the document would be fine (with a clear indication that it is added?).

What I like about this idea is that it is similarly designed like embed_meta_fields of embedders.

Feel free to open a lightweight PR for this feature.

@sjrl
Copy link
Contributor Author
sjrl commented Oct 2, 2023

What I like about this idea is that it is similarly designed like embed_meta_fields of embedders.

I would say that embed_meta_fields obscures the addition of the meta data to the text file. The embed_meta_fields feature only adds the text at indexing time, but when searching the end-user doesn't see that this meta info was prepended to the document.

In the prompt, users can check what was passed to the model. With Extractive QA we want to ensure even more that the user can check the predictions properly.

However, this is a really good point. Maybe a compromise could be that we add the additional_context to the document in the returned Haystack Answer so the user can see it, but we still restrict the model from returning the additional_context as part of the answer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 P3 Low priority, leave it in the backlog type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants