ENH: accepts ETen-B5 and UniCNS-UTF16 encodings #2721

pubpub-zz · 2024-06-21T19:07:46Z

Related to #2356

closes py-pdf#2356

stefan6419846 · 2024-06-22T08:41:22Z

There are three aspects I am not sure about:

Do we really need the TBC comments inside the mapping?
If we have possibly public PDF files, shouldn't we add at least a basic test.
We should not close PdfReader - Extract images from specific pages #2536 with this - there are still unsupported encodings left, as indicated by the "TBC" comments as well.

pubpub-zz · 2024-06-22T10:04:29Z

There are three aspects I am not sure about:
* Do we really need the `TBC` comments inside the mapping?

The TBC are just here to wait from feed back from @actuary-chen

* If we have possibly public PDF files, shouldn't we add at least a basic test.

I did not focus as this should not be easily subject to regressio on it but I agree it should be better

* We should not close [PdfReader - Extract images from specific pages #2536](https://github.com/py-pdf/pypdf/discussions/2536) with this - there are still unsupported encodings left, as indicated by the "TBC" comments as well.

I dislike the Idea of having a garbage collecting issue on this subject : We need to have some test file to confirm the proper encoding; I prefer new issue to raised on case per case.

pubpub-zz · 2024-06-22T10:11:33Z

I'veremoved all TBC. Let's wait a litte for some feedbacks from @actuary-chen for the last entries

codecov · 2024-06-22T10:20:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.14%. Comparing base (a512408) to head (fdbf37c).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2721   +/-   ##
=======================================
  Coverage   95.14%   95.14%           
=======================================
  Files          51       51           
  Lines        8547     8547           
  Branches     1703     1703           
=======================================
  Hits         8132     8132           
  Misses        261      261           
  Partials      154      154

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

stefan6419846 · 2024-06-22T18:27:21Z

I dislike the Idea of having a garbage collecting issue on this subject : We need to have some test file to confirm the proper encoding; I prefer new issue to raised on case per case.

I initially opened the corresponding issue to discuss how this could be done in general or whether there might be any official test documents which would allow us to cover all cases without having lots of small commits for it.

actuary-chen · 2024-06-22T21:25:34Z

I can only confirm no wording shows as "pypdf._cmap: implementation of advance cmap ...." However, I cannot make sure whether the text is correct to decode or not, because I use it in the embedding model to a vector database. codecov[bot] ***@***.***> 於 2024年6月22日週六下午6:20寫道：

…

Codecov <https://app.codecov.io/gh/py-pdf/pypdf/pull/2721?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf> Report All modified and coverable lines are covered by tests ✅ Project coverage is 95.14%. Comparing base (a512408) <https://app.codecov.io/gh/py-pdf/pypdf/commit/a512408c9559771c5b7e67d9c62de64e09ca4c08?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf> to head (fdbf37c) <https://app.codecov.io/gh/py-pdf/pypdf/commit/fdbf37c57d9cd2be0ad48ab9ff0bdd12163c2a7d?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf> . Report is 1 commits behind head on main. Additional details and impacted files @@ Coverage Diff @@## main #2721 +/- ## ======================================= Coverage 95.14% 95.14% ======================================= Files 51 51 Lines 8547 8547 Branches 1703 1703 ======================================= Hits 8132 8132 Misses 261 261 Partials 154 154 ☔ View full report in Codecov by Sentry <https://app.codecov.io/gh/py-pdf/pypdf/pull/2721?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf> . 📢 Have feedback on the report? Share it here <https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf> . — Reply to this email directly, view it on GitHub <#2721 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEO7QJAE5IPC76MWIUEYZZTZIVFXTAVCNFSM6AAAAABJWTN6WSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBTHE3TCMRXGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

actuary-chen · 2024-06-23T00:01:30Z

It sounds good after I retrieve some texts from the database. Benjamin Chen ***@***.***> 於 2024年6月23日週日上午5:25寫道：

…

I can only confirm no wording shows as "pypdf._cmap: implementation of advance cmap ...." However, I cannot make sure whether the text is correct to decode or not, because I use it in the embedding model to a vector database. codecov[bot] ***@***.***> 於 2024年6月22日週六下午6:20寫道： > Codecov > <https://app.codecov.io/gh/py-pdf/pypdf/pull/2721?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf> > Report > > All modified and coverable lines are covered by tests ✅ > > Project coverage is 95.14%. Comparing base (a512408) > <https://app.codecov.io/gh/py-pdf/pypdf/commit/a512408c9559771c5b7e67d9c62de64e09ca4c08?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf> > to head (fdbf37c) > <https://app.codecov.io/gh/py-pdf/pypdf/commit/fdbf37c57d9cd2be0ad48ab9ff0bdd12163c2a7d?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf> > . > Report is 1 commits behind head on main. > > Additional details and impacted files > > @@ Coverage Diff @@## main #2721 +/- ## > ======================================= > Coverage 95.14% 95.14% > ======================================= > Files 51 51 > Lines 8547 8547 > Branches 1703 1703 > ======================================= > Hits 8132 8132 > Misses 261 261 > Partials 154 154 > > ☔ View full report in Codecov by Sentry > <https://app.codecov.io/gh/py-pdf/pypdf/pull/2721?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf> > . > 📢 Have feedback on the report? Share it here > <https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=py-pdf> > . > > — > Reply to this email directly, view it on GitHub > <#2721 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AEO7QJAE5IPC76MWIUEYZZTZIVFXTAVCNFSM6AAAAABJWTN6WSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBTHE3TCMRXGQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

@pubpub-zz

## What's new ### New Features (ENH) - Accept ETen-B5 and UniCNS-UTF16 encodings (#2721) by @pubpub-zz - Add decode_as_image() to ContentStreams (#2615) by @pubpub-zz - context manager for PdfReader (#2666) by @tibor-reiss - Add capability to set font and size in fields (#2636) by @pubpub-zz - Allow to pass input file without named argument (#2576) by @pubpub-zz ### Bug Fixes (BUG) - Fix deprecation for Ressources when using old constants (#2705) by @stefan6419846 - Fix images issue 4 bits encoding and LUT starting with UTF16_BOM (#2675) by @pubpub-zz - Reading large compressed images takes huge time to process (#2644) by @snanda85 - Highlighted Text Cannot Be Printed (#2604) by @Nifury - Fix UnboundLocalError on malformed pdf (#2619) by @farjasju ### Documentation (DOC) - Various improvements on docstrings and examples by @j-t-1 ### Robustness (ROB) - Cope with missing Standard 14 fonts in fields (#2677) by @pubpub-zz - Improve inline image extraction (#2622) by @pubpub-zz - Cope with loops in Fields tree (#2656) by @pubpub-zz - Discard /I in choice fields for compatibility with Acrobat (#2614) by @pubpub-zz - Cope with some issues in pillow (#2595) by @pubpub-zz - Cope with some image extraction issues (#2591) by @pubpub-zz ### Maintenance (MAINT) - Deprecate interiour_color with replacement interior_color (#2706) by @j-t-1 - Add deprecate_with_replacement to PdfWriter.find_bookmark (#2674) by @j-t-1 ### Code Style (STY) - Change Link to be a non-markup annotation (#2714) by @j-t-1 [Full Changelog](4.2.0...4.3.0)

ENH: accepts ETen-B5 and UniCNS-UTF16 encodings

54fbcd7

closes py-pdf#2356

from comments

fdbf37c

pubpub-zz requested a review from stefan6419846 June 23, 2024 07:47

stefan6419846 approved these changes Jun 23, 2024

View reviewed changes

stefan6419846 merged commit 81f35f9 into py-pdf:main Jun 23, 2024
17 checks passed

pubpub-zz deleted the iss2356 branch June 23, 2024 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: accepts ETen-B5 and UniCNS-UTF16 encodings #2721

ENH: accepts ETen-B5 and UniCNS-UTF16 encodings #2721

ENH: accepts ETen-B5 and UniCNS-UTF16 encodings #2721

ENH: accepts ETen-B5 and UniCNS-UTF16 encodings #2721

Conversation

Codecov Report