Conversation
id(obj) can and will hit the same value for different objects, if they do not exist at the same time. In CPython, its just a memory address. When that collission occurs, PdfWriter when asked to insert a page from one PdfReader object may silently confuse it with one from a since-deleted PdfReader object, thus corrupting output in not-easily-noticed ways. Replacing dict[id(obj)] with dict[weakref(obj)] should suffice, as we do not replace entries - we only ever insert or delete.
|
@pubpub-zz Thanks for pointing me there. I only now noticed the almost identical PR #1995 .. suggested before I even ran into the bug. What was applied instead, #1841 did not resolve the problem because it only touches one out of two (maybe three?) opportunities to get it wrong: PageObject.clone()->PdfObject._reference_clone() and IndirectObject.clone() duplicate that code. I expect that (costly) workaround can be removed if the weakref fix works. |
|
Also. Have you try first to clone the pages before calling merge_page ? |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2558 +/- ##
==========================================
+ Coverage 94.71% 94.73% +0.01%
==========================================
Files 50 50
Lines 8237 8257 +20
Branches 1646 1651 +5
==========================================
+ Hits 7802 7822 +20
Misses 267 267
Partials 168 168 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
That is how I know both code paths are exposed. Have the page clone() first, then |
|
@biredel also there is a trick in the code : _id_translated values are not only Therefore, looking at your test code I recommend a test calling Can you give feedback ? |
What worked best for me was a low-core-count amd64 virtual machine, on a 3.10 series CPython, high iteration count and reading
I saw that and recommend this is discarded once the reason for keeping that reference is resolved.
Yes; positive; confirming. I did do that. That is how I know both code paths are exposed. You can trigger the bug via pagein.clone() too, once there is at least one id of since-deleted objects in the dictionary keys. It is much easier to trigger the bug using merge_page only[1] so that is what my sample code does, but one such call suffices to potentially corrupt the output. [1] I suspect that is because CPython is much more likely to recycle id() if the new object fits into the memory slot of a no longer referenced object. |
|
What is the state of this PR? |
pypdf uses id(obj) to keep track of objects in
_id_translated. This identifier is not unique for different objects, only for objects existing at the same time. e.g. In CPython id(obj), is just a memory address. This cannot be used to tell whether we already have referenced a page or not. It might point to a page of a since-deleted PdfReader, silently corrupting out output.The bug manifests rarely, and the corrupted output document usually looks OK at first glance (but has missing/duplicated content), the following is my best reproducer so far, adapted from https://stackoverflow.com/questions/78186323/
The attached draft replaces
dict[id(obj)]withdict[weakref(obj)]- For the simple insert/lookup/delete cases, that behaves the same. Yet no-longer-existing objects are cleared automatically, removing the possible collision.There is another instance of id(obj) outside of
__repr__()methods inPageObject.hash_value_data- I have not determined whether that can cause related problems.Essentially resubmitting #1995 as a replacement for #1841 which has memory usage side-effects and does not fully avoid #1788 anyway.