WARNING:root:Found repetitions in sample 0

Question

WARNING:root:Found repetitions in sample 0

jacobmarks opened this issue 10 months ago · 18 comments

Tried applying to individual pages of the PDF for EPR paper https://cds.cern.ch/record/405662/files/PhysRev.47.777.pdf.

While the first page works and I get a print out of the text, pages 2-4 don't work. I get errors like this:

504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  0%|                                                       | 0/1 [00:00<?, ?it/s]WARNING:root:Found repetitions in sample 0
INFO:root:Processing file pg_0004.pdf with 1 pages
[MISSING_PAGE_EMPTY:1]

Answer 1 · 2023-08-31T05:12:51.000Z

#11 (comment)

I meet the same problem. There is nothing but

[MISSING_PAGE_EMPTY:1]

[MISSING_PAGE_EMPTY:2]

[MISSING_PAGE_EMPTY:3]

in the mmd file.

Answer 2 · 2023-08-31T06:59:09.000Z

Hi @jacobmarks
After some investigation I found two reasons:

The failure detection is sensitive to out of domain samples. Sometimes this leads to needlessly removed pages (here pg 2, see below)
The model is not able to handle the integrations in your example pdf very well.

I will look into the failure detection heuristic again. Thank you for brining this to my attention.

Raw output

of lanthanum is 7/2, hence the nuclear magnetic moment as determined by this analysis is 2.5 nuclear magnetons. This is in fair agreement with the value 2.8 nuclear magnetons determined from La III hyperfine structures by the writer and N. S. Grace.9

Footnote 9: M. F. Crawford and N. S. Grace, Phys. Rev. 47, 536 (1935).

Footnote 10: M. F. Crawford and N. S. Grace, Phys. Rev. 47, 536 (1935).

This investigation was carried out under the supervision of Professor G. Breit, and I wish to thank him for the invaluable advice and assistance so freely given. I also take this opportunity to acknowledge the award of a Fellowship by the Royal Society of Canada, and to thank the University of Wisconsin and the Department of Physics for the privilege of working here.

1 Can Quantum-Mechanical Description of Physical Reality Be Considered Complete?

A. Einstein, B. Podolsky and N. Rosen, Institute for Advanced Study, Princeton, New Jersey

(Received March 25, 1935)

In a complete theory there is an element corresponding to each element of reality. A sufficient condition for the reality of a physical quantity is the possibility of predicting it with certainty, without disturbing the system. In quantum mechanics in the case of two physical quantities described by non-commuting operators, the knowledge of one precludes the knowledge of the other. Then either (1) the description of reality given by the wave function in quantum mechanics is not complete or (2) these two quantities cannot have simultaneous reality. Consideration of the problem of making predictions concerning a system on the basis of measurements made on another system that had previously interacted with it leads to the result that if (1) is false then (2) is also false. One is thus led to conclude that the description of reality as given by a wave function is not complete.

Whatever the meaning assigned to the term complete, the following requirement for a complete theory seems to be a necessary one : every element of the physical reality must have a counterpart in the physical theory. We shall call this the condition of completeness. The second question is thus easily answered, as soon as we are able to decide what are the elements of the physical reality.

The elements of the physical reality cannot be determined by a priori philosophical considerations, but must be found by an appeal to results of experiments and measurements. A comprehensive definition of reality is, however, unnecessary for our purpose. We shall be satisfied with the following criterion, which we regard as reasonable. If, without in any way disturbing a system, we can predict with certainty (i.e., with probability equal to unity) the value of a physical quantity, then there exists an element of physical reality corresponding to this physical quantity_. It seems to us that this criterion, while far from exhausting all possible ways of recognizing a physical reality, at least provides us with one such way, whenever the conditions set down in it occur. Regarded not as a necessary, but merely as a sufficient, condition of reality, this criterion is in agreement with classical as well as quantum-mechanical ideas of reality.

To illustrate the ideas involved let us consider the quantum-mechanical description of the behavior of a particle having a single degree of freedom. The fundamental concept of the theory is the concept of state, which is supposed to be completely characterized by the wave function (\psi), which is a function of the variables chosen to describe the particle's behavior. Corresponding to each physically observable quantity (A) there is an operator, which may be designated by the same letter.

If (\psi) is an eigenfunction of the operator (A), that is, if

[\psi^{\prime}!=!A\psi!=!a\psi, \tag{1}]

where (a) is a number, then the physical quantity (A) has with certainty the value (a) whenever the particle is in the state given by (\psi). In accordance with our criterion of reality, for a particle in the state given by (\psi) for which Eq. (1) holds, there is an element of physical reality corresponding to the physical quantity (A). Let, for example,

[\psi!=!e^{(2\pi i/\lambda),p_{0}\pi}, \tag{2}]

where (h) is Planck's constant, (p_{0}) is some constant number, and (x) the independent variable. Since the operator corresponding to the momentum of the particle is

[\dot{p}!=!(h/2\pi i)\partial/\partial x, \tag{3}]

we obtain

[\psi^{\prime}!=!\dot{p}\psi!=!(h/2\pi i)\partial\psi/\partial x!=!\dot{p }\psi. \tag{4}]

Thus, in the state given by Eq. (2), the momentum has certainly the value (\dot{p}_{0}). It thus has meaning to say that the momentum of.the particle in the state given by Eq. (2) is real.

On the other hand if Eq. (1) does not hold, we can no longer speak of the physical quantity (A) having a particular value. This is the case, for example, with the coordinate of the particle. The operator corresponding to it, say (q), is the operator of multiplication by the independent variable. Thus,

[g\psi!=!x\psi!\neq!a\psi. \tag{5}]

In accordance with quantum mechanics we can only say that the relative probability that a measurement of the coordinate will give a result lying between (a) and (b) is

[P(a,b)!=!\int_{a}^{b}!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!sight this assumption is entirely reasonable, for the information obtainable from a wave function seems to correspond exactly to what can be measured without altering the state of the system. We shall show, however, that this assumption, together with the criterion of reality given above, leads to a contradiction.

For this purpose let us suppose that we have two systems, I and II, which we permit to interact from the time (t!=!0) to (t!=!T), after which time we suppose that there is no longer any interaction between the two parts. We suppose further that the states of the two systems before (t!=!0) were known. We can then calculate with the help of Schrodinger's equation the state of the combined system I+II at any subsequent time; in particular, for any (t!>!T). Let us designate the corresponding wave function by (\Psi). We cannot, however, calculate the state in which either one of the two systems is left after the interaction. This, according to quantum mechanics, can be done only with the help of further measurements, by a process known as the reduction of the wave packet. Let us consider the essentials of this process.

Let (a_{1}), (a_{2}), (a_{3}), (\cdots) be the eigenvalues of some physical quantity (A) pertaining to system I and (u_{1}(x_{1})), (u_{2}(x_{1})), (u_{3}(x_{1})), (\cdots) the corresponding eigenfunctions, where (x_{1}) stands for the variables used to describe the first system. Then (\Psi), considered as a function of (x_{1}), can be expressed as

[\Psi(x_{1},,x_{2})!=!\sum_{n=1}^{\infty},\psi_{n}(x_{2})u_{n}(x_{1}), \tag{7}]

where (x_{2}) stands for the variables used to describe the second system. Here (\psi_{n}(x_{2})) are to be regarded merely as the coefficients of the expansion of (\Psi) into a series of orthogonal functions (u_{n}(x_{1})). Suppose now that the quantity (A) is measured and it is found that it has the value (a_{k}). It is then concluded that after the measurement the first system is left in the state given by the wave function (u_{k}(x_{1})), and that the second system is left in the state given by the wave function (\psi_{k}(x_{2})). This is the process of reduction of the wave packet; the wave packet given by the infinite series (7) is reduced to a single term (\psi_{k}(x_{2})u_{k}(x_{1})).

The set of functions (u_{n}(x_{1})) is determined by the choice of the physical quantity (A). If, instead of this, we had chosen another quantity, say (B), having the eigenvalues (b_{1}), (b_{2}), (b_{3}), (\cdots) and eigenfunctions (v_{1}(x_{1})), (v_{2}(x_{1})), (v_{3}(x_{1})), (\cdots) we should have obtained, instead of Eq. (7), the expansion

[\Psi(x_{1},,x_{2})!=!\sum_{s=1}^{\infty},\varphi_{s}(x_{2})v_{s}(x_{1}), \tag{8}]

where (\varphi_{s})'s are the new coefficients. If now the quantity (B) is measured and is found to have the value (b_{r}), we conclude that after the measurement the first system is left in the state given by (v_{r}(x_{1})) and the second system is left in the state given by (\varphi_{r}(x_{2})).

We see therefore that, as a consequence of two different measurements performed upon the first system, the second system may be left in states with two different wave functions. On the other hand, since at the time of measurement the two systems no longer interact, no real change can take place in the second system in consequence of anything that may be done to the first system. This is, of course, merely a statement of what is meant by the absence of an interaction between the two systems. Thus, it is possible to assign two different wave functions (in our example (\psi_{k}) and (\varphi_{r})) to the same reality (the second system after the interaction with the first).

Now, it may happen that the two wave functions, (\psi_{k}) and (\varphi_{r}), are eigenfunctions of two non-commuting operators corresponding to some physical quantities (P) and (Q), respectively. That this may actually be the case can best be shown by an example. Let us suppose that the two systems are two particles, and that

[\Psi(x_{1},,x_{2})!=!\int_{-\infty}^{\infty}!!e^{(2\pi i/k),(x_{1}-x_{2} +x_{0}),p}d\dot{p}, \tag{9}]

where (x_{0}) is some constant. Let (A) be the momentum of the first particle; then, as we have seen in Eq. (4), its eigenfunctions will be

[u_{p}(x_{1})!=!e^{(2\pi i/k),p\pi_{1}} \tag{10}]

corresponding to the eigenvalue (\dot{p}). Since we have here the case of a continuous spectrum, Eq. (7) will now be written [\Psi(x_{1},,x_{2})=\int_{-\infty}^{\infty}!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!

Answer 3 · 2023-08-31T18:23:40.000Z

Thanks for looking into this @lukas-blecher ! Really appreciate your prompt response

Answer 4 · 2023-09-09T15:38:05.000Z

I am also getting a lot of 'MISSING_PAGE_FAIL:{n}]' on output file and 'WARNING:root:Skipping page {n} due to repetitions.' in terminal.

Answer 5 · 2023-09-10T18:04:35.000Z

I'm experiencing a similar issue of missing pages during PDF to Markdown conversion, but with some nuances.

System Information

Nougat Version: 0.1.4
OS: Windows 11
GPU: NVIDIA 1050 mobile with 4GB VRAM
RAM: 16GB
Pytorch Version: 2.0.1+cu118
NVIDIA-SMI: Driver version 537.13
CUDA Version: 12.2

Issue Details

Paper Tested (original, complete pdf): Springer Article
Command Used: nougat \path\to\file.pdf -o \path\to\folder\ --markdown
Expected Behavior: Complete conversion of all pages.
Actual Behavior: Missed pages: 6, 8-10, 21, 23, 26-33, 35-36, 39, 46-49.
I then created a new PDF paper with only the missing pages and tested that. None of the new pdf pages were processed (it missed all of them)
Output (for the pdf with only missing pages):
C:\Users\Me>nougat C:\Users\Me\Downloads\papers\missed.pdf -o C:\Users\Me\Downloads\papers --markdown
C:\Users\Me\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ..\aten\src\ATen\native\TensorShape.cpp:3484.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
0%| | 0/21 [00:00<?, ?it/s]WARNING:root:Found repetitions in sample 0
INFO:root:Processing file C:\Users\Me\Downloads\papers\missed.pdf with 21 pages
5%|███▉ | 1/21 [00:08<02:54, 8.70s/it]WARNING:root:Found repetitions in sample 0
WARNING:root:Skipping page 2 due to repetitions.
10%|███████▉ | 2/21 [00:28<04:45, 15.05s/it]WARNING:root:Found repetitions in sample 0
14%|███████████▊ | 3/21 [00:35<03:24, 11.36s/it]WARNING:root:Found repetitions in sample 0
19%|███████████████▊ | 4/21 [00:41<02:36, 9.20s/it]W...

Additional Notes

This issue is consistently reproducible.
Running the same paper through the online HuggingFace demo resulted in fewer missing pages.

Let me know if you need any more info.

Answer 6 · 2023-09-11T14:03:56.000Z

I'm experiencing a similar issue of missing pages during PDF to Markdown conversion, but with some nuances.

Issue Details

Paper Tested (original, complete pdf): Springer Article

Command Used: nougat \path\to\file.pdf -o \path\to\folder\ --markdown

Expected Behavior: Complete conversion of all pages.

Actual Behavior: Missed pages: 6, 8-10, 21, 23, 26-33, 35-36, 39, 46-49.

I then created a new PDF paper with only the missing pages and tested that. None of the new pdf pages were processed (it missed all of them)

Output (for the pdf with only missing pages):
C:\Users\Me>nougat C:\Users\Me\Downloads\papers\missed.pdf -o C:\Users\Me\Downloads\papers --markdown
C:\Users\Me\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ..\aten\src\ATen\native\TensorShape.cpp:3484.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
0%| | 0/21 [00:00<?, ?it/s]WARNING:root:Found repetitions in sample 0
INFO:root:Processing file C:\Users\Me\Downloads\papers\missed.pdf with 21 pages
5%|███▉ | 1/21 [00:08<02:54, 8.70s/it]WARNING:root:Found repetitions in sample 0
WARNING:root:Skipping page 2 due to repetitions.
10%|███████▉ | 2/21 [00:28<04:45, 15.05s/it]WARNING:root:Found repetitions in sample 0
14%|███████████▊ | 3/21 [00:35<03:24, 11.36s/it]WARNING:root:Found repetitions in sample 0
19%|███████████████▊ | 4/21 [00:41<02:36, 9.20s/it]W...

Additional Notes

This issue is consistently reproducible.

Running the same paper through the online HuggingFace demo resulted in fewer missing pages.

Let me know if you need any more info.

I actually noticed this as well, that the quality of the output generated by Nougat was better in the Huggingface demo than on my own computer. Maybe this is a false positive though?

Remember that Nougat was trained on Arxiv scientific papers in the STEM field, so if you feed it a magazine article in PDF format that only includes text and some graphics, don't be surprised if it fails here and there. Out-of-domain content seems to not be super robust yet. If your doc includes content that is not similar (enough) to the training data it will not work correctly, just like a dog you've taught to sit will not necessarily stay.

Answer 7 · 2023-09-11T14:59:39.000Z

I actually noticed this as well, that the quality of the output generated by Nougat was better in the Huggingface demo than on my own computer. Maybe this is a false positive though?

Remember that Nougat was trained on Arxiv scientific papers in the STEM field, so if you feed it a magazine article in PDF format that only includes text and some graphics, don't be surprised if it fails here and there. Out-of-domain content seems to not be super robust yet. If your doc includes content that is not similar (enough) to the training data it will not work correctly, just like a dog you've taught to sit will not necessarily stay.

Did you see my doc? It's pretty academic and Arxiv-like. I wonder if it's the architecture of my old hardware that's the issue.

Answer 8 · 2023-09-11T15:04:57.000Z

I actually noticed this as well, that the quality of the output generated by Nougat was better in the Huggingface demo than on my own computer. Maybe this is a false positive though?
Remember that Nougat was trained on Arxiv scientific papers in the STEM field, so if you feed it a magazine article in PDF format that only includes text and some graphics, don't be surprised if it fails here and there. Out-of-domain content seems to not be super robust yet. If your doc includes content that is not similar (enough) to the training data it will not work correctly, just like a dog you've taught to sit will not necessarily stay.

Did you see my doc? It's pretty academic and Arxiv-like. I wonder if it's the architecture of my old hardware that's the issue.

I did see your doc. Looks like it should be processed perfectly fine? Did the shape of the LaTeX rendered integral sign change drastically at one point? Unlikely to be a hardware issue IMO, the processes being run are the same, so the only difference with better hardware (video card) would be more VRAM and therefore lower inferencing time, not necessarily changing quality of output.

Have you tried with the Huggingface demo? How did the output of the Huggingface demo compare with the results you achieved?

Answer 9 · 2023-09-15T10:17:18.000Z

I'm experiencing a similar issue of missing pages during PDF to Markdown conversion, but with some nuances.

System Information

Nougat Version: 0.1.4

OS: Windows 11

GPU: NVIDIA 1050 mobile with 4GB VRAM

RAM: 16GB

Pytorch Version: 2.0.1+cu118

NVIDIA-SMI: Driver version 537.13

CUDA Version: 12.2

Issue Details

Paper Tested (original, complete pdf): Springer Article

Command Used: nougat \path\to\file.pdf -o \path\to\folder\ --markdown

Expected Behavior: Complete conversion of all pages.

Actual Behavior: Missed pages: 6, 8-10, 21, 23, 26-33, 35-36, 39, 46-49.

I then created a new PDF paper with only the missing pages and tested that. None of the new pdf pages were processed (it missed all of them)

Output (for the pdf with only missing pages):
C:\Users\Me>nougat C:\Users\Me\Downloads\papers\missed.pdf -o C:\Users\Me\Downloads\papers --markdown
C:\Users\Me\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ..\aten\src\ATen\native\TensorShape.cpp:3484.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
0%| | 0/21 [00:00<?, ?it/s]WARNING:root:Found repetitions in sample 0
INFO:root:Processing file C:\Users\Me\Downloads\papers\missed.pdf with 21 pages
5%|███▉ | 1/21 [00:08<02:54, 8.70s/it]WARNING:root:Found repetitions in sample 0
WARNING:root:Skipping page 2 due to repetitions.
10%|███████▉ | 2/21 [00:28<04:45, 15.05s/it]WARNING:root:Found repetitions in sample 0
14%|███████████▊ | 3/21 [00:35<03:24, 11.36s/it]WARNING:root:Found repetitions in sample 0
19%|███████████████▊ | 4/21 [00:41<02:36, 9.20s/it]W...

Additional Notes

This issue is consistently reproducible.

Running the same paper through the online HuggingFace demo resulted in fewer missing pages.

Let me know if you need any more info.

Hi, thanks for the detailed report. However I was unable to reproduce the it. My best guess is that it's the GPU. Can you try to convert some of the failed pages with CPU only (set batch size to 0 -b 0 )
I have noticed some very slight differences when running on full precision but not so drastic as you are describing them

Edit: Was able to reproduce on CPU. Will investigate now, thanks!

Answer 10 · 2023-09-15T11:35:22.000Z

I was able to confirm that this is again a case of a false positive failure detection. I've added the --no-skipping flag. Please try and rerun your conversion with this flag on, thank you

Answer 11 · 2023-09-15T11:43:37.000Z

@lukas-blecher what specifically does the --no-skipping flag do (other than not skip pages)? I'm asking since I'm also interested, since this sort of issue happens to me quite a bit too^^ even though your reply was to @sm18lr88

Answer 12 · 2023-09-15T11:49:22.000Z

In short, it won't apply the failure detection heuristic described in the paper. I still haven't fully grasped the problem at hand but for some reason pytorch gives different values depending on the device you're using to compute.
Since I chose the the threshold etc on the specific GPU type I used to test the model, I never noticed this issue myself.

But what this also means is that true positives won't be caught as well. So you might get a lot of repetitions for out of domain PDFs (plus the computation time will be longer because we aren't stopping in the middle of the generation anymore)

Answer 13 · 2023-09-15T11:54:52.000Z

In short, it won't apply the failure detection heuristic described in the paper. I still haven't fully grasped the problem at hand but for some reason pytorch gives different values depending on the device you're using to compute. Since I chose the the threshold etc on the specific GPU type I used to test the model, I never noticed this issue myself.

But what this also means is that true positives won't be caught as well. So you might get a lot of repetitions for out of domain PDFs (plus the computation time will be longer because we aren't stopping in the middle of the generation anymore)

Is this also why it seems that the huggingface demo seems to give me the best results? Where does that run? I'm pretty convinced that this does not run locally.

So hardware seems to have an effect on the quality of the output, correct?

Also, repetitions for out of domain documents, this is already a known issue so with this new flag, will this get worse, stay the same or improve?

Answer 14 · 2023-09-15T12:17:26.000Z

The hardware doesn't really change the output but it does change the repetition / failure detection. So you should get the same results as the HF demo.

What will change for out of domain documents is that the repetition will not be detected during the generation. There are still some rule based postprocessing functions that will detect some of them though.

Answer 15 · 2023-09-18T17:54:55.000Z

As I still continue to experience missing pages (which I'm sure will improve over time), I find myself turning to Claude (v2) and asking it to convert pdfs into the format I desire. It's very good at it.

Answer 16 · 2023-12-24T18:07:26.000Z

FYI, I believe Claude 2 preprocesses PDF files using Mathpix (at least on the official website).

Answer 17 · 2023-12-26T19:53:45.000Z

FYI, I believe Claude 2 preprocesses PDF files using Mathpix (at least on the official website).

Well, for whatever reason it was doing an excellent job. IDK if it was my GPU/CUDA compatibility or what.

Answer 18 · 2024-02-19T14:26:29.000Z

I have the same problem.