How do I extract `PageLabel` form pdf?
Atreyagaurav opened this issue · 11 comments
I see that there is a datatype [PageLabel](https://docs.rs/pdf/latest/pdf/object/struct.PageLabel.html#)
in the library. But I can't figure out any way to read it from PDF. I know the PDF has that, as I can see it if I convert the PDF into text editor friendly format and open it. And also Beamer created pdfs have those.
Edit: Also to add more general question, how do I extract data from the Stream.
The case for PageLabels is something like this:
%% Object stream: object 17, index 15; original object ID: 2148
<<
/Metadata 1548 0 R
/Names 16 0 R
/OpenAction 386 0 R
/Outlines 1537 0 R
/PageLabels <<
/Nums [
0
<<
/P <feff0031>
>>
1
<<
/P <feff0032>
>>
3
<<
/P <feff0033>
>>
4
<<
/P <feff0034>
>>
6
<<
/P <feff0035>
>>
7
<<
/P <feff0036>
>>
8
<<
/P <feff0037>
>>
11
<<
/P <feff0038>
>>
13
<<
/P <feff0039>
>>
16
<<
/P <feff00310030>
>>
17
<<
/P <feff00310031>
>>
18
<<
/P <feff00310032>
>>
21
<<
/P <feff00310033>
>>
23
<<
/P <feff00310034>
>>
24
<<
/P <feff00310035>
>>
25
<<
/P <feff00310036>
>>
26
<<
/P <feff00310037>
>>
27
<<
/P <feff00310038>
>>
28
<<
/P <feff00310039>
>>
29
<<
/P <feff00320030>
>>
30
<<
/P <feff00320031>
>>
31
<<
/P <feff00320032>
>>
32
<<
/P <feff00320033>
>>
33
<<
/P <feff00320034>
>>
34
<<
/P <feff0031>
>>
36
<<
/P <feff0032>
>>
37
<<
/P <feff0033>
>>
38
<<
/P <feff0034>
>>
39
<<
/P <feff0035>
>>
40
<<
/P <feff0036>
>>
41
<<
/P <feff0037>
>>
42
<<
/P <feff0038>
>>
]
>>
/PageMode /UseOutlines
/Pages 1536 0 R
/Type /Catalog
>>
endstream
endobj
Looks like it is the catalog.
https://docs.rs/pdf/latest/pdf/file/struct.File.html#method.get_root
And pagelabels needs to be added there.
I see the catalog, but everything there is Ref
, I can get some Stream
from the root but I want to know how can I get the information from there programmatically. Because it just says Ref
for everything.
Ref can be dereferenced with the resolver.
file.resolver().get(ref)
I added page_labels to the Catalog.
Ref can be dereferenced with the resolver.
Yes, but I get more Ref
(or PlainRef
), how do I know what kind of data it has and how to convert it into usable data? Debug printing just gives this. Support I want to search for PageLabels manually, looking at object streams, all I get are these. With even if I get inner
from there, I have no idea what data type it's supposed to be.
RcRef { inner: PlainRef { id: 5807, gen: 0 }, data: () }
I added page_labels to the Catalog.
I don't see any commits, where can I try that.
Some examples there could be useful. Getting custom tags from PDF or things like that.
Also, for now I went with poppler-rs
for my program now as it seems to give the page labels, although I had to get it for each page instead of from the document itself.
This is a sample code I tried:
use std::path::PathBuf;
use pdf;
use pdf::object::Resolve;
fn main() {
let path = PathBuf::from("/path/to/slides.pdf");
let file = pdf::file::FileOptions::cached().open(path).unwrap();
println!(
"{:?}",
file.resolver()
.get(file.get_root().metadata.unwrap())
.unwrap()
);
}
Oops. I didn't check the terminal again after hitting return.
If you are working with PlainRefs, you just have to fetch them and see what it actually is.
Resolver::resolve, I think would be the function to call.
To read the Metadata field, again, resolver::get and then call data() on the stream you got.
Well, the code I added is incorrect.
Yeah, I saw it's added but it doesn't extract the info.
println!("{:#?}", file.get_root().page_labels);
Gives me this:
Some(
NameTree {
limits: None,
node: Intermediate(
[],
),
},
)
It is working as of 5c19ff6.
See the end of examples/names.rs for an example.
Thank you. It works. Looks like beamer page numbers are saved as prefix, so I did something like this:
if let Some(ref labels) = catalog.page_labels {
labels.walk(&resolver, &mut |page: i32, label| {
println!(
"{page} -> {:?}",
label.prefix.as_ref().unwrap().to_string_lossy()
);
});
}