yob/pdf-reader

Whitespaces removed with certain fonts

Opened this issue · 0 comments

Given the pdf file sample.pdf, that has a few lines of text using different fonts, when I try to extract the text on the page with

file = File.open('./tmp/sample.pdf')
reader = PDF::Reader.new(file)
puts reader.pages.first.text

I get

Spaces with font Courier bold
Spaces with font Courier normal
Spaces with font Times-Roman bold
Spaces with font Times-Roman normal
Spaces with font Helvetica bold
Spaces with font Helvetica normal

SpaceswithfontLatobold
SpaceswithfontLatonormal

Notice that for the text in Lato font, whitespaces have been removed.

I was expceting
whitespaces to be preserved.

Spaces with font Lato bold
Spaces with font Lato normal

Is this because Lato's space glyph is not wide enough for the criteria in PDF::Reader#+?

if (other.x - endx) <( font_size * 0.2)