bubibubi/ExtractTablesFromPdf

Coordinate Y is not correct

CJ1789 opened this issue · 17 comments

I want to only extract line positions so at that desired area I can extract Text from Pdf, But it is taking left top as (0,0) and when I am tring to extract text from pdf using itextsharp it is taking left bottom as (0,0), So I am not able to take correct text.

please help me I am stuck.

The code does not work for all files.
You could start comparing the results of other softwares (on line) and see if the can extract data.
Then you can tailor the code on your pdf.
Usually there is a lot of work to do to parse a pdf exactly as you need (so it is good if you need to extract data from a lot of pdfs).

Can you please tell me why you transform point x and y like

    public double TransformX(double x, double y)
    {
        return a * x + c * y + e;

    }

    public double TransformY(double x, double y)
    {
        return b * x + d * y + f;

    }

and in case 0 Rotation rotated point y as Y= 800-Y;

    public Point Rotate(int pageRotation)
    {
        switch (pageRotation)
        {
            case 0:
                return new Point(X, 800 - Y);
            case 90:
                return new Point(Y, X);
            case 180:
                return new Point(X, Y);
            default:
                return this;
        }
    }

The first transformation is from the pdf guide.

About second question, 0 as page rotation means no rotation. I prefere to have the origin in upper left corner while pdf origin is lower left. 800 - y is to flip vertically (800 works for me, you can use a different literal). Otherwise you have to do this in 180 rotation.

How to do 180 rotation?

you could swap the two rotations.
0 => y
180 => 800 - y

But then I think that you'll find several things not working (the other functions expects that the origin is in the upper left corner).
Anyway, if you see that for some reasons you have everything is already flipped you could try it.

I am not getting the answer in both the cases. please help me what to do. Is 800 - y is the way to flip pdf or you have got this value for your pdf?

c - y
c is from my pdf.
The condition to determine c is c -y > 0 and it is used for rendering (debug) so it can't be 1000000 - y

what is c??? I mean how can I identify it for my pdf

Can I sent you mt pdf??

c means a literal a constant.

Yes, send me your pdf. I can have a look...

send me your mail id please

७_१२_6.pdf
७_१२_7.pdf
७_१२_8.pdf
७_१२_9.pdf
७_१२_10.pdf
७_१२_11.pdf
७_१२_data on 2 pages.pdf

I want to determine vertical line position of line 3 ie line[2] and line 6 ie line[5]

Ok, I had a look to the first pdf.
You can do the same thing updating the source code and using the BuildTablesFromPdf.Renderer app.
The table in the first page is not really a table because is not well aligned. So the library determines more cells then there are.
Also, there is an issue on text positioning. I'm probably ignoring a pdf statement that locates the text in the right place.

About second page there is a different issue. The coordinates are wrong. Probably I'm ignoring a pdf statements that I should consider.
After solving this issue you will also have the issue about text positioning as in first page.

I will probably try to fix it but I'm not sure and I don't know when.
If you fix it and you share the code it will be really appreciated.

ok, thnx

I got correct Y.

Could you send me the code?
THX!!!

Just have to modify it by adding 150.

But now new issue had arrived. I am able to extract pdf but some characters are not being identified. Can you help me with that?

Example
अधिकार as अ\0धकार
महाराष्ट्र as महारा\0\0
क्षेत्र as \0े\0

Hello,

I want to know that the code is running perfectly for first page in pdf but what to do for second page. I am not getting correct Y. Please help me.