How can I get information from checkboxes in tables?
ddotta opened this issue · 3 comments
Prework
- Read and agree to the code of conduct and contributing guidelines.
- If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
Question
I'm trying to extract data from a pdf document that contains tables with checkboxes (see my reproducible example below).
The extract_tables()
function works well and manages to identify the tables in the pdf document, but I only get NA
for all the checkboxes.
Is there any way of identifying which boxes are checked?
Many thanks for your help ! 🙏
Reproducible example
Here's my pdf
test.pdf
And my code :
library(tabulapdf)
fichier <- "test.pdf"
tableaux <- extract_tables(fichier, output = "tibble")
bases_de_conjoncture <- tableaux[[1]]
sources <- tableaux[[2]]
What I get :
# A tibble: 33 × 3
`CERISE (Espace de Production des données)` ...2 ...3
<chr> <chr> <chr>
1 Préciser ci-dessous la liste des sources statistiques (cf. liste sur GEDSI) NA NA
2 Rubrique Source Producteur Chargé d'étude
3 000_Referentiels NA NA
4 0010_Balsa_IAA NA NA
5 0020_Balsa_EA NA NA
6 0030_Balsa_v2_EA NA NA
7 0040_Geo NA NA
8 0050_BDNU NA NA
9 010_Territoires NA NA
10 1010_Enquete_TERUTI NA NA
11 020_Meteorologie NA NA
12 2010_Conj_meteo NA NA
13 030_Structures_exploitations NA NA
14 3010_Enquetes_Structures NA NA
15 3020_Recensements NA NA
16 040_Pratiques_agricoles NA NA
17 4000_Pratiques_Culturales NA NA
18 4010_Pratiques_grandes_cultures NA NA
19 4040_Pratiques_arboriculture NA NA
20 4050_Pratiques_elevage NA NA
21 4060_Conso_energie_EA NA NA
22 4070_Conso_energie_EDT_CUMA NA NA
23 050_Productions_vegetales NA NA
24 5010_Terres_labourables NA NA
25 5030_Conj_Prairies NA NA
26 5040_Conj_viticole NA NA
27 5050_Conj_fruits NA NA
28 5060_Conj_legumes NA NA
29 060_Productions_viandes_oeufs NA NA
30 6010_Enquetes_cheptels NA NA
31 6020_Abattage_gros_animaux NA NA
32 6030_Abattage_volailles_lapins NA NA
33 6035_Abattages NA NA
I managed to do what I wanted with pdftools::pdf_text()
and some complications.
It would be very useful if this could be implemented directly in extract_tables()
hi @ddotta
thanks for reporting this
how did you manage to do this?
@pachadotdev
Here's a solution - not very optimized but does what I want
https://gist.github.com/ddotta/8e828145355bb87e78d83191b747b2e0