ChartReader

Fully automated end-to-end framework to extract data from bar plots and other figures in scientific research papers using modules such as OpenCV, AWS-Rekognition for text detection in images.

Figure extraction

pdffigures2 is used to extract/download images (charts + tables) from the research papers.

Image set

Bar plots used are here: https://drive.google.com/drive/u/1/folders/154sgx3M49NoKOoOjoppsSuvqd2WzqZqX

Chart classification (Accuracy: 84.08%)

Data preparation

Step 1: google_images_download python module is used to download google images for each type of chart: Area chart, Line chart, bar plot, pie chart, venn diagram etc. based on their corresponding keywords.

$ git clone https://github.com/Joeclinton1/google-images-download.git
$ cd google-images-download && python setup.py install

Step 2: The downloaded images are carefully reviewed and the incorrect images are removed.

The following are the training data used, and model files.
training corpus: https://drive.google.com/drive/u/1/folders/1M8kwdQE7bpjpdT08ldBURFdzLaQR9n5h
model: https://drive.google.com/drive/u/1/folders/1GVW_MtFFYT-Tj44p0_QLKM7hVnn_AcKI

Below is the count of images for each type:

Plot type	Count
BarGraph	528
VennDiagram	364
PieChart	355
ScatterGraph	335

Plot type	Count
TreeDiagram	297
FlowChart	293
Map	276
ParetoChart	329

Plot type	Count
BubbleChart	311
LineGraph	300
AreaGraph	299
NetworkDiagram	321
BoxPlot	312

Training phase:

pretrained models VGG-19, ResNet152V2, InceptionV3, EfficientNetB3 are used to train the images, and is run on the test images to classify the images to 13 different types such as Bar chart, Line graph, Pie chart etc.

The accuracy is calculated using stratified five-fold cross validation. The accuracy of the models are given below in the table. We see that the accuracy is around 84% for all the models used to train the data. The following are the training accuracy and loss curves captured during the training phase for each fold of cross validation.

Model	Training parameters	Accuracy
VGG-19	47,736,845	84.08% (+/- 0.49%)
ResNet152V2	143,428,621	83.54% (+/- 1.19%)
InceptionV3	91,358,605	84.53% (+/- 1.61%)
EfficientNetB3	107,331,693	84.53% (+/- 0.92%)

Results (predictions on test data)

The following are 100 randonly picked images which are predicted as bar plots. Highlighted images (6 in number out of 100 randomly picked) are incorrectly classified as bar plots.

Axes Detection (Accuracy: 80.22%) [1006/1254 correct]

Firstly, the image is converted into black and white image, then the max-continuous ones along each row and each column are obtained.
Next, for all columns, the maximum value of the max-continuous 1s is picked.
A certain threshold (=10) is assumed, and the first column where the max-continuous 1s falls in the region [max - threshold, max + threshold] is the y-axis.
Similar approach is followed for the x-axis, but the last row is picked where the max-continuous 1s fall in the region [max - threshold, max + threshold]

We experimented with threshold values of 0, 5, 10, 12 and found that threshold value of 10 gives better results for axes detection. Table below shows the accuracy of axes detection with varying threshold values.

Threshold	0	5	10	12
Accuracy (%)	73.2	78.8	80.22	79.26

Results

Both x and y axes are detected correctly for 1006 images out of 1254 images (test data set). Below are some of the failed cases in axes detection.

Text detection

Amazon Rekognition is used to detect text in the image. DetectText API is used for detecting text. Only the text with confidence >= 80 are considered.

Double-pass algorithm for text detection

To improve text detection, double-pass algorithm is employed.

Text detection using detect_text AWS Rekognition API, and considered only the text boxes for which confidence >= 80
Fill the polygons corresponding to these text with white color
Run text detection (2nd pass) on the new image, and consider only the ones with confidence >= 80

Bounding Box calculation

There is an issue with bounding box for vertical text or text with an angle. Therefore, bounding box is calculated from the polygon coordinates (or vertices) from the AWS Rekognition output.

Label Detection

X-labels:

Filter the text boxes which are below the x-axis(, and to the right of y-axis).
Run a sweeping line from x-axis (detected by axes detection algorithm) to the bottom of the image, and check when the sweeping line intersects with the maximum number of text boxes.
This maximum intersection gives the bounding boxes for all of the x-labels.

X-text

Filter the text boxes which are below the x-labels
Run a sweeping line from x-labels to the bottom of the image, and check when the sweeping line intersects with the maximum number of text boxes.
This maximum intersection gives all the bounding boxes for all the x-text.

Y-labels:

Filter the text boxes which are to the left of y-axis.
Run a sweeping line from y-axis and start moving towards the left, and check when the sweeping line intersects with the maximum number of text boxes.
Pick these text boxes where there was maximum intersection, and filter them further using python regex to detect only numeric values.

Y-text:

Filter the text boxes which are to the left of y-axis.
The remaining text boxes that are not classified as y-labels will be considered as y-text.

Legend detection

Filter the text boxes that are above the x-axis, and to the right of y-axis.
Clean the text to remove 'I'. These are obtained since error bars in the charts are detected as 'I' by AWS Rekognition OCR API(s).
Use an appropriate regex to disregard the numerical values. These are mostly the ones which are there on top of the bars to denote the bar value.
Now merge the remaining text boxes (with x-value threshold of 10) to make sure all the multi-word legends are part of a single bounding box.
Group bounding-boxes in such a way that each member in the group is either horizontally or vertically aligned to atleast one other member in the group.
The maximum length group from all the groups obtained in Step 5 gives the bounding boxes for all the legends.
Legend text can be parsed and obtained from these bounding boxes.

Data extraction

Value-tick ratio calculation:

This ratio is used to calculate the y-values from each bar-plot using the pixel projection method. Y-axis ticks are detected by left-bounding boxes to the y-axis.

Since the text detection (numeric values) isn't perfect, once the pixel values for the ticks and actual y-label texts are obtained, the outliers are removed by assuming a normal distribution and whether the values deviate very much. Then, the mean distance between the ticks is calculated. Further, the mean value of the actual y-label ticks is calculated. Finally, the value-tick ratio is calculated by:

Pattern (or color) estimation

As an initial step, all the bounding boxes for the text in the image are whitened.
Convert the resulting image into a binary image.
Find contours (and bounding rectangles) in the resulting image.
For each legend, find the nearest bounding box to the left and on the same height as the legend.
Now in the original image, find the major color (or pattern) from the nearest bounding box obtained for each legend in Step 4.

Getting bar plot for each legend

All the pixel values of the image are divided into clusters. Prior to clustering, all the white pixels are removed, and the bounding boxes found by above procedure for each legend are whitened.
The number of clusters are determined by the number of legends detected. The colors finalized in the above procedure form the initial clusters.
We then simplify the given plot into multiple plots (one per each cluster). These plots would be a simple bar plot. i.e.., by clustering we convert a stacked bar chart into multiple simpler bar plots.
We then get the contours for the plot, and subsequently bounding rectangles for the contours determined.
For each label, the closest bounding rectangle is picked.
The height of each bounding box is recorded by the help of the merging rectangles obtained by the above procedure. This ratio is used to further calculate the y-values :

Below shows data extraction results on an image.

Reporting results

The results (axes, legends, labels, values, captions and file-names) are written to the Excel sheet.

Table below shows the evaluation metrics.

Parameter	Accuracy	True Positive Rate
Legends	0.8054	0.8054
X-axis ticks	0.9755	0.9755
Y-axis ticks	0.6815	0.6815
height/value ratio	0.8919	0.8919
Y-axis label	0.7758	0.7758
X-axis label	0.7129	0.7129
Data correlation	0.6470	0.7504
Data values	0.2158	0.4095

EliaukTM/ChartReader