This project is designed to research possibilities for automated or semi-automated transcription of herbarium sheet text, both handwritten and typed.
- Create a Python 3.8 virtual environment. For example, in Anaconda Terminal:
conda create -n <envname> python=3.8 pip
conda activate <envname>
pip install -r requirements.txt
- Install the needed nltk corpora by running the
requirements-nltk.py
script. For example, in the same Anaconda window, execute:python requirements-nltk.py
- Set up Google Cloud Vision credentials. (Optional, but required to generate new GCV analyses.)
- Download your Google Cloud Service Account Key. Save the file in this main directory, e.g.
service_account_token.json
. (See Google Cloud help center for more guidance with creating a key.) - Copy the
Configuration-plain.cfg
file asConfiguration.cfg
.- Edit the file to add the name of your service account token (in the
GOOGLE_CLOUD_VISION_API
section, underserviceAccountTokenPath={name-of-your-service-account-token}
. - Update any other settings if desired.
- Edit the file to add the name of your service account token (in the
- Download your Google Cloud Service Account Key. Save the file in this main directory, e.g.
- Set up Amazon Web Services credentials. (Optional, but required to generate new AWS analyses.)
- Set up your Amazon Web Services account and store your credentials in the proper location (this location is OS-dependent, please consult the AWS documentation) for full instructions).
Currently available: Google Cloud Vision and Amazon Web Services Textract
- Download the ground truth information for your dataset, plus URLs of the images.
- In a web browser, log into a Symbiota portal (Fern Portal or Bryophyte Portal) and download an occurrence file, which will be a CSV file containing image data and transcriptions of the herbarium sheet text.
- As an authenticated user, click "Search Collections" under the "Search" tab. Uncheck the box to the left of "Select/Deselect all collections" to deselect everything, and then check the box for the desired dataset (e.g. "Field Museum of Natural History"). Click the search button.
- Use the search fields to narrow down your export as needed. Some common examples:
- Collector's Last Name (e.g. Steyermark).
- Click "Table Display" to load the query results. Near the top right, click the down-arrow button to open a
pop-up window for exporting the data.
- For "Structure," select "Darwin Core."
- For "Data Extensions," deselect "include Determination History" and select "include Image Records."
- (Compression should already be checked, and "CSV" selected for file format.)
- For "Character Set" select "UTF-8 (unicode)."
- Click "Download Records" button.
- Download your image set.
- Run the script
utilities\join_occurrence_file_with_image_urls.py
, pointing to the ZIP file you just downloaded. A new CSV file (e.g. "occurrence_file_with_images.csv") is created in the same directory. - Run the script
utilities\download_images_from_csv.py
, pointing to (1) the "occurrence_file_with_images.csv" file, and (2) the desired directory for the downloaded image set.
- Run the script
- Gather language data.
- Run the script
utilities\detect_language.py
, pointing to (1) the desired directory for the downloaded image set, and (2) the "occurrence_file_with_images.csv" file. This generates a timestamped CSV file called "detect_language_data.csv". This file contains the document level language for each sample as well as any detected languages and the confidence of detection. - If the ground truth for the detected labels exists, you can run the script
utilities\language_validation.py
pointing to (1) ground truth occurrence file and (2) detect_language_date.csv file and a language_validation_date.csv file will be generated showing the accuracy of the language detection
- Run the script
- Retrieve and save OCR data for your image set.
- Run the script
gather_ocr_data_from_cloud_platforms.py
pointing to the folder of images downloaded in the previous step, and the "occurrence_with_image_urls" file. - (To cut down on cloud usage, the program attempts to find already existing
ocr response objects, and any OCR responses generated will be saved in the
ocr_responses
folder, with one subfolder for each cloud platform.) For a brand-new set of images, this script will take 23-30 seconds per image. - The file "occurrence_with_ocr-<yyyy_mm_dd-hh_mm_ss>.csv" will be saved in the folder
test_results
. - To generate annotated images for each image and cloud platform, e.g. if you want
to visualize and manually compare, add a flag 'True' to the script.
These images are saved within a new subfolder called
cloud_ocr-[yyyy_mm_dd-hh_mm_ss]
. (e.g.python gather_ocr_data_from_cloud_platforms.py images True
)
- Run the script
- Compare OCR data to ground truth data.
- Run the script
prep_comparison_data.py
with the "occurrences_with_ocr" file generated in the previous step. This script saves 2 new files to thetest_results
folder:- The occurrence file with 3 added columns,saved as
"occurrence_with_ocr_and_scores-<yyyy_mm_dd-hh-mm-ss>.csv":
labelText
- ground truth data (compiled from the human-created transcriptions in the occurrence file)awsMatchingScore
- The total "score" for the AWS Textract platform's OCR text found in this image. (Roughly, this score gives 1 pt for an exact match, 0.5 pt for a 60-99% match, and no points for any match <60%)gcvMatchingScore
- The total "score" for the GCV platform's OCR text found in this image.
- A file called "compare_word_by_word-<yyyy_mm_dd-hh_mm_ss>.csv" which shows the best OCR match (fuzzy match ratio) for each word in the ground truth text (taken from the occurrence file).
- The occurrence file with 3 added columns,saved as
"occurrence_with_ocr_and_scores-<yyyy_mm_dd-hh-mm-ss>.csv":
- (Note: because of the label-finding feature, this script takes roughly 23-30 seconds per image on an average personal computer.)
- Run the script
- After the first three steps above, run
create_images_for_zooniverse.py
using the folder of downloaded images (and an optional 2nd argument to ID the folders that will be created -- highly recommended!). This script will generate two new folders:processed_images-{optional_id}
(label images with words boxed by blue squares) andprocessed_images_nn-{optional_id}
(images of each word, cropped, for use in machine learning once labelled). The first folder will also contain a filezooniverse_manifest.csv
, which should be uploaded to Zooniverse with the images. - Upload the
processed_images-{optional_id}
folder with the CSV file to Zooniverse as a subject set. - Once volunteers have finished the workflow for this subject set, export the data ("Request new classification export CSV").
/
gather_ocr_data_from_cloud_platforms.py
- see "Comparing OCR platforms on analyzing herbarium sheet labels" in Example Workflowsprep_comparison_data.py
- see "Comparing OCR platforms on analyzing herbarium sheet labels" in Example Workflowscalculate_changes.py
- quick visualization (in terminal), comparing OCR platforms' performance.compare_ocrs.py
- deprecated?create_images_for_zooniverse.py
- Given a folder of images, create images to spec for the latest Zooniverse project.crop_images_of_words.py
- (in development) Use Zooniverse results and herbarium images to create a dataset of labeled word images.Configuration-plain.cfg
- see Environment Setuprequirements.txt
- see Environment Setup
/imageprocessor/
- Contains classes for handling, parsing, and visualizing OCR data.
image_annotator.py
image_processor.py
/labelcorpus/
- (not in use) Contains files for creating and applying text corpora.
analyze_corpus.py
make_corpus_from_occurence_file.py
/nameresolution/
- (not in use) Contains files for fuzzy text matching/error correction, specifically for scientific names and synonym resolution.
fuzzy_text_matching.py
taxon_binomial_name_matching.py
/utilities/
- Contains files for quickly loading and saving commonly used data types, as well as some scripts which have specific uses, such as for parsing files and batch downloading.
/image_preparation/
data_loader.py
- Quickly load various data types which are common in this repo.data_processor.py
- Quickly save various data types which are common in this repo.join_occurrence_file_with_image_urls.py
- Given a Fern Portal occurrence file, find and verify URLs for each image.download_images_from_csv.py
- After running the previous script, download the images for a given URL.detect_language.py
- Extracts language data from the Google Cloud API calllanguage_validation.py
- Compare the detect language CSV to a ground truth occurrence CSVquick_crop_labels.py
- Quickly and roughly crop the bottom right corner of a set of herbarium sheet images.timer.py
- Quick timer class for tracking program execution time.
/imageprocessor_objects/
- Stores all pickled ImageProcessor objects.
/test_results/
- Stores various processed CSV files and annotated images.
Use this program to send each image file to all available cloud-based platforms, for OCR processing.
Input:
- A single image file on the local computer
- OR
- A folder of image files on the local computer
N.B. about file naming and cloud server usage: To reduce cloud computing costs,
the program always searches the ocr_responses
folder for an existing response object before
sending a query to the cloud service. Queries are stored in the folder with the base of the
image file name as the name. e.g. The OCR response for cat-and-dog.jpg
is saved as
cat-and-dog.pickle
. If cat-and-dog.jpg
is run through this script again, it will import the
pickle (and print a message to the console, Using previously pickled response object for
cat-and-dog).
Outputs:
- The response for any new cloud queries are saved in the folder
ocr_responses
, with one sub-folder for each cloud service, e.g.aws
andgcv
. - The other outputs are all saved to the
test_results
folder, in a new subfolder calledcloud_ocr-[timestamp]
. - The complete text output is saved as
ocr_texts.csv
, with one row per image. For AWS and GCV respectively, a line break (\n
) character separates each "line" or "paragraph" of OCR data. (Both AWS and GCV will generate extended character sets and non-latin characters, such as latin letters with diacritics, Korean, Arabic, etc.) - Unless flagged false (see example usage), one copy of each image is generated per cloud
platform, with annotations indicating the "words" found by all platforms.
The program is configured to draw (1) a thin black box around each line/paragraph, (2)
a green line at the start of each detected word, and (3) a red line at the end of each
detected word. (This can be adjusted in the
draw_comparison_image
function.)
Example usage:
python gather_ocr_data_from_cloud_platforms.py oneimage.jpg
python gather_ocr_data_from_cloud_platforms.py image_folder
python gather_ocr_data_from_cloud_platforms.py image_folder True
(Same functionality as the
previous example)
python gather_ocr_data_from_cloud_platforms.py image_folder false
(Optional second argument
to skip the creation of the annotated images. Case-insensitive, will detect "false", "no",
or "n".)
Example output for python gather_ocr_data_from_cloud_platforms.py oneimage.jpg
:
Saved in the folder ./test_results/cloud_ocr-<yyyy-mm-dd_hh-mm-ss>/
:
- Image saved as
oneimage-annotated<datestamp>.jpg
in sub-folderaws
. - Image saved as
oneimage-annotated<datestamp>.jpg
in sub-foldergcv
. - CSV file
ocr_texts.csv
:
barcode | gcv | aws |
---|---|---|
C12345678F | 31160 PLANTS OF GUATEMALA ...(etc) | 31160 PLANTS OF GUATEMALA ...(etc) |
... | ... | ... |
Using the downloaded occurrence information (as a ZIP file), this program joins the full occurrence record with the URL of the high resolution image for each row.
Input:
- A ZIP file exported from the Fern Portal. See workflows above for detailed information.
Output:
- The results are saved as a new file in the same directory, with the file name
occurrence_file_with_images-[timestamp].csv
. This file is the same as theoccurrences.csv
file, with one additional column,image_url
, taken from theimages.csv
file.
Example usage:
python utilities/join_occurrence_file_with_image_urls resources/occur_download.zip
Example output:
Saved in resources/
as occurrence_file_with_images-2021_04_14-10_58_13.csv
:
This program uses fuzzy match ratio to find the closest name match based on World Flora Online.
Input:
- A text file of generated binomial names (genus and species), e.g. as generated by OCR, with one name per line.
Output:
- The results are saved as with the name
[original_filename]-name_match_results.csv
in the current working directory. - This file has 3 columns:
- The original text string from the input file
- A list showing the highest ratio match (or multiple options, if tied)
- The highest ratio achieved by those matches (an integer value 0-100, representing %)
Example usage:
python nameresolution/taxonomic_name.py file_of_OCR_names_to_match.txt
Example output:
Saved as file_of_OCR_names_to_match-name_match_results.csv :
search_query | best_matches | best_match_ratio |
---|---|---|
Adiantum pedatum | ['adiantum pedatum'] | 100 |
Polypodium virginiangan | ['polypodium virginianum'] | 89 |
This project is being developed for the Grainger Bioinformatics Center at the Field Museum by Beth McDonald (Machine Learning Engineer, @emcdona1) and Sean Cullen (Botany Collections Intern, @SeanCullen11), under the guidance of Dr. Rick Ree and Dr. Matt von Konrat.
Original codebase for a GUI system with a local database developed by Keshab Panthi (@kpanthi), Northeastern Illinois University.