This extension allows you to spot-check extracted data versus source PDFs. It is very useful when developing an OCR/document analysis pipeline using Python/Pandas. It does so by providing a custom editor that shows a record on the left half and a PDF on the right half.
- Take random samples from a Python script and show them on a custom editor:
-
Install this extension
-
Install Python library
vscode-spot-check
:pip install vscode-spot-check
-
Add these lines to your Python script:
from vscode_spot_check import print_samples ... if __name__ == '__main__': dataframe = do_your_processing() print_samples( dataframe, resolve_source_path=lambda row: row.filepath, resolve_pageno=lambda row: row.pageno, ) ...
-
Run the command
Open with Spot Check
while your Python script is opened.
- Python >= 3.8
This extension contributes the following settings:
spot-check.pythonInterpreterPath
: path to the Python interpreter this extension uses to run your Python script. Defaults topython
.spot-check.pythonPaths
: array of paths to add toPYTHONPATH
during script execution. For any path, you can substitute the variableworkspaceFolder
.spot-check.cwd
: current working directory during script execution. You can also substitute the variableworkspaceFolder
here.
Print random samples from a dataframe. This function only works if the first shell argument to the script is "printSamples" (which is how the extension invokes this script), so you don't need to comment and then uncomment this function when running the script for a different purpose.
Arguments:
- data (pandas.DataFrame): the data to sample
- resolve_source_path (func(pandas.Series) -> str): given a row from the data, this function must return the absolute path to source PDF file
- resolve_pageno (pandas.Series) -> int): optional. Given a row from the data, this function must return the page number of the PDF.
- number_of_samples (int): number of samples to produce with each incantation. Defaults to 100.
- sort (bool): sort the samples according to the original row order. Defaults to True.
- exit_on_success (bool): exit the script after this function prints samples successfully. It prevents any code that comes after this function from running to reduce side effects. Defaults to True.
Includes web assets in package
Initial release