Unstructured-IO/unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
HTMLApache-2.0
Issues
- 0
bug/unable to instantiate HuggingFaceEmbeddingEncoder from unstructured.embed.huggingface
#3731 opened by mattseddon - 3
- 11
broken inference source code for 'hi_res', AttributeError: 'list' object has no attribute 'element_coords', the same code worked with previous versions of unstructured
#3718 opened by Arslan-Mehmood1 - 0
bug/file timeout when partition
#3727 opened by AustinZzx - 0
Minio + Unstructured + Weaviate
#3726 opened by naelsen - 7
bug/partition_msg halts for attachmentes with UNK type
#3671 opened by S1M0N38 - 3
bug/reading html file returns empty list
#3708 opened by lwollenbergfuzzy - 0
请问解析doc或者docx 是否可以增加图片标签
#3719 opened by sph116 - 3
bug/API Fails out of the box ingesting a PDF - "File type application/octet-stream is not supported"
#3673 opened by ReMuSoMeGA93 - 0
bug/partition_html解析时会删除html 表格
#3717 opened by deku0818 - 1
Facing punkt error for PY3_tab
#3597 opened by Asif-droid - 1
bug/wong-example-file-path-in-readme: Get "No such file or directory" error by following the steps in Readme
#3713 opened by shaofengshi - 0
bug/Extract ppt failed by api
#3707 opened by JohnJyong - 0
bug/md page breaks
#3601 opened by johnayoub-wtw - 0
- 5
- 0
Simplify Element type by use of Pydantic?
#3702 opened by ctrahey - 1
feat/<short-name>Writing back the unstructured extracted partitions to the same file format
#3699 opened by SinaRanjkeshzade - 0
bug/certain htmls cannot be parsed
#3697 opened by AraiYuno - 1
feat/option to load extraction models once instead of everytime partition pdf function called
#3698 opened by hasansalimkanmaz - 0
bug/Titles not included in chunks by-title
#3688 opened by dividor - 4
- 0
feat/numpy_2
#3684 opened by mgraczyk - 2
bug/<TypeError: unstructured.partition.common.add_element_metadata() got multiple values for keyword argument 'coordinates'>
#3665 opened by MrForExample - 0
bug/application/octet-stream not supported
#3677 opened by jeremydiba - 2
- 0
bug/Auto partition fails on text files which are empty or contain only whitespaces
#3674 opened by tc360950 - 0
bug/Extensions .mdx and .markdown not supported
#3670 opened by butasebi - 0
bug/html parsing incorrectly categorizing text
#3666 opened by bhoppeadoy - 0
bug/`partition_xlsx` function raises TypeError with `infer_table_structure = False` and `find_subtable = False`
#3662 opened by bawgz - 1
bug/<502 bad gatway Error>
#3654 opened by shriharshan - 6
bug/Cannot partition doc files with multi-byte names
#3652 opened by Snowman-s - 0
Not extracting data using api_url in aws marketplace
#3653 opened by shriharshan - 1
bug/AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'
#3642 opened by skehlet - 2
bug/Unable to download NLTK data
#3617 opened by TaylorN15 - 0
error in reading and parsing elements from file
#3640 opened by prashanthin - 2
bug/partiton docx with table fail
#3628 opened by Klaijan - 2
algorithm underlying table extract and parsing
#3576 opened by naarkhoo - 0
bug/sorting order for PDF partitioning strategy `fast`
#3619 opened by simonschoe - 0
bug/<short-name>In the case of aws lambda, the check fails because nltkdir is not writable.
#3612 opened by cds-code - 0
feat/support utf-8 responses
#3611 opened by davidgilbertson - 1
bug/skipping-figures
#3606 opened by joelgwebber - 0
feat/chunking by page
#3613 opened by saeedesmaili - 4
bug/Docker image import unstructured failed
#3590 opened by guici123 - 0
translate error: IndexError: index out of range in self
#3607 opened by vkbbkvvkb - 0
bug/file type detection fallback strategy not working
#3596 opened by WHALEEYE - 0
bug(docx): O(N^2) time on large section+paragraph count
#3592 opened by scanny - 2
Docker image is missing dnf or yum
#3589 opened by guici123 - 2
cannot import name 'is_temp_file_path' from 'unstructured.utils' (/usr/local/lib/python3.10/dist-packages/unstructured/utils.py)
#3580 opened by regismvargas - 1