Task #11980 (new)
Extract text from large PDFs for indexing
|Reported by:||jballanco-x||Owned by:||jballanco-x|
|Keywords:||search, full text indexing||Cc:|
A PDF file may contain multiple images, causing it to exceed the large-file-size rejection limit (see #11979), even though the text content of the file is below the limit. We need a mechanism for extracting text from such files and checking the text against the size limit independent of the parent file. If the text alone is below the cut-off, we should still index it.