Task #11980 (new)
Opened 10 years ago
Last modified 8 years ago
Extract text from large PDFs for indexing
Reported by: | jballanco-x | Owned by: | jballanco-x |
---|---|---|---|
Priority: | major | Milestone: | Asynchronous |
Component: | Search | Version: | 4.4.10 |
Keywords: | search, full text indexing | Cc: | |
Resources: | n.a. | Referenced By: | n.a. |
References: | n.a. | Remaining Time: | n.a. |
Sprint: | n.a. |
Description
A PDF file may contain multiple images, causing it to exceed the large-file-size rejection limit (see #11979), even though the text content of the file is below the limit. We need a mechanism for extracting text from such files and checking the text against the size limit independent of the parent file. If the text alone is below the cut-off, we should still index it.
Change History (8)
comment:1 Changed 10 years ago by jballanco-x
comment:2 Changed 10 years ago by jballanco-x
- Milestone changed from 5.0.1 to 5.0.2
comment:3 Changed 10 years ago by jballanco-x
Referencing ticket #11936 has changed sprint.
comment:4 Changed 9 years ago by jamoore
Referencing ticket #11936 has changed sprint.
comment:8 Changed 8 years ago by jamoore
- Milestone changed from 5.x to Asynchronous
comment:6 Changed 8 years ago by jamoore
Referencing ticket #11936 has changed sprint.
comment:7 Changed 8 years ago by jamoore
Referencing ticket #11936 has changed sprint.
Referencing ticket #11936 has changed sprint.