Context Navigation

Notice: In order to edit this ticket you need to be either: a Product Owner, The owner or the reporter of the ticket, or, in case of a Task not yet assigned, a team_member"

Task #11980 (new)

Opened 5 years ago

Last modified 3 years ago

Extract text from large PDFs for indexing

Reported by:	jballanco-x	Owned by:	jballanco-x
Priority:	major	Milestone:	Asynchronous
Component:	Search	Version:	4.4.10
Keywords:	search, full text indexing	Cc:
Resources:	n.a.	Referenced By:	n.a.
References:	n.a.	Remaining Time:	n.a.
Sprint:	n.a.

Description

A PDF file may contain multiple images, causing it to exceed the large-file-size rejection limit (see #11979), even though the text content of the file is below the limit. We need a mechanism for extracting text from such files and checking the text against the size limit independent of the parent file. If the text alone is below the cut-off, we should still index it.

References

Referenced by:
← User Story (#11936): Improve search indexing robustness, performance, and reliability

Change History (8)

comment:1 Changed 5 years ago by jballanco-x

Referencing ticket #11936 has changed sprint.

comment:2 Changed 5 years ago by jballanco-x

Milestone changed from 5.0.1 to 5.0.2

comment:3 Changed 5 years ago by jballanco-x

Referencing ticket #11936 has changed sprint.

comment:4 Changed 5 years ago by jamoore

Referencing ticket #11936 has changed sprint.

comment:5 Changed 5 years ago by jamoore

Milestone changed from 5.1.0-m4 to 5.x

Pushing out.

comment:8 Changed 3 years ago by jamoore

Milestone changed from 5.x to Asynchronous

comment:6 Changed 3 years ago by jamoore

Referencing ticket #11936 has changed sprint.

comment:7 Changed 3 years ago by jamoore

Referencing ticket #11936 has changed sprint.

Note: See TracTickets for help on using tickets. You may also have a look at Agilo extensions to the ticket.

Download in other formats: