Task #11979 (closed)
Reject large files from Indexing
Reported by: | jballanco-x | Owned by: | jballanco-x |
---|---|---|---|
Priority: | blocker | Milestone: | 5.0.0 |
Component: | Search | Version: | 4.4.10 |
Keywords: | n.a. | Cc: | |
Resources: | n.a. | Referenced By: | n.a. |
References: | n.a. | Remaining Time: | n.a. |
Sprint: | n.a. |
Description
It's been observed that attempting to index very large files (size > heap) can cause the Lucene index for an image to become corrupted, such that further files/tags/other metadata added to the image after the large file is no longer searchable.
Until we can narrow the scope of why these indexes become corrupted or craft a work-around, we should reject large files from indexing in order to prevent indexes from becoming corrupted.
Change History (5)
comment:1 Changed 10 years ago by jballanco-x
comment:2 Changed 10 years ago by jballanco-x
Testing indicates that max file size varies with heap size. With a 256MB heap, a 126MB file indexes fine but a 256MB file causes index corruption. With a 512MB heap, the 256MB file is fine but a 512MB file is not. Setting half-heap-space as the default max for now while investigation continues as to exact cut-off.
comment:3 Changed 10 years ago by jballanco-x
Pull request sent: https://github.com/openmicroscopy/openmicroscopy/pull/2107
comment:4 Changed 10 years ago by jballanco-x
- Resolution set to fixed
- Status changed from new to closed
PR merged. Ready for 5.0.0
comment:5 Changed 10 years ago by Josh Moore <josh@…>
(In [dac7d403a79109bf954f2c8ac25a90ad4b3fd8f3/ome.git] on branch develop) Merge pull request #2142 from jballanc/rebased/develop/limit-indexed-file-size
Limit max file size for FullTextParser (see #11979) (rebased onto develop)
First step in this task is to figure out exactly where the file-size cut-off should be. So far testing indicates that files larger than 296 MB with a 256 MB heap are too large. Further testing will attempt to discern if "large" is independent of heap size and/or total size of files already indexed.
We should endeavor to find a good default cut-off, but the exact cut-off will be end-user configurable.