User Story #3164 (closed)
BUG: Searching returns no results for wildcard searches
Reported by: | atarkowska | Owned by: | jamoore |
---|---|---|---|
Priority: | blocker | Milestone: | 5.0.3 |
Component: | Services | Keywords: | n.a. |
Cc: | jrswedlow, java@…, wmoore | Story Points: | n.a. |
Sprint: | n.a. | Importance: | n.a. |
Total Remaining Time: | n.a. | Estimated Remaining Time: | n.a. |
Description (last modified by jmoore)
This is caused by the FullTextAnalyzer (ticket:1010) not being used for wildcard searches:
- http://wiki.apache.org/lucene-java/LuceneFAQ#What_wildcard_search_support_is_available_from_Lucene.3F
- http://jira.atlassian.com/browse/JRA-15006
- http://stackoverflow.com/questions/2432486/lucene-wildcard-queries
- http://markmail.org/message/nfpoofjlyu4fkgcl#query:+page:1+mid:bkbjgtjvp5xz5igo+state:results
There's a proposed workaround in lucene-misc-2.4.1.jar called the AnalyzingQueryParser which does pass the search string to the analyzer, even with wildcards. To use it, however, we may need more investigation. (Especially the JIRA link above illustrates some of the issues one can run into)
Extending Ola's test with the following tests:
texts = ("*earch", "*h", "search tif", "search",\ "test", "tag", "t*", "search_test",\ "*test*.tif", "search*tif", "s .tif",\ ".tif", "tif", "*tif",\ "s*.tif", "*.tif")
I see the following terms fail for the default QueryParser:
- *.tif
- search*tif
- s*.tif
- *test*.tif
For the new AnalyzingQueryParser:
- *earch
- *h
- search*tif
- s*.tif
- *test*.tif
So, we can get "*.tif" back, but at the cost of "*earch" and "*h". With further investigation, we can probably come up with something that makes each of these cases pass, but other searches may then start to fail.
Possibly related is #1011 which would not use an analyzer at all on some fields like Image.name so that the underscores in "search_test_1.tif" don't get removed.
I'll commit the extended test and the lucene-misc jars and we can discuss further.
Update
This issue is apparently not only restricted to leading wildcards, but other forms of wildcard searches. Moving to 4.3 for review.
Matching: test-project-a-b-c ============================================= Query Found Ok? test 21 GOOD test-project 21 GOOD test\-project 21 GOOD test- 21 GOOD test-project- 21 GOOD test\-project\- 21 GOOD test* 21 GOOD test-project* 0 FAIL test\-project* 0 FAIL test-* 0 FAIL test-project-* 0 FAIL test\-project\-* 0 FAIL name:test* 21 GOOD name:test-project* 0 FAIL name:test\-project* 0 FAIL name:test* 21 GOOD name:test name:project 21 GOOD test project 21 GOOD test* project* 21 GOOD test- project- 21 GOOD test-* project-* 0 FAIL test-project-a-b-c 21 GOOD a-b-c 21 GOOD a b c 21 GOOD t* 21 GOOD p* 21 GOOD a* 21 GOOD b* 21 GOOD c* 21 GOOD t* p* 21 GOOD proj* 21 GOOD tes* proj* 21 GOOD tes*-project 0 FAIL test-proj* 0 FAIL
Change History (19)
comment:1 Changed 13 years ago by atarkowska
- Component changed from General to Services
- Priority changed from minor to blocker
comment:2 Changed 13 years ago by atarkowska
comment:3 Changed 13 years ago by jmoore
- Description modified (diff)
- Summary changed from BUG: Searching returns no results to BUG: Searching returns no results for leading wildcard term
comment:3 Changed 13 years ago by jmoore
comment:4 Changed 13 years ago by jmoore
comment:6 Changed 13 years ago by atarkowska
I'm not sure if it was known, but in collaborative group I am able to use "*tif", etc. If I switch to read-only or private group no results is returned.
comment:7 Changed 13 years ago by jmoore
comment:8 Changed 13 years ago by jmoore
comment:9 Changed 13 years ago by jburel
- Sprint changed from 2010-10-28 (18) to 2010-11-11 (19)
Moved from sprint 2010-10-28 (18)
comment:10 Changed 13 years ago by jmoore
- Milestone changed from OMERO-Beta4.2.1 to Unscheduled
- Sprint 2010-11-11 (19) deleted
Not making any code modifications to support this for 4.2.1. I've linked this under #2097 (4.2+ search fixes) and am moving to "unscheduled". Hopefully, we will have a large search review after big images.
comment:11 Changed 13 years ago by jmoore
- Description modified (diff)
- Milestone changed from Unscheduled to OMERO-Beta4.3
comment:12 Changed 13 years ago by jmoore
- Description modified (diff)
- Summary changed from BUG: Searching returns no results for leading wildcard term to BUG: Searching returns no results for wildcard searches
comment:13 Changed 13 years ago by jmoore
- Description modified (diff)
comment:14 Changed 13 years ago by jmoore
- Type changed from Task to User Story
This is going to take significant time and in fact may not be fixable in 4.3. Turning into a story.
comment:15 Changed 13 years ago by jmoore
- Milestone changed from OMERO-Beta4.3 to Unscheduled
comment:16 Changed 10 years ago by jamoore
- Cc jrswedlow java@… wmoore added
- Milestone changed from Unscheduled to 5.0.3
- Resolution set to fixed
- Status changed from new to closed
Considering the improvements in wildcard handling in 5.0.3, I'm closing this. Of course, as as outlined in https://trello.com/c/INhtQu6q/21-search-tng there are still other improvements that can be made (via the analyzer and n-grams) but the base issues here are taken care of.
comment:17 Changed 10 years ago by pwalczysko
@jamoore Fine with closing this, because principally this works now. The only questions remaining here are the *.svs type queries, i.e. "wildcard immediately followed by a non-alphanumeric" problem. I understand that his is being hacked in clients atm to give "svs".
comment:18 Changed 10 years ago by dlindner
Yes, "*.svs" effectively does a search for "svs" (it's handled on the server).
comment:19 Changed 5 years ago by mtbcarroll
The current "BROKEN" in https://github.com/openmicroscopy/openmicroscopy/blob/v5.5.0-m5/components/tools/OmeroPy/test/integration/test_search.py#L140 may be of interest.
(In [8385]) test, see #3164