Bug #1010 (new)
Opened 16 years ago
Last modified 16 years ago
StandardAnalyzer from Lucene does not parse appropriately for our application — at Initial Version
Reported by: | jamoore | Owned by: | jamoore |
---|---|---|---|
Priority: | blocker | Cc: | cxallan, jburel, wmoore |
Sprint: | n.a. | ||
Total Remaining Time: | n.a. |
Description
After difficulties finding images with simple queries, it was decided we need our own analyzer for milestone:3.0-Beta3.
A filename like: "csfv-gfp01_1_r3d_d3d" was not findable with a search of "gfp*" since, the whole thing parsed to one token of type=<NUM> with the StandardAnalyzer?.
Print tokening: foo bar [(foo,0,3,type=<ALPHANUM>), (bar,4,7,type=<ALPHANUM>)] Print tokening: foo/bar [(foo,0,3,type=<ALPHANUM>), (bar,4,7,type=<ALPHANUM>)] Print tokening: foo-bar [(foo,0,3,type=<ALPHANUM>), (bar,4,7,type=<ALPHANUM>)] Print tokening: foo_bar [(foo,0,3,type=<ALPHANUM>), (bar,4,7,type=<ALPHANUM>)] Print tokening: foo.bar [(foo.bar,0,7,type=<HOST>)] Print tokening: 26.8.06-antiCSFV/CSFV-GFP/CSFV-GFP01_1_R3D_D3D.dv [(26.8.06-anticsfv,0,16,type=<NUM>), (csfv,17,21,type=<ALPHANUM>), (gfp,22,25,type=<ALPHANUM>), (csfv-gfp01_1_r3d_d3d,26,46,type=<NUM>), (dv,47,49,type=<ALPHANUM>)] Print tokening: ...FRAP-23.8.05/IAGFP-Noc01_R3D.dv [(frap-23.8.05/iagfp-noc01_r3d,3,31,type=<NUM>), (dv,32,34,type=<ALPHANUM>)] Print tokening: ...FRAP-23.8.05/IAGFP-Noc02_R3D.dv [(frap-23.8.05/iagfp-noc02_r3d,3,31,type=<NUM>), (dv,32,34,type=<ALPHANUM>)] Print tokening: ...FRAP-23.8.05/IAGFP-Noc03_R3D.dv [(frap-23.8.05/iagfp-noc03_r3d,3,31,type=<NUM>), (dv,32,34,type=<ALPHANUM>)] Print tokening: ...FRAP-23.8.05/IAGFP-Noc04_R3D.dv [(frap-23.8.05/iagfp-noc04_r3d,3,31,type=<NUM>), (dv,32,34,type=<ALPHANUM>)] Print tokening: ...FRAP-23.8.05/IAGFP-Noc05_R3D.dv [(frap-23.8.05/iagfp-noc05_r3d,3,31,type=<NUM>), (dv,32,34,type=<ALPHANUM>)] Print tokening: will/Desktop/CSFV-GFP01_2_R3D_D3D.dv [(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_2_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)] Print tokening: will/Desktop/CSFV-GFP01_4_R3D_D3D.dv [(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_4_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)] Print tokening: will/Desktop/CSFV-GFP01_5_R3D_D3D.dv [(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_5_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)] Print tokening: will/Desktop/CSFV-GFP01_1_R3D_D3D.dv [(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_1_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)] Print tokening: will/Desktop/CSFV-GFP01_3_R3D_D3D.dv [(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_3_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)] Print tokening: will/Desktop/CSFV-GFP01_3_R3D_D3D.dv [(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_3_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)] Print tokening: Documents/biology-data/CSFV-GFP01_2_R3D_D3D.dv [(documents,0,9,type=<ALPHANUM>), (biology,10,17,type=<ALPHANUM>), (data,18,22,type=<ALPHANUM>), (csfv-gfp01_2_r3d_d3d,23,43,type=<NUM>), (dv,44,46,type=<ALPHANUM>)] Print tokening: Documents/biology-data/CSFV-GFP01_4_R3D_D3D.dv [(documents,0,9,type=<ALPHANUM>), (biology,10,17,type=<ALPHANUM>), (data,18,22,type=<ALPHANUM>), (csfv-gfp01_4_r3d_d3d,23,43,type=<NUM>), (dv,44,46,type=<ALPHANUM>)] Print tokening: Documents/biology-data/CSFV-GFP01_5_R3D_D3D.dv [(documents,0,9,type=<ALPHANUM>), (biology,10,17,type=<ALPHANUM>), (data,18,22,type=<ALPHANUM>), (csfv-gfp01_5_r3d_d3d,23,43,type=<NUM>), (dv,44,46,type=<ALPHANUM>)] Print tokening: Documents/biology-data/CSFV-GFP01_1_R3D_D3D.dv [(documents,0,9,type=<ALPHANUM>), (biology,10,17,type=<ALPHANUM>), (data,18,22,type=<ALPHANUM>), (csfv-gfp01_1_r3d_d3d,23,43,type=<NUM>), (dv,44,46,type=<ALPHANUM>)] Print tokening: Documents/biology-data/CSFV-GFP01_3_R3D_D3D.dv [(documents,0,9,type=<ALPHANUM>), (biology,10,17,type=<ALPHANUM>), (data,18,22,type=<ALPHANUM>), (csfv-gfp01_3_r3d_d3d,23,43,type=<NUM>), (dv,44,46,type=<ALPHANUM>)] 39:50
From the javadocs:
A grammar-based tokenizer constructed with JFlex * * <p> This should be a good tokenizer for most European-language documents: * * <ul> * <li>Splits words at punctuation characters, removing punctuation. However, a * dot that's not followed by whitespace is considered part of a token. * <li>Splits words at hyphens, unless there's a number in the token, in which case * the whole token is interpreted as a product number and is not split. * <li>Recognizes email addresses and internet hostnames as one token. * </ul> * * <p>Many applications have specific tokenizer needs. If this tokenizer does * not suit your application, please consider copying this source code * directory to your project and maintaining your own grammar-based tokenizer.
Note: See
TracTickets for help on using
tickets.
You may also have a look at Agilo extensions to the ticket.