Warning: Can't synchronize with repository "(default)" (/home/git/ome.git does not appear to be a Git repository.). Look in the Trac log for more information.
Notice: In order to edit this ticket you need to be either: a Product Owner, The owner or the reporter of the ticket, or, in case of a Task not yet assigned, a team_member"

Bug #1010 (closed)

Opened 16 years ago

Closed 16 years ago

StandardAnalyzer from Lucene does not parse appropriately for our application

Reported by: jamoore Owned by: jamoore
Priority: blocker Cc: cxallan, jburel, wmoore
Sprint: n.a.
Total Remaining Time: n.a.

Description (last modified by jmoore)

After difficulties finding images with simple queries, it was decided we need our own analyzer for milestone:3.0-Beta3.

A filename like: "csfv-gfp01_1_r3d_d3d" was not findable with a search of "gfp*" since, the whole thing parsed to one token of type=<NUM> with the StandardAnalyzer.

Print tokening: foo bar
[(foo,0,3,type=<ALPHANUM>), (bar,4,7,type=<ALPHANUM>)]
Print tokening: foo/bar
[(foo,0,3,type=<ALPHANUM>), (bar,4,7,type=<ALPHANUM>)]
Print tokening: foo-bar
[(foo,0,3,type=<ALPHANUM>), (bar,4,7,type=<ALPHANUM>)]
Print tokening: foo_bar
[(foo,0,3,type=<ALPHANUM>), (bar,4,7,type=<ALPHANUM>)]
Print tokening: foo.bar
[(foo.bar,0,7,type=<HOST>)]
Print tokening: 26.8.06-antiCSFV/CSFV-GFP/CSFV-GFP01_1_R3D_D3D.dv
[(26.8.06-anticsfv,0,16,type=<NUM>), (csfv,17,21,type=<ALPHANUM>), (gfp,22,25,type=<ALPHANUM>), (csfv-gfp01_1_r3d_d3d,26,46,type=<NUM>), (dv,47,49,type=<ALPHANUM>)]
Print tokening: ...FRAP-23.8.05/IAGFP-Noc01_R3D.dv
[(frap-23.8.05/iagfp-noc01_r3d,3,31,type=<NUM>), (dv,32,34,type=<ALPHANUM>)]
Print tokening: ...FRAP-23.8.05/IAGFP-Noc02_R3D.dv
[(frap-23.8.05/iagfp-noc02_r3d,3,31,type=<NUM>), (dv,32,34,type=<ALPHANUM>)]
Print tokening: ...FRAP-23.8.05/IAGFP-Noc03_R3D.dv
[(frap-23.8.05/iagfp-noc03_r3d,3,31,type=<NUM>), (dv,32,34,type=<ALPHANUM>)]
Print tokening: ...FRAP-23.8.05/IAGFP-Noc04_R3D.dv
[(frap-23.8.05/iagfp-noc04_r3d,3,31,type=<NUM>), (dv,32,34,type=<ALPHANUM>)]
Print tokening: ...FRAP-23.8.05/IAGFP-Noc05_R3D.dv
[(frap-23.8.05/iagfp-noc05_r3d,3,31,type=<NUM>), (dv,32,34,type=<ALPHANUM>)]
Print tokening: will/Desktop/CSFV-GFP01_2_R3D_D3D.dv
[(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_2_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)]
Print tokening: will/Desktop/CSFV-GFP01_4_R3D_D3D.dv
[(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_4_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)]
Print tokening: will/Desktop/CSFV-GFP01_5_R3D_D3D.dv
[(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_5_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)]
Print tokening: will/Desktop/CSFV-GFP01_1_R3D_D3D.dv
[(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_1_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)]
Print tokening: will/Desktop/CSFV-GFP01_3_R3D_D3D.dv
[(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_3_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)]
Print tokening: will/Desktop/CSFV-GFP01_3_R3D_D3D.dv
[(desktop,5,12,type=<ALPHANUM>), (csfv-gfp01_3_r3d_d3d,13,33,type=<NUM>), (dv,34,36,type=<ALPHANUM>)]
Print tokening: Documents/biology-data/CSFV-GFP01_2_R3D_D3D.dv
[(documents,0,9,type=<ALPHANUM>), (biology,10,17,type=<ALPHANUM>), (data,18,22,type=<ALPHANUM>), (csfv-gfp01_2_r3d_d3d,23,43,type=<NUM>), (dv,44,46,type=<ALPHANUM>)]
Print tokening: Documents/biology-data/CSFV-GFP01_4_R3D_D3D.dv
[(documents,0,9,type=<ALPHANUM>), (biology,10,17,type=<ALPHANUM>), (data,18,22,type=<ALPHANUM>), (csfv-gfp01_4_r3d_d3d,23,43,type=<NUM>), (dv,44,46,type=<ALPHANUM>)]
Print tokening: Documents/biology-data/CSFV-GFP01_5_R3D_D3D.dv
[(documents,0,9,type=<ALPHANUM>), (biology,10,17,type=<ALPHANUM>), (data,18,22,type=<ALPHANUM>), (csfv-gfp01_5_r3d_d3d,23,43,type=<NUM>), (dv,44,46,type=<ALPHANUM>)]
Print tokening: Documents/biology-data/CSFV-GFP01_1_R3D_D3D.dv
[(documents,0,9,type=<ALPHANUM>), (biology,10,17,type=<ALPHANUM>), (data,18,22,type=<ALPHANUM>), (csfv-gfp01_1_r3d_d3d,23,43,type=<NUM>), (dv,44,46,type=<ALPHANUM>)]
Print tokening: Documents/biology-data/CSFV-GFP01_3_R3D_D3D.dv
[(documents,0,9,type=<ALPHANUM>), (biology,10,17,type=<ALPHANUM>), (data,18,22,type=<ALPHANUM>), (csfv-gfp01_3_r3d_d3d,23,43,type=<NUM>), (dv,44,46,type=<ALPHANUM>)]
39:50

From the javadocs:

 A grammar-based tokenizer constructed with JFlex
*
* <p> This should be a good tokenizer for most European-language documents:
*
* <ul>
*   <li>Splits words at punctuation characters, removing punctuation. However, a 
*     dot that's not followed by whitespace is considered part of a token.
*   <li>Splits words at hyphens, unless there's a number in the token, in which case
*     the whole token is interpreted as a product number and is not split.
*   <li>Recognizes email addresses and internet hostnames as one token.
* </ul>
*
* <p>Many applications have specific tokenizer needs.  If this tokenizer does
* not suit your application, please consider copying this source code
* directory to your project and maintaining your own grammar-based tokenizer.

Change History (3)

comment:1 Changed 16 years ago by jmoore

  • Status changed from new to assigned

r2487 has an initial implementation for testing.

comment:2 Changed 16 years ago by jmoore

  • Description modified (diff)

comment:3 Changed 16 years ago by jmoore

  • Resolution set to fixed
  • Status changed from assigned to closed

Still not optimal, but working for most queries. Will need further evaluation (and tickets) in milestone:3.0-Beta4

Note: See TracTickets for help on using tickets. You may also have a look at Agilo extensions to the ticket.

1.3.13-PRO © 2008-2011 Agilo Software all rights reserved (this page was served in: 0.62879 sec.)

We're Hiring!