Warning: Can't synchronize with repository "(default)" (/home/git/ome.git does not appear to be a Git repository.). Look in the Trac log for more information.
Notice: In order to edit this ticket you need to be either: a Product Owner, The owner or the reporter of the ticket, or, in case of a Task not yet assigned, a team_member"

Task #10338 (closed)

Opened 11 years ago

Closed 11 years ago

Checksum choice on import

Reported by: bpindelski Owned by:
Priority: major Milestone: 5.0.0-beta1
Component: OmeroFs Version: 5.0.0-beta1
Keywords: n.a. Cc: fs@…
Resources: n.a. Referenced By: n.a.
References: n.a. Remaining Time: n.a.
Sprint: FS demo 4.4

Description (last modified by bpindelski)

The user should be able to select a specific checksum algorithm (being aware of the speed/reliability trade-off - that will have to be added to the docs) to be used system-wide. This information has to be passed to the lower layers to init the checksum class with a specific algorithm. The setting should be changeable through the CLI interface.

In a later stage we might want an extra setting of the algorithm for non-image files (i.e. weaker checksum for .doc etc. files).

Change History (27)

comment:1 Changed 11 years ago by jmoore

  • Sprint set to FS Demo 2

comment:2 Changed 11 years ago by jmoore

  • Sprint FS Demo 2 deleted
  • Summary changed from Bug: Checksum choice on import to Checksum choice on import

Blazej, I'm going to be slightly more optimistic and say this isn't a bug but a new feature! :) Also moving out of the demo 2 sprint. Haven't created a demo 3 sprint, but likely we can move it there, no?

comment:3 Changed 11 years ago by bpindelski

Good call, Josh. It is indeed new code that's being added to address this issue.

comment:4 Changed 11 years ago by jmoore

  • Sprint set to FS Demo 3

comment:5 Changed 11 years ago by mtbcarroll

  • Cc mtbcarroll added

comment:6 Changed 11 years ago by mtbcarroll

In making the checksum algorithm client-selectable an obvious issue is how to adjust the API of raw file stores, because they have checksumming built in. Presumably one needn't select an algorithm for reading. At the moment, the checksumming starts on the first write.

For code that shouldn't allow algorithm selection, I'd prefer to hard-code the selection in the caller rather than having raw file stores default to SHA-1 (maybe scripts, user pics?), because then it forces callers to think about the choice and prevents us from accidentally defaulting when we meant to pass in the selection.

Would it be okay to add a setChecksumAlgorithm or suchlike to the raw file store operations: it must be called before the first write? For normal imports, I'd guess that the algorithm can be chosen in ImportSettings and then ManagedImportProcessI or somesuch could set the algorithm on clients' behalf when handing them a store for each file they upload.

comment:7 Changed 11 years ago by mtbcarroll

  • Cc fs@… added; mtbcarroll removed

comment:8 Changed 11 years ago by mtbcarroll

The latest commit at http://github.com/mtbc/openmicroscopy/commits/checksum-selection-server (including the commit message) shows the current state of my work, in case it needs to be picked up and made at least somewhat usable while I am away this week.

comment:9 Changed 11 years ago by jamoore

Mark: since RFS isn't responsible for creating the original files, is there anything wrong with doing the algorithm detection via the OriginalFile instance itself?

comment:10 Changed 11 years ago by mtbcarroll

That sounds like an excellent idea that should reduce the changes and complexity. I will give it a try.

comment:11 Changed 11 years ago by mtbcarroll

https://github.com/openmicroscopy/openmicroscopy/pull/995 includes the model changes for storing checksum choice in OriginalFile.

comment:12 Changed 11 years ago by bpindelski

  • Description modified (diff)

comment:13 Changed 11 years ago by mtbcarroll

Further work on the hash and hasher columns of originalfile is in https://github.com/openmicroscopy/openmicroscopy/pull/1020

comment:14 Changed 11 years ago by mtbcarroll

https://github.com/openmicroscopy/openmicroscopy/pull/1031 includes management of default checksum algorithm. Clients may now pass null to accept the server default.

This raises a few issues for me. The shorter term one is that my previous code did not allow a null so all the import paths I could find had the client explicitly specify SHA1. This now needs to be undone so that those clients' activity is, where appropriate, instead in accordance with the server default. We can test this by setting the default to be MD5, exercising a variety of upload paths, seeing if any SHA1 hashes appear in the originalfile table, and reviewing the instances in which they do.

comment:15 Changed 11 years ago by mtbcarroll

In the longer term I believe that there is an issue of clients that do not have an implementation for the server's default checksum algorithm and are now handed an import process that is demanding to use it. After some consideration of use cases, and tradeoffs between what's easy and what's desirable, I would suggest that the interaction should be that,

  • the server returns to requiring the client to specify the checksum algorithm to create an import process
  • the client is still offered the getChecksumAlgorithms API call to find out which are available, so that it may be sure to select one that is supported server-side.

Either the client asks the server which algorithms are supported, and it chooses one, or it lets the server default apply as follows,

  1. rather than a single default, the server is configured with a list of supported algorithms sorted in order of preference
  2. the client tells the server which algorithms it supports
  3. the server tells the client which algorithm to set (the first in the list that the client supports) in requesting an import process.

comment:16 Changed 11 years ago by Josh Moore <josh@…>

(In [5813fdb75dff3cd32efe2ec7893a0f200966a931/ome.git] on branch develop) Merge pull request #1031 from bpindelski/chksm-setting

Set checksum algorithm for import using bin/omero (see #10338)

comment:17 Changed 11 years ago by jburel

I am still not convinced that the users will be able to understand which algorithm to choice.

  • Having it configurable in the server yes,
  • returning the algo used.
  • Displaying how strong the algo is: maybe
  • Letting the user select: probably not
Version 0, edited 11 years ago by jburel (next)

comment:18 Changed 11 years ago by jamoore

  • Sprint changed from FS Demo 3 to FS Demo 4

Moving to next FS milestone.

comment:19 Changed 11 years ago by mtbcarroll

For when the server default algorithm is changed and clients are happy to accept the server default, some test plan should check where file uploads still result in non-default SHA1 being used.

comment:20 Changed 11 years ago by jburel

Is this still something we wish to pursue for demo 4.x, Great to have but I see that more like a long term goal

Moving it out of 4.x

Last edited 11 years ago by jburel (previous) (diff)

comment:21 Changed 11 years ago by jburel

  • Sprint FS demo 4.x deleted

comment:22 Changed 11 years ago by bpindelski

Looking at the comments in this ticket, I don't think we have a clear strategy on how to move forward with the checksum work. If I understand correctly, the only thing left to do would be to look at implementing the "different checksum per file type" scenario, and that's not a trivial task. I'd say there are more pressing matters, but happy to hear other opinions.

comment:23 Changed 11 years ago by mtbcarroll

One "left to do" thing would plausibly be comment 15; Josh has recently been doing some refactoring that might interact with it.

comment:24 follow-up: Changed 11 years ago by mtbcarroll

I don't know if Josh's refactoring affected this; I don't mind owning comment 15 factored out into a new ticket.

comment:25 in reply to: ↑ 24 Changed 11 years ago by bpindelski

Replying to mtbcarroll:

I don't know if Josh's refactoring affected this; I don't mind owning comment 15 factored out into a new ticket.

If you're ok with creating an extra ticket and associating it with the story, please do. I'm happy to also route some resources back to looking at checksums.

comment:26 Changed 11 years ago by mtbcarroll

Thank you. #10766 and #11202 may now capture all that is worth retaining from this ticket. Does anyone see any other reasons to keep this ticket open?

comment:27 Changed 11 years ago by jamoore

  • Resolution set to fixed
  • Sprint set to FS demo 4.4
  • Status changed from new to closed
  • Version set to 5.0.0-beta1

I don't see any reason not to. Thanks all.

Note: See TracTickets for help on using tickets. You may also have a look at Agilo extensions to the ticket.

1.3.13-PRO © 2008-2011 Agilo Software all rights reserved (this page was served in: 0.71198 sec.)

We're Hiring!