id summary reporter owner description type status priority milestone component resolution keywords cc rd_points sprint story_priority 10335 FS checksums for data integrity bpindelski "This story aims to capture all the tickets related to work needed to be done to get a checksum system in place to guarantee data integrity during an FS upload of a file. === Goal === To guarantee file integrity by conducting checksum calculation during and post upload and to offer the user a choice of checksum speed vs. security. This requirement does not supersede the integrity guaranteed on the transport layer of the OSI model, but enhances it. === Proposed workflow === During upload of a single image, a checksum of a specific type (low/medium/high security) is computed for each element of the image using the byte content of the file under upload. The checksum type and value is attached to the file before upload and transmitted together with the data. On the receiving side, the server reads the checksum type and calculates the value using the same algorithm. If the checksums match, the image is considered valid. Exceptions: - checksum mismatch after upload - image considered invalid, error returned to client, the user has the possibility to stop the import process, - checksum calculation fails client-side due to algorithm error - import process stops automatically, - checksum calculation fails server-side due to algorithm error - error returned to client, the user has the possibility to stop the import process, - checksum fails on corrupted file after n-th round of verification - error returned to client, - checksum capability mismatch between client and server - lowest common denominator chosen, user informed about the algorithm chosen. === Implementation context === The checksum has to be calculated “on-the-fly” to avoid duplicating file I/O operations. Error correction is not a requirement at the current stage. External libraries can be used (with the caveat of supporting JDK 5). The transmission medium does not influence the quality of the checksum algorithm used, but in the future the file type might. The other language bindings (Python, C++) have to be considered during the design stage - the clients will want to work with a unified checksum naming scheme and also the implementations of checksum algorithms have to be present in both Python and C++. === Proposed algorithms === Ordered by computational cost: - Adler-32, - CRC-32, - MD5, - Murmur hash, - SHA1." story closed critical 5.x OmeroFs fixed omero-team@…