Fix slight variations in threaded float operations
Operations in the Neon backend have both a vector path and a scalar path. The vector path is used to process most data and the scalar path is used to process the parts of the data that don't fit into the vector width. For floating point operations in particular, the results may be very slightly different between vector and scalar paths. When using multithreading, images are divided into parts to be processed by each thread, and this could change which parts of the data end up being processed by the vector and scalar paths. Since the threading may be non-deterministic in how it divides up the image, this non-determinism could leak through in the values of the output. This could cause subtle bugs.