There are actually very few solutions to share big data sets efficiently.
Some research teams use Amazon S3. However, Amazon S3 does no provide a tool to manage a catalog of data sets, and is not suitable for data protected by trade secret (unless it is encrypted).
Most corporations tend to useMicrosoft Azure's data lake technology. However, Microsoft Azure can be fairly expensive and does not satisfy certain trade secret requirements due to the CLOUD Act.
One could also use Github as a data lake thanks to Git Large File Storage. There is even an open source implementation for Gitlab which could be combined with Sheepdog block storage and Nexedi's gitlab acceleration patch (17 times faster) to form a high performance platform.
XXX explain why those solutions are insufficient XXX