Most Powerful Open Source ERP

Towards reliable SFTP interfaces

Interfacing two applications through SFTP in a reliable way can be very tricky. This blog lists possible approaches towards reliability and fundamental principles that must be enforced.
  • Last Update:2019-03-07
  • Version:001
  • Language:en

Many companies use SFTP protocol and server (S) to exchange files between two systems and interface two applications (A and B).

The interface logic can be one-way:

A -> S -> B

Or can be two-way:

A <--> S <--> B

In most cases we are aware of, this type of interface is unreliable for various reasons. We will try in this article to list various approaches to make a reliable SFTP interface, based on the experience of other reliable interface techniques.

Reliable interfaces

There are numerous ways to create a reliable interface:

  • CFT;
  • exchange table in a transactional database;
  • transactional HTTP API;
  • synchronisation protocol;
  • log-file or graph.

CFT is a well known tool that solves many problems found in file transfer between A and B. It provides the guarantee that:

  • a file is transferred entirely or not transferred;
  • each file sent leaves a trace in a log;
  • each file received leaves a trace in a log.

It is thus possible to ensure that all files sent have been received.

An exchange table in a relational database has similar advantages:

  • data is inserted into the table entirely or not at all (transactional database);
  • the list of inserted data can be queried any time;
  • reception of data can be acknowledged too in a transactional way by insert an acknowledgement line.

Transactional HTTP protocol (ex. Zope transactions) is a way to implement HTTP in an application server which ensures through HTTP similar transactionality as with a database. Data is either transferred entirely or not at all. Each transfer (upload) leaves a trace in a log file. Each transfer (download) leaves a trace. If data is in addition self-certified (using an SHA or signature for consistency), downloaded data can be checked for consistency.

Synchronisation protocols such as SyncMLJIO and to some extend embulk ensure that all parties share the same view on the same data subset at any time. By running synchronisation recurringly, the state of A and B eventually becomes consistent, despite synchronisation not being transactionnal.

Protocols such as git or any kind of shared, read-only log file or graph, ensures that all parties share the same data consistently. All parties can thus ensure that the state of applications (A or B) is consistent with the common log file or graph.

All cases we described ensure the integrity and traceability of data transferred by A and B. However, this does not ensure that the interface between A and B is fully consistent. 

Each data D that is transferred should in addition be well formed, by matching for example a JSON schema or an XML DTD. Beyond being well formed, the semantic of the data exchange should be validated using validation functions FA, FB which compares the state of applications (ex. FA for A) to the history of all data that were transferred:

FA(A(t), D(0), D(1), ..., D(t)) == True

It is mandatory to define schema and validation function FA, FB  to ensure that an interface is consistent.

Without that, project teams behind A and B endlessly spend their time trying to prove that the other party is wrong instead of both having a way to prove they are both right. Schema and validation functions play the same role in interfaces as unit or functional tests in application development.

The role of validation function FA is essential in particular to prevent a common problem: B never receives data D from A because A did not even transfer it. Without FA, this problem can never be solved.

The problems of SFTP

SFTP servers have a few issues that lead to unreliable interfaces:

  • there is no guarantee that all data was uploaded (by A) because the file might be visible before the end of the upload by A;
  • there is no guarantee that all data was downloaded (by B) because the file might be visible before the end of upload by A for example or because of network glitch;
  • some SFTP servers erase data after the first download (by B) and thus there is no way to list all data that was exchanged;
  • there is no way for A to know if B downloaded the file;
  • there is no way for B to know which files A uploaded if files are erased after download (by a rogue user for example which has same credentials as B);
  • there is no way to get a full copy of files exchanged by A and B if SFTP servers erases files after first download.

There is thus no way to ensure a reliable interface between A and B by using only SFTP without extra protocols.

SFTP is a bit like UDP compared to TCP. It does not guarantee any kind of integrity in communication.

Towards reliable SFTP interfaces

Obviously, the first step towards reliability is to design a well specified interface:

  • define schema for each type of data D which is exchanged;
  • define a function F for each application (A, B) which ensures consistency of the interface on each side.

This step is not specific to SFTP. Without it, it is impossible to ever reach reliability.

Let us now solve problems specific to SFTP.

There is no guarantee that all data was uploaded

This problem can be solved with different techniques:

  • rename each file after upload (ex. day1.upload -> day1.data)
  • upload a hash for each file (ex. day1.sha).

There is no guarantee that all data was downloaded

Compare SHA(day1.data) and day1.sha.

There is no way to list all data that was exchanged

Solution 1: remove the "erase after download" feature of the SFTP server.

Solution 2: add to the interface

  • a "message list query" that lets B query A the list of files which A ever sent.

Solution 3:

  • A sends every day to B the list of all files that were ever uploaded

There is no way for A to know if B downloaded the file

Solution 1: rename each file after download (ex. day1.data -> day1.received)

Solution 2: 

  • a "message received query" that lets A request B the list of files that B ever downloaded.

Solution 3:

  • B sends every day to A the list of all files that were ever downloaded

There is no way to get a full copy of files exchanged by A and B

Solution 1: remove the "erase after download" feature of the SFTP server.

Solution 2: add to the interface

  • a "message resend query" that lets B request A to resend a message.

Additional thoughts

Defining a schema for Excel files or for COBOL-style data can be difficult. Yet, it is not impossible and it is mandatory.

Here are some suggestions of approaches:

  • convert Excel to XML and use DTD or XSD
  • use a COBOL conversion utility to XML or JSON (ex. xfw tool by Nexedi, used in many applications)

About the validation function, it is implemented in ERP5 in a simple way: we simply store all files we receive and all files we send in specific modules. And each document that relies on them (ex. Packing List) has an aggregate relation to it. This way, we can ensure that ERP5 received everything. ERP5 can also send again everything. We even keep a trace of all access logs to SFTP server to ensure that we can prove for example that an SFTP directory was empty at a given date.

On systems that do not keep a trace of what they send or receive, the validation function needs to be developed. Yet, without it, there is no way to ensure that the interface works reliably.

Some people might consider blockchains as a kind of log file or graph. Actually, only the merkle tree is really useful here to ensure that data is consistent and equal on both sides, not the proof of work. Blockchains without proof of work are actually very similar to git. It is thus easier to use git.