
Synthesizing series of transactions with a Generative Adversarial Network
![]() |
In our company we have developed a system and done some research on methodologies to synthesize tabular data and specially customer accounts bank transactions. Our current solution is based on deep neural networks, and have three main components, which use techniques as Generative Adversarial Networks or Long Short Term Memory cells. The solution can be used either to synthesize tabular data considering the rows (transactions) as independent (see as reference our previous blog post on the topic), or to synthesize series of transactions. In the first case, the model can generate data that reproduce the histograms of the individual columns and relationship between them. In the second case, the temporal aspects must also be considered.
In this post I explain an exercise aimed to synthesize complete series of transactions using a single model. For introduction on synthetic data utility in various scenarios see our previous blog posts (see references).
Bank transaction data set
- transaction type. Defined by three different columns: type: indicates mainly if it is a debit or credit transaction and contains categorical values with three potential values, operation: categorical column with five values, and k-symbol: categorical with eight values. Both operation and k-symbol contain missing values that were treated as an additional category. All the categorical columns were one-hot encoded.
- amount. Positive real number. During pre-processing it was transformed into logarithmic scale and scaled in the range [-1, 1].
- balance. Real number. As amount was transformed into logarithmic scale and scaled in the range [-1, 1].
-
date. Date object with resolution of one day indicating when each transaction takes place. In this exercise the date was encoded during pre-processing using the following strategy. Extra transactions types that mark the start of a new month were inserted into the series (these additional entries are not associated with any real transaction). The part of the date object containing the day of the month of each transaction was treated as a float number and scaled in [-1, 1] range. The date was reconstructed during post-processing from a given start date, the marks indicating the beginning of a new month, and the day of the month associated with each synthetic transaction.
Model architecture
Working with Deep Learning models for synthesizing tabular data, we have learned that Generative Adversarial Networks are the main path to focus on. Our previous experience also indicates that training such models is complex, and that using Wasserstein loss (WGAN-GP) ease the job. Finally, another good potential ingredient for a model capable of producing series of transactions is the recurrent capacity.
![]() |
Image from Zinan Lin et al 2019 |
Results
The next figure shows the correlation among the amount and balance. It must be said that many of the features in the figure are produced by the tied relation between the amount and the type of transaction. This means that network has learned the correct amount range for each type. In any case, there are also correlations inside each transaction type that are reproduced. On the other hand, with the current parameters, the network cannot synthesize properly constant low amounts that are independent of the balance for certain types. Although the magnitude of the operation is well captured also in this case.
Temporal aspects
A final aspect to consider is how well the model reproduces when each transaction takes place. The following figures show the histograms of the day of the month for different type of transactions. The fraction of each type in the real and synthesized data is indicated in each panel. One can observe that, for the more frequent types the day is well reproduced whereas is not as well captured for less represented types.
Conclusions
The goal of this exercise was to test the potential of using a single deep neural network to synthesize complete series of banking transactions. This post illustrates the results: with the correct architecture and recurrent capacity it is possible to achieve a single model that performs well on the data used.
It is worth to mention that, in comparison with the data set used, present day bank transactions are significantly richer both in terms of information available for each entry, and in the frequency and the volume of operations. These factors may make the training of a single model more challenging. For example, a single model may fail to capture specific details that turn to be important. Modifying the training strategy to improve some aspects of the synthesis may have an unwanted effect in the model elsewhere. For these reasons, when working with more complex data sets, it can be beneficial to divide the system into components focused on reproducing the different dimensions of the data.
As a final comment, there are different circumstances that prevent the use or sharing of real data. In these situations, having systems with the capability of generating realistic data on demand can be crucial. Synthetic data can prove helpful in testing, software development, or even in the early stages of a machine learning solution. If you’re interested around any of the topics discussed, please check out the AI Team offering or contact us.
References
- Boogie Software AI Services
- Boogie Software Data Synthesizer Solution
- Deep learning method for synthesis of tabular data
- Synthesizing Demographic Data
- Generating High-fidelity, Synthetic Time Series Datasets with DoppelGANger, Zinan Li et al 2019, arXiv:1909.13403
- The open bank data used in the article can be found in the following link https://sorry.vse.cz/~berka/challenge/