Bank Statement Analysis – We've tried to build a tool to categorize the narrations appearing in the bank statements into a logical manner. The bank account statements are attached here for analysis. These are consolidated and extracted bank account statements of various bank accounts. It would be able to categorise similar transactions on the basis of narrations.
Transaciton details in the corporate bank accounts would be read and grouped to find out the inflow and outflow of money.
We can extend our project to a point where it can find recurring, similar, transactions by discovering similarities in the description field of transaction. More data will lead to a better generalization.
We've provided a command line interface to the user in which he can choose one option at a time from the following list of options:
-
Bank/Firm wise analytics
-
Transaction Mode
Visualized heatmap for showing the correlation between account numbers and transaction modes.
- Transfer Type (FDRL)
Visualized pie charts for showing percentage internal and national fund transfer in FDRL.
- Charges/Tax
Visualized pie chart for showing percentage share of each type of charges/cess like tax, gst, income etc.
- Direct Transactions
Visualized bar chart for showing counts of direct transaction for different account numbers.
We've done a one time labelling for transactions in the given data. This result can be used for similar datasets as well.
- A : Bank/firm names
- B : Mode of transactions
- C : Type of transactions in FDRL
- D : Taxes/cess/charges
- E : Direct transactions (account to account)
{
"indiaforensic":"A",
"cashdep":"B",
"fdrl":"A",
"internal":"C",
...
...
"hdfc":"A",
"net":"D",
"epayment":"B",
"loan":"C",
"cess":"D"
}
pip install -r requirements.txt
apoorv@apoorv:~/Desktop$ python final_cmd.py --help
usage: final_cmd.py [-h] [--master_summary MASTER_SUMMARY]
[--mode_of_transfer MODE_OF_TRANSFER]
[--medium_of_transfer MEDIUM_OF_TRANSFER]
[--types_of_tax TYPES_OF_TAX]
[--direct_payments DIRECT_PAYMENTS]
Process some integers.
optional arguments:
-h, --help show this help message and exit
--master_summary MASTER_SUMMARY
Infow and outflow summary for each bank/firm. Add 'm'
to execute this part.
--mode_of_transfer MODE_OF_TRANSFER
Heat map and deposit/withdraw statistics of bank
transactions. Add 'hm' to execute this part.
--medium_of_transfer MEDIUM_OF_TRANSFER
Specifically for FDRL. Add 'mt' to execute this part.
--types_of_tax TYPES_OF_TAX
Plots pie chart for tax charges. Add 'tx' to execute
this part.
--direct_payments DIRECT_PAYMENTS
Plots bar chart for direct payments. Add 'dp' to
execute this part.
Labelling done for this dataset can be extended in classification applications. Basically we can train a model to recognize the text input and associate it with the output categories.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
# some sample data
df=pd.DataFrame([
['MasterCard online Transaction at GOOGLE *TinderPlus', 'dating'],
['CHECKCARD AMAZON AMAZN.COM/BILLWA 125679318', 'shopping'],
['CHECKCARD AMAZON AMAZN.COM/BILLWA 467924720', 'shopping'],
['Visa Fandango.com CA Fandango.com 787879089', 'entertainment'],
['VISA Amazon web services aws.amazon.coWA 12321321', 'work'],
['Mastercard Dr Jimmy Smits DDS', 'medical'],
# ... lots more rows
], columns=['Description','category'])
# convert the input text to something that sklearn can compute on using a TfidfVectorizer
tfidf=TfidfVectorizer()
x_train=tfidf.fit_transform(df.Description)
# we need to also encode the "target" as something the algorithm can handle (numbers)
le=LabelEncoder()
y_train=le.fit_transform(df.category)
# here's the actual ML algorithm
classifier=RandomForestClassifier(n_jobs=-1)
# train the model on your historical data
classifier.fit(x_train.todense(), y_train)
# here's our "new" data that we want to get categories for, we need to treat it the same way
txt_predict=['Amazon web services', 'Harris Teeter', 'Amazon']
x_predict=tfidf.transform(txt_predict)
# do the magic prediction!
predicted=classifier.predict(x_predict.todense())
# predict() output is just a bunch of numbers, we need to turn it back into words
actual_answers=le.inverse_transform(predicted)
print(actual_answers)