okfn-brasil/serenata-toolbox

Add `if` statement to avoid dropping 'Flight ticket issue' expenses

Closed this issue · 14 comments

@jtemporal figured out why the dataset is missing subquota 999, 'Flight ticket issue' (see #106). According to her findings:

What happens is, there is a filter that cuts out receipts with reimbursement_value equals to 0 because this means that, that document was not reimbursed.

It is not a bug indeed. The reason: subquota 999, 'Flight ticket issue' does not generate reimbursement value. According to Chamber of Deputies:

Os gastos com bilhete aéreo (...) também não são objeto de reembolso e, por isso, não há emissão individual de nota fiscal. O valor gasto é debitado automaticamente do valor da cota do respectivo parlamentar.

Filght ticket expenses (...) are also not subject to reimbursement, therefore, there is no individual invoice issue. The amount spent is automatically deducted from the amount of the respective member's subquota.

I understand the mission of this project regarding reimbursement and how this work flows around reimbursement values. But taking it strictly, we disregard expenses on which the congressperson does not have to get reimbursed; we disregard subquotas in which the congressperson has a monthly value to deduct from.

In this category, although congresspersons do not have to pay first and get the value reimbursed later, there is public money being spent. And a lot of it: over R$ 100 million during the current term, putting Flight ticket issue in second place among subquotas with most expenses.

As an example of the relevance of having this subquota in our dataset, a few years ago there was this public scandal called "Farra das passagens", about congresspersons using this specific subquota to issue tickets for his family members and friends.

So I ask you guys: although dropping Flight ticket issue from our dataset is not a bug, shouldn't we reconsider having it back?

Very good and important point, @rodolfo-viana! I'm almost sure this detail was unnoticed until now… Surely we need to take that into account.

although dropping Flight ticket issue from our dataset is not a bug, shouldn't we reconsider having it back?

I believe it was never our intention to cut out an entire sub quota.

this detail was unnoticed until now

And I agree with this.

In this category, although congresspersons do not have to pay first and get the value reimbursed later, there is public money being spent.

You are completely right! I believe the way to go here is find a way we can, cut out receipts that weren't reimbursed and still have the 999 sub quota expenses in our dataset.

@cuducos any ideas?

I believe this subquota was cut out -- not by Serenata team, of course -- during the time Chamber was setting up its second version of open data. I say that because I had read a notebook that covered Flight ticket issue: https://github.com/datasciencebr/serenata-de-amor/blob/master/develop/2016-08-13-irio-descriptive-analysis.ipynb

I guess that when they changed their dataset something went wrong, athough another analysis had led to a positive result: https://github.com/datasciencebr/serenata-de-amor/blob/master/develop/2017-05-21-luizcavalcanti-chamber-ceap-api-version-comparison.ipynb

Anyway, if I can help somehow, just let me know.

@cuducos any ideas?

Is the data available in the new version of the API or only in the XML version?

If it is, I think we simply enhance the if statement that excludes reimbursements with zero values to keep them if subquota is 999. And surely mention that in the documentation because people will ask about that.

Is the data available in the new version of the API or only in the XML version?

Yes it is, I checked to find out if we were cutting it out or by any chance the chamber was. It is the filter we have that was dropping lines with 0 reimbursement value.

If it is, I think we simply enhance the if statement that excludes reimbursements with zero values to keep them if subquota is 999.

That was my initial idea!

I took the liberty to rename this issue since we agreed on an approach to have the 999 sub quota in the final dataset ;)

I checked .csv files and compared 999 to other subquotas. It lacks three rows:

  • batch_number (that we hardly use),
  • reimbursement_number (ditto), and
  • document_id

I believe document_id rows, as inexistent, are being dropped in reimbursements.py:

    def group(self, receipts):
        print('Dropping rows without document_value or reimbursement_number…')
        subset = ('document_value', 'reimbursement_number')
        receipts = receipts.dropna(subset=subset)

        groupby_keys = ('year', 'applicant_id', 'document_id')
        receipts = receipts.dropna(subset=subset + groupby_keys)

Is it the issue?

I wouldn't be so sure they are being dropped. Can anyone confirm that in the source 999 subquota have document_ids?

The thing is that document_id is not documented anywhere in their material. We guess it's is a kind of unique identifier for the reimbursed. As flight tickets are not actually reimbursed, maybe they never receive this identifier at all…

It seems this def drops rows of reimbursement_number and document_id, both inexistent rows in 999.

Anyway, if I can help somehow, let me know. :)

It seems this def drops rows of reimbursement_number and document_id, both inexistent rows in 999.

So this is the problem — they don't come with a document_id. That's awful. Anyway… pandas can catch that quite easily I guess. Is that right, @jtemporal?

We're gonna have to discuss this along with Jarbas architecture too — the whole API is based on the uniqueness of document_id.

Just to explain the process I went through to come to this ideia: I downloaded the .csv file regarding expenses of this year, opened in Excel, picked 999 and other subquotas, looked up which rows these other subquotas have and 999 do not, and found these three mentioned above.

I am not sure if it is different in .xml. I believe it is not.

Is that right, @jtemporal?

I guess so, but a I'll test it today with @rodolfo-viana ;)

Yep! a talk about Jarbas architecture is required. To have back in our dataset 999 (and also 10 and 11 for that matter) sub quota we need to study the implications and maybe revisit Jarbas whole structure. Maybe generate a separated dataset for these quotas for now could be a way.

Jarbas used to have an composed unique ID with year, applicant_id and document_id. No problem in recreating something like that. I think it's a heads up about it but the first thing is to generate data, bring them in and see what crashes (in our local machines). The main question is not about Jarbas itself, is about the data (what are the unique ID for each row? just the sequential index?). Even if there's none we can work around (eg no detail view, only list views).