same code base for fetching prices and importing transactions?

Question

same code base for fetching prices and importing transactions?

blais opened this issue 6 years ago · 5 comments

Original report by Johannes Harms (Bitbucket: johannesjh, GitHub: johannesjh).

Could / should we use the same code base for fetching prices and importing transactions? Using importers for importing downloaded prices could help reduce duplicate code. The only thing missing is to extend the importer base classes to support fetching (of not only prices, but also transactions).

@blais: I would be glad to hear your thoughts on this, including the previous design rationale to split price-fetching and transaction-importing into separate modules.
@seltzered: I am posting this as follow-up to beancount/beanprice#2 "Bean-price: support fetch over range of dates", because I did not want to take the other issue off-topic.

Motivation:
My personal observation is that importing prices is very similar to importing transactions.

fetching: While prices are usually fetched automatically, I found this is not always possible. In a similar way, transactions could be fetched automatically, but that's not always possible (or worth the effort)
Importing (identifying, filing, extracting): These steps are nearly identical for prices and transactions.

Example: I wrote this importer for yahoo prices:

Fetching: As mentioned above, automatic fetching is not always easy. In this case, I found it easier to manually fetch prices by scraping the HTML table using artoo.js.
Importing: The code snippet below illustrates that it makes perfectly sense to import prices using import functionality.

#!python

"""
Imports prices from CSV that was scraped from yahoo finance
"""
# pylint: disable=C0411,C0330


from _pydecimal import DecimalException
import csv
import logging
from typing import Dict, Iterable, NamedTuple

from beancount.core.amount import Amount
from beancount.core.data import Price, new_metadata, sorted as sorted_entries
from beancount.core.number import D
from beancount.ingest.cache import _FileMemo
from beancount.ingest.importer import ImporterProtocol
from beancount.ingest.importers.csv import Col
from beancount.ingest.importers.mixins.identifier import IdentifyMixin
from beancount.utils.date_utils import parse_date_liberally

logger = logging.getLogger(__name__)  # pylint: disable=C0103

Row = NamedTuple(
    "Row", [("file_name", str), ("line_number", int), ("data", Dict)]
)


class PricesImporter(ImporterProtocol):
    """Imports prices from CSV"""

    def __init__(self, **kwargs):  # pylint: disable=R0913
        """
        Initializes the importer.
        """
        # gets required arguments:
        self.columns = kwargs.pop("columns")
        self.commodity = kwargs.pop("commodity")
        self.currency = kwargs.pop("currency")

        # gets optional arguments:
        self.debug = kwargs.pop("debug", False)
        self.csv_dialect = kwargs.get("csv_dialect", None)
        self.dateutil_kwds = kwargs.get("dateutil_kwds", None)
        super().__init__(**kwargs)

    def extract(self, file: _FileMemo, existing_entries=None):
        """Extracts price entries from CSV file"""
        rows = self.read_lines(file.name)
        price_entries = sorted_entries(self.get_price_entries(rows))
        return price_entries

    def read_lines(self, file_name: str) -> Iterable[Row]:
        """Parses CSV lines into Row objects"""
        with open(file_name) as file:
            reader = csv.DictReader(file, dialect=self.csv_dialect)
            for row in reader:
                yield Row(file_name, reader.line_num, row)

    def get_price_entries(self, lines: Iterable[Row]) -> Iterable[Price]:
        """Converts Row objects to beancount Price objects"""
        for line in lines:
            try:
                self.validate_line(line)
                meta = self.build_metadata(line.file_name, line.line_number)
                date = self.parse_date(line.data[self.columns[Col.DATE]])
                amount = self.parse_amount(line.data[self.columns[Col.AMOUNT]])
                amount_with_currency = Amount(amount, self.currency)
                yield Price(  # pylint: disable=E1102
                    meta, date, self.commodity, amount_with_currency
                )
            except (ValueError, DecimalException, AssertionError) as exception:
                logger.warning(
                    "Skipped CSV line due to %s exception at %s line %d: %s",
                    exception.__class__.__name__,
                    line.file_name,
                    line.line_number,
                    line.data,
                )

    def validate_line(self, row):
        """Validates CSV rows. If invalid, an AssertionError is thrown."""
        data = row.data
        assert data[self.columns[Col.AMOUNT]]

    def build_metadata(self, file_name, line_number):
        """Constructs beancount metadata"""
        line_number = str(line_number)
        return new_metadata(
            file_name,
            line_number,
            {"source_file": file_name, "source_line": line_number}
            if self.debug
            else None,
        )

    def parse_date(self, date_str):
        """Parses the date string"""
        return parse_date_liberally(date_str, self.dateutil_kwds)

    def parse_amount(self, amount_str):  # pylint: disable=R0201
        """Parses an amount string to decimal"""
        return D(amount_str)


class YahooFinancePricesImporter(IdentifyMixin, PricesImporter):
    """
    Imports CSV scraped from finance.yahoo.com

    Usage:

    Scrape historical prices using artoo.js, for example from:
    https://finance.yahoo.com/quote/EXS2.DE/history?p=EXS2.DE

    artoo.scrapeTable('table[data-test="historical-prices"]', {
      headers: 'th',
      done: artoo.saveCsv
    })

    Then run this importer to convert the scraped csv file to beancount prices.
    """

    def __init__(self, **kwargs):
        kwargs.setdefault(
            "columns", {Col.DATE: "Date", Col.AMOUNT: "Adj Close**"}
        )
        self.matchers = [
            ("content", r"Date,Open,High,Low,Close\*,Adj Close.*")
        ]
        super().__init__(**kwargs)


class TecdaxImporter(YahooFinancePricesImporter):
    """
    Imports CSV scraped from:
    https://finance.yahoo.com/quote/EXS2.DE/history?p=EXS2.DE
    """

    def __init__(self, **kwargs):
        kwargs.setdefault("commodity", "TECDAX")
        kwargs.setdefault("currency", "EUR")
        super().__init__(**kwargs)

In my opinion, the above code illustrates that prices and transactions could use the same import process. I would therefore like to propose: Let's use importers for importing downloaded prices. And let's extend the importer base classes to support fetching of not only prices, but also transactions.

Answer 1 · 2019-01-03T23:25:57.000Z

Original comment by Martin Blais (Bitbucket: blais, GitHub: blais).

I think these two processes are unrelated: fetching prices don't usually involve a file, so identify / filing doesn't mean anything.
Also, prices may be fetching over specific dates or date ranges (informed by the contents of one's ledger), while importing converts documents.
You're welcome to use the importers to fetch prices if you like.
Maybe something to think about for the next rewrite (I'm not sure how much would be gained by merging those two things now).

Answer 2 · 2019-01-04T15:41:21.000Z

Original comment by Johannes Harms (Bitbucket: johannesjh, GitHub: johannesjh).

Hello Martin, thank you for your thoughts!

What can specifically be gained by merging the two codebases is: A smaller overall codebase, re-use of tools and terminology (fetch - identify - file - extract does make sense for prices and transactions, in my opinion), the additional usecase of fetching of transactions (not just fetching prices), as well as the additional usecase of identifying & filing (manually) downloaded price data.

I think these two processes are unrelated

Hm, funny. I think that, on the contrary, the processes for prices and transactions are remarkably similar. Both follow the same sequence of steps. In Detail:

Fetching: Beancount allows to fetch prices, but beancount importers currently have not method for fetching transactions. It would be nice if beancount could provide infrastructure (i.e., python methods to be overwritten and a commandline api) for fetching not just prices, but also transactions.
Identifying and filing: Users may wish to file whatever was manually downloaded or automatically fetched. For example, csv files, json-rest api responses, pdfs etc., the content of which can be prices, transactions, balances, etc. Beancount already allows to file fetched transactions and balances in a folder that corresponds to an account. In a similar way, it makes sense to file fetched prices in a folder that represents a commodity.
Extracting/converting: It is the identical process of parsing a source format (CSV, PDF, whatever) and converting it to beancount entries.

I agree that this ticket is not of immediate importance (feel free to close it), but I do see long-term benefits of merging the two codebases. - So thank you for your thoughts and consideration!

Answer 3 · 2019-01-04T19:48:22.000Z

Original comment by Martin Blais (Bitbucket: blais, GitHub: blais).

Nothing against doing it this way, but you can already currently use the ingest framework to import price directives (there aren't any restrictions on the kind of directives an importer can return).

(There's normally no "document" to fetch when getting price information. Sure, you could store some of the raw information in a file though, I doubt that would be useful. Anyway, you could do it like that indeed.)

The only bit that's missing is a "fetching" stage, and perhaps some mechanics for selecting currencies from the command-line (maybe that's something that can be input to the creation of the importers list).

One issue is that there isn't much to add in terms of generic support for "fetching." It's just too varied or limited. I never saw anything that could be provided in common. There's already urllib and for stuff like mechanize or webdriver/selenium, but it's just way too platform-specific, and the user could always run this from their implementation of the fetching stage.

Perhaps just a hook and a fetch command might be sufficient to add though. I'll admit I recently have converted some of my importers from documents to using an API, in which case I do need to manually fetch by running some code (as opposed to visiting a website and downloading a file). I currently just run a script separately to do that, but maybe it's time to create a command so I can just fetch from it. I think that would suit your use case as well, and open the door to fetching price information in this way.

I'll keep this open so I remember to add a command.

Answer 4 · 2021-02-26T13:29:13.000Z

The new command line parsing code already allows to implement arbitrary commands. The documentation will provide an example. I don't think there is anything else to do to support the fetching of prices through the same command line interface that is generic enough to be implemented in Beangulp. I see the workflows of importing transactions from statements and fetching prices fairly well separated thus I don't think there is much value in trying to unify them under the same tooling. Even more so now that price fetching has been spun off to Beanprice.

Answer 5 · 2021-02-26T13:56:16.000Z

Totally agree. They're separate things. Moreover, the price fetching tool is not as closely tied to Beancount.

It would, however, benefit from a similar treatment, e.g. simplification, and the ability to run just the one importer in a similar way.