iamlemec/fastpat

Some suggestions on downloading

Opened this issue · 6 comments

mIgLLL commented

1, in the windows environment, unzip function in fetch.py is not working.
2, network volatile will stop the loop of downloading files.

For the first issue, I rewrite the code and it may help.

import os
import time
import zipfile

def fetch_file(zurl, output, overwrite=False, dryrun=False, unzip=False):
    system = print if dryrun else os.system
    zflags = '' if overwrite else '-n'

    if not dryrun and not os.path.exists(output):
        print(f'Creating directory {output}')
        os.makedirs(output)

    _, zname = os.path.split(zurl)
    zpath = os.path.join(output, zname)
    fetch = overwrite or not os.path.isfile(zpath)

    if fetch:
        print(f'Fetching {zname}')
        system(f'curl -o {zpath} {zurl} --ssl-no-revoke -x 127.0.0.1:7890') # here I adjust the code for using clash, you can just ignore it or support a another prxoy argument.

    if fetch or unzip:
        print(f'Unzipping {zname}')
        with zipfile.ZipFile(zpath, 'r') as zip_ref:
            zip_ref.extractall(output)

    return fetch

For the second issue, maybe a try...except may help.

If some files is missing, just rerun the code to fill the missing file is fine. But left the download task undone is quite unreasonable.

Hope it helps. And really thanks for your contribution.

mIgLLL commented

By the way, your download links in "tmapply_files.txt" is out-of-date.

you can update the links by seeing "https://bulkdata.uspto.gov/data/trademark/dailyxml/applications/"

At last, your code in storing the info into the csv have encoding bugs.

Here is a fixed version of "ChunkWriter" in tools\tables.py

class ChunkWriter:
    def __init__(self, path, schema, chunk_size=1000, output=False):
        self.path = path
        self.schema = schema
        self.chunk_size = chunk_size
        self.output = output
        self.items = []
        self.i = 0
        self.j = 0
        self.file = open(self.path, 'w+', encoding='utf-8-sig')
        header = ','.join(schema)
        self.file.write(f'{header}\n')

Thank you! I don't get to test things much on Windows, so it's good to have some feedback there. I just committed some changes in 936814a that handle things from your first comment. I decided to implement both fetch and unzip using native Python libraries. As for the proxy situation, can you still control that under urllib by setting appropriate environment variables? I haven't been in a situation where I needed proxies recently, so I'm not super knowledgable there, but am happy to help out if I can.

As for the second comment, for the non-apply, non-grant streams, the USPTO does this annoying thing where they rename all the files on a year basis, meaning you have to re-download everything. So I just updated apply/grant for now, but will get to assign/maint/tmapply in a minute.

So the only change on the second code snippet is going from utf-8 to utf-8-sig? I've never seen that before. Is that change really necessary? It seems to work fine for me with utf-8.

Thank you! I don't get to test things much on Windows, so it's good to have some feedback there. I just committed some changes in 936814a that handle things from your first comment. I decided to implement both fetch and unzip using native Python libraries. As for the proxy situation, can you still control that under urllib by setting appropriate environment variables? I haven't been in a situation where I needed proxies recently, so I'm not super knowledgable there, but am happy to help out if I can.

As for the second comment, for the non-apply, non-grant streams, the USPTO does this annoying thing where they rename all the files on a year basis, meaning you have to re-download everything. So I just updated apply/grant for now, but will get to assign/maint/tmapply in a minute.

So the only change on the second code snippet is going from utf-8 to utf-8-sig? I've never seen that before. Is that change really necessary? It seems to work fine for me with utf-8.

Yes, only change from utf-8 to utf-8-sig, there will occur errors when using utf-8. I am using the application data, maybe you can test on that. Usually, utf-8-sig is safer than utf-8 (on my experience).

What kind of errors are you seeing with the existing code? Do they occur when you're writing or reading later? In the current codebase, we aren't specifying the encoding, so it will default to locale.getpreferredencoding(False). I found this blog post documenting some related issues:

https://jdhao.github.io/2018/12/03/text_file_read_write_on_windows/

It seems like Linux always returns utf-8, while Windows will return utf-8 with an English-language version and cp936 with a Chinese-language version (at least for simplified).

So I think any problems would be solved here by simply adding encoding=utf-8. The only thing that utf-8-sig does is add an invisible byte-order-mark (BOM) character at the beginning. But it's actually not required in UTF-8 because it's only a one byte encoding.

What kind of errors are you seeing with the existing code? Do they occur when you're writing or reading later? In the current codebase, we aren't specifying the encoding, so it will default to locale.getpreferredencoding(False). I found this blog post documenting some related issues:

https://jdhao.github.io/2018/12/03/text_file_read_write_on_windows/

It seems like Linux always returns utf-8, while Windows will return utf-8 with an English-language version and cp936 with a Chinese-language version (at least for simplified).

So I think any problems would be solved here by simply adding encoding=utf-8. The only thing that utf-8-sig does is add an invisible byte-order-mark (BOM) character at the beginning. But it's actually not required in UTF-8 because it's only a one byte encoding.

I met problems when parsing the files. It occurs that "in pandas._libs.writers.write_csv_rows
UnicodeEncodeError: 'gbk' codec can't encode character '\u2003' in position 534: illegal multibyte sequence." and I ask new bing, it tells me that I should use utf-8-sig instead. it works. (I found this record from new bing, thanks them for keeping my questions)

Since USTPO have nothing to do with Chinese-language, I think English is the issue.

Thanks for the info. It looks like open on your system defaults to the gbk encoding, which has trouble with unusual unicode characters like \u2003 (em space). This can be fixed by explicitly passing encoding='utf-8'. Going with utf-8-sig isn't necessary in this case and might cause issues on non-Windows systems. I'm going to push that change, but let me know if things work. Happy to help out if there are still issues!