HTML changes on the responses makes the script fail.

Question

HTML changes on the responses makes the script fail.

lockcda opened this issue 2 years ago · 4 comments

I detected the script bridges_get.py started to fail because the HTML generated by https://bridges.torproject.org/bridges?transport=obfs4%27 has changed. The following patch makes the script work again:

diff --git a/bridges_get.py b/bridges_get.py
index 6e61078..c0fea45 100755
--- a/bridges_get.py
+++ b/bridges_get.py
@@ -161,7 +161,7 @@ while bridges == False:
 
     # look for the bridges if the captcha was beaten
     html = str(reply.read())
-    q = re.findall(r'<div class="bridge-lines" id="bridgelines">(.*?)</div>',
+    q = re.findall(r'<div[^>]*id="bridgelines"[^>]*>(.*?)</div>',
                     html,
                     re.DOTALL)
     try:
@@ -170,7 +170,7 @@ while bridges == False:
 
         for l in b:
             # clean string for newlines and spaces
-            _b = l.strip().replace('\\n', '')
+            _b = l.replace('\\n', '').strip()
             if _b != '':
                 bridges = _b
     # captcha failed, try again

The patch changes the regular expression to find the correct div (the class attribute changed and I modified the regular expression so it doesn't depend on the class anymore) and the strip command of each bridge line should be run after cleaning \n from the string.

Answer 1 · 2022-08-28T23:40:26.000Z

Thank you, lockcda, for that patch.
However, we found on our test system another issue with the script, producing the following error:

Traceback (most recent call last):
  File "/home/torbox/torbox/bridges_get.py", line 47, in <module>
    from mechanize import Browser
ModuleNotFoundError: No module named 'mechanize'

This issue exists before and after applying the patch. Currently, we are not sure if this is something specific to our test system or if others can reproduce that error. Reinstalling the mechanize module didn't fix it.

Answer 2 · 2022-08-29T00:20:34.000Z

Thank you, lockcda, for that patch. However, we found on our test system another issue with the script, producing the following error:
Traceback (most recent call last):
  File "/home/torbox/torbox/bridges_get.py", line 47, in <module>
    from mechanize import Browser
ModuleNotFoundError: No module named 'mechanize' 
This issue exists before and after applying the patch. Currently, we are not sure if this is something specific to our test system or if others can reproduce that error. Reinstalling the mechanize module didn't fix it.

I can't reproduce that in my system so (if you're also using the master branch) it seems to a be system specific problem. Did you update your system recently?

Answer 3 · 2022-08-30T20:20:13.000Z

I found the bug. It seems that version 0.4.8 of mechanize will not be correctly installed. Even several tries to install it failed.
We will switch back to 0.4.7 and add your patch, which works great. Thanks for helping!

Answer 4 · 2022-08-30T20:34:23.000Z

Good to know, thank you. Maybe it would be better to use the following regular expression:

     q = re.findall(r'<div[^>]+id="bridgelines"[^>]*>(.*?)</div>',
                     html,
                     re.DOTALL)

Note I replaced a * by a +, because after <div there will always be (at least) a space.