Write up a short analysis of different presentations of the sender line in email metadata
Closed this issue · 5 comments
this is something to do by hand, although now we have DEQ 14's URLS all in a row in the large list of textfile names, you can do experiments with pulling these automatically once you have a sense of the variations, which will start getting us toward creating regular expressions for the sender line. We discussed on 4/1 how there were at least three versions of how this line appears.
...and here is the code for reformatting/standardizing sender names into [Last, First]. it's not perfect, but it's a start!!
This looks really impressive, Hannah! I think your question at the end is a good one--can you lock the list so running the code doesn't keep adding? maybe there's something about creating a JSON that would be more protective, but I don't know.
here's what googling brought up: it seems like the key might be "locking" although this is about making sure that steps happen either simultaneously or in sequence, depending how you write it. but I can kind of see how there might be an application to use from that. like "if this object already has content don't add to it."
https://hackernoon.com/synchronization-primitives-in-python-564f89fee732
https://stackoverflow.com/questions/22422357/do-i-need-a-lock-block-to-iterate-on-a-shared-list
https://www.bogotobogo.com/python/Multithread/python_multithreading_Synchronization_Lock_Objects_Acquire_Release.php
thanks for these! I actually ended up realizing I had the wrong idea of what the problem was. This version fixes that problem (I think) and also is able to pick up more sender names.
I believe the problem was the last line in the second cell and the way it strung the emails together. I trial-and-error'd it away.
(also note that the linked version is using a different sample of deq14)
ok! so I played with it some more and this third version picks up more names, and also tells you what it weeded out. This is still on just a sample of deq14, but I am going to run it on the full deq14 and see what happens then.