Data ethics and data privacy are integral to any data project. There are obvious cases such as protecting the privacy of individuals health records under HIPAA. There are also many gray areas surrounding what constitutes personally identifiable information (PII) which occur throughout many industries including advertising, finance, and consumer goods. You may have noticed that starting around the summer of 2018, you started receiving privacy policy notices on many websites asking you to accept the use of cookies. This was a result of Europe's GDPR legislation. You are also probably aware of the Cambridge Analytica debacle in the 2016 United States presidential election. As a data practitioner, it is your responsibility to uphold data ethics in a fast-changing environment.
You will be able to:
- Determine whether or not a data science procedure meets an ethics standard
If the data you are handling is valuable, then security should be a primary concern. Data breaches are all too common and often, such leaks of sensitive information could have been avoided if businesses and organizations followed standard security protocols. While there are thousands of said cases, two of the biggest breaches which have caught the public's attention include Cambridge Analytica's misuse of Facebook data to influence political elections, and Equifax's leaking of roughly 100 million individuals' social security numbers and credit scores.
PII stands for personally identifiable information. While some cases such as one's social security number and medical records are clear examples of PII, other pieces of data may or may not qualify as PII depending on the jurisdiction. In the United States for example, there are two federal regulations: the Health Insurance Portability and Accountability Act (HIPAA), and the Privacy Act of 1974. While in theory these acts aim to protect the use, collection, and maintenance of personal data, the scope of what constitutes PII and the subsequent regulations surrounding handling and using said data is generally antiquated. For example, a user's IP address has been categorized as non-PII by several U.S. courts despite it being a unique identifier to most individual's home internet connection. This was further eroded by the rollback of net neutrality laws by the FCC Chairman Ajit Pai in mid-2018. Aside from federal jurisdiction, several states, most notably California have their own data protection laws to the benefit and protection of users and consumers.
GDPR stands for the general data protection regulation. It was passed on April 14th 2016 by the European Union and went into effect on May 25th 2018. GDPR protects the data rights of all European citizens and is an example of how legislation will have to change and adapt to the online digital era of the 21st century. GDPR has implemented more widespread regulations surrounding what constitutes PII and has set fine structures for up to 4% of a company's revenue.
There are two primary practices that you should follow when dealing with PII and other sensitive data. The first is to encrypt sensitive data. When in doubt, encrypt. Secondly, ask yourself what level of information you really need. Large organizations will always include data cleaning teams which will first scrub sensitive data such as names and addresses before passing said data off to analysts and others to mine. Ultimately, any well-thought strategy will include multiple layers, safeguards, and other measures to ensure data is safe and secure.
When collecting data, it is important to ensure you are not gathering it in a manner that will generate bias. For example, if Data Scientists are not careful in the way they phrase questions in surveys, they can generate misleading results. If a poll contained the question "How poorly has Politician X performed when it comes to the economy" it adds a negative connotation the question. That phrasing might make people say Politician X performed worse than if they had merely been asked "How has Politician X performed when it comes to the economy?"
In some cases, choosing which variables to collect and how to define them can also contain bias. You’ll notice that in some of the datasets we use, gender is represented as a binary value and race is referenced in an insensitive manner. This is an artifact of the societal conditions at the time the data was collected. As soon-to-be Data Scientists, it will be your responsibility to ensure that data collection is done in an inclusive manner.
People often trust algorithms and their output based on measurements such as "this algorithm has 99.9% accuracy". However, it should also be noted that while algorithms such as linear regression are mathematically sound and powerful tools, the models are simply reflections of the data that is fed in. For example, logistic regression and other algorithms are used to inform a wide range of decisions including whether to provide someone with a loan, the degree of criminal sentencing, or whether to hire an individual for a job. (Do a quick search online for algorithm bias, or check out some of the articles below.) In all of these scenarios, it is again important to remember that the algorithm is simply reflective of the underlying data itself. If an algorithm is trained on a dataset where African Americans have had disproportionate criminal prosecution, the algorithm will continue to perpetuate these racial injustices. Similarly, algorithms trained on data reflecting a gender pay-gap will also continue to promote this bias. With this, substantial thought and analysis regarding problem set up and the resulting model is incredibly important.
Below is a handful of resources providing further information regarding some of the topics discussed here.
Aside from overtly illegal practices according to current legislation, data privacy and ethics calls into question a myriad of various thought experiments. For example, should IP addresses or cookies be considered PII? How should security camera footage be handled? What about vehicles such as Google street view cars which are capturing video and pictures of public places? Some companies are now even taking pictures of license plates to track car movements. Should they be allowed to maintain massive databases of said information? What regulations should be put on these and other potentially sensitive datasets?
All of these examples question where and when limits should be put on data. Science fiction stories such as 1984, are much more accurate then one might expect. Moreover, injustices and questionable practices still abound. For example, despite public outcry at debacles like Cambridge Analytica, many companies still exist with nearly identical practices such as Applecart in New York City, which collects and sells user data to the Republican party, amongst others.
In staying current, you should also identify some news sources to stay up to date on tech trends.
One great resource is the Electronic Frontier Foundation (EFF).
EFF recently put together an article called Fix it Already, outlining fixable mishaps by technology companies that continue to be ignored. Take a look at the article here and get involved to put pressure on these organizations and your representatives to shape up. Here's a quick preview of their list:
As a final note, it should also be noted that the nature of online data can also include offensive or inappropriate data at times. For example, if acquiring data from an API such as Twitter, there is potential to encounter lewd or offensive material. While many of these services will eventually screen out and remove particularly egregious cases, plenty of trolls still exist.
There's a multitude of resources to get involved with data privacy and ethics, but here's a few to get you started.
In this lesson, you got a preview of some of the many issues regarding data privacy and ethics. From GDPR to being aware of your own data aura, there's plenty to keep you busy and on your toes regarding this fascinating perspective on the data industry.