The Data-Driven Scientist
With Python, comes responsibility
I recently joined a book club with a few people from my data science cohort and the first book we indulged in was a quick read by DJ Patil and Hilary Mason called, Data Driven. It was only 31 pages so it took about a night to get through. After reading the book, I made a few highlighted points of interest and discussed it with my group. Now, I’d like to archive my biggest take always. Perhaps if you’re interested in giving it a read, this will help you be more actionable, or if you’ve already read it and have any highlights of your own, I’d love to read them in the comments!
First, the book goes over what a data scientist is and what skills one would need to possess in order to adequately be labeled a data scientist. The three most important ones were in mathematics, computing, and communication. For math, a good data scientist is proficient specifically in statistics and algebra. From my experience, statistics can exponentially accelerate your data insights by way of hypothesis testing, among other tests, in order to find any possible correlations between metadata to help better train your model. And of course you cannot train any sort of model without having knowledge in computing. For data science, Python seems to be the most up and coming popular programming language. It is user-friendly and very intuitive compared to other languages. And after gathering all of your insights using math and computations, those finding need to be communicated through the appropriate chain of command. What is meant by this is, not everyone is a data scientist, so not everyone will be able to understand what they do, so an excellent data scientist is not only good a gathering actionable insights, but also communicating those insights clearly and effectively to the company for the betterment of the company. They must be able to ask and answer the right questions.
Next, the book goes into what it means to be data driven and the democratization of data. Walmart is used as an example of a data driven company. I found it fascinating to learn how they were really the pioneers for using data deliberately. I really had to think about it. Yeah! You know how Walmart always seems to be on trend with the season and they seem to always change up their store displays. These kinds of moves scream action as a result of customer data analysis. The book does not detail the probable methods used to garner their actionable insights, but it seems that Time Series Analysis would play a huge part.
Democratizing data, as the book describes it, is where “everyone in an organization has access to as much data as legally possible”. This bring us to the subtitle of this blog, with python comes responsibility. What’s the most evil thing that can be done with data? The point of this question is to really get the scientist to think outside the box and really push the boundaries of creativity. This question also knocks on the door of moral philosophy. How does one ensure ethics are being upheld in project production?
Team organization and optimization are very detailed parts of the book, so I will not spoil it for you. 😉
My favorite part, however, is at towards the end and references the moral use of data with regard to politics. This part took me a bit longer to understand but what I was able to get from it was, it is good and preferable to use data to gain valuable insights, but it should never be skewed in a way to fit a particular narrative. The data isn’t speaking for itself. The way I see it is, data is the unconscious (and/or sometimes conscious) movement of the masses. That was Walmart’s approach and it has proven to be extraordinarily lucrative for them. They followed the data and restructured their whole operation accordingly. This seems to be a simpler approach rather than pushing something that just wont budge and unfortunately, we see data being push or exaggerated to fit wild stories. We see this mainly in politics where data becomes more of a weapon than a tool.