It’s easy to focus on the number crunching part of data science-it’s actually a thing I quite enjoy, and part of why I choose data science. I love the logical and mathematical parts! However, it’s important to also remember the fundamental goal, which is to be able to communicate the findings that can be drawn from the numbers.
We started learning about visualizations using the matplotlib library, and quickly added Seaborn to the toolbox. Seaborn can give a pretty good visual just from putting in the data! While the end goal for a visualization might be to share the well-formatted results in a presentation, having the ability to easily see comparisons of the data can also help guide where to focus our data analysis.
Before you can plot the data, you first have to clean the data. Datasets, especially large ones, are often imperfect-in particular, there will almost always be values that weren’t collected for some number of points. When python reads these missing data points, it fills the relevant cell with a value called NaN, or not a number. NaNs cannot be plotted, and so most graphing interfaces will throw an error if they are not cleaned first. With pandas, it’s easy to fill in these missing values, in order to make the data able to be plotted. It depends on the data set, but usually, the median is a good choice-a central value, and unlike the mean, is not going to be skewed by outliers. Here is the simple bit of code to fix these missing values:
df['column_name'].fillna(df['column_name'].median(),inplace = True)
For some data sets, before you can get a median value, it’s important to make sure that values that should be numbers are registered as numbers. $2,400,00.00 is easy to read and understand as a number for a person, but a computer cannot combine the dollar signs and commas with a number and still understand it as a number. You have to strip the punctuation out, and then tell the program to check again that they are in fact numerical values. The relevant code is:
df['column_name'] = df['column_name'].apply(lambda x:x.replace('$','').replace(',','')
df['column_name']= pd.to_numeric(df['column_name'])
Once the computer knows it’s looking at numbers and doesn’t trip over any missing values, we can plot our data!
Some common types of visualizations:
Scatterplots are very good for showing correlations and relationships between variables-for instance, how the budget of a movie can affect the revenue that movie makes. Each data point would be for one movie: one budget, one revenue. Scatterplots can be used to form models and thus make predictions: if we put X in, how much Y can we expect to get out of it?
And after reading the data set into the notebook as tn_budgets_df, this plot was made with just one line of code:
sns.scatterplot(x='production_budget',y='worldwide_gross',data=tn_budgets_df)
So wonderfully simple!
Another way to talk about correlation is the correlation coefficient, which is a numeric value that describes how strongly correlated two variables are. Scatterplots are a way to actually see what that correlation looks like in an intuitive way-which is especially important when sharing information with people who aren’t data scientists. No one can really picture what a correlation coefficient means without graphing something, but at least technically trained people will have a logical understanding of it. But if you can’t communicate the meaning of your data, then it won’t do you any good!
It’s easy to feel like a lack of correlation is a lack of a result, but knowing that two variables are unrelated is information in itself. Discovering that there is no pattern still means you have learned something about the data set.
Barplots are good for showing relationships between categories. Sometimes, the variable you want to change is not numeric, and thus cannot be plotted as a scatterplot. The following visual provides an easy way to see the difference in median revenue based on the genre of movie.
Over the past few years, I had realized that I did not have the skills that most employers in my area were desperate for (coding coding coding!), and decided it was time to branch out from the chemical engineering knowledge I had gained in college. I have a smattering of experience with both coding and manual data analysis, and greatly enjoyed the experiences I did have, so Data Science seems like an ideal opportunity for me. (Also that it’s the kind of coding that uses the most math!) Data science is a great opportunity to gain new knowledge and skills while working in ways I already know I am passionate about. And, importantly, knowledge and skills that will be valuable to more than just myself!
I started researching bootcamps, and even with the newer field of data science, there were so many choices! Flatiron provided resources so I can start learning even before I applied, which was just the first sign they showed of how much they value providing resources to interested students. I really connected with the way they make learning new information easy to anyone who is willing to seek it out and work to teach themselves.
Given this, I decided to attend Flatiron’s Data Science program to open a new folder for my career. Originally intending to attend the in-person course in San Francisco, I was able to switch into the online course when everything everywhere went off the rails because of COVID-19 in March 2020. I definitely appreciate the ability to learn while stuck in social distancing mode, as well as having the ability to communicate with course leads and other students; having access to support like that makes such a difference.
I started my course on April 13th 2020, so I’m only a couple days in at this point, but I am definitely looking forward to where it takes me!