Analyzing StackOverflow Survey 2019 and 2020

Yogesh Kumar
8 min readJan 11, 2021

--

Source: ElegantThemes

This is my first ever blog post on the Internet. In this post, I will be using the datasets from StackOverflow Survey 2019 and 2020 (https://insights.stackoverflow.com/survey) for my first project at Udacity’s Data Science Nanodegree which involves using a CRISP-DM approach to get the answers that I require.

CRISP-DM stands for Cross-Industry Standard Process of Data Mining. It has 6 steps :
1) Business Understanding
2) Data Understanding
3) Data Preparation
4) Modeling the Data
5) Evaluating the results
6) Deployment

1. Business Understanding

In this step, I had to ask some business questions that I was looking forward to obtaining the answers for. The main project motivation for me was to extract answers to the following questions:
- Which languages saw the rise in their usage the most from 2019–2020?
- Has the deciding factor ‘Remote Work / Work from home’ has increased over the years and due to the pandemic?
- When did most people write their first line of code?
- Which group of people use StackOverflow for their work the most?
- What factors are associated with a high salary?

2. Data Understanding

In this phase, I went through the data, saw how it looked like, and got the necessary columns from the dataset that I require for my analysis.
I had used two datasets from the years 2019 & 2020.
The columns that I found the most interesting to answer my questions were:

Age1stCode
LanguageWorkedWith
LanguageDesireNextYear
SOVisitFreqJ
JobFactors
YearsCodePro
OrgSize
EdLevel
DevType
JobFactors

Here is a description of the columns that I am interested in.

Descriptions of each selected column

3. Data Preparation

In this phase, I need to prepare the data by cleaning it for analysis in the further steps. So, I had used the tool pandas_profiling, which returns me a nice exploratory data analysis on all the columns in the data. Please find below a few screenshots of the result from this tool.

Output Overview from pandas_profiling tool
This shows the basic details about the column of interests
This shows the correlations between the columns of interest

Based on this tool, I then removed the duplicate rows from the table which otherwise would have caused me to extract slightly incorrect answers.
Now, I start answering the questions.

Q1. Which languages saw the rise in their usage the most from 2019–2020?

From the charts above, it is evident that the Top 10 languages have stayed the same :

  • C
  • Java
  • HTML/CSS
  • SQL
  • Python
  • Bash/Shell/PowerShell
  • C#
  • PHP
  • C++
  • R

But the question was which languages have jumped a few places over the year. They are:

  • Go
  • Swift

I myself have been interested in learning one of the languages from the above (Go).

Q2. Has the deciding factor ‘Remote Work / Work from home’ increased over the years and due to the pandemic?

Both the graphs above look similar, although the range on the y-axis is different. Hence, I explored the relative frequency of people’s opinions on this question.

From the graphs above, it seems like the job factor ‘Remote Work Options’ was the same around 28–29%. This is because the data was collected in February 2020 (when the Covid pandemic had just started). This might change in the upcoming survey which is going to happen soon in the upcoming months

Q3. When did most people write their first line of code?

It can be seen from the graphs (they follow almost the same pattern [excpet 14 and 16 by a small margin]) that most people during their teenage years (13–19 years) write their first line of code and there are people who started coding well early in life (at the age of 12).

Q4. Which group of people use StackOverflow for their work the most?

2019
2020

The above charts show that people with increasing coding experience over the years visit StackOverflow less and people starting their careers do it Multiple times a day (this is as expected).

Q5. What factors are associated with a high salary?

I am just posting the graphs for 2019 since it is really close to the 2020 graphs.

Salaries for Different Jobs
Salaries for different Organization sizes
Salaries for different degrees

As it can be from the DevType Graph, both of them suggest that the Engineering Manager, followed by Senior executive and Site Reliability Engineer are paid the most.

From the Organization graph, it can be seen that the bigger the company, the more the income. Sometimes, being a Freelancer might provide you with a salary equivalent of working in a mid-size company.

From the Education Level Graph, it can be seen that the top paid degrees are Ph.D., Bachelors, and Masters. Although, based on this data, we cannot conclude that just having a Bachelors's degree would give you more salary or pursuing a master's degree might reduce your salary. This data provides the median value of all the salaries of all people with respective degrees/non-degrees. Also, one might have 10 years of experience with just a Bachelors Degree and earn more than a person with a Masters Degree who has just started his career. So, these salaries values should not be taken as absolute but as what could your salaries be if you have these degrees and work for some years to reach there. That’s just my 2 cents on this, please feel free to correct me.

4. Modelling the Data

There are a lot of missing values in different columns. I am not removing or performing any imputation to any of them as I am not modeling anything.
Since all of my questions were answered during the data analysis stage, I did not model any data.

5. Evaluate the Results

To summarize:

Which languages saw the rise in their usage the most from 2019–2020?

  • The languages that have jumped a few places over the year are:
  • Go
  • Swift

Has the deciding factor ‘Remote Work / Work from home’ increased over the years and due to the pandemic?

  • It seems like the job factor ‘Remote Work Options’ was the same around 28–29%. This is because the data was collected in February 2020 (when the Covid pandemic had just started). This might change in the upcoming survey which is going to happen soon in the upcoming months

When did most people write their first line of code?

  • Most people during their teenage years (13–19 years) write their first line of code and there are people who started coding well early in life (at the age of 12). There are also people who start coding after the mid-50s too, although very few of them.

Which group of people use StackOverflow for their work the most?

  • People with increasing coding experience over the years visit StackOverflow less whereas people starting their careers do it multiple times a day

What factors are associated with a high salary?

  • As it can be from the DevType Graphs (2019 & 2020), both of them suggest that the Engineering Manager, followed by Senior executive and Site Reliability Engineer are paid the most.
  • From the Organization graphs, it can be seen that the bigger the company, the more the income. Sometimes, being a Freelancer might provide you with a salary equivalent to working in a mid-size company.
  • From the Education Level Graphs, it can be seen that top paid degrees are Ph.D., Bachelors, and Masters. Although, based on this data, we cannot conclude that just having a Bachelors's degree would give you more salary or pursuing a master's degree might reduce your salary. This data provides the median value of all the salaries of all people with respective degrees/non-degrees. Also, one might have 10 years of experience with just a Bachelors Degree and earn more than a person with a Masters Degree who has just started his career. So, these salaries values should not be taken as absolute but as what could your salaries be if you have these degrees and work for some years to reach there. That’s just my 2 cents on this, please feel free to correct me.
  • Thus, based on the data above, I would say that experience (years) and company size play a bigger role in determining your salary given that you have at least one degree. I am not taking into account people who do not have a degree and still get paid more as I think they are just outliers (not everyone could do that).

6. Deploy

This is the stage where I wrote this article to communicate my results to you.

In this project, I have done mostly the exploratory analysis. In the upcoming projects, I will use Modelling techniques to perform explanatory analysis and then provide the results.

For a deep look into the code, please visit my Github.

I hope you enjoyed it. Thank you very much for reading this.

--

--

Yogesh Kumar
0 Followers

A CS Masters Student at the Saarland University