DS Week 4

Data Science for Social Science Research

Najah

Ashoka University

Contents

  • Joining data frames
    • left_join()
    • right_join()
    • full_join()
    • inner_join()
    • anti_join()
  • Sample Case

Joining data frames

  • left_join(): all rows from x
  • right_join(): all rows from y
  • full_join(): all rows from both x and y
  • inner_join(): all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches
  • anti_join(): return all rows from x where there are not matching values in y, never duplicate rows of x

For the next few slides…

id value_x
1 x1
2 x2
3 x3
id value_y
1 y1
2 y2
4 y4

left_join()

id value_x value_y
1 x1 y1
2 x2 y2
3 x3 NA

right_join()

id value_x value_y
1 x1 y1
2 x2 y2
4 NA y4

full_join()

id value_x value_y
1 x1 y1
2 x2 y2
3 x3 NA
4 NA y4

inner_join()

id value_x value_y
1 x1 y1
2 x2 y2

anti_join()

id value_x
3 x3

Sample Case

Percentage of the STEM graduates among the total direct recruits in the civil services for every year?

  • STEM - engineering and natural science, natural sciences and Medical sciences

  • First degree

  • Degree at the time of entry

calling the library

library(tidyverse)

reading the data

profile = read.csv("./slides/ias-profile.csv")

education = read.csv("./data/ias-education.csv")

Overview

glimpse(education)
Rows: 22,149
Columns: 9
$ ID                  <chr> "TN003700", "TN003700", "UP011100", "UP011100", "U…
$ Name                <chr> "Ms. Anna R Malhotra", "Ms. Anna R Malhotra", "Shr…
$ Cadre               <chr> "Tamil Nadu", "Tamil Nadu", "Uttar Pradesh", "Utta…
$ Reference_Value     <int> 0, 1, 0, 1, 2, 3, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,…
$ Qualification       <chr> "Graduate", "P.G.", "Certificate", "Graduate", "P.…
$ Subject             <chr> "English Lit", "N.A.", "French", "Physics", "Econo…
$ Category_of_Subject <chr> "Languages and Fine Arts", "N.A.", "Languages and …
$ Division            <chr> "Ordinary", "First", "Ordinary", "Second", "Ordina…
$ Source              <chr> "Supremo", "Supremo", "Supremo", "Supremo", "Supre…

filter to the first degree

edu_ug = education %>% filter(Reference_Value ==0)
ID Name Cadre Reference_Value Qualification Subject Category_of_Subject Division Source
TN003700 Ms. Anna R Malhotra Tamil Nadu 0 Graduate English Lit Languages and Fine Arts Ordinary Supremo
UP011100 Shri Pk Kaul Uttar Pradesh 0 Certificate French Languages and Fine Arts Ordinary Supremo
MH006800 Shri B G Deshmukh Maharashtra 0 Graduate Botany Natural Sciences Ordinary Supremo
TN003900 Shri Ram Chander Tamil Nadu 0 Graduate Law Governance, Law and Management Third Supremo
AP004800 Shri Gv Ramakrishna Andhra Pradesh 0 Graduate Chemistry Natural Sciences First Supremo
BH005400 Shri N Nagamani Bihar 0 Graduate Mathematics Engineering, Technology and Mathematics First Supremo

Join the data

profile_edu = inner_join(profile, edu_ug, by = "ID")

Summary table

edu_summary = profile_edu %>% 
  
  ## limit data to direct recruitment
  filter(Source_of_Recruitment == "Direct Recruitment") %>% 
  
  # select relevant vars
  select(ID, Category_of_Subject) %>% 
  
  ## add a new column stem_grad
  mutate(stem_grad = ifelse(Category_of_Subject %in% c("Engineering,
                                                       Technology and Mathematics", "Natural Sciences","Medical" )  , "yes", "no")) %>% 
  
  # grouping the var  
  group_by(stem_grad) %>% 
  # counting by group
  summarise(count = n())
stem_grad count
no 5896
yes 1241

manipulations on the summary table

edu_summary %>% 
  mutate(
    
    # adding up the count
    sum_count = sum(count),
    
    #getting the fraction
   frac = count/sum_count,
   
   # converting to percentage
   perc = frac*100,
   
   # rounding the perc value
   pc_round = round(perc,2)) %>% 
  
  select(stem_grad, pc_round) %>% 
  # renaming the cols
  rename("STEM Graduates" = stem_grad,
         "Percentage" = pc_round) 
STEM Graduates Percentage
no 82.61
yes 17.39