Notas de la clase 2025-05-05

Published

May 5, 2025

Datos del 2024

datos_indec <- readRDS("/cloud/project/data/datos_indec.rds")
filename <- "../data/usu_hogar_T324.txt"
file.exists(filename)
[1] TRUE
datos2 <- readr::read_delim(filename, delim = ";")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 16650 Columns: 88
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr  (6): CODUSU, MAS_500, IV1_ESP, IV7_ESP, II7_ESP, II8_ESP
dbl (80): ANO4, TRIMESTRE, NRO_HOGAR, REALIZADA, REGION, AGLOMERADO, PONDERA...
num  (1): IPCF
lgl  (1): IV3_ESP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Comparación en crudo:

ingresos2024 <- datos2$ITF
ingresos2023 <- datos_indec$ITF
mean(ingresos2023)
[1] 245610.7
mean(ingresos2024)
[1] 769401.6

Visualizar

hist(ingresos2023)

hist(ingresos2024)

my_data <- data.frame(
  ingreso = c(ingresos2023,ingresos2024),
  year = c(rep("2023", length(ingresos2023)),
            rep("2024", length(ingresos2024))
  )   
)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)

my_data %>% 
  filter(ingreso > 0) %>% 
  ggplot(aes(ingreso, fill = year))+
  geom_histogram(alpha = .5)+
  scale_x_log10()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Pregunta

  • Hay mas observaciones del 23 o del 24

    my_data %>% 
      group_by(year) %>% 
      count()
    # A tibble: 2 × 2
    # Groups:   year [2]
      year      n
      <chr> <int>
    1 2023  16656
    2 2024  16650
my_data_normalizada <- my_data %>% 
  group_by(year) %>% 
  mutate(ingreso_normalizado = scale(ingreso))

Verificar el resultado

my_data_normalizada %>% 
  filter(year == 2023) %>% 
  pull(ingreso_normalizado) %>% 
  mean()
[1] -1.453445e-17
my_data_normalizada %>% 
  filter(year == "2024") %>% 
  pull(ingreso_normalizado) %>% 
  mean()
[1] -4.013567e-19
my_data_normalizada$ingreso_normalizado %>% sd()
[1] 0.999985

Hacer la transformación “a mano”

my_data3 <- data.frame(
  ingreso = c(scale(ingresos2023),scale(ingresos2024)),
  year = c(rep("2023", length(ingresos2023)),
            rep("2024", length(ingresos2024))
  )   
)
my_data_normalizada %>% 
  ggplot(aes(ingreso_normalizado, fill=year, color = year))+
  geom_histogram()+
  scale_x_log10()
Warning in transformation$transform(x): NaNs produced
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 20196 rows containing non-finite outside the scale range
(`stat_bin()`).

my_data_normalizada %>% 
  ggplot(aes(ingreso_normalizado, color = year))+
  geom_density(alpha = .5)+
  scale_x_log10()
Warning in transformation$transform(x): NaNs produced
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning: Removed 20196 rows containing non-finite outside the scale range
(`stat_density()`).

Error estándar

filename <- "../data/usu_individual_T324.txt"
file.exists(filename)
[1] TRUE
my_data4 <- readr::read_delim(filename, delim = ";")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 47564 Columns: 177
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr   (6): CODUSU, MAS_500, CH05, CH14, PP04D_COD, PP09C_ESP
dbl (169): ANO4, TRIMESTRE, NRO_HOGAR, COMPONENTE, H15, REGION, AGLOMERADO, ...
num   (1): IPCF
lgl   (1): PP09A_ESP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
edad <- my_data4$CH06
mean(edad)
[1] 36.47019
N <- length(edad)

SE <- (sd(edad)/sqrt(N))

SE
[1] 0.1017988
conf95 <- c(mean(edad)-SE*1.96, mean(edad) + SE*1.96)
conf95
[1] 36.27066 36.66971
edad2 <- sample(edad, 100)
mean(edad2)
[1] 37.24
N <- length(edad2)

SE <- (sd(edad2)/sqrt(N))

SE
[1] 2.169351
conf95 <- c(mean(edad2)-SE*1.96, mean(edad2) + SE*1.96)
conf95
[1] 32.98807 41.49193

Hemos visto

  • Visualizacion de distribuciones

  • error estandar y como calcularlo

  • Probabilidad

Tarea para la próxima

  • Eligir una variable (que no sea edad) de los datos del indec

  • Visualizarla

  • Sacar promedio

  • Calcular el nivel de confianza a 95%

Lecturas

  • Capítulo 4

  • Capítulo 7