Notas de la clase 2025-05-05

Published

May 5, 2025

Datos del 2024

datos_indec <- readRDS("/cloud/project/data/datos_indec.rds")
filename <- "../data/usu_hogar_T324.txt"
file.exists(filename)

[1] TRUE

datos2 <- readr::read_delim(filename, delim = ";")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 16650 Columns: 88
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr  (6): CODUSU, MAS_500, IV1_ESP, IV7_ESP, II7_ESP, II8_ESP
dbl (80): ANO4, TRIMESTRE, NRO_HOGAR, REALIZADA, REGION, AGLOMERADO, PONDERA...
num  (1): IPCF
lgl  (1): IV3_ESP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Comparación en crudo:

ingresos2024 <- datos2$ITF
ingresos2023 <- datos_indec$ITF

mean(ingresos2023)

[1] 245610.7

mean(ingresos2024)

[1] 769401.6

Visualizar

hist(ingresos2023)

hist(ingresos2024)

my_data <- data.frame(
  ingreso = c(ingresos2023,ingresos2024),
  year = c(rep("2023", length(ingresos2023)),
            rep("2024", length(ingresos2024))
  )   
)

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)

my_data %>% 
  filter(ingreso > 0) %>% 
  ggplot(aes(ingreso, fill = year))+
  geom_histogram(alpha = .5)+
  scale_x_log10()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Pregunta

Hay mas observaciones del 23 o del 24

my_data %>% 
  group_by(year) %>% 
  count()

# A tibble: 2 × 2
# Groups:   year [2]
  year      n
  <chr> <int>
1 2023  16656
2 2024  16650

my_data_normalizada <- my_data %>% 
  group_by(year) %>% 
  mutate(ingreso_normalizado = scale(ingreso))

Verificar el resultado

my_data_normalizada %>% 
  filter(year == 2023) %>% 
  pull(ingreso_normalizado) %>% 
  mean()

[1] -1.453445e-17

my_data_normalizada %>% 
  filter(year == "2024") %>% 
  pull(ingreso_normalizado) %>% 
  mean()

[1] -4.013567e-19

my_data_normalizada$ingreso_normalizado %>% sd()

[1] 0.999985

Hacer la transformación “a mano”

my_data3 <- data.frame(
  ingreso = c(scale(ingresos2023),scale(ingresos2024)),
  year = c(rep("2023", length(ingresos2023)),
            rep("2024", length(ingresos2024))
  )   
)

my_data_normalizada %>% 
  ggplot(aes(ingreso_normalizado, fill=year, color = year))+
  geom_histogram()+
  scale_x_log10()

Warning in transformation$transform(x): NaNs produced

Warning in scale_x_log10(): log-10 transformation introduced infinite values.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 20196 rows containing non-finite outside the scale range
(`stat_bin()`).

my_data_normalizada %>% 
  ggplot(aes(ingreso_normalizado, color = year))+
  geom_density(alpha = .5)+
  scale_x_log10()

Warning in transformation$transform(x): NaNs produced

Warning in scale_x_log10(): log-10 transformation introduced infinite values.

Warning: Removed 20196 rows containing non-finite outside the scale range
(`stat_density()`).

Error estándar

filename <- "../data/usu_individual_T324.txt"
file.exists(filename)

[1] TRUE

my_data4 <- readr::read_delim(filename, delim = ";")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 47564 Columns: 177
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr   (6): CODUSU, MAS_500, CH05, CH14, PP04D_COD, PP09C_ESP
dbl (169): ANO4, TRIMESTRE, NRO_HOGAR, COMPONENTE, H15, REGION, AGLOMERADO, ...
num   (1): IPCF
lgl   (1): PP09A_ESP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

edad <- my_data4$CH06

mean(edad)

[1] 36.47019

N <- length(edad)

SE <- (sd(edad)/sqrt(N))

SE

[1] 0.1017988

conf95 <- c(mean(edad)-SE*1.96, mean(edad) + SE*1.96)

conf95

[1] 36.27066 36.66971

edad2 <- sample(edad, 100)

mean(edad2)

[1] 37.24

N <- length(edad2)

SE <- (sd(edad2)/sqrt(N))

SE

[1] 2.169351

conf95 <- c(mean(edad2)-SE*1.96, mean(edad2) + SE*1.96)

conf95

[1] 32.98807 41.49193

Hemos visto

Visualizacion de distribuciones
error estandar y como calcularlo
Probabilidad

Tarea para la próxima

Eligir una variable (que no sea edad) de los datos del indec
Visualizarla
Sacar promedio
Calcular el nivel de confianza a 95%

Lecturas

Capítulo 4
Capítulo 7