R data import(DBI/http data download) - 데이터 캠프 정리

Notice

Recent Posts

Recent Comments

Link

« 2025/10 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

public bigdata

R data import(DBI/http data download) - 데이터 캠프 정리 본문

R programming

R data import(DBI/http data download) - 데이터 캠프 정리

public bigdata 2020. 1. 4. 16:04

1. DBI MySql 연결

> # Load the DBI package
> library(DBI)
> 
> # Connect to the MySQL database: con
> con <- dbConnect(RMySQL::MySQL(), 
+                  dbname = "tweater", 
+                  host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
+                  port = 3306,
+                  user = "student",
+                  password = "datacamp")
> 
> # Build a vector of table names: tables
> tables <- dbListTables(con)
> 
> # Display structure of tables
> str(tables)
 chr [1:3] "comments" "tweats" "users"

DMI패키지는 RMySQL과 같은 데이터베이스 패키지들을 통합해서 사용할 수 있도록 표준 인터페이스를 제공한다.

dbListTables : 연결된 db의 테이블들을 반환한다.

2. 데이터베이스의 테이블 가져오기

> # Import the users table from tweater: users
> users <- dbReadTable(con, "users")
> 
> # Print users
> users
  id      name     login
1  1 elisabeth  elismith
2  2      mike     mikey
3  3      thea   teatime
4  4    thomas tomatotom
5  5    oliver olivander
6  6      kate  katebenn
7  7    anjali    lianja

dbReadTable : 위 코드에서는 "users"테이블을 R로 읽어들인다.

3. 데이터베이스 모든 테이블 리스트로 가져오기

> # Get table names
> table_names <- dbListTables(con)
> 
> # Import all tables
> # tables <- lapply(table_names, dbReadTable, conn = con) 수업에서는 이렇게 사용
> tables <- lapply(table_names, FUN = function(table_name){ # 
+   dbReadTable(name = table_name, conn = con)})
> 
> # Print out tables
> tables
[[1]]
     id tweat_id user_id            message
1  1022       87       7              nice!
2  1000       77       7             great!
3  1011       49       5            love it
4  1012       87       1   awesome! thanks!
5  1010       88       6              yuck!
6  1026       77       4      not my thing!
7  1004       49       1  this is fabulous!
8  1030       75       6           so easy!
9  1025       88       2             oh yes
10 1007       49       3           serious?
11 1020       77       1 couldn't be better
12 1014       77       1       saved my day

[[2]]
  id user_id
1 75       3
2 88       4
3 77       6
4 87       5
5 49       1
6 24       7
                                                                 post
1                                       break egg. bake egg. eat egg.
2                           wash strawberries. add ice. blend. enjoy.
3                       2 slices of bread. add cheese. grill. heaven.
4               open and crush avocado. add shrimps. perfect starter.
5 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
6                              just eat an apple. simply and healthy.
        date
1 2015-09-05
2 2015-09-14
3 2015-09-21
4 2015-09-22
5 2015-09-22
6 2015-09-24

[[3]]
  id      name     login
1  1 elisabeth  elismith
2  2      mike     mikey
3  3      thea   teatime
4  4    thomas tomatotom
5  5    oliver olivander
6  6      kate  katebenn
7  7    anjali    lianja

lapply, dbListTables를 이용해 데이터베이스의 모든 테이블을 가져온다.

4. 데이터베이스에 sql 쿼리 날리기

> # Import tweat_id column of comments where user_id is 1: elisabeth
> elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments WHERE user_id = 1")
> 
> # Print  elisabeth
> elisabeth
  tweat_id
1       87
2       49
3       77
4       77

dbGetQuery를 이용해 데이터베이스에 sql 쿼리를 날린다.
where 조건 주의사항 : 보통 프로그래밍 언어에서 "=="을 사용하지만 sql에선 "="하나만 사용한다.

5. 데이터베이스에 sql 쿼리 날리기(date)

> # Import post column of tweats where date is higher than '2015-09-21': latest
> latest <- dbGetQuery(con, "SELECT post FROM tweats where date > '2015-09-21'")
> 
> # Print latest
> latest
                                                                 post
1               open and crush avocado. add shrimps. perfect starter.
2 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
3                              just eat an apple. simply and healthy.
>

날짜 조건에 해당하는 데이터만 가져온다

6. 데이터베이스에 sql 쿼리 날리기(and)

> # Create data frame specific
> specific <- dbGetQuery(con, "SELECT message from comments where tweat_id=77 and user_id>4")
> 
> # Print specific
> specific
  message
1  great!

and 조건 사용

7. 데이터베이스에 sql 쿼리 날리기(문자 길이가 5미만인 데이터)

> # Create data frame short
> short <- dbGetQuery(con, "select id, name from users where char_length(name) < 5")
> 
> # Print short
> short
  id name
1  2 mike
2  3 thea
3  6 kate

char_lenth(name) : name 열의 문자 길이

8. DBI

1) dbGetQuery는 DBI의 함수다. 실행하면 RMySQL패키지를 통해서 실질적으로 수행한다.

2) 해당 쿼리는 3단계로 이루어 진다.

dbSendQuery
dbFetch
dbClearResult

dbSendQuery로 쿼리를 날리고 dbFetch로 쿼리테이블에서 데이터를 가져온다. 가져올 때마다 쿼리테이블에서 가져온 데이터는 사라진다. dbFetch함수를 여러번 반복해서 데이터를 모두 가져오면 쿼리테이블은 빈 테이블이고 마지막으로 쿼리테이블에서 데이터를 다 가져왔는지 dbClearResult함수로 확인하면 True를 반환한다.

> res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")
> # Use dbFetch() twice
> dbFetch(res, n = 2)
    id tweat_id user_id message
1 1022       87       7   nice!
2 1000       77       7  great!
> dbFetch(res)
    id tweat_id user_id  message
1 1011       49       5  love it
2 1010       88       6    yuck!
3 1030       75       6 so easy!
> # Clear res
> dbClearResult(res)
[1] TRUE

3) db 사용 후 연결 해제

> long_tweats <- dbGetQuery(con, "SELECT post, date FROM tweats WHERE char_length(post) >40")
> # Print long_tweats
> print(long_tweats)
                                                                 post       date
1                           wash strawberries. add ice. blend. enjoy. 2015-09-14
2                       2 slices of bread. add cheese. grill. heaven. 2015-09-21
3               open and crush avocado. add shrimps. perfect starter. 2015-09-22
4 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins. 2015-09-22
> # Disconnect from the database
> dbDisconnect(con)
[1] TRUE

1. readr 패키지로 데이터 가져오기

# Load the readr package 
library(readr) 

# Import the csv file: pools 
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv" 
pools <- read_csv(url_csv) 

# Import the txt file: potatoes 
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt" 
potatoes <- read_tsv(url_delim) 

# Print pools and potatoes 
print(pools) 
print(potatoes)

2. utils::read.csv, readr::read_csv

> # https URL to the swimming_pools csv file.
> url_csv <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
> # Import the file using read.csv(): pools1
> pools1 <- read.csv(url_csv)
> # Load the readr package
> library(readr)
> # Import the file using read_csv(): pools2
> pools2 <- read_csv(url_csv)
Parsed with column specification:
cols(
  Name = col_character(),
  Address = col_character(),
  Latitude = col_double(),
  Longitude = col_double()
)
> # Print the structure of pools1 and pools2
> str(pools1)
'data.frame':	20 obs. of  4 variables:
 $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
 $ Address  : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...
> str(pools2)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':	20 obs. of  4 variables:
 $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
 $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...
 - attr(*, "spec")=
  .. cols(
  ..   Name = col_character(),
  ..   Address = col_character(),
  ..   Latitude = col_double(),
  ..   Longitude = col_double()
  .. )

read_csv는 문자를 그대로 가져온다 factor형으로 형변환 하지 않는다.

2. 파일을 다운 받아서 읽어오기

readxl이 url을 통해서 데이터를 바로 가져오지 못하기 때문에 download.file을 통해서 다운받고 readxl로 불러올 수 있다.

> # Downloading files -------------------------------------------------------
> # readxl이 아직 웹 데이터 가져오는 것을 지원하지 않아서.
> # 아래처럼 download.file을 사용하는 것 같다
> # 아니면 gddata 이용하던지.
> # Load the readxl and gdata package
> library(readxl)
> # Specification of url: url_xls
> url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
> # Download file behind URL, name it local_latitude.xls
> download.file(url_xls, destfile = "local_latitude.xls")
URL 'http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls'을 시도합니다
Content type '' length 37888 bytes (37 KB)
downloaded 37 KB

> # Import the local .xls file with readxl: excel_readxl
> excel_readxl <- read_excel("local_latitude.xls")
에러: 
  filepath: C:\Users\bok\Google 드라이브(ysjjhb@gmail.com)\2. 분석\6. datacamp\local_latitude.xls
  libxls error: Unable to open file

데이터가 제대로 다운이 안돼서 불러오는데 에러가 발생했다. 데이터 캠프 url이라서 권한이 없는건지 비정상적으로 다운로드 된다.

3. RData 파일 다운받기

> # https URL to the wine RData file.
> url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"
> # Download the wine file to your working directory
> # 아래처럼 디렉토리에 저장 안하고 바로 불러올려면 
> # download.file 대신 url 함수를 사용해서 RData를 메모리에 바로 불러올 수 있다.
> download.file(url_rdata, "wine_local.RData")
URL 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData'을 시도합니다
Content type '' length 4781 bytes
downloaded 4781 bytes

> # Load the wine data into your workspace using load()
> load("wine_local.RData")
> # Print out the summary of the wine data
> summary(wine)
    Alcohol        Malic acid        Ash        Alcalinity of ash   Magnesium     
 Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60     Min.   : 70.00  
 1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20     1st Qu.: 88.00  
 Median :13.05   Median :1.87   Median :2.360   Median :19.50     Median : 98.00  
 Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52     Mean   : 99.59  
 3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50     3rd Qu.:107.00  
 Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00     Max.   :162.00  
 Total phenols     Flavanoids    Nonflavanoid phenols Proanthocyanins
 Min.   :0.980   Min.   :0.340   Min.   :0.1300       Min.   :0.410  
 1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700       1st Qu.:1.250  
 Median :2.350   Median :2.130   Median :0.3400       Median :1.550  
 Mean   :2.292   Mean   :2.023   Mean   :0.3623       Mean   :1.587  
 3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400       3rd Qu.:1.950  
 Max.   :3.880   Max.   :5.080   Max.   :0.6600       Max.   :3.580  
 Color intensity       Hue           Proline      
 Min.   : 1.280   Min.   :1.270   Min.   : 278.0  
 1st Qu.: 3.210   1st Qu.:1.930   1st Qu.: 500.0  
 Median : 4.680   Median :2.780   Median : 672.0  
 Mean   : 5.055   Mean   :2.604   Mean   : 745.1  
 3rd Qu.: 6.200   3rd Qu.:3.170   3rd Qu.: 985.0  
 Max.   :13.000   Max.   :4.000   Max.   :1680.0

RData 파일은 잘 받아진다.

1. httr패키지를 이용한 데이터 다운받기

library(httr)

# Get the url, save response to resp
url <- "http://www.example.com/"
resp <- GET(url)

# Print resp
resp

# Get the raw content of resp: raw_content
raw_content <- content(resp, as = "raw")

# Print the head of raw_content
head(raw_content)

content 함수로 요청하고 받은 데이터에서 내용 정보만 가져오게 되는데 "raw", "text", "parsed" 3가지의 형태로 가져올 수 있다.

> ## 아래 코드는 뭘 가져오는지 모르겠다. 
> ## html 같은게 없는데 뭘 가져오는 거지?
> # Get the url
> url <- "http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json"
> resp <- GET(url)
> # Print resp
> resp
Response [http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json]
  Date: 2020-01-04 15:29
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 929 B

> # Print content of resp as text
> content(resp, as = "text")
[1] "{\"Title\":\"Annie Hall\",\"Year\":\"1977\",\"Rated\":\"PG\",\"Released\":\"20 Apr 1977\",\"Runtime\":\"93 min\",\"Genre\":\"Comedy, Romance\",\"Director\":\"Woody Allen\",\"Writer\":\"Woody Allen, Marshall Brickman\",\"Actors\":\"Woody Allen, Diane Keaton, Tony Roberts, Carol Kane\",\"Plot\":\"Neurotic New York comedian Alvy Singer falls in love with the ditzy Annie Hall.\",\"Language\":\"English, German\",\"Country\":\"USA\",\"Awards\":\"Won 4 Oscars. Another 26 wins & 8 nominations.\",\"Poster\":\"https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg\",\"Ratings\":[{\"Source\":\"Internet Movie Database\",\"Value\":\"8.0/10\"},{\"Source\":\"Rotten Tomatoes\",\"Value\":\"97%\"},{\"Source\":\"Metacritic\",\"Value\":\"92/100\"}],\"Metascore\":\"92\",\"imdbRating\":\"8.0\",\"imdbVotes\":\"240,490\",\"imdbID\":\"tt0075686\",\"Type\":\"movie\",\"DVD\":\"28 Apr 1998\",\"BoxOffice\":\"N/A\",\"Production\":\"United Artists\",\"Website\":\"N/A\",\"Response\":\"True\"}"
> # Print content of resp
> content(resp)
$Title
[1] "Annie Hall"

$Year
[1] "1977"

$Rated
[1] "PG"

$Released
[1] "20 Apr 1977"

$Runtime
[1] "93 min"
...
...
...

1. JSON 데이터 가져오기1

# Load the jsonlite package
library(jsonlite)

# wine_json is a JSON
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'

# Convert wine_json into a list: wine
wine <- fromJSON(wine_json)

# Print structure of wine
str(wine)

2. JSON 데이터 가져오기2

> # jsonlite is preloaded
> library(jsonlite)
> # Definition of quandl_url
> quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
> # Import Quandl data: quandl_data
> quandl_data <- fromJSON(quandl_url)
> # Print structure of quandl_data
> str(quandl_data)
List of 1
 $ dataset_data:List of 10
  ..$ limit       : NULL
  ..$ transform   : NULL
  ..$ column_index: NULL
  ..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
  ..$ start_date  : chr "2012-05-18"
  ..$ end_date    : chr "2018-03-27"
  ..$ frequency   : chr "daily"
  ..$ data        : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
  ..$ collapse    : NULL
  ..$ order       : NULL

서로 다른 url에서 데이터 가져와서 값 비교해 보기

> # jsonlite is preloaded
> library(jsonlite)
> # Definition of quandl_url
> quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
> # Import Quandl data: quandl_data
> quandl_data <- fromJSON(quandl_url)
> # Print structure of quandl_data
> str(quandl_data)
List of 1
 $ dataset_data:List of 10
  ..$ limit       : NULL
  ..$ transform   : NULL
  ..$ column_index: NULL
  ..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
  ..$ start_date  : chr "2012-05-18"
  ..$ end_date    : chr "2018-03-27"
  ..$ frequency   : chr "daily"
  ..$ data        : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
  ..$ collapse    : NULL
  ..$ order       : NULL
> # Definition of the URLs
> url_sw4 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0076759&r=json"
> url_sw3 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0121766&r=json"
> # Import two URLs with fromJSON(): sw4 and sw3
> sw4 <- fromJSON(url_sw4)
> sw3 <- fromJSON(url_sw3)
> # Print out the Title element of both lists
> sw4$Title
[1] "Star Wars: Episode IV - A New Hope"
> sw3$Title
[1] "Star Wars: Episode III - Revenge of the Sith"
> # Is the release year of sw4 later than sw3?
> sw4$Year > sw3$Year
[1] FALSE

3. fromJSON함수가 데이터를 가져오는 다양한 형태

> # Challenge 1
> json1 <- '[1, 2, 3, 4, 5, 6]'
> fromJSON(json1)
[1] 1 2 3 4 5 6
> # Challenge 2
> json2 <- '{"a": [1, 2, 3], "b": [4, 5, 6]}'
> fromJSON(json2)
$a
[1] 1 2 3

$b
[1] 4 5 6

> # Challenge 1
> json1 <- '[[1, 2], [3, 4]]'
> fromJSON(json1)
     [,1] [,2]
[1,]    1    2
[2,]    3    4
> # Challenge 2
> json2 <- '[{"a": 1, "b": 2}, 
+ {"a": 3, "b": 4}, 
+ {"a": 5, "b": 6}]'
> fromJSON(json2)
  a b
1 1 2
2 3 4
3 5 6

4. data.frame 형식을 JSON형태로 바꾸기 toJSON

# URL pointing to the .csv file
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/water.csv"

# Import the .csv file located at url_csv
water <- read.csv(url_csv, stringsAsFactors = FALSE)

# Convert the data file according to the requirements
water_json <- toJSON(water)

# Print out water_json
water_json

5. toJSON을 print하는 2가지 방법

# Convert mtcars to a pretty JSON: pretty_json
pretty_json<- toJSON(mtcars, pretty = T)

# Print pretty_json
pretty_json

# Minify pretty_json: mini_json
mini_json <- toJSON(mtcars, pretty = F)

# Print mini_json
mini_json

2가지 형태로 출력한다. pretty 형식과 mini형식

## pretty형식 ##
{
    "mpg": 22.8,
    "cyl": 4,
    "disp": 108,
    "hp": 93,
    "drat": 3.85,
    "wt": 2.32,
    "qsec": 18.61,
    "vs": 1,
    "am": 1,
    "gear": 4,
    "carb": 1,
    "_row": "Datsun 710"
  }
  
  ## mini형식 ##
  {"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"}

저작자표시 비영리 (새창열림)

'R programming' 카테고리의 다른 글

Machine Learning in the Tidyverse [datacamp 정리] (0)	2020.02.24
R markdown 참고자료 (0)	2020.01.23
R 데이터 저장 (fread, rds, feather 등) (0)	2019.12.19
병렬 프로그래밍 (0)	2019.11.26
ggplot tip (0)	2019.11.07

'R programming' Related Articles

public bigdata

R data import(DBI/http data download) - 데이터 캠프 정리 본문

R data import(DBI/http data download) - 데이터 캠프 정리

'R programming' 카테고리의 다른 글

티스토리툴바