public bigdata

R data import(DBI/http data download) - 데이터 캠프 정리 본문

R programming

R data import(DBI/http data download) - 데이터 캠프 정리

public bigdata 2020. 1. 4. 16:04

1. DBI MySql 연결

> # Load the DBI package
> library(DBI)
> 
> # Connect to the MySQL database: con
> con <- dbConnect(RMySQL::MySQL(), 
+                  dbname = "tweater", 
+                  host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
+                  port = 3306,
+                  user = "student",
+                  password = "datacamp")
> 
> # Build a vector of table names: tables
> tables <- dbListTables(con)
> 
> # Display structure of tables
> str(tables)
 chr [1:3] "comments" "tweats" "users"

DMI패키지는 RMySQL과 같은 데이터베이스 패키지들을 통합해서 사용할 수 있도록 표준 인터페이스를 제공한다.

  • dbListTables : 연결된 db의 테이블들을 반환한다.

2. 데이터베이스의 테이블 가져오기

> # Import the users table from tweater: users
> users <- dbReadTable(con, "users")
> 
> # Print users
> users
  id      name     login
1  1 elisabeth  elismith
2  2      mike     mikey
3  3      thea   teatime
4  4    thomas tomatotom
5  5    oliver olivander
6  6      kate  katebenn
7  7    anjali    lianja
  • dbReadTable : 위 코드에서는 "users"테이블을 R로 읽어들인다.

3. 데이터베이스 모든 테이블 리스트로 가져오기

> # Get table names
> table_names <- dbListTables(con)
> 
> # Import all tables
> # tables <- lapply(table_names, dbReadTable, conn = con) 수업에서는 이렇게 사용
> tables <- lapply(table_names, FUN = function(table_name){ # 
+   dbReadTable(name = table_name, conn = con)})
> 
> # Print out tables
> tables
[[1]]
     id tweat_id user_id            message
1  1022       87       7              nice!
2  1000       77       7             great!
3  1011       49       5            love it
4  1012       87       1   awesome! thanks!
5  1010       88       6              yuck!
6  1026       77       4      not my thing!
7  1004       49       1  this is fabulous!
8  1030       75       6           so easy!
9  1025       88       2             oh yes
10 1007       49       3           serious?
11 1020       77       1 couldn't be better
12 1014       77       1       saved my day

[[2]]
  id user_id
1 75       3
2 88       4
3 77       6
4 87       5
5 49       1
6 24       7
                                                                 post
1                                       break egg. bake egg. eat egg.
2                           wash strawberries. add ice. blend. enjoy.
3                       2 slices of bread. add cheese. grill. heaven.
4               open and crush avocado. add shrimps. perfect starter.
5 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
6                              just eat an apple. simply and healthy.
        date
1 2015-09-05
2 2015-09-14
3 2015-09-21
4 2015-09-22
5 2015-09-22
6 2015-09-24

[[3]]
  id      name     login
1  1 elisabeth  elismith
2  2      mike     mikey
3  3      thea   teatime
4  4    thomas tomatotom
5  5    oliver olivander
6  6      kate  katebenn
7  7    anjali    lianja
  • lapply, dbListTables를 이용해 데이터베이스의 모든 테이블을 가져온다.

4. 데이터베이스에 sql 쿼리 날리기

> # Import tweat_id column of comments where user_id is 1: elisabeth
> elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments WHERE user_id = 1")
> 
> # Print  elisabeth
> elisabeth
  tweat_id
1       87
2       49
3       77
4       77
  • dbGetQuery를 이용해 데이터베이스에 sql 쿼리를 날린다.
  • where 조건 주의사항 : 보통 프로그래밍 언어에서 "=="을 사용하지만 sql에선 "="하나만 사용한다.

5. 데이터베이스에 sql 쿼리 날리기(date)

> # Import post column of tweats where date is higher than '2015-09-21': latest
> latest <- dbGetQuery(con, "SELECT post FROM tweats where date > '2015-09-21'")
> 
> # Print latest
> latest
                                                                 post
1               open and crush avocado. add shrimps. perfect starter.
2 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
3                              just eat an apple. simply and healthy.
> 
  • 날짜 조건에 해당하는 데이터만 가져온다

6. 데이터베이스에 sql 쿼리 날리기(and)

> # Create data frame specific
> specific <- dbGetQuery(con, "SELECT message from comments where tweat_id=77 and user_id>4")
> 
> # Print specific
> specific
  message
1  great!
  • and 조건 사용

7. 데이터베이스에 sql 쿼리 날리기(문자 길이가 5미만인 데이터)

> # Create data frame short
> short <- dbGetQuery(con, "select id, name from users where char_length(name) < 5")
> 
> # Print short
> short
  id name
1  2 mike
2  3 thea
3  6 kate
  • char_lenth(name) : name 열의 문자 길이

8. DBI

1) dbGetQuery는 DBI의 함수다. 실행하면 RMySQL패키지를 통해서 실질적으로 수행한다.

 

2) 해당 쿼리는 3단계로 이루어 진다.

  • dbSendQuery
  • dbFetch
  • dbClearResult

dbSendQuery로 쿼리를 날리고 dbFetch로 쿼리테이블에서 데이터를 가져온다. 가져올 때마다 쿼리테이블에서 가져온 데이터는 사라진다. dbFetch함수를 여러번  반복해서 데이터를 모두 가져오면 쿼리테이블은 빈 테이블이고 마지막으로 쿼리테이블에서 데이터를 다 가져왔는지 dbClearResult함수로 확인하면 True를 반환한다.

> res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")
> # Use dbFetch() twice
> dbFetch(res, n = 2)
    id tweat_id user_id message
1 1022       87       7   nice!
2 1000       77       7  great!
> dbFetch(res)
    id tweat_id user_id  message
1 1011       49       5  love it
2 1010       88       6    yuck!
3 1030       75       6 so easy!
> # Clear res
> dbClearResult(res)
[1] TRUE

3) db 사용 후 연결 해제

> long_tweats <- dbGetQuery(con, "SELECT post, date FROM tweats WHERE char_length(post) >40")
> # Print long_tweats
> print(long_tweats)
                                                                 post       date
1                           wash strawberries. add ice. blend. enjoy. 2015-09-14
2                       2 slices of bread. add cheese. grill. heaven. 2015-09-21
3               open and crush avocado. add shrimps. perfect starter. 2015-09-22
4 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins. 2015-09-22
> # Disconnect from the database
> dbDisconnect(con)
[1] TRUE

1. readr 패키지로 데이터 가져오기

# Load the readr package 
library(readr) 

# Import the csv file: pools 
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv" 
pools <- read_csv(url_csv) 

# Import the txt file: potatoes 
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt" 
potatoes <- read_tsv(url_delim) 

# Print pools and potatoes 
print(pools) 
print(potatoes)

2. utils::read.csv, readr::read_csv 

> # https URL to the swimming_pools csv file.
> url_csv <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
> # Import the file using read.csv(): pools1
> pools1 <- read.csv(url_csv)
> # Load the readr package
> library(readr)
> # Import the file using read_csv(): pools2
> pools2 <- read_csv(url_csv)
Parsed with column specification:
cols(
  Name = col_character(),
  Address = col_character(),
  Latitude = col_double(),
  Longitude = col_double()
)
> # Print the structure of pools1 and pools2
> str(pools1)
'data.frame':	20 obs. of  4 variables:
 $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
 $ Address  : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...
> str(pools2)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':	20 obs. of  4 variables:
 $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
 $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...
 - attr(*, "spec")=
  .. cols(
  ..   Name = col_character(),
  ..   Address = col_character(),
  ..   Latitude = col_double(),
  ..   Longitude = col_double()
  .. )

read_csv는 문자를 그대로 가져온다 factor형으로 형변환 하지 않는다.

 

2. 파일을 다운 받아서 읽어오기

readxl이 url을 통해서 데이터를 바로 가져오지 못하기 때문에 download.file을 통해서 다운받고 readxl로 불러올 수 있다.

> # Downloading files -------------------------------------------------------
> # readxl이 아직 웹 데이터 가져오는 것을 지원하지 않아서.
> # 아래처럼 download.file을 사용하는 것 같다
> # 아니면 gddata 이용하던지.
> # Load the readxl and gdata package
> library(readxl)
> # Specification of url: url_xls
> url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
> # Download file behind URL, name it local_latitude.xls
> download.file(url_xls, destfile = "local_latitude.xls")
URL 'http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls'을 시도합니다
Content type '' length 37888 bytes (37 KB)
downloaded 37 KB

> # Import the local .xls file with readxl: excel_readxl
> excel_readxl <- read_excel("local_latitude.xls")
에러: 
  filepath: C:\Users\bok\Google 드라이브(ysjjhb@gmail.com)\2. 분석\6. datacamp\local_latitude.xls
  libxls error: Unable to open file

데이터가 제대로 다운이 안돼서 불러오는데 에러가 발생했다. 데이터 캠프 url이라서 권한이 없는건지 비정상적으로 다운로드 된다.

 

3. RData 파일 다운받기

> # https URL to the wine RData file.
> url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"
> # Download the wine file to your working directory
> # 아래처럼 디렉토리에 저장 안하고 바로 불러올려면 
> # download.file 대신 url 함수를 사용해서 RData를 메모리에 바로 불러올 수 있다.
> download.file(url_rdata, "wine_local.RData")
URL 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData'을 시도합니다
Content type '' length 4781 bytes
downloaded 4781 bytes

> # Load the wine data into your workspace using load()
> load("wine_local.RData")
> # Print out the summary of the wine data
> summary(wine)
    Alcohol        Malic acid        Ash        Alcalinity of ash   Magnesium     
 Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60     Min.   : 70.00  
 1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20     1st Qu.: 88.00  
 Median :13.05   Median :1.87   Median :2.360   Median :19.50     Median : 98.00  
 Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52     Mean   : 99.59  
 3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50     3rd Qu.:107.00  
 Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00     Max.   :162.00  
 Total phenols     Flavanoids    Nonflavanoid phenols Proanthocyanins
 Min.   :0.980   Min.   :0.340   Min.   :0.1300       Min.   :0.410  
 1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700       1st Qu.:1.250  
 Median :2.350   Median :2.130   Median :0.3400       Median :1.550  
 Mean   :2.292   Mean   :2.023   Mean   :0.3623       Mean   :1.587  
 3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400       3rd Qu.:1.950  
 Max.   :3.880   Max.   :5.080   Max.   :0.6600       Max.   :3.580  
 Color intensity       Hue           Proline      
 Min.   : 1.280   Min.   :1.270   Min.   : 278.0  
 1st Qu.: 3.210   1st Qu.:1.930   1st Qu.: 500.0  
 Median : 4.680   Median :2.780   Median : 672.0  
 Mean   : 5.055   Mean   :2.604   Mean   : 745.1  
 3rd Qu.: 6.200   3rd Qu.:3.170   3rd Qu.: 985.0  
 Max.   :13.000   Max.   :4.000   Max.   :1680.0  

RData 파일은 잘 받아진다.


1. httr패키지를 이용한 데이터 다운받기

library(httr)

# Get the url, save response to resp
url <- "http://www.example.com/"
resp <- GET(url)

# Print resp
resp

# Get the raw content of resp: raw_content
raw_content <- content(resp, as = "raw")

# Print the head of raw_content
head(raw_content)

 

content 함수로 요청하고 받은 데이터에서 내용 정보만 가져오게 되는데 "raw", "text", "parsed" 3가지의 형태로 가져올 수 있다.

> ## 아래 코드는 뭘 가져오는지 모르겠다. 
> ## html 같은게 없는데 뭘 가져오는 거지?
> # Get the url
> url <- "http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json"
> resp <- GET(url)
> # Print resp
> resp
Response [http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json]
  Date: 2020-01-04 15:29
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 929 B

> # Print content of resp as text
> content(resp, as = "text")
[1] "{\"Title\":\"Annie Hall\",\"Year\":\"1977\",\"Rated\":\"PG\",\"Released\":\"20 Apr 1977\",\"Runtime\":\"93 min\",\"Genre\":\"Comedy, Romance\",\"Director\":\"Woody Allen\",\"Writer\":\"Woody Allen, Marshall Brickman\",\"Actors\":\"Woody Allen, Diane Keaton, Tony Roberts, Carol Kane\",\"Plot\":\"Neurotic New York comedian Alvy Singer falls in love with the ditzy Annie Hall.\",\"Language\":\"English, German\",\"Country\":\"USA\",\"Awards\":\"Won 4 Oscars. Another 26 wins & 8 nominations.\",\"Poster\":\"https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg\",\"Ratings\":[{\"Source\":\"Internet Movie Database\",\"Value\":\"8.0/10\"},{\"Source\":\"Rotten Tomatoes\",\"Value\":\"97%\"},{\"Source\":\"Metacritic\",\"Value\":\"92/100\"}],\"Metascore\":\"92\",\"imdbRating\":\"8.0\",\"imdbVotes\":\"240,490\",\"imdbID\":\"tt0075686\",\"Type\":\"movie\",\"DVD\":\"28 Apr 1998\",\"BoxOffice\":\"N/A\",\"Production\":\"United Artists\",\"Website\":\"N/A\",\"Response\":\"True\"}"
> # Print content of resp
> content(resp)
$Title
[1] "Annie Hall"

$Year
[1] "1977"

$Rated
[1] "PG"

$Released
[1] "20 Apr 1977"

$Runtime
[1] "93 min"
...
...
...

1. JSON 데이터 가져오기1

# Load the jsonlite package
library(jsonlite)

# wine_json is a JSON
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'

# Convert wine_json into a list: wine
wine <- fromJSON(wine_json)

# Print structure of wine
str(wine)

2. JSON 데이터 가져오기2

> # jsonlite is preloaded
> library(jsonlite)
> # Definition of quandl_url
> quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
> # Import Quandl data: quandl_data
> quandl_data <- fromJSON(quandl_url)
> # Print structure of quandl_data
> str(quandl_data)
List of 1
 $ dataset_data:List of 10
  ..$ limit       : NULL
  ..$ transform   : NULL
  ..$ column_index: NULL
  ..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
  ..$ start_date  : chr "2012-05-18"
  ..$ end_date    : chr "2018-03-27"
  ..$ frequency   : chr "daily"
  ..$ data        : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
  ..$ collapse    : NULL
  ..$ order       : NULL

서로 다른 url에서 데이터 가져와서 값 비교해 보기

> # jsonlite is preloaded
> library(jsonlite)
> # Definition of quandl_url
> quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
> # Import Quandl data: quandl_data
> quandl_data <- fromJSON(quandl_url)
> # Print structure of quandl_data
> str(quandl_data)
List of 1
 $ dataset_data:List of 10
  ..$ limit       : NULL
  ..$ transform   : NULL
  ..$ column_index: NULL
  ..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
  ..$ start_date  : chr "2012-05-18"
  ..$ end_date    : chr "2018-03-27"
  ..$ frequency   : chr "daily"
  ..$ data        : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
  ..$ collapse    : NULL
  ..$ order       : NULL
> # Definition of the URLs
> url_sw4 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0076759&r=json"
> url_sw3 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0121766&r=json"
> # Import two URLs with fromJSON(): sw4 and sw3
> sw4 <- fromJSON(url_sw4)
> sw3 <- fromJSON(url_sw3)
> # Print out the Title element of both lists
> sw4$Title
[1] "Star Wars: Episode IV - A New Hope"
> sw3$Title
[1] "Star Wars: Episode III - Revenge of the Sith"
> # Is the release year of sw4 later than sw3?
> sw4$Year > sw3$Year
[1] FALSE

3. fromJSON함수가 데이터를 가져오는 다양한 형태

> # Challenge 1
> json1 <- '[1, 2, 3, 4, 5, 6]'
> fromJSON(json1)
[1] 1 2 3 4 5 6
> # Challenge 2
> json2 <- '{"a": [1, 2, 3], "b": [4, 5, 6]}'
> fromJSON(json2)
$a
[1] 1 2 3

$b
[1] 4 5 6

> # Challenge 1
> json1 <- '[[1, 2], [3, 4]]'
> fromJSON(json1)
     [,1] [,2]
[1,]    1    2
[2,]    3    4
> # Challenge 2
> json2 <- '[{"a": 1, "b": 2}, 
+ {"a": 3, "b": 4}, 
+ {"a": 5, "b": 6}]'
> fromJSON(json2)
  a b
1 1 2
2 3 4
3 5 6

4. data.frame 형식을 JSON형태로 바꾸기 toJSON

# URL pointing to the .csv file
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/water.csv"

# Import the .csv file located at url_csv
water <- read.csv(url_csv, stringsAsFactors = FALSE)

# Convert the data file according to the requirements
water_json <- toJSON(water)

# Print out water_json
water_json

5. toJSON을 print하는 2가지 방법

# Convert mtcars to a pretty JSON: pretty_json
pretty_json<- toJSON(mtcars, pretty = T)

# Print pretty_json
pretty_json

# Minify pretty_json: mini_json
mini_json <- toJSON(mtcars, pretty = F)

# Print mini_json
mini_json

2가지 형태로 출력한다. pretty 형식과 mini형식

## pretty형식 ##
{
    "mpg": 22.8,
    "cyl": 4,
    "disp": 108,
    "hp": 93,
    "drat": 3.85,
    "wt": 2.32,
    "qsec": 18.61,
    "vs": 1,
    "am": 1,
    "gear": 4,
    "carb": 1,
    "_row": "Datsun 710"
  }
  
  ## mini형식 ##
  {"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"}

'R programming' 카테고리의 다른 글

Machine Learning in the Tidyverse [datacamp 정리]  (0) 2020.02.24
R markdown 참고자료  (0) 2020.01.23
R 데이터 저장 (fread, rds, feather 등)  (0) 2019.12.19
병렬 프로그래밍  (0) 2019.11.26
ggplot tip  (0) 2019.11.07