일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 | 31 |
- 아인슈타인
- 수학적 사고
- 통계 오류
- 멘탈관리
- 산입범위
- 동전 던지기
- 유닛테스트
- 통계오류
- R 프로그래밍
- 비행기 추락
- 최저 시급
- 비선형성
- 이기적 유전자
- 찬물샤워
- t검정
- 선형성
- 핵개발
- t-test
- 최저시급 개정안
- R4DS
- 인터스텔라
- 조던피더슨
- 비율
- 큰수의 법칙
- 성악설
- 티모시페리스
- 산입 범위
- 자기관리
- 핵 개발
- R 기초
- Today
- Total
public bigdata
R data import(DBI/http data download) - 데이터 캠프 정리 본문
1. DBI MySql 연결
> # Load the DBI package
> library(DBI)
>
> # Connect to the MySQL database: con
> con <- dbConnect(RMySQL::MySQL(),
+ dbname = "tweater",
+ host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
+ port = 3306,
+ user = "student",
+ password = "datacamp")
>
> # Build a vector of table names: tables
> tables <- dbListTables(con)
>
> # Display structure of tables
> str(tables)
chr [1:3] "comments" "tweats" "users"
DMI패키지는 RMySQL과 같은 데이터베이스 패키지들을 통합해서 사용할 수 있도록 표준 인터페이스를 제공한다.
- dbListTables : 연결된 db의 테이블들을 반환한다.
2. 데이터베이스의 테이블 가져오기
> # Import the users table from tweater: users
> users <- dbReadTable(con, "users")
>
> # Print users
> users
id name login
1 1 elisabeth elismith
2 2 mike mikey
3 3 thea teatime
4 4 thomas tomatotom
5 5 oliver olivander
6 6 kate katebenn
7 7 anjali lianja
-
dbReadTable : 위 코드에서는 "users"테이블을 R로 읽어들인다.
3. 데이터베이스 모든 테이블 리스트로 가져오기
> # Get table names
> table_names <- dbListTables(con)
>
> # Import all tables
> # tables <- lapply(table_names, dbReadTable, conn = con) 수업에서는 이렇게 사용
> tables <- lapply(table_names, FUN = function(table_name){ #
+ dbReadTable(name = table_name, conn = con)})
>
> # Print out tables
> tables
[[1]]
id tweat_id user_id message
1 1022 87 7 nice!
2 1000 77 7 great!
3 1011 49 5 love it
4 1012 87 1 awesome! thanks!
5 1010 88 6 yuck!
6 1026 77 4 not my thing!
7 1004 49 1 this is fabulous!
8 1030 75 6 so easy!
9 1025 88 2 oh yes
10 1007 49 3 serious?
11 1020 77 1 couldn't be better
12 1014 77 1 saved my day
[[2]]
id user_id
1 75 3
2 88 4
3 77 6
4 87 5
5 49 1
6 24 7
post
1 break egg. bake egg. eat egg.
2 wash strawberries. add ice. blend. enjoy.
3 2 slices of bread. add cheese. grill. heaven.
4 open and crush avocado. add shrimps. perfect starter.
5 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
6 just eat an apple. simply and healthy.
date
1 2015-09-05
2 2015-09-14
3 2015-09-21
4 2015-09-22
5 2015-09-22
6 2015-09-24
[[3]]
id name login
1 1 elisabeth elismith
2 2 mike mikey
3 3 thea teatime
4 4 thomas tomatotom
5 5 oliver olivander
6 6 kate katebenn
7 7 anjali lianja
- lapply, dbListTables를 이용해 데이터베이스의 모든 테이블을 가져온다.
4. 데이터베이스에 sql 쿼리 날리기
> # Import tweat_id column of comments where user_id is 1: elisabeth
> elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments WHERE user_id = 1")
>
> # Print elisabeth
> elisabeth
tweat_id
1 87
2 49
3 77
4 77
- dbGetQuery를 이용해 데이터베이스에 sql 쿼리를 날린다.
- where 조건 주의사항 : 보통 프로그래밍 언어에서 "=="을 사용하지만 sql에선 "="하나만 사용한다.
5. 데이터베이스에 sql 쿼리 날리기(date)
> # Import post column of tweats where date is higher than '2015-09-21': latest
> latest <- dbGetQuery(con, "SELECT post FROM tweats where date > '2015-09-21'")
>
> # Print latest
> latest
post
1 open and crush avocado. add shrimps. perfect starter.
2 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
3 just eat an apple. simply and healthy.
>
- 날짜 조건에 해당하는 데이터만 가져온다
6. 데이터베이스에 sql 쿼리 날리기(and)
> # Create data frame specific
> specific <- dbGetQuery(con, "SELECT message from comments where tweat_id=77 and user_id>4")
>
> # Print specific
> specific
message
1 great!
- and 조건 사용
7. 데이터베이스에 sql 쿼리 날리기(문자 길이가 5미만인 데이터)
> # Create data frame short
> short <- dbGetQuery(con, "select id, name from users where char_length(name) < 5")
>
> # Print short
> short
id name
1 2 mike
2 3 thea
3 6 kate
-
char_lenth(name) : name 열의 문자 길이
8. DBI
1) dbGetQuery는 DBI의 함수다. 실행하면 RMySQL패키지를 통해서 실질적으로 수행한다.
2) 해당 쿼리는 3단계로 이루어 진다.
- dbSendQuery
- dbFetch
- dbClearResult
dbSendQuery로 쿼리를 날리고 dbFetch로 쿼리테이블에서 데이터를 가져온다. 가져올 때마다 쿼리테이블에서 가져온 데이터는 사라진다. dbFetch함수를 여러번 반복해서 데이터를 모두 가져오면 쿼리테이블은 빈 테이블이고 마지막으로 쿼리테이블에서 데이터를 다 가져왔는지 dbClearResult함수로 확인하면 True를 반환한다.
> res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")
> # Use dbFetch() twice
> dbFetch(res, n = 2)
id tweat_id user_id message
1 1022 87 7 nice!
2 1000 77 7 great!
> dbFetch(res)
id tweat_id user_id message
1 1011 49 5 love it
2 1010 88 6 yuck!
3 1030 75 6 so easy!
> # Clear res
> dbClearResult(res)
[1] TRUE
3) db 사용 후 연결 해제
> long_tweats <- dbGetQuery(con, "SELECT post, date FROM tweats WHERE char_length(post) >40")
> # Print long_tweats
> print(long_tweats)
post date
1 wash strawberries. add ice. blend. enjoy. 2015-09-14
2 2 slices of bread. add cheese. grill. heaven. 2015-09-21
3 open and crush avocado. add shrimps. perfect starter. 2015-09-22
4 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins. 2015-09-22
> # Disconnect from the database
> dbDisconnect(con)
[1] TRUE
1. readr 패키지로 데이터 가져오기
# Load the readr package
library(readr)
# Import the csv file: pools
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
pools <- read_csv(url_csv)
# Import the txt file: potatoes
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_tsv(url_delim)
# Print pools and potatoes
print(pools)
print(potatoes)
2. utils::read.csv, readr::read_csv
> # https URL to the swimming_pools csv file.
> url_csv <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
> # Import the file using read.csv(): pools1
> pools1 <- read.csv(url_csv)
> # Load the readr package
> library(readr)
> # Import the file using read_csv(): pools2
> pools2 <- read_csv(url_csv)
Parsed with column specification:
cols(
Name = col_character(),
Address = col_character(),
Latitude = col_double(),
Longitude = col_double()
)
> # Print the structure of pools1 and pools2
> str(pools1)
'data.frame': 20 obs. of 4 variables:
$ Name : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
$ Address : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
$ Latitude : num -27.6 -27.6 -27.6 -27.5 -27.4 ...
$ Longitude: num 153 153 153 153 153 ...
> str(pools2)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 20 obs. of 4 variables:
$ Name : chr "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
$ Address : chr "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
$ Latitude : num -27.6 -27.6 -27.6 -27.5 -27.4 ...
$ Longitude: num 153 153 153 153 153 ...
- attr(*, "spec")=
.. cols(
.. Name = col_character(),
.. Address = col_character(),
.. Latitude = col_double(),
.. Longitude = col_double()
.. )
read_csv는 문자를 그대로 가져온다 factor형으로 형변환 하지 않는다.
2. 파일을 다운 받아서 읽어오기
readxl이 url을 통해서 데이터를 바로 가져오지 못하기 때문에 download.file을 통해서 다운받고 readxl로 불러올 수 있다.
> # Downloading files -------------------------------------------------------
> # readxl이 아직 웹 데이터 가져오는 것을 지원하지 않아서.
> # 아래처럼 download.file을 사용하는 것 같다
> # 아니면 gddata 이용하던지.
> # Load the readxl and gdata package
> library(readxl)
> # Specification of url: url_xls
> url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
> # Download file behind URL, name it local_latitude.xls
> download.file(url_xls, destfile = "local_latitude.xls")
URL 'http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls'을 시도합니다
Content type '' length 37888 bytes (37 KB)
downloaded 37 KB
> # Import the local .xls file with readxl: excel_readxl
> excel_readxl <- read_excel("local_latitude.xls")
에러:
filepath: C:\Users\bok\Google 드라이브(ysjjhb@gmail.com)\2. 분석\6. datacamp\local_latitude.xls
libxls error: Unable to open file
데이터가 제대로 다운이 안돼서 불러오는데 에러가 발생했다. 데이터 캠프 url이라서 권한이 없는건지 비정상적으로 다운로드 된다.
3. RData 파일 다운받기
> # https URL to the wine RData file.
> url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"
> # Download the wine file to your working directory
> # 아래처럼 디렉토리에 저장 안하고 바로 불러올려면
> # download.file 대신 url 함수를 사용해서 RData를 메모리에 바로 불러올 수 있다.
> download.file(url_rdata, "wine_local.RData")
URL 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData'을 시도합니다
Content type '' length 4781 bytes
downloaded 4781 bytes
> # Load the wine data into your workspace using load()
> load("wine_local.RData")
> # Print out the summary of the wine data
> summary(wine)
Alcohol Malic acid Ash Alcalinity of ash Magnesium
Min. :11.03 Min. :0.74 Min. :1.360 Min. :10.60 Min. : 70.00
1st Qu.:12.36 1st Qu.:1.60 1st Qu.:2.210 1st Qu.:17.20 1st Qu.: 88.00
Median :13.05 Median :1.87 Median :2.360 Median :19.50 Median : 98.00
Mean :12.99 Mean :2.34 Mean :2.366 Mean :19.52 Mean : 99.59
3rd Qu.:13.67 3rd Qu.:3.10 3rd Qu.:2.560 3rd Qu.:21.50 3rd Qu.:107.00
Max. :14.83 Max. :5.80 Max. :3.230 Max. :30.00 Max. :162.00
Total phenols Flavanoids Nonflavanoid phenols Proanthocyanins
Min. :0.980 Min. :0.340 Min. :0.1300 Min. :0.410
1st Qu.:1.740 1st Qu.:1.200 1st Qu.:0.2700 1st Qu.:1.250
Median :2.350 Median :2.130 Median :0.3400 Median :1.550
Mean :2.292 Mean :2.023 Mean :0.3623 Mean :1.587
3rd Qu.:2.800 3rd Qu.:2.860 3rd Qu.:0.4400 3rd Qu.:1.950
Max. :3.880 Max. :5.080 Max. :0.6600 Max. :3.580
Color intensity Hue Proline
Min. : 1.280 Min. :1.270 Min. : 278.0
1st Qu.: 3.210 1st Qu.:1.930 1st Qu.: 500.0
Median : 4.680 Median :2.780 Median : 672.0
Mean : 5.055 Mean :2.604 Mean : 745.1
3rd Qu.: 6.200 3rd Qu.:3.170 3rd Qu.: 985.0
Max. :13.000 Max. :4.000 Max. :1680.0
RData 파일은 잘 받아진다.
1. httr패키지를 이용한 데이터 다운받기
library(httr)
# Get the url, save response to resp
url <- "http://www.example.com/"
resp <- GET(url)
# Print resp
resp
# Get the raw content of resp: raw_content
raw_content <- content(resp, as = "raw")
# Print the head of raw_content
head(raw_content)
content 함수로 요청하고 받은 데이터에서 내용 정보만 가져오게 되는데 "raw", "text", "parsed" 3가지의 형태로 가져올 수 있다.
> ## 아래 코드는 뭘 가져오는지 모르겠다.
> ## html 같은게 없는데 뭘 가져오는 거지?
> # Get the url
> url <- "http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json"
> resp <- GET(url)
> # Print resp
> resp
Response [http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json]
Date: 2020-01-04 15:29
Status: 200
Content-Type: application/json; charset=utf-8
Size: 929 B
> # Print content of resp as text
> content(resp, as = "text")
[1] "{\"Title\":\"Annie Hall\",\"Year\":\"1977\",\"Rated\":\"PG\",\"Released\":\"20 Apr 1977\",\"Runtime\":\"93 min\",\"Genre\":\"Comedy, Romance\",\"Director\":\"Woody Allen\",\"Writer\":\"Woody Allen, Marshall Brickman\",\"Actors\":\"Woody Allen, Diane Keaton, Tony Roberts, Carol Kane\",\"Plot\":\"Neurotic New York comedian Alvy Singer falls in love with the ditzy Annie Hall.\",\"Language\":\"English, German\",\"Country\":\"USA\",\"Awards\":\"Won 4 Oscars. Another 26 wins & 8 nominations.\",\"Poster\":\"https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg\",\"Ratings\":[{\"Source\":\"Internet Movie Database\",\"Value\":\"8.0/10\"},{\"Source\":\"Rotten Tomatoes\",\"Value\":\"97%\"},{\"Source\":\"Metacritic\",\"Value\":\"92/100\"}],\"Metascore\":\"92\",\"imdbRating\":\"8.0\",\"imdbVotes\":\"240,490\",\"imdbID\":\"tt0075686\",\"Type\":\"movie\",\"DVD\":\"28 Apr 1998\",\"BoxOffice\":\"N/A\",\"Production\":\"United Artists\",\"Website\":\"N/A\",\"Response\":\"True\"}"
> # Print content of resp
> content(resp)
$Title
[1] "Annie Hall"
$Year
[1] "1977"
$Rated
[1] "PG"
$Released
[1] "20 Apr 1977"
$Runtime
[1] "93 min"
...
...
...
1. JSON 데이터 가져오기1
# Load the jsonlite package
library(jsonlite)
# wine_json is a JSON
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'
# Convert wine_json into a list: wine
wine <- fromJSON(wine_json)
# Print structure of wine
str(wine)
2. JSON 데이터 가져오기2
> # jsonlite is preloaded
> library(jsonlite)
> # Definition of quandl_url
> quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
> # Import Quandl data: quandl_data
> quandl_data <- fromJSON(quandl_url)
> # Print structure of quandl_data
> str(quandl_data)
List of 1
$ dataset_data:List of 10
..$ limit : NULL
..$ transform : NULL
..$ column_index: NULL
..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
..$ start_date : chr "2012-05-18"
..$ end_date : chr "2018-03-27"
..$ frequency : chr "daily"
..$ data : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
..$ collapse : NULL
..$ order : NULL
서로 다른 url에서 데이터 가져와서 값 비교해 보기
> # jsonlite is preloaded
> library(jsonlite)
> # Definition of quandl_url
> quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
> # Import Quandl data: quandl_data
> quandl_data <- fromJSON(quandl_url)
> # Print structure of quandl_data
> str(quandl_data)
List of 1
$ dataset_data:List of 10
..$ limit : NULL
..$ transform : NULL
..$ column_index: NULL
..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
..$ start_date : chr "2012-05-18"
..$ end_date : chr "2018-03-27"
..$ frequency : chr "daily"
..$ data : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
..$ collapse : NULL
..$ order : NULL
> # Definition of the URLs
> url_sw4 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0076759&r=json"
> url_sw3 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0121766&r=json"
> # Import two URLs with fromJSON(): sw4 and sw3
> sw4 <- fromJSON(url_sw4)
> sw3 <- fromJSON(url_sw3)
> # Print out the Title element of both lists
> sw4$Title
[1] "Star Wars: Episode IV - A New Hope"
> sw3$Title
[1] "Star Wars: Episode III - Revenge of the Sith"
> # Is the release year of sw4 later than sw3?
> sw4$Year > sw3$Year
[1] FALSE
3. fromJSON함수가 데이터를 가져오는 다양한 형태
> # Challenge 1
> json1 <- '[1, 2, 3, 4, 5, 6]'
> fromJSON(json1)
[1] 1 2 3 4 5 6
> # Challenge 2
> json2 <- '{"a": [1, 2, 3], "b": [4, 5, 6]}'
> fromJSON(json2)
$a
[1] 1 2 3
$b
[1] 4 5 6
> # Challenge 1
> json1 <- '[[1, 2], [3, 4]]'
> fromJSON(json1)
[,1] [,2]
[1,] 1 2
[2,] 3 4
> # Challenge 2
> json2 <- '[{"a": 1, "b": 2},
+ {"a": 3, "b": 4},
+ {"a": 5, "b": 6}]'
> fromJSON(json2)
a b
1 1 2
2 3 4
3 5 6
4. data.frame 형식을 JSON형태로 바꾸기 toJSON
# URL pointing to the .csv file
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/water.csv"
# Import the .csv file located at url_csv
water <- read.csv(url_csv, stringsAsFactors = FALSE)
# Convert the data file according to the requirements
water_json <- toJSON(water)
# Print out water_json
water_json
5. toJSON을 print하는 2가지 방법
# Convert mtcars to a pretty JSON: pretty_json
pretty_json<- toJSON(mtcars, pretty = T)
# Print pretty_json
pretty_json
# Minify pretty_json: mini_json
mini_json <- toJSON(mtcars, pretty = F)
# Print mini_json
mini_json
2가지 형태로 출력한다. pretty 형식과 mini형식
## pretty형식 ##
{
"mpg": 22.8,
"cyl": 4,
"disp": 108,
"hp": 93,
"drat": 3.85,
"wt": 2.32,
"qsec": 18.61,
"vs": 1,
"am": 1,
"gear": 4,
"carb": 1,
"_row": "Datsun 710"
}
## mini형식 ##
{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"}
'R programming' 카테고리의 다른 글
Machine Learning in the Tidyverse [datacamp 정리] (0) | 2020.02.24 |
---|---|
R markdown 참고자료 (0) | 2020.01.23 |
R 데이터 저장 (fread, rds, feather 등) (0) | 2019.12.19 |
병렬 프로그래밍 (0) | 2019.11.26 |
ggplot tip (0) | 2019.11.07 |