forked from tidyverse/rvest
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
105 lines (75 loc) · 3.34 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# rvest <img src='man/figures/logo.png' align="right" height="139" />
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/rvest)](https://cran.r-project.org/package=rvest)
[![Travis build status](https://travis-ci.org/tidyverse/rvest.svg?branch=master)](https://travis-ci.org/tidyverse/rvest)
[![Codecov test coverage](https://codecov.io/gh/tidyverse/rvest/branch/master/graph/badge.svg)](https://codecov.io/gh/tidyverse/rvest?branch=master)
<!-- badges: end -->
## Overview
rvest helps you scrape information from web pages. It is designed to work with [magrittr](https://github.com/smbache/magrittr) to make it easy to express common web scraping tasks, inspired by libraries like [beautiful soup](https://www.crummy.com/software/BeautifulSoup/).
```{r, message = FALSE}
library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
rating <- lego_movie %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
rating
cast <- lego_movie %>%
html_nodes("#titleCast .primary_photo img") %>%
html_attr("alt")
cast
poster <- lego_movie %>%
html_nodes(".poster img") %>%
html_attr("src")
poster
```
## Installation
Install the release version from CRAN:
```{r, eval = FALSE}
install.packages("rvest")
```
Or the development version from GitHub
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("tidyverse/rvest")
```
## Key functions
The most important functions in rvest are:
* Create an html document from a url, a file on disk or a string containing
html with `read_html()`.
* Select parts of a document using CSS selectors: `html_nodes(doc, "table td")`
(or if you've a glutton for punishment, use XPath selectors with
`html_nodes(doc, xpath = "//table//td")`). If you haven't heard of
[selectorgadget](http://selectorgadget.com/), make sure to read
`vignette("selectorgadget")` to learn about it.
* Extract components with `html_name()` (the name of the tag), `html_text()`
(all text inside the tag), `html_attr()` (contents of a single attribute) and
`html_attrs()` (all attributes).
* (You can also use rvest with XML files: parse with `xml()`, then extract
components using `xml_node()`, `xml_attr()`, `xml_attrs()`, `xml_text()`
and `xml_name()`.)
* Parse tables into data frames with `html_table()`.
* Extract, modify and submit forms with `html_form()`, `set_values()` and
`submit_form()`.
* Detect and repair encoding problems with `guess_encoding()` and
`repair_encoding()`.
* Navigate around a website as if you're in a browser with `html_session()`,
`jump_to()`, `follow_link()`, `back()`, `forward()`, `submit_form()` and
so on. (This is still a work in progress, so I'd love your feedback.)
To see examples of these function in use, check out the demos.
## Inspirations
* Python: [RoboBrowser](http://robobrowser.readthedocs.org/en/latest/readme.html),
[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/).
## Code of Conduct
Please note that the rvest project is released with a [Contributor Code of Conduct](https://rvest.tidyverse.org/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.