R
package to extract data from PDF figures
The easiest way to install the development version of scrapR
is to use the devtools
package:
# install.packages("devtools")
library(devtools)
install_github("adamkucharski/scrapR")
library(scrapR)
# load dependencies
# install.packages("readr")
# install.packages("grImport")
# install.packages("magrittr")
library(grImport)
library(readr)
Note that the dependency grImport
requires the ghostscript PDF interpreter to be installed. You can check which version you have installed (if any) by running $ gs -v
on the command line. If required, installation can be done via homebrew with $ brew install ghostscript
.
First you need a figure to extract data from. If you want a simple test figure, you can run:
simulate_PDF_data()
to generate a simulated set of lines and output as figure1.pdf
.
Next, navigate to the directory containing your PDF figure and import the data:
load_PDF_data(file_name="figure1.pdf")
This will output a raw RDS file and a figure ([FIGURENAME].guide.pdf
) with the different vector components labelled with numbers.
If the data fails to import, it's probably because the vector graphic has too many surrounding features. In this case, use an editor like Affinity/Illustrator etc. to delete unnecessary surrounding content, making sure to leave the lines with data you want and at least four tick marks (2 on x-axis, 2 on y-axis), which will be used to calibrate the scale.
Once you've run load_PDF_data()
, edit/create [FIGURE NAME].guide.csv
so numbers match up with two x-axis tick marks and two y-axis tick marks, and specify which data you want to extract:
point | value | axis |
---|---|---|
5 | 5 | x |
10 | 30 | x |
13 | 200 | y |
16 | 800 | y |
2 | NA | data |
18 | NA | data |
Then extract the data using the RDS file and guide CSV file.
extract_PDF_data(file_name = "figure1.pdf")
The resulting data for the line(s) will be output as [FIGURENAME].csv
, with each line grouped by index
. The above function also has an option to adjust for x and/or y axes on logarithmic scale.