Skip to content

Latest commit

 

History

History
710 lines (592 loc) · 23.6 KB

readme.md

File metadata and controls

710 lines (592 loc) · 23.6 KB

Collecting Data from an API

Data is essential when building applications.

Let's collect some data that we can use to write an application.

Continue reading if you want to:

  • Query an API.
  • Handle some Http errors.
  • Collect some data and save it as JSON.
  • Write an application in Rust.

If you are looking for the complete code you can find it here.

Requirements

Before going further let's define how we can address the requirements of the application to avoid doing more than necessary.

  • Finding data we need to have some open data we can collect from an API somewhere. It needs to be open because like Music and Films we need to have permission to use it.
  • Collecting data we can do this using a HTTP client, it's worth noting that a data structure that is transmitted via HTTP is serialized as a string.
  • Storing data can be done using a filesystem. This can be done by converting the strings we collect to a suitable data structure and outputting it to a file. The format we choose will depend on what we want to do with the data later on.

Finding Data

Firstly we need an API we can query for some data if you do not already have one chosen, a good place to start is this repository.

For this exercise, we are going to get some cat facts. Taking a look through the repository we can see a repository licence, which looks permissive enough to use this data.

We can use a web browser to call a GET endpoint, pasting this endpoint https://cat-fact.herokuapp.com/facts/random into the address bar returns the following response.

{
  "used": false,
  "source": "api",
  "type": "cat",
  "deleted": false,
  "_id": "591f98703b90f7150a19c151",
  "__v": 0,
  "text": "Cat families usually play best in even numbers. Cats and kittens should be aquired in pairs whenever possible.",
  "updatedAt": "2020-06-30T20:20:33.478Z",
  "createdAt": "2018-01-04T01:10:54.673Z",
  "status": {
    "verified": true,
    "sentCount": 1
  },
  "user": "5a9ac18c7478810ea6c06381"
}

Writing an Application in Rust

Setting up the project

We are going to write our application in Rust, if you haven't already you can install Rust using the instructions here.

Next, we want to check that the installation provided the Cargo package manager, we can do this by running:

$ cargo --version

If that returned the version we have, we can then initialize a new project that uses the current folder as the name of the project by running:

$ cargo init --bin

One thing I like about Rust is the ecosystem, as functionalities that other languages have built-in can be provided through Crates until the Rust Language team adopt them.

I would recommend installing cargo-edit for adding packages the same way you might do in Javascript when running $ npm install --save.

$ cargo install cargo-edit

To address the problem our application will solve, we can use the following crates together:

  • The HTTP client we can use to send the request for our data is very helpfully called reqwest
  • Filesystem interactions will be provided by a JSON file store called jfs
  • To convert our strings to data structures and data structures to strings we can use serde
    • For dealing with JSON data structures we can use serde_json
  • To avoid worrying about how we implement errors for now we can use anyhow
$ cargo add reqwest serde serde_json jfs anyhow 

The output will look like:

      Adding reqwest v0.10.7 to dependencies
      Adding serde v1.0.115 to dependencies
      Adding serde_json v1.0.57 to dependencies
      Adding jfs v0.6.2 to dependencies
      Adding anyhow v1.0.32 to dependencies

Without specifying a version number for these libraries, we will want to check the versions we are telling Cargo to use because that will determine which version of the documentation we need to look at.

Inside the Cargo.toml we can see:

[dependencies]
reqwest =  "0.10",
serde = "1.0.115",
serde_json = "1.0.57"
jfs = "0.6.2"
anyhow = "1.0.32"

So next we can build the application to install the dependencies, we can do this by running:

$ cargo build

What I am going to do from here onwards however:

$ cargo run

Which will build and execute the code for us so we can have some feedback from theError: Server responded with: 404 Not Found output:

    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/data-collection-rust`
Hello, world!!

Normally if we had tests written then we could have them watch for new file changes but that is out of scope for the article today.

Calling the API

We will start by replicating calling the API in our application.

In a similar way to how the Web Browser was our client before, we must have a client that will interact with the API.

 fn main() {
-    println!("Hello, world!");
+    let client = reqwest::blocking::Client::new();
 }

Running this code, we encounter an error.

error[E0433]: failed to resolve: could not find `blocking` in `reqwest`
 --> src/main.rs:2:27
  |
2 |     let client = reqwest::blocking::Client::new();
  |                           ^^^^^^^^ could not find `blocking` in `reqwest`

When using some crates we must specify the features that our application will use in the Cargo.toml.

 [dependencies]
-reqwest = "0.10"
+reqwest = { version = "0.10", features = ["blocking"] }

Ok so let's add another line to call out endpoint

     let client = reqwest::blocking::Client::new();
+    let uri = "https://cat-fact.herokuapp.com/fact/random";
+    let response = client.get(uri).send()?;

Upon running our application again we see another error:

error[E0277]: the `?` operator can only be used in a function that returns `Result` or `Option` (or another type that implements...

We're using the ? operator here to handle calling a function that could throw an error. Let's do as the compiler suggests:

-fn main() {
+fn main() -> Result {

Seems that isn't exactly what the compiler wants:

error[E0107]: wrong number of type arguments: expected 2, found 0
 --> src/main.rs:1:14
  |
1 | fn main() -> Result {
  |              ^^^^^^ expected 2 type arguments

The Result will only be the return type if our code is successful, if not then this function will return an error. This is where we can use anyhow:

-fn main() {
+fn main() -> Result<(),  anyhow::Error> {

Ok, so we have another compiler error...

error[E0308]: mismatched types
 --> src/main.rs:1:14
  |
1 | fn main() -> Result<(), anyhow::Error> {
  |    ----      ^^^^^^^^^^^^^^^^^^^^^^^^^ expected enum `std::result::Result`, found `()`
  |    |
  |    implicitly returns `()` as its body has no tail or `return` expression

We need to add the following it seems, for any successful outcome that doesn't throw an error.

     let response = client.get(uri).send()?;
+    Ok(())
 })

So we probably want to know what the response looks like. We can take a quick look using the following macro:

+    dbg!(response);
     Ok(())

And it appears that I might have mistyped the url, as we receive a not found error.

[src/main.rs:5] response = Response {
    url: "https://cat-fact.herokuapp.com/fact/random",
    status: 404,
    headers: {
        "server": "Cowboy",
        "connection": "keep-alive",
        "x-powered-by": "Express",
        "access-control-allow-origin": "*",
        "content-security-policy": "default-src 'none'",
        "x-content-type-options": "nosniff",
        "content-type": "text/html; charset=utf-8",
        "content-length": "150",
        "set-cookie": "connect.sid=s%3A5IS9zYZqbamwJECS6C5JrdcDfIBJ8epX.Lbh4Zl5C21jdFOyih1RgS1%2FiZr2c8jxbEc1l1XiwTvo; Path=/; HttpOnly",
        "date": "Tue, 25 Aug 2020 17:46:27 GMT",
        "via": "1.1 vegur",
    },
}

I checked the API documentation and indeed I had mistyped the url.

-    let uri = "https://cat-fact.herokuapp.com/fact/random";
+    let uri = "https://cat-fact.herokuapp.com/facts/random";

After correction, we get the correct status code.

    url: "https://cat-fact.herokuapp.com/facts/random",
    status: 200,
    headers: {

It would probably be a good idea to handle errors when we don't get a 200 response. Let's check the response value so we can add a condition.

-    dbg!(response);
+    dbg!(response.status());

Ok.

[src/main.rs:5] response.status() = 200

Now let's add a conditional.

-    dbg!(response.status());
+    if(response.status() == 200) {
+        println!("{}", response.status());
+    }

Cool.

200 OK

However we want it to throw an error right, seems reqwest might let us do this, let's force it to fail again by adding the typo back in.

-    let uri = "https://cat-fact.herokuapp.com/facts/random";
+    let uri = "https://cat-fact.herokuapp.com/fact/random";
     let response = client.get(uri).send()?;
-    if(response.status() == 200) {
+    if response.status().is_client_error() || response.status().is_server_error() {
         println!("{}", response.status());
     } 

And let's have the application return the error it encounters to avoid running any other code. We can do this by using a macro bundled with anyhow.

+use anyhow::anyhow;
+
 fn main() -> Result<(), anyhow::Error> {
     let client = reqwest::blocking::Client::new();
     let uri = "https://cat-fact.herokuapp.com/facts/random";
     let response = client.get(uri).send()?;
     if(response.status().is_client_error() || response.status().is_server_error()) {
-        println!("{}", response.status());
+        return Err(anyhow!("Server responded with: {}", response.status()));
     } 
     Ok(())

Seems the compiler is warning us about something:

warning: unnecessary parentheses around `if` condition

Remove the parentheses and now upon receiving a status code that isn't 200

Error: Server responded with: 404 Not Found

Great. Next, let's look at getting our facts out of the response.

Deserializing Data

Let's start by trying to store the body's text as a string.

+    let string = serde_json::from_str(response.text());
     Ok(())

The compiler complains...

error[E0308]: mismatched types
  --> src/main.rs:10:39
   |
10 |     let string = serde_json::from_str(response.text());
   |                                       ^^^^^^^^^^^^^^^ expected `&str`, found enum `std::result::Result`

We can try to get the result by telling Rust that we might expect an error on both function calls, allowing us to access the result which hopefully is of type &str.

-    let string = serde_json::from_str(response.text());
+    let string = serde_json::from_str(response.text()?)?;
error[E0308]: try expression alternatives have incompatible types
  --> src/main.rs:10:39
   |
10 |     let string = serde_json::from_str(response.text()?);
   |                                       ^^^^^^^^^^^^^^^^
   |                                       |
   |                                       expected `&str`, found struct `std::string::String`
   |                                       help: consider borrowing here: `&response.text()?`

Wow, the compiler tells us what we might be able to do to fix the problem. Ok so if we reference the response instead...

warning: unused variable: `string`
  --> src/main.rs:10:9
   |
10 |     let string = serde_json::from_str(&response.text()?)?;
   |         ^^^^^^ help: if this is intentional, prefix it with an underscore: `_string`
   |
   = note: `#[warn(unused_variables)]` on by default

warning: 1 warning emitted

    Finished dev [unoptimized + debuginfo] target(s) in 3.20s
     Running `target/debug/data-collection-rust`
Error: invalid type: map, expected unit at line 1 column 0

Ok, nice so we only have one warning which we can ignore because we plan to use the variable. The error however might take some figuring out. So it turns out we can coerce Rust to try to parse using a specific type provided by serde_json. This will only work if the item is valid JSON.

-    let string = serde_json::from_str(&response.text()?)?;
+    let string: Value = serde_json::from_str(&response.text()?)?;

So the code compiles, we can print out what string now holds to check what the response body looks like.

+    dbg!(string);
[src/main.rs:12] string = Object({
    "__v": Number(
        0,
    ),
    "_id": String(
        "591f97d48dec2e14e3c20aff",
    ),
    "createdAt": String(
        "2018-01-04T01:10:54.673Z",
    ),
    "deleted": Bool(
        false,
    ),
    "source": String(
        "api",
    ),
    "status": Object({
        "sentCount": Number(
            1,
        ),
        "verified": Bool(
            true,
        ),
    }),
    "text": String(
        "Cats have the largest eyes of any mammal.",
    ),
    "type": String(
        "cat",
    ),
    "updatedAt": String(
        "2020-08-23T20:20:01.611Z",
    ),
    "used": Bool(
        false,
    ),
    "user": String(
        "5a9ac18c7478810ea6c06381",
    ),
})

Awesome, so this looks more like the kind of data we might want to use later. Rust even provides us with what types it thinks the fields are.

Determining the types

Rust is a strongly typed language so this means we define the types that our application needs to know about, and should do so when we can because the compiler is not always able to infer. It's useful to define the structure of our data for future reference so that when it comes to expanding our application we might need to understand the shape of our data.

Having checked the serde_json documentation we will need to make the following changes:

  1. At the top of the main.rs
    -use serde_json::Value;
    +use serde::{Deserialize, Serialize};
  2. Further down the main.rs
    -    let string: Value = serde_json::from_str(&response.text()?)?;
    +    let string: CatFact = serde_json::from_str(&response.text()?)?;
        Ok(())
    }
    +
    +#[derive(Debug, Serialize, Deserialize)]
    +struct CatFact {
    +    used: bool,
    +    source: String,
    +    r#type: String,
    +    deleted: bool,
    +    _id: String,
    +    __v: i32,
    +    text: String,
    +    updatedAt: String,
    +    createdAt: String,
    +    status: Status,
    +    user: String
    +}
    +
    +#[derive(Debug, Serialize, Deserialize)]
    +struct Status {
    +    verified: bool,
    +    sentCount: i32
    +}
  3. In the Cargo.toml:
    -serde = "1.0.115"
    +serde = { version = "1.0.115", features = ["derive"] }

Notice that when attempting to define the types for a JSON record, if the field name (also known as a key) happens to be a reserved keyword then the compiler handily points this out.

error: expected identifier, found keyword `type`
  --> src/main.rs:34:5
   |
34 |     type: String,
   |     ^^^^ expected identifier, found keyword
   |
help: you can escape reserved keywords to use them as identifiers
   |
34 |     r#type: String,
   |     ^^^^^^

error: aborting due to previous error

error: could not compile `playground-data-collection-rust`.

So when we run the application and dbg!(string) we see we have a cat fact.

[src/main.rs:12] string = CatFact {
    used: false,
    source: "api",
    type: "cat",
    deleted: false,
    etc...

Then if we want to access a specific field we use dot notation.

+    dbg!(string.text);
     Ok(())

Now we have the data we can store it somewhere.

Persisting the data locally

If we want the data to be used after our application has finished running, we need to consider using a persistence layer. This is useful if the application crashes, or we have another application that will use the data elsewhere. For this exercise let's consider the scope of how we can store data.

  • We can write a new file for each dataset that we collect.
  • We restrict the number of calls we are writing a single record at a time, so our program is synchronous.
  • We can use a loop to run our program for the prescribed number of times, and we pause the thread's execution so that we don't flood the API with requests. This can be based on whether the API we are calling has a throttling limit.

In the future we can increase the collection frequency then we might want to consider a different storage layer that considers scalability.

We have chosen to use jfs which will use our filesystem to store the data. If we wanted to analyze our data straight away we could have considered sqlite.

Ok so let's add the ability to save our data structure.

Start by importing:

 use anyhow::anyhow;
+use jfs::Store;

And we can print out the key that it uses as the file name.

+    let d: Store = Store::new("data")?;
+    let key = db.save(&string)?;
+    dbg!(key);
     Ok(())

Let's add some logging capability so that we can let our application log the result of what it's doing.

$ cargo add env_log

And adding the following in the main.rs.

 use jfs::Store;
+#[macro_use]
+extern crate log;
...
+    env_logger::init();
+    info!("Starting up");
....
-    dbg!(key);
+    info!("Written one file with key: {}", key);

When we run our application with an environment variable.

RUST_LOG=info cargo run

It will now print out some details for us.

[2020-08-27T17:54:41Z INFO  data_collection_rust] Starting up
[2020-08-27T17:54:42Z INFO  data_collection_rust] Written one file with key: 032bfc7b-f1c8-4cdd-bb9a-f29b3f1fa9c4

Great, but what if we want to keep collecting items, can we make the application do this?

Collecting more than one item

So now to collect more than one item.

We can add a loop that will run infinitely. You will notice that I have moved items out of the loop as it should be more efficient to only run them once at start-up.

fn main() -> Result<(), anyhow::Error> {
     info!("Starting up");
     let client = reqwest::blocking::Client::new();
     let uri = "https://cat-fact.herokuapp.com/facts/random";
-    let response = client.get(uri).send()?;
-    if response.status().is_client_error() || response.status().is_server_error() {
-        return Err(anyhow!("Server responded with: {}", response.status()));
-    }
-    let string: CatFact = serde_json::from_str(&response.text()?)?;
     let db: Store = Store::new("data")?;
-    let key = db.save(&string)?;
-    info!("Written one file with key: {}", key);
+    loop {
+        let response = client.get(uri).send()?;
+        if response.status().is_client_error() || response.status().is_server_error() {
+            return Err(anyhow!("Server responded with: {}", response.status()));
+        }
+        let string: CatFact = serde_json::from_str(&response.text()?)?;
+        let key = db.save(&string)?;
+        info!("Written one file with key: {}", key);
+    }
     Ok(())
 }

Great but we can probably slow down our requests so we don't DDOS or throttle the service we're using.

[2020-08-28T13:17:57Z INFO  data_collection_rust] Starting up
[2020-08-28T13:18:04Z INFO  data_collection_rust] Written one file with key: 6ec26b1e-51e9-46d7-92fc-5b5d848f3b85
[2020-08-28T13:18:04Z INFO  data_collection_rust] Written one file with key: e731bec5-ef6e-4f64-93c9-abd2df4d837b
[2020-08-28T13:18:05Z INFO  data_collection_rust] Written one file with key: fecd2a8b-2f3c-4f3b-a15e-f87f577a1116
[2020-08-28T13:18:05Z INFO  data_collection_rust] Written one file with key: b2220b56-ed18-4a60-a79a-b096f308cfae
[2020-08-28T13:18:05Z INFO  data_collection_rust] Written one file with key: c38acf34-edee-4e24-938f-2bf3f6b08e7d
[2020-08-28T13:18:05Z INFO  data_collection_rust] Written one file with key: 1e0f9ca5-d2fe-44a0-801b-869bee21233c

We can do this be making the thread this process run in sleep for some time. So let's import this functionality.

 use serde::{Deserialize, Serialize};
+use std::thread;
+use std::time::Duration;

And add this to call the method.

         info!("Written one file with key: {}", key);
+        thread::sleep(Duration::from_millis(5000));

Then when we run our application.

[2020-08-28T13:24:08Z INFO  data_collection_rust] Starting up
[2020-08-28T13:24:08Z INFO  data_collection_rust] Written one file with key: 30183e77-c25c-453c-922b-e027ba9a0e54
[2020-08-28T13:24:14Z INFO  data_collection_rust] Written one file with key: f1b63d41-8452-442f-8e3d-c4fea7bf80d0

Great, you can see the time difference between the two requests is greater now, we could set this duration as an environment variable if we wanted later.

warning: unreachable expression
  --> src/main.rs:26:5
   |
16 | /     loop {
17 | |         let response = client.get(uri).send()?;
18 | |         if response.status().is_client_error() || response.status().is_server_error() {
19 | |             return Err(anyhow!("Server responded with: {}", response.status()));
...  |
24 | |         thread::sleep(Duration::from_millis(5000));
25 | |     }
   | |_____- any code following this expression is unreachable
26 |       Ok(())
   |       ^^^^^^ unreachable expression
   |
   = note: `#[warn(unreachable_code)]` on by default

The compiler is warning us that our function is never going to return OK because it never leaves the loop. We can add a counter and break condition to resolve this.

For our use case, we can use an unsigned integer since we know we will never have a negative number when incrementing our counter. We also make it mutable because we want to change it.

     info!("Starting up");
+    let mut count = 0u32;
loop {
+    count += 1;

And add the break condition.

         thread::sleep(Duration::from_millis(5000));
+        if count == 5 {
+            break;
+        } else {
+            continue;
+        }

Future work

Some things that we could do next:

  • Add a test that will mock calling the API.
  • Create a data factory for generating random test data.
  • Use a storage layer such as Amazon S3 to enable scaling.

Thanks for reading, I hope this helped you, please reach out to me on Twitter if you have any questions and/or suggestions!

Misc

Response

Response {
    url: "https://cat-fact.herokuapp.com/facts/random",
    status: 200,
    header: {
        "server": "Cowboy",
        "connection": "keep-alive",
        "x-powered-by": "Express",
        "access-control-allow-origin": "*",
        "content-type": "application/json; charset=utf-8",
        "content-length": "305",
        "etag": "W/\"131-4MvaJDAXqtlkSUWatxfF1BCPTek\"",
        "set-cookie": "connect.sid=s%3Ap0LnvYyEQUN9plmg--3mnx7DNd1dyhI4.Ms1oCBB5sWMAMwCtgfye572bD%2FBfxlCRkRRoq8cLzEY; Path=/; HttpOnly",
        "date": "Tue, 18 Aug 2020 20:04:01 GMT",
        "via": "1.1 vegur",
    },
}

Compiler Warnings

This type of warning doesn't stop the code from compiling, it's a bit like Javascripts ESLint telling us that our code isn't idiomatic.

warning: structure field `sentCount` should have a snake case name
  --> src/main.rs:49:5
   |
49 |     sentCount: i32
   |     ^^^^^^^^^ help: convert the identifier to snake case: `sent_count`

warning: 3 warnings emitted

This can be disabled whilst developing if the noise gets in the way with the following at the top of the file.

#![allow(non_snake_case)]