Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Portuguese tutorials #1

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions literate_notebooks/src-PT-BR/01_constructors.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# # Introduction to DataFrames
# **[Bogumił Kamiński](http://bogumilkaminski.pl/about/), May 23, 2018**
#
# Let's get started by loading the `DataFrames` package.

using DataFrames

# ## Constructors and conversion

#-

# ### Constructors
#
# In this section, you'll see many ways to create a `DataFrame` using the `DataFrame()` constructor.
#
# First, we could create an empty DataFrame,

DataFrame() # empty DataFrame

# Or we could call the constructor using keyword arguments to add columns to the `DataFrame`.

DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3]))

# We can create a `DataFrame` from a dictionary, in which case keys from the dictionary will be sorted to create the `DataFrame` columns.

x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])
DataFrame(x)

# Rather than explicitly creating a dictionary first, as above, we could pass `DataFrame` arguments with the syntax of dictionary key-value pairs.
#
# Note that in this case, we use symbols to denote the column names and arguments are not sorted. For example, `:A`, the symbol, produces `A`, the name of the first column here:

DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b'])

# Here we create a `DataFrame` from a vector of vectors, and each vector becomes a column.

DataFrame([rand(3) for i in 1:3])

# For now we can construct a single `DataFrame` from a `Vector` of atoms, creating a `DataFrame` with a single row. In future releases of DataFrames.jl, this will throw an error.

DataFrame(rand(3))

# Instead use a transposed vector if you have a vector of atoms (in this way you effectively pass a two dimensional array to the constructor which is supported).

DataFrame(transpose([1, 2, 3]))

# Pass a second argument to give the columns names.

DataFrame([1:3, 4:6, 7:9], [:A, :B, :C])

# Here we create a `DataFrame` from a matrix,

DataFrame(rand(3,4))

# and here we do the same but also pass column names.

DataFrame(rand(3,4), Symbol.('a':'d'))

# We can also construct an uninitialized DataFrame.
#
# Here we pass column types, names and number of rows; we get `missing` in column :C because `Any >: Missing`.

DataFrame([Int, Float64, Any], [:A, :B, :C], 1)

# Here we create a `DataFrame`, but column `:C` is #undef and Jupyter has problem with displaying it. (This works OK at the REPL.)
#
# This will be fixed in next release of DataFrames!

DataFrame([Int, Float64, String], [:A, :B, :C], 1)

# To initialize a `DataFrame` with column names, but no rows use

DataFrame([Int, Float64, String], [:A, :B, :C], 0)

# This syntax gives us a quick way to create homogenous `DataFrame`.

DataFrame(Int, 3, 5)

# This example is similar, but has nonhomogenous columns.

DataFrame([Int, Float64], 4)

# Finally, we can create a `DataFrame` by copying an existing `DataFrame`.
#
# Note that `copy` creates a shallow copy.

y = DataFrame(x)
z = copy(x)
(x === y), (x === z), isequal(x, z)

# ### Conversion to a matrix
#
# Let's start by creating a `DataFrame` with two rows and two columns.

x = DataFrame(x=1:2, y=["A", "B"])

# We can create a matrix by passing this `DataFrame` to `Matrix`.

Matrix(x)

# This would work even if the `DataFrame` had some `missing`s:

x = DataFrame(x=1:2, y=[missing,"B"])

#-

Matrix(x)

# In the two previous matrix examples, Julia created matrices with elements of type `Any`. We can see more clearly that the type of matrix is inferred when we pass, for example, a `DataFrame` of integers to `Matrix`, creating a 2D `Array` of `Int64`s:

x = DataFrame(x=1:2, y=3:4)

#-

Matrix(x)

# In this next example, Julia correctly identifies that `Union` is needed to express the type of the resulting `Matrix` (which contains `missing`s).

x = DataFrame(x=1:2, y=[missing,4])

#-

Matrix(x)

# Note that we can't force a conversion of `missing` values to `Int`s!

Matrix{Int}(x)

# ### Handling of duplicate column names
#
# We can pass the `makeunique` keyword argument to allow passing duplicate names (they get deduplicated)

df = DataFrame(:a=>1, :a=>2, :a_1=>3; makeunique=true)

# Otherwise, duplicates will not be allowed in the future.

df = DataFrame(:a=>1, :a=>2, :a_1=>3)

# A constructor that is passed column names as keyword arguments is a corner case.
# You cannot pass `makeunique` to allow duplicates here.

df = DataFrame(a=1, a=2, makeunique=true)

76 changes: 76 additions & 0 deletions literate_notebooks/src-PT-BR/02_basicinfo.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# # Introduction to DataFrames
# **[Bogumił Kamiński](http://bogumilkaminski.pl/about/), May 23, 2018**

using DataFrames # load package

# ## Getting basic information about a data frame
#
# Let's start by creating a `DataFrame` object, `x`, so that we can learn how to get information on that data frame.

x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"])

# The standard `size` function works to get dimensions of the `DataFrame`,

size(x), size(x, 1), size(x, 2)

# as well as `nrow` and `ncol` from R; `length` gives number of columns.

nrow(x), ncol(x), length(x)

# `describe` gives basic summary statistics of data in your `DataFrame`.

describe(x)

# Use `showcols` to get informaton about columns stored in a DataFrame.

showcols(x)

# `names` will return the names of all columns,

names(x)

# and `eltypes` returns their types.

eltypes(x)

# Here we create some large DataFrame

y = DataFrame(rand(1:10, 1000, 10));

# and then we can use `head` to peek into its top rows

head(y)

# and `tail` to see its bottom rows.

tail(y, 3)

# ### Most elementary get and set operations
#
# Given the `DataFrame`, `x`, here are three ways to grab one of its columns as a `Vector`:

x[1], x[:A], x[:, 1]

# To grab one row as a DataFrame, we can index as follows.

x[1, :]

# We can grab a single cell or element with the same syntax to grab an element of an array.

x[1, 1]

# Assignment can be done in ranges to a scalar,

x[1:2, 1:2] = 1
x

# to a vector of length equal to the number of assigned rows,

x[1:2, 1:2] = [1,2]
x

# or to another data frame of matching size.

x[1:2, 1:2] = DataFrame([5 6; 7 8])
x

112 changes: 112 additions & 0 deletions literate_notebooks/src-PT-BR/03_missingvalues.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# # Introduction to DataFrames
# **[Bogumił Kamiński](http://bogumilkaminski.pl/about/), May 23, 2018**

using DataFrames # load package

# ## Handling missing values
#
# A singelton type `Missings.Missing` allows us to deal with missing values.

missing, typeof(missing)

# Arrays automatically create an appropriate union type.

x = [1, 2, missing, 3]

# `ismissing` checks if passed value is missing.

ismissing(1), ismissing(missing), ismissing(x), ismissing.(x)

# We can extract the type combined with Missing from a `Union` via
#
# (This is useful for arrays!)

eltype(x), Missings.T(eltype(x))

# `missing` comparisons produce `missing`.

missing == missing, missing != missing, missing < missing

# This is also true when `missing`s are compared with values of other types.

1 == missing, 1 != missing, 1 < missing

# `isequal`, `isless`, and `===` produce results of type `Bool`.

isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing)

# In the next few examples, we see that many (not all) functions handle `missing`.

map(x -> x(missing), [sin, cos, zero, sqrt]) # part 1

#-

map(x -> x(missing, 1), [+, - , *, /, div]) # part 2

#-

map(x -> x([1,2,missing]), [minimum, maximum, extrema, mean, any, float]) # part 3

# `skipmissing` returns iterator skipping missing values. We can use `collect` and `skipmissing` to create an array that excludes these missing values.

collect(skipmissing([1, missing, 2, missing]))

# Similarly, here we combine `collect` and `Missings.replace` to create an array that replaces all missing values with some value (`NaN` in this case).

collect(Missings.replace([1.0, missing, 2.0, missing], NaN))

# Another way to do this:

coalesce.([1.0, missing, 2.0, missing], NaN)

# Caution: `nothing` would also be replaced here (for Julia 0.7 a more sophisticated behavior of `coalesce` that allows to avoid this problem is planned).

coalesce.([1.0, missing, nothing, missing], NaN)

# You can use `recode` if you have homogenous output types.

recode([1.0, missing, 2.0, missing], missing=>NaN)

# You can use `unique` or `levels` to get unique values with or without missings, respectively.

unique([1, missing, 2, missing]), levels([1, missing, 2, missing])

# In this next example, we convert `x` to `y` with `allowmissing`, where `y` has a type that accepts missings.

x = [1,2,3]
y = allowmissing(x)

# Then, we convert back with `disallowmissing`. This would fail if `y` contained missing values!

z = disallowmissing(y)
x,y,z

# In this next example, we show that the type of each column in `x` is initially `Int64`. After using `allowmissing!` to accept missing values in columns 1 and 3, the types of those columns become `Union`s of `Int64` and `Missings.Missing`.

x = DataFrame(Int, 2, 3)
println("Before: ", eltypes(x))
allowmissing!(x, 1) # make first column accept missings
allowmissing!(x, :x3) # make :x3 column accept missings
println("After: ", eltypes(x))

# In this next example, we'll use `completecases` to find all the rows of a `DataFrame` that have complete data.

x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])
println(x)
println("Complete cases:\n", completecases(x))

# We can use `dropmissing` or `dropmissing!` to remove the rows with incomplete data from a `DataFrame` and either create a new `DataFrame` or mutate the original in-place.

y = dropmissing(x)
dropmissing!(x)
[x, y]

# When we call `showcols` on a `DataFrame` with dropped missing values, the columns still allow missing values.

showcols(x)

# Since we've excluded missing values, we can safely use `disallowmissing!` so that the columns will no longer accept missing values.

disallowmissing!(x)
showcols(x)

64 changes: 64 additions & 0 deletions literate_notebooks/src-PT-BR/04_loadsave.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# # Introduction to DataFrames
# **[Bogumił Kamiński](http://bogumilkaminski.pl/about/), May 23, 2018**

using DataFrames # load package

# ## Load and save DataFrames
# We do not cover all features of the packages. Please refer to their documentation to learn them.
#
# Here we'll load `CSV` to read and write CSV files and `JLD`, which allows us to work with a Julia native binary format.

using CSV
using JLD

# Let's create a simple `DataFrame` for testing purposes,

x = DataFrame(A=[true, false, true], B=[1, 2, missing],
C=[missing, "b", "c"], D=['a', missing, 'c'])


# and use `eltypes` to look at the columnwise types.

eltypes(x)

# Let's use `CSV` to save `x` to disk; make sure `x.csv` does not conflict with some file in your working directory.

CSV.write("x.csv", x)

# Now we can see how it was saved by reading `x.csv`.

print(read("x.csv", String))

# We can also load it back. `use_mmap=false` disables memory mapping so that on Windows the file can be deleted in the same session.

y = CSV.read("x.csv", use_mmap=false)

# When loading in a `DataFrame` from a `CSV`, all columns allow `Missing` by default. Note that the column types have changed!

eltypes(y)

# Now let's save `x` to a file in a binary format; make sure that `x.jld` does not exist in your working directory.

save("x.jld", "x", x)

# After loading in `x.jld` as `y`, `y` is identical to `x`.

y = load("x.jld", "x")

# Note that the column types of `y` are the same as those of `x`!

eltypes(y)

# Next, we'll create the files `bigdf.csv` and `bigdf.jld`, so be careful that you don't already have these files on disc!
#
# In particular, we'll time how long it takes us to write a `DataFrame` with 10^3 rows and 10^5 columns to `.csv` and `.jld` files. *You can expect JLD to be faster!* Use `compress=true` to reduce file sizes.

bigdf = DataFrame(Bool, 10^3, 10^2)
@time CSV.write("bigdf.csv", bigdf)
@time save("bigdf.jld", "bigdf", bigdf)
getfield.(stat.(["bigdf.csv", "bigdf.jld"]), :size)

# Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files.

foreach(rm, ["x.csv", "x.jld", "bigdf.csv", "bigdf.jld"])

Loading