-
Notifications
You must be signed in to change notification settings - Fork 2
/
01_python_tools_ds.Rmd
662 lines (442 loc) · 20.5 KB
/
01_python_tools_ds.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
---
jupyter:
jupytext:
formats: ipynb,Rmd
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.4.2
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Python tools for data science
(last updated 2020-05-18)
## The PyData Stack
The Python Data Stack comprises a set of packages that makes Python a powerful data science language. These include
+ Numpy: provides arrays and matrix algebra
+ Scipy: provides scientific computing capabilities
+ matplotlib: provides graphing capabilities
These were the original stack that was meant to replace Matlab. However, these were meant to tackle purely numerical data, and the kinds of heterogeneous data we regularly face needed more tools. These were added more recently.
+ Pandas: provides data analytic structures like the data frame, as well as basic descriptive statistical capabilities
+ statsmodels: provides a fairly comprehensive set of statistical functions
+ scikit-learn: provides machine learning capabilities
This is the basic stack of packages we will be using in this workshop. Additionally we will use a few packages that add some functionality to the data science process. These include
+ seaborn: Better statistical graphs
+ plotly: Interactive graphics
+ biopython: Python for bioinformatics
We may also introduce the package `rpy2` which allows one to run R from within Python. This can be useful since many bioinformatic pipelines are already implemented in R.
> The [PyData stack](https://scipy.org) also includes `sympy`, a symbolic mathematics package emulating Maple
## Numpy (numerical and scientific computing)
We start by importing the Numpy package into Python using the alias `np`.
```{python 01-python-tools-ds-2}
import numpy as np
```
Numpy provides both arrays (vectors, matrices, higher dimensional arrays) and vectorized functions which are very fast. Let's see how this works.
```{python 01-python-tools-ds-3}
z = [1,2,3,4,5,6,7,8,9.3,10.6] # This is a list
z_array = np.array(z)
z_array
```
Now, we have already seen functions in Python earlier. In Numpy, there are functions that are optimized for arrays, that can be accessed directly from the array objects. This is an example of *object-oriented programming* in Python, where functions are provided for particular *classes* of objects, and which can be directly accessed from the objects. We will use several such functions over the course of this workshop, but we won't actually talk about how to do this program development here.
> Numpy functions are often very fast, and are *vectorized*, i.e., they are written to work on vectors of numbers rather than single numbers. This is an advantage in data science since we often want to do the same operation to all elements of a column of data, which is essentially a vector
We apply the functions `sum`, `min` (minimum value) and `max` (maximum value) to `z_array`.
```{python 01-python-tools-ds-4}
z_array.sum()
```
```{python 01-python-tools-ds-5}
z_array.min()
```
```{python 01-python-tools-ds-6}
z_array.max()
```
The versions of these functions in Numpy are optimized for arrays and are quite a bit faster than the corresponding functions available in base Python. When doing data work, these are the preferred functions.
These functions can also be used in the usual function manner:
```{python 01-python-tools-ds-7}
np.max(z_array)
```
Calling `np.max` ensures that we are using the `max` function from numpy, and not the one in base Python.
### Numpy data types
Numpy arrays are homogeneous in type.
```{python 01-python-tools-ds-8}
np.array(['a','b','c'])
```
```{python 01-python-tools-ds-9}
np.array([1,2,3,6,8,29])
```
But, what if we provide a heterogeneous list?
```{python 01-python-tools-ds-10}
y = [1,3,'a']
np.array(y)
```
So what's going on here? Upon conversion from a heterogeneous list, numpy converted the numbers into strings. This is necessary since, by definition, numpy arrays can hold data of a single type. When one of the elements is a string, numpy casts all the other entities into strings as well. Think about what would happen if the opposite rule was used. The string 'a' doesn't have a corresponding number, while both numbers 1 and 3 have corresponding string representations, so going from string to numeric would create all sorts of problems.
> The advantage of numpy arrays is that the data is stored in a contiguous section of memory, and you can be very efficient with homogeneous arrays in terms of manipulating them, applying functions, etc. However, `numpy` does provide a "catch-all" `dtype` called `object`, which can be any Python object. This `dtype` essentially is an array of pointers to actual data stored in different parts of the memory. You can get to the actual objects by extracting them. So one could do <a name='object'></a>
```{python 01-python-tools-ds-11}
np.array([1,3,'a'], dtype='object')
```
> which would basically be a valid `numpy` array, but would go back to the actual objects when used, much like a list. We can see this later if we want to transform a heterogeneous `pandas` `DataFrame` into a `numpy` array. It's not particularly useful as is, but it prevents errors from popping up during transformations from `pandas` to `numpy`.
### Generating data in numpy
We had seen earlier how we could generate a sequence of numbers in a list using `range`. In numpy, you can generate a sequence of numbers in an array using `arange` (which actually creates the array rather than provide an iterator like `range`).
```{python 01-python-tools-ds-12}
np.arange(10)
```
You can also generate regularly spaced sequences of numbers between particular values
```{python 01-python-tools-ds-13}
np.linspace(start=0, stop=1, num=11) # or np.linspace(0, 1, 11)
```
You can also do this with real numbers rather than integers.
```{python 01-python-tools-ds-14}
np.linspace(start = 0, stop = 2*np.pi, num = 10)
```
More generally, you can transform lists into `numpy` arrays. We saw this above for vectors. For matrices, you can provide a list of lists. Note the double `[` in front and back.
```{python 01-python-tools-ds-15}
np.array([[1,3,5,6],[4,3,9,7]])
```
You can generate an array of 0's
```{python 01-python-tools-ds-16}
np.zeros(10)
```
This can easily be extended to a two-dimensional array (a matrix), by specifying the dimension of the matrix as a tuple.
```{python 01-python-tools-ds-17}
np.zeros((10,10))
```
You can also generate a matrix of 1s in a similar manner.
```{python 01-python-tools-ds-18}
np.ones((3,4))
```
In matrix algebra, the identity matrix is important. It is a square matrix with 1's on the diagonal and 0's everywhere else.
```{python 01-python-tools-ds-19}
np.eye(4)
```
You can also create numpy vectors directly from lists, as long as lists are made up of atomic elements of the same type. This means a list of numbers or a list of strings. The elements can't be more composite structures, generally. One exception is a list of lists, where all the lists contain the same type of atomic data, which, as we will see, can be used to create a matrix or 2-dimensional array.
```{python 01-python-tools-ds-20}
a = [1,2,3,4,5,6,7,8]
b = ['a','b','c','d','3']
np.array(a)
```
```{python 01-python-tools-ds-21}
np.array(b)
```
#### Random numbers
Generating random numbers is quite useful in many areas of data science. All computers don't produce truly random numbers but generate *pseudo-random* sequences. These are completely deterministic sequences defined algorithmically that emulate the properties of random numbers. Since these are deterministic, we can set a *seed* or starting value for the sequence, so that we can exactly reproduce this sequence to help debug our code. To actually see how things behave in simulations we will often run several sequences of random numbers starting at different seed values.
The seed is set by the `RandomState` function within the `random` submodule of numpy. Note that all Python names are case-sensitive.
```{python 01-python-tools-ds-22}
rng = np.random.RandomState(35) # set seed
rng.randint(0, 10, (3,4))
```
We have created a 3x4 matrix of random integers between 0 and 10 (in line with slicing rules, this includes 0 but not 10).
We can also create a random sample of numbers between 0 and 1.
```{python 01-python-tools-ds-23}
rng.random_sample((5,2))
```
We'll see later how to generate random numbers from particular probability distributions.
### Vectors and matrices
Numpy generates arrays, which can be of arbitrary dimension. However the most useful are vectors (1-d arrays) and matrices (2-d arrays).
In these examples, we will generate samples from the Normal (Gaussian) distribution, with mean 0 and variance 1.
```{python 01-python-tools-ds-24}
A = rng.normal(0,1,(4,5))
```
We can compute some characteristics of this matrix's dimensions. The number of rows and columns are given by `shape`.
```{python 01-python-tools-ds-25}
A.shape
```
The total number of elements are given by `size`.
```{python 01-python-tools-ds-26}
A.size
```
If we want to create a matrix of 0's with the same dimensions as `A`, we don't actually have to compute its dimensions. We can use the `zeros_like` function to figure it out.
```{python 01-python-tools-ds-27}
np.zeros_like(A)
```
We can also create vectors by only providing the number of rows to the random sampling function. The number of columns will be assumed to be 1.
```{python 01-python-tools-ds-28}
B = rng.normal(0, 1, (4,))
B
```
#### Extracting elements from arrays
The syntax for extracting elements from arrays is almost exactly the same as for lists, with the same rules for slices.
**Exercise:** State what elements of B are extracted by each of the following statements
```
B[:3]
B[:-1]
B[[0,2,4]]
B[[0,2,5]]
```
For matrices, we have two dimensions, so you can slice by rows, or columns or both.
```{python 01-python-tools-ds-29}
A
```
We can extract the first column by specifying `:` (meaning everything) for the rows, and the index for the column (reminder, Python starts counting at 0)
```{python 01-python-tools-ds-30}
A[:,0]
```
Similarly the 4th row can be extracted by putting the row index, and `:` for the column index.
```{python 01-python-tools-ds-31}
A[3,:]
```
All slicing operations work for rows and columns
```{python 01-python-tools-ds-32}
A[:2,:2]
```
#### Array operations
We can do a variety of vector and matrix operations in `numpy`.
First, all usual arithmetic operations work on arrays, like adding or multiplying an array with a scalar.
```{python 01-python-tools-ds-33}
A = rng.randn(3,5)
A
```
```{python 01-python-tools-ds-34}
A + 10
```
We can also add and multiply arrays __element-wise__ as long as they are the same shape.
```{python 01-python-tools-ds-35}
B = rng.randint(0,10, (3,5))
B
```
```{python 01-python-tools-ds-36}
A + B
```
```{python 01-python-tools-ds-37}
A * B
```
You can also do **matrix multiplication**. Recall what this is.
If you have a matrix $A_{m x n}$ and another matrix $B_{n x p}$, as long as the number of columns of $A$ and rows of $B$ are the same, you can multiply them ($C_{m x p} = A_{m x n}B_{n x p}$), with the (i,j)-th element of C being
$$ c_{ij} = \sum_{k=1}^n a_{ik}b_{kj}, i= 1, \dots, m; j = 1, \dots, p$$
In `numpy` the operant for matrix multiplication is `@`.
In the above examples, `A` and `B` cannot be multiplied since they have incompatible dimensions. However, we can take the *transpose* of `B`, i.e. flip the rows and columns, to make it compatible with `A` for matrix multiplication.
```{python 01-python-tools-ds-38}
A @ np.transpose(B)
```
```{python 01-python-tools-ds-39}
np.transpose(A) @ B
```
More generally, you can *reshape* a `numpy` array into a new shape, provided it is compatible with the number of elements in the original array.
```{python 01-python-tools-ds-40}
D = rng.randint(0,5, (4,4))
D
```
```{python 01-python-tools-ds-41}
D.reshape(8,2)
```
```{python 01-python-tools-ds-42}
D.reshape(1,16)
```
This can also be used to cast a vector into a matrix.
```{python 01-python-tools-ds-43}
e = np.arange(20)
E = e.reshape(5,4)
E
```
> One thing to note in all the reshaping operations above is that the new array takes elements of the old array **by row**. See the examples above to convince yourself of that.
#### Statistical operations on arrays
You can sum all the elements of a matrix using `sum`. You can also sum along rows or along columns by adding an argument to the `sum` function.
```{python 01-python-tools-ds-44}
A = rng.normal(0, 1, (4,2))
A
```
```{python 01-python-tools-ds-45}
A.sum()
```
You can sum along rows (i.e., down columns) with the option `axis = 0`
```{python 01-python-tools-ds-46}
A.sum(axis=0)
```
You can sum along columns (i.e., across rows) with `axis = 1`.
```{python 01-python-tools-ds-47}
A.sum(axis=1)
```
> Of course, you can use the usual function calls: `np.sum(A, axis = 1)`
We can also find the minimum and maximum values.
```{python 01-python-tools-ds-48}
A.min(axis = 0)
```
```{python 01-python-tools-ds-49}
A.max(axis = 0)
```
We can also find the **position** where the minimum and maximum values occur.
```{python 01-python-tools-ds-50}
A.argmin(axis=0)
```
```{python 01-python-tools-ds-51}
A.argmax(axis=0)
```
We can sort arrays and also find the indices which will result in the sorted array. I'll demonstrate this for a vector, where it is more relevant
```{python 01-python-tools-ds-52}
a = rng.randint(0,10, 8)
a
```
```{python 01-python-tools-ds-53}
np.sort(a)
```
```{python 01-python-tools-ds-54}
np.argsort(a)
```
```{python 01-python-tools-ds-55}
a[np.argsort(a)]
```
`np.argsort` can also help you find the 2nd smallest or 3rd largest value in an array, too.
```{python 01-python-tools-ds-56}
ind_2nd_smallest = np.argsort(a)[1]
a[ind_2nd_smallest]
```
```{python 01-python-tools-ds-57}
ind_3rd_largest = np.argsort(a)[-3]
a[ind_3rd_largest]
```
You can also sort strings in this way.
```{python 01-python-tools-ds-58}
m = np.array(['Aram','Raymond','Elizabeth','Donald','Harold'])
np.sort(m)
```
If you want to sort arrays **in place**, you can use the `sort` function in a different way.
```{python 01-python-tools-ds-59}
m.sort()
m
```
#### Putting arrays together
We can put arrays together by row or column, provided the corresponding axes have compatible lengths.
```{python 01-python-tools-ds-60}
A = rng.randint(0,5, (3,5))
B = rng.randint(0,5, (3,5))
print('A = ', A)
print('B = ', B)
```
```{python 01-python-tools-ds-61}
np.hstack((A,B))
```
```{python 01-python-tools-ds-62}
np.vstack((A,B))
```
Note that both `hstack` and `vstack` take a **tuple** of arrays as input.
#### Logical/Boolean operations
You can query a matrix to see which elements meet some criterion. In this example, we'll see which elements are negative.
```{python 01-python-tools-ds-63}
A < 0
```
This is called **masking**, and is useful in many contexts.
We can extract all the negative elements of A using
```{python 01-python-tools-ds-64}
A[A<0]
```
This forms a 1-d array. You can also count the number of elements that meet the criterion
```{python 01-python-tools-ds-65}
np.sum(A<0)
```
Since the entity `A<0` is a matrix as well, we can do row-wise and column-wise operations as well.
### Beware of copies
One has to be a bit careful with copying objects in Python. By default, if you just assign one object to a new name, it does a *shallow copy*, which means that both names point to the same memory. So if you change something in the original, it also changes in the new copy.
```{python 01-python-tools-ds-66}
A[0,:]
```
```{python 01-python-tools-ds-67}
A1 = A
A1[0,0] = 4
A[0,0]
```
To actually create a copy that is not linked back to the original, you have to make a *deep copy*, which creates a new space in memory and a new pointer, and copies the original object to the new memory location
```{python 01-python-tools-ds-68}
A1 = A.copy()
A1[0,0] = 6
A[0,0]
```
You can also replace sub-matrices of a matrix with new data, provided that the dimensions are compatible. (Make sure that the sub-matrix we are replacing below truly has 2 rows and 2 columns, which is what `np.eye(2)` will produce)
```{python 01-python-tools-ds-69}
A[:2,:2] = np.eye(2)
A
```
#### Reducing matrix dimensions
Sometimes the output of some operation ends up being a matrix of one column or one row. We can reduce it to become a vector. There are two functions that can do that, `flatten` and `ravel`.
```{python 01-python-tools-ds-70}
A = rng.randint(0,5, (5,1))
A
```
```{python 01-python-tools-ds-71}
A.flatten()
```
```{python 01-python-tools-ds-72}
A.ravel()
```
So why two functions? I'm not sure, but they do different things behind the scenes. `flatten` creates a **copy**, i.e. a new array disconnected from `A`. `ravel` creates a **view**, so a representation of the original array. If you then changed a value after a `ravel` operation, you would also change it in the original array; if you did this after a `flatten` operation, you would not.
### Broadcasting in Python
Python deals with arrays in an interesting way, in terms of matching up dimensions of arrays for arithmetic operations. There are 3 rules:
1. If two arrays differ in the number of dimensions, the shape of the smaller array is padded with 1s on its _left_ side
2. If the shape doesn't match in any dimension, the array with shape = 1 in that dimension is stretched to match the others' shape
3. If in any dimension the sizes disagree and none of the sizes are 1, then an error is generated
```{python 01-python-tools-ds-73}
A = rng.normal(0,1,(4,5))
B = rng.normal(0,1,5)
```
```{python 01-python-tools-ds-74}
A.shape
```
```{python 01-python-tools-ds-75}
B.shape
```
```{python 01-python-tools-ds-76}
A - B
```
B is 1-d, A is 2-d, so B's shape is made into (1,5) (added to the left). Then it is repeated into 4 rows to make it's shape (4,5), then the operation is performed. This means that we subtract the first element of B from the first column of A, the second element of B from the second column of A, and so on.
You can be explicit about adding dimensions for broadcasting by using `np.newaxis`.
```{python 01-python-tools-ds-77}
B[np.newaxis,:].shape
```
```{python 01-python-tools-ds-78}
B[:,np.newaxis].shape
```
#### An example (optional, intermediate/advanced))
This can be very useful, since these operations are faster than for loops. For example:
```{python 01-python-tools-ds-79}
d = rng.random_sample((10,2))
d
```
We want to find the Euclidean distance (the sum of squared differences) between the points defined by the rows. This should result in a 10x10 distance matrix
```{python 01-python-tools-ds-80}
d.shape
```
```{python 01-python-tools-ds-81}
d[np.newaxis,:,:]
```
creates a 3-d array with the first dimension being of length 1
```{python 01-python-tools-ds-82}
d[np.newaxis,:,:].shape
```
```{python 01-python-tools-ds-83}
d[:, np.newaxis,:]
```
creates a 3-d array with the 2nd dimension being of length 1
```{python 01-python-tools-ds-84}
d[:,np.newaxis,:].shape
```
Now for the trick, using broadcasting of arrays. These two arrays are incompatible without broadcasting, but with broadcasting, the right things get repeated to make things compatible
```{python 01-python-tools-ds-85}
dist_sq = np.sum((d[:,np.newaxis,:] - d[np.newaxis,:,:]) ** 2)
```
```{python 01-python-tools-ds-86}
dist_sq.shape
```
```{python 01-python-tools-ds-87}
dist_sq
```
Whoops! we wanted a 10x10 matrix, not a scalar.
```{python 01-python-tools-ds-88}
(d[:,np.newaxis,:] - d[np.newaxis,:,:]).shape
```
What we really want is the 10x10 distance matrix.
```{python 01-python-tools-ds-89}
dist_sq = np.sum((d[:,np.newaxis,:] - d[np.newaxis,:,:]) ** 2, axis=2)
```
You can verify what is happening by creating `D = d[:,np.newaxis,:]-d[np.newaxis,:,:]` and then looking at `D[:,:,0]` and `D[:,:,1]`. These are the difference between each combination in the first and second columns of d, respectively. Squaring and summing along the 3rd axis then gives the sum of squared differences.
```{python 01-python-tools-ds-90}
dist_sq
```
```{python 01-python-tools-ds-91}
dist_sq.shape
```
```{python 01-python-tools-ds-92}
dist_sq.diagonal()
```
### Conclusions moving forward
It's important to understand numpy and arrays, since most data sets we encounter are rectangular. The notations and operations we saw in numpy will translate to data, except for the fact that data is typically heterogeneous, i.e., of different types. The problem with using numpy for modern data analysis is that if you have mixed data types, it will all be coerced to strings, and then you can't actually do any data analysis.
The solution to this issue (which is also present in Matlab) came about with the `pandas` package, which is the main workhorse of data science in Python