Name		Name	Last commit message	Last commit date
parent directory ..
10X_1.3_Million_Brain_Cells_from_E18_Mice.md		10X_1.3_Million_Brain_Cells_from_E18_Mice.md
10X_1.3_Million_Brain_Cells_from_E18_Mice_parallel.md		10X_1.3_Million_Brain_Cells_from_E18_Mice_parallel.md
BipolarCell2016.md		BipolarCell2016.md
Dockerfile		Dockerfile
README.md		README.md
dense_to_sparse.py		dense_to_sparse.py
hdf5_to_sparse.py		hdf5_to_sparse.py
hdf5_to_sparse_tasks.tsv		hdf5_to_sparse_tasks.tsv

README.md

Load Data

Single-cell RNAseq data is sparse and the code in this section loads this data into a "Tidy Data" table schema with (at a minimum) columns for cell identifier, gene name, and transcript count.

The general steps in this process are:

Convert source data to long, sparse format.

For example given a CSV file in dense matrix format, such as:

,cell1,cell2,cell3
gene1,0.0,0.0,3.0
gene2,0.0,0.0,0.0
gene3,1.0,0.0,2.0

reshape it as a tidy data CSV.

cell,gene,trans_cnt
cell1,gene3,1.0
cell3,gene1,3.0
cell3,gene3,2.0

Load the reshaped data to BigQuery.

bq --project PROJECT-ID load --autodetect DATASET-NAME.TABLE-NAME \
  gs://BUCKET-NAME/PATH/TO/LONG/SPARSE/FILE.csv

Here are instructions to load specific datasets, each demonstrating a different technique:

Bipolar Cell 2016 - uses a Compute Engine instance with the Container-Optimized OS VM image to run Dockerized R
10X Genomics 1.3 Million Brain Cells from E18 Mice - uses a Compute Engine instance with the default VM image to run a Python script
10X Genomics 1.3 Million Brain Cells from E18 Mice - faster via parallel execution - uses Cloud Shell with dsub for batch computing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_loading

data_loading

README.md

Load Data

Files

data_loading

Directory actions

More options

Directory actions

More options

Latest commit

History

data_loading

Folders and files

parent directory

README.md

Load Data