Importing dgCMatrix data from GEO

Hello,

I am trying to upload a dgCMatrix dataset from GEO (GEO Accession viewer). I downloaded the rds file, and tried demultiplexing it, however, the barcodes do not have any prefix, and two additional files (metadata and umap) could not be uploaded on Cellenics. Should I find a way to merge these files together into a single Seurat rds file? Altneratively, is there a demultiplexing command I could use to convert this rds file into the right 10X folders?

Thank you!

Eric

Hi Eric,

Based on the data you provided, it seems that the barcodes in your dataset indeed have suffixes (e.g., “1_1”, “2_1”, “3_1”, “4_1”, etc.). By utilizing the sample regular expression ([ACGT]+-1)_([[:digit:]]+), you should be able to demultiplex your dataset effectively, obtaining 22 distinct samples.

It’s important to cross-reference the paper and its supplemental material for detailed information on how these suffixes correlate with the actual sample names. If the information provided there doesn’t clearly map these suffixes to sample names, I’d suggest reaching out to the authors for clarification.

Regarding the additional metadata and UMAP files you mentioned, here’s how you can handle them for use in Cellenics:

  • For the metadata file, you can upload this as cell-level metadata to Cellenics. Use the “Add data” button in the Data Management page and upload your file in the .tsv format.
  • The umap file, which contains embeddings, cannot be directly uploaded to Cellenics. To replicate the embeddings presented in the original study, you would need to apply the same filtering, normalization, and clustering parameters outlined in their methods. You can manually adjust different parameters in Data Processing.

I hope this information helps guide your next steps in analyzing your dataset. If you have any more questions or need further assistance, feel free to ask.

Thank you Sara,

However, I am not familiar with the syntax of regular expression, I have been trying different permutations, and I either get different error messages, or the demultiplexing does not remove the barcode.

Here is what I thought was the closest:

data.pfx ← gsub(([ACGT]±1)_([[:digit:]]+), “\1”, colnames(data.count), perl=TRUE)

To which I get

Error: unexpected ‘[’ in “data.pfx ← gsub(([”

This regular expression does not result in an error message, but does not remove the barcodes either:

data.pfx ← gsub(“([ACGT]±1)_([[:digit:]]+)”,“\1”, colnames(data.count), perl=TRUE)

head(samplenames):
[1] “AAACCCACAAATGAAC-1_1” “AAACCCACAAGAAACT-1_1” “AAACCCACAGCCTTCT-1_1” “AAACCCAGTAGCGCTC-1_1” “AAACCCATCCATATGG-1_1”
[6] "AAACCCATCCCGTTGT-1_1

I also tried using the sample data, however the Lambrechts lab site is no longer online, so I cannot download the data and compare how a normal process would look like.

Thanks again!

Hi Eric,

Sorry for the late reply. The problem in the code that you are using is that it’s using prefixes, while in this case the samples are encoded in suffixes.

To solve this, I would suggest following the tutorial here: How to demultiplex an rds object and convert it to 10X files (count matrices)?, but with a couple of changes.

  • The regex data.pfx <- gsub("(.+)_[A-Z]+-1$", "\\1", colnames(data.count), perl=TRUE) needs to be changed to data.pfx <- gsub("([ACGT]+-1)_([[:digit:]]+)", "\\2", colnames(data.count), perl=TRUE). Note that “\1” is replaced with “\2” to take into account that samples are in suffixes instead of prefixes.

  • Also, you need to change this line: DropletUtils::write10xCounts(path = paste0(getwd(),"/demultiplexed/",samples[i]), x = obj[,grep(paste0("^",samples[i],".*"),colnames(obj))], type = "sparse", version="3")
    to DropletUtils::write10xCounts(path = paste0(getwd(),"/demultiplexed/",samples[i]), x = obj[,grep(paste0(samples[i], "$"),colnames(obj))], type = "sparse", version="3") so that the samples are demultiplexed correctly.

I hope this helps. Let me know if you have any other questions.

Hi Sara,

Yes, now the samples were demultiplexed and I was able to upload them, thank you very much!

I have a new question: I was inspecting the barcodes.tsv files, and I noticed that there might be some barcodes associated with another sample that were not demultiplexed. For example, sample 1_1 contained barcodes from both 1_1 and 11_1 (probably because 11_1 also contains 1_1!). Should I modify the regex, or can I manually remove the 11_1 samples from the barcodes file?

Thank you once again, and no worries about the prior late reply, I realize that this is taking some of your time away from other projects!

Hi Eric,

I took a deeper look into your data and noticed there’s a file named “GSE183839_EXPORT_snRNAseq_metadata.txt.gz” which offers insight into the association of barcodes with specific samples. You can check that to see which samples to expect. I would not remove the “11_1” samples. The regular expression I provided earlier is designed to differentiate samples based on their unique identifiers, such as “11_1” and “1_1”, by separating the first part of the barcode up to “-1_”, from the remaining section. This approach ensures that barcodes like “AAACCCACAAGAAACT-1_1_1” and “TTTGTTGGTACCGTTT-1_11_1” are assigned correctly to their respective samples because “-1_1_1” differs from “-1_11_1”.

It’s important to ensure you’re utilizing the regex correctly: data.pfx <- gsub("([ACGT]+-1)_([[:digit:]]+)", "\\2", colnames(data.count), perl=TRUE).
This expression should effectively demultiplex the samples. I tested it and I confirm it’s working as expected.

Please, take a moment to double-check your code. If after reviewing your code and considering the metadata file you still face issues, feel free to reach out again.