Friday, 13 September 2013

R read.table loops row column entries to next row

R read.table loops row column entries to next row

This is the first time I encountered this problem using read.table: For
row entries with very large number of columns, read.table loops the column
entries into the next rows.
I have a .txt file with rows of variable and unequal length. For reference
this is the .txt file I am reading:
http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/4.0/c5.bp.v4.0.symbols.gmt
Here is my code:
tabsep <- gsub("\\\\t", "\t", "\\t")
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE,
as.is = TRUE, sep = tabsep)
Partial output: first columns
V1
V2
V3 V4 V5 V6
1 TRNA_PROCESSING
http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1
TRNT1 FARS2
2 REGULATION_OF_BIOLOGICAL_QUALITY
http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY
DLC1 ALS2 SLC9A7
3 DNA_METABOLIC_PROCESS
http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS
XRCC5 XRCC4 RAD51C
4 AMINO_SUGAR_METABOLIC_PROCESS
http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS
UAP1 CHIA GNPDA1
5 BIOPOLYMER_CATABOLIC_PROCESS
http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS
BTRC HNRNPD USE1
6 RNA_METABOLIC_PROCESS
http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS
HNRNPF HNRNPD SYNCRIP
7 INTS6
LSM5 LSM4 LSM3 LSM1
8 CRK
9 GLUCAN_METABOLIC_PROCESS
http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS
GCK PYGM GSK3B
10 PROTEIN_POLYUBIQUITINATION
http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION
ERCC8 HUWE1 DZIP3
...
Partial output: last columns
V403 V404 V405 V406 V407 V408 V409 V410 V411 V412
V413 V414 V415 V416 V417 V418 V419 V420 V421
1
2 CALCA CALCB FAM107A CDK11A RASGRP4 CDK11B SYN3 GP1BA TNN ENO1
PTPRC MTL5 ISOC2 RHAG VWF GPI HPX SLC5A7 F2R
3
4
5
6 IRF2 IRF3 SLC2A4RG LSM6 XRCC6 INTS1 HOXD13 RP9 INTS2 ZNF638
INTS3 ZNF254 CITED1 CITED2 INTS9 INTS8 INTS5 INTS4 INTS7
7 POU1F1 TCF7L2 TNFRSF1A NPAS2 HAND1 HAND2 NUDT21 APEX1 ENO1 ERF
DTX1 SOX30 CBY1 DIS3 SP1 SP2 SP3 SP4 NFIC
8
9
10
For instance, column entries for row 6 gets looped to fill row 7 and row
8. I seem to only this problem for row entries with very large number of
columns. This occurs for other .txt files as well but it breaks at
different column numbers. I inspected all the row entries at where the
break happens and there are no unusual characters in the entries (they are
all standard upper case gene symbols).
I have tried both read.table and read.delim with the same result. If I
convert the .txt file to .csv first and use the same code, I do not have
this problem (see below for the equivalent output). But I don't want to
convert each file first .csv and really I just want to understand what is
going on.
Correct output if I convert to .csv file:
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE,
as.is = TRUE, sep = ",")
V1
V2
V3 V4 V5 V6
1 TRNA_PROCESSING
http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING ADAT1
TRNT1 FARS2 METTL1
2 REGULATION_OF_BIOLOGICAL_QUALITY
http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY
DLC1 ALS2 SLC9A7 PTGS2
3 DNA_METABOLIC_PROCESS
http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS
XRCC5 XRCC4 RAD51C XRCC3
4 AMINO_SUGAR_METABOLIC_PROCESS
http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS
UAP1 CHIA GNPDA1 GNE
5 BIOPOLYMER_CATABOLIC_PROCESS
http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS
BTRC HNRNPD USE1 RNASEH1
6 RNA_METABOLIC_PROCESS
http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS
HNRNPF HNRNPD SYNCRIP MED24
7 GLUCAN_METABOLIC_PROCESS
http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS
GCK PYGM GSK3B EPM2A
8 PROTEIN_POLYUBIQUITINATION
http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION
ERCC8 HUWE1 DZIP3 DDB2
9 PROTEIN_OLIGOMERIZATION
http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_OLIGOMERIZATION
SYT1 AASS TP63 HPRT1

No comments:

Post a Comment