 |
 |
text file challenge
|
 |
|
 |
|
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status:
Offline
|
|
I have a text file that is in a horrible format and I need to get the info into something usable, namely a matrix. Here's a snippet of the text file:
Time Point: 0
ID Mean
0 198.186
1 200.604
2 176.318
Time Point: 1
ID Mean
0 196.536
1 199.177
2 177.227
Time Point: 2
ID Mean
0 197.371
1 201.187
2 176.864
I would like it in a format like this:
Timepoint 0 1 2
0 198.186 200.604 176.318
1 196.536 199.177 177.227
2 197.371 201.187 176.864
I'm sure that a combination of unix cli functions or a short perl script would do it, but I'm a relative newbie to these sorts of things so any help would be appreciated.
thanks,
kman
|
|
|
| |
|
|
|
 |
|
 |
|
Moderator 
Join Date: May 2001
Location: Hilbert space
Status:
Offline
|
|
What app do you want to feed it to?
How much data of this kind do you have?
E. g. gnuplot could do that by itself, so could every C program.
|
|
I don't suffer from insanity, I enjoy every minute of it.
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status:
Offline
|
|
Originally posted by OreoCookie:
What app do you want to feed it to?
How much data of this kind do you have?
E. g. gnuplot could do that by itself, so could every C program.
I want to get the data into matlab.
I have about 25 files like this, each with about 200 timepoints.
kman
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Sep 2000
Location: Edmond, OK USA
Status:
Offline
|
|
OK, here you go. Save this class to a file names Matrix.java and use:
>javac Matrix.java
To compile, and:
>java Matrix <input-files>
to execute. You can specify as many source files as you want. The result will be in the same directory as each source file and will be named <source>.mat.
Also, I didn't do anything fancy to determine if each time point actually has exactly 3 measurments or not, so YMMV, I just basically did what was required to reproduce your output (Although this only affects the header at the top - the values will still be recorded).
Code:
import java.io.*;
public class Matrix
{
private static final String TIME_POINT = "Time Point: ";
private static final int TIME_POINT_LENGTH = TIME_POINT.length();
public static void main(String [] argv) throws Exception
{
for (int i = 0; i < argv.length; i++)
{
new Matrix(new File(argv[i])).process();
}
}
private final File _source;
private final File _target;
private Matrix(File source)
{
_source = source;
_target = new File(_source.getParentFile(), _source.getName() + ".mat");
}
public void process() throws Exception
{
BufferedReader reader = new BufferedReader(new FileReader(_source));
FileWriter writer = new FileWriter(_target);
writer.write("Timepoint 0 1 2");
String line = null;
while ((line = reader.readLine())!= null)
{
line = line.trim();
// check for blanks
if (line.equals("") || line.startsWith("ID"))
{
continue;
}
// check for a new time point
if (line.startsWith(TIME_POINT))
{
writer.write('\n');
// grab the current point ID and continue
String currentID = line.substring(TIME_POINT_LENGTH);
writer.write(currentID);
continue;
}
else
{
// this is another record for a time point
// strip the ID first then the value
writer.write(' ');
int space = line.indexOf(' ');
String ID = line.substring(0, space);
String value = line.substring(space + 1);
// write it out
writer.write(value);
}
}
writer.close();
reader.close();
}
}
|
|
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: May 2001
Location: Cupertino, CA
Status:
Offline
|
|
Here's a somewhat more concise Perl script that does what you want, with some caveats: - assumes filenames have one period, eg data.txt
- doesn't assume that time points will have three measurements, but assumes that all the time points in the same file will have the same number of measurements
- assumes all your measurements will have a decimal point
- I don't remember where Perl on a Mac is, so you may have to modify the first line to the correct path
You can execute it with 'perl matrix.pl filenames' and for each file such as name.txt, it will create a new file name.mat in the same directory as the source file with the format you want. And it gets the header right
Code:
#!/usr/local/bin/perl
foreach $file (@ARGV) {
$count = 0;
my %data;
open INPUT, $file or die "Couldn't open input file: $file -- $!";
while (<INPUT>) {
chomp;
if (/^(\d+) (\d+.\d+)/) {
$data{$1} .= " $2";
} elsif (/^Time Point*/) {
++$count;
}
}
close INPUT;
$file =~ s/\.(.*)$/.mat/;
open OUTPUT, "> $file" or die "Couldn't open output file: $file -- $!";
print OUTPUT "Timepoint ";
print OUTPUT "$_ " foreach (0..($count - 1));
print OUTPUT "\n";
foreach $key (sort {$a <=> $b} (keys %data)) {
print OUTPUT "$key$data{$key}\n";
}
close OUTPUT;
}
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Sep 2000
Location: Edmond, OK USA
Status:
Offline
|
|
Originally posted by itai195:
Here's a somewhat more concise Perl script that does what you want, with some caveats:
Somewhat more concise? HAH! My version is 58 lines with comments and pretty-printing and yours is 24 (with no comments and densely formated). Here is the Java version with comments removed and formatting condensed (25 lines):
Code:
public class Matrix {
public static void main(String [] argv) throws Exception {
for (int i = 0; i < argv.length; i++)
new Matrix(new java.io.File(argv[i]));
}
private Matrix(java.io.File source) throws Exception {
java.io.BufferedReader reader = new java.io.BufferedReader(new java.io.FileReader(source));
java.io.FileWriter writer = new java.io.FileWriter(new java.io.File(source.getParentFile(), source.getName() + ".mat"));
writer.write("Timepoint 0 1 2");
String line = null;
while ((line = reader.readLine())!= null) {
line = line.trim();
if (line.equals("") || line.startsWith("ID"))
continue;
if (line.startsWith("Time Point: ")) {
writer.write('\n' + line.substring("Time Point: ".length()));
continue;
} else {
writer.write(' ' + line.substring(line.indexOf(' ') + 1));
}
}
writer.close();
reader.close();
}
}
That said, I really don't care which version he uses. I prefer the Java version because it is really simple to read. The Perl stuff is difficult to follow for someone who doesn't use perl or RE alot. Goes with the old Maxim that you build with the tools you know (esp. in the case of throw-away code).
Anyway, kman42 now has two fine solutions and no indication if he still cares. 
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Sep 2000
Location: Edmond, OK USA
Status:
Offline
|
|
Originally posted by itai195:
And it gets the header right 
I specifically did not read the data into memory because data files of this nature can be quite large and I didn't want to limit size by reading it all into memory.
OK, I should have been more nit-picky before. If you run the perl script on the supplied input, you do NOT get the supplied output. You get this instead:
Code:
Timepoint 0 1 2
0 198.186 196.536 197.371
1 200.604 199.177 201.187
2 176.318 177.227 176.864
Notice that you actually mirrored the matrix about the diagonal from 0, 0 to 2, 2. The header does not represent Timepoints across the top, but rather in the header:
Timepoint n-2 n-1 n
"Timepoint" is a column header showing that the first column numbers are timepoint ID's, and that n-2 - n are ID's from actual samples. E.G., this timpepoint:
Time Point: 0
ID Mean
0 198.186
1 200.604
2 176.318
3 300.090
Shuold generate:
Timepoint 0 1 2 3
0 198.186 200.604 176.318 300.090
And this makes sense. Just think that you would want a data file to expand lengthwise with more timepoints, not widthwise. However, your script generates this:
Timepoint 0
0 198.186
1 200.604
2 176.318
3 300.090
Which is why I wasn't so concerned about the header, since the values would be generated anyway and if he knew that a file had 3 ID's for timepoints he could just put them there.
This wouldn't have been so confusing if kman42 hadn't chosen such a symmetrical sample.
|
|
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: May 2001
Location: Cupertino, CA
Status:
Offline
|
|
Good find absmiths. You're right, I wrote that code in a little too much of a hurry.
Given that the headers aren't important, the perl script shrinks to 17 lines:
Code:
#!/usr/bin/perl
foreach $file (@ARGV) {
$count = 0;
open INPUT, $file or die "Couldn't open input file: $file -- $!";
$file =~ s/\.(.*)$/.mat/;
open OUTPUT, "> $file" or die "Couldn't open output file: $file -- $!";
print OUTPUT "Timepoint 0 1 2";
while (<INPUT>) {
if (/(\d+\.\d+)/) {
print OUTPUT " $1";
} elsif (/Time Point: (\d+)/) {
print OUTPUT "\n$1";
}
}
close INPUT;
close OUTPUT;
}
Someone more expert than myself at Perl could probably get this smaller. While I also am a big Java fan, I think Perl is a lot more convenient for this kind of task. Java IO is powerful, but it seems heavy-handed for this application. Perl is harder to read, no doubt, and keep in mind that this code is hardly considered "condensed" by Perl standards (scary!). Most of what's hard to read in this script is regular expressions though, which probably should be replaced with string processing functions (index, substr) for better performance here.
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Sep 2000
Location: Edmond, OK USA
Status:
Offline
|
|
Originally posted by itai195:
Someone more expert than myself at Perl could probably get this smaller. While I also am a big Java fan, I think Perl is a lot more convenient for this kind of task. Java IO is powerful, but it seems heavy-handed for this application. Perl is harder to read, no doubt, and keep in mind that this code is hardly considered "condensed" by Perl standards (scary!). Most of what's hard to read in this script is regular expressions though, which probably should be replaced with string processing functions (index, substr) for better performance here.
I agree, although I am too lazy to learn a new language for occasional maintenance functions. The Perl script is certainly faster - it finishes before you even realize it has begun.
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Sep 2000
Location: Edmond, OK USA
Status:
Offline
|
|
I can't believe kman42 never replied to this. If he didn't care anymore he should at least have posted something saying so.
|
|
|
| |
|
|
|
 |
 |
|
 |
|
|
|
|
|

|
|
 |
Forum Rules
|
 |
 |
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
|
HTML code is Off
|
|
|
|
|
|
 |
 |
 |
 |
|
 |
|