Saturday, January 31, 2015

[Python] Files I/O

1. Binary file: Use pylab to simply read and write files

Array.fromfile (pylab) is a  intrinsic package to load the binary file in the local disk, eg. *.bin file.
What it does is read machine data into arrays.

(1) read file
Before you read your binary file, you should figure out what format it is.
I am not an expert at data types (check it out: Numpy Data Types). There are many ways you can go and test the format.
In my case, I directly open the file (matlab, python, or any fast program) and check the machine values. The first thing to look at is whether it is string, interger, or float (double, single). After that, I can estimate whether it is 8, 16, 32, or 64 bit according to the range of the data, or file size.

(A) It becomes rather easy as long as you know the format. Say it is float32,
> from pylab import *
> file = fromfile("filename.bin", dtype = 'float32')
# the extension doesn't matter, it can be bin, dat, whatever, even no extenstion. 

(B) You can also do some list comprehension to process your data simpler and nicer.
In: file = [fromfile("filename_%s_%s.bin" %("tas", str(year)), dtype = 'uint16').reshape(12,180,360)[3:5,:,:] for year in xrange(1900,2001)]
Out: [array([...],dtype=float32), ..., 101 of them]

The example shows how to read the global temperature data from 1900-2000 among the period of April to June. So this becomes a list of arrays, maybe it is not what you want.
> df = np.hstack(file)
which can simply do the combination of all the arrays in the list into a multi-dimensional arrays.

(C) We can still use traditional loop to read the data.
for var in forcing:
      for year in xrange(styr, edyr+1):
           input = fromfile('%s_%s.bin' % (var, str(year)), 'float32').reshape(-1, 180, 360)
input = input[::-1]
           # use concatenate, or append
   data = np.concatenate((data, input))  
           # append
           data = np.append(data, input)


(2) write file


2. NetCDF file: use netcdf4 package
(1) Install
Use anaconda to install package for netcdf4: 
$ conda install netcdf4
(2) read file
> file = Dataset("path/filename.nc").variables["variable_name"][:]
This method will automatically fill out the missing values. 
So far, I use the numpy method, "file.data" to get the original data, and then set the original missing value to np.nan.
> file = Dataset("path/filename.nc").variables["variable_name"][:].data
> file[file==-9999] = np.nan


3. String TXT file: 
3.1 readlines

3.2 pandas read_csv


Thursday, January 29, 2015

粮食生产大问题

从今天开始推出这个专题,食品粮食安全问题。
我感兴趣的几个大问题:
(1)如今的粮食生产的浪费环节
(2)如今的粮食生产的能源消耗到底有多少,是不是真的
(3)转基因食品到底给发展中国家带来多大利弊