Stata - reshape 數據格式變換
目標:
某些資料庫下載的數據是如圖所示的那種數據格式,而往往我們需要的是一個標準的面板類型的數據集,兩種數據類型如下所示:
//原始數據
|------------|----------|---------|---------|
| variable | id |value1990|value1991|
|------------|----------|---------|---------|
| var1 | 1 | 1 | 2 |
| var1 | 2 | 3 | 4 |
| var2 | 1 | 5 | 6 |
| var2 | 2 | 7 | 8 |
|------------|----------|---------|---------|
//面板數據
|------------|----------|---------|---------|
| id | year | var1 | var2 |
|------------|----------|---------|---------|
| 1 | 1990 | 1 | 5 |
| 1 | 1991 | 2 | 6 |
| 2 | 1990 | 3 | 7 |
| 2 | 1991 | 4 | 8 |
|------------|----------|---------|---------|
思路:
第一份數據實際上既是一個long(variable)數據集也是一個wide(value)數據集,所以先將這個數據集通過reshape long value
處理成一個標準的long數據集,然後通過reshape wide variable
整理成面板。
實例:
step 1.將原始數據轉換為長類型的數據
. reshape long value,i(variable id) j(year)
. list
/*
+------------------------------+
| variable id year value |
|------------------------------|
1. | var1 1 1990 1 |
2. | var1 1 1991 2 |
3. | var1 2 1990 3 |
4. | var1 2 1991 4 |
5. | var2 1 1990 5 |
|------------------------------|
6. | var2 1 1991 6 |
7. | var2 2 1990 7 |
8. | var2 2 1991 8 |
+------------------------------+
*/
// 將這個數據集還原可用 reshape wide value ,i(variable id) j(year)
step 2.轉換為面板
step 2.1 利用基本命令reshape
. reshape wide value,i(id year) j(variable) string
/*
+---------------------------------+
| id year valuev~1 valuev~2 |
|---------------------------------|
1. | 1 1990 1 5 |
2. | 1 1991 2 6 |
3. | 2 1990 3 7 |
4. | 2 1991 4 8 |
+---------------------------------+
*/
. rename value* *
/*
+-------------------------+
| id year var1 var2 |
|-------------------------|
1. | 1 1990 1 5 |
2. | 1 1991 2 6 |
3. | 2 1990 3 7 |
4. | 2 1991 4 8 |
+-------------------------+
*/
step 2.2 利用外部命令tidy -spread
. ssc install tidy
. spread variable value
. list
/*
+-------------------------+
| id year var1 var2 |
|-------------------------|
1. | 1 1990 1 5 |
2. | 1 1991 2 6 |
3. | 2 1990 3 7 |
4. | 2 1991 4 8 |
+-------------------------+
*/
注意事項
1常見問題:
- 在stata中變數名不能以數字開頭且變數名中不能含有
"" "-" "/"
等特殊符號,stata15以前的版本不支持中文變數名。 - 那如果我們的原始數據集的變數就是含有這些字元那麼會有什麼後果呢?
2 直接操作
- 原始數據集
//原始數據
|------------|----------|---------|---------|
| variable | id |value1990|value1991|
|------------|----------|---------|---------|
| 1st | 1 | 1 | 2 |
| 1st | 2 | 3 | 4 |
| 1st/gdp | 1 | 5 | 6 |
| 1st/gdp | 2 | 7 | 8 |
|------------|----------|---------|---------|
. reshape long value,i(variable id) j(year)
. list
/*
+------------------------------+
| variable id year value |
|------------------------------|
1. | 1st 1 1990 1 |
2. | 1st 1 1991 2 |
3. | 1st 2 1990 3 |
4. | 1st 2 1991 4 |
5. | 1st/gdp 1 1990 5 |
|------------------------------|
6. | 1st/gdp 1 1991 6 |
7. | 1st/gdp 2 1990 7 |
8. | 1st/gdp 2 1991 8 |
+------------------------------+
*/
- 直接操作
. spread variable value
/*
Some observations for variable dont have valid variable names: `"1st/gdp"
*/
. reshape wide value ,i(id year) j(variable) string
/*
(note: j = 1st 1st/gdp)
value1st/gdp invalid variable name
*/
結果不管用哪個都會出錯,從錯誤提示也可以看得出來主要是因為我們的變數的文本含有特殊字元或開頭為數字,那麼該怎麼辦呢?
3 思路:
我們退而求其次,我們要的是一個面板並知道變數的信息,那麼給變數加個變數標籤的話,即使變數名是value1、 value2
也不影響我們。
4 實例:
4.1 tidy - spread
- spread :我們將變數
encode
並生成一個新變數new_var
,那麼這個變數new_var
就是1,2,3這種數值型變數,然後我們再轉換成面板,但是這時候我們給我們的變數加了個變數標籤。
. list in 1/5
/*
+------------------------------+
| variable id year value |
|------------------------------|
1. | 1st 1 1990 1 |
2. | 1st 1 1991 2 |
3. | 1st 2 1990 3 |
4. | 1st 2 1991 4 |
5. | 1st/gdp 1 1990 5 |
|------------------------------|
*/
. encode variable,gen(new_variable)
. list in 1/5,nolabel
/*
+-----------------------------------------+
| variable id year value new_va~e |
|-----------------------------------------|
1. | 1st 1 1990 1 1 |
2. | 1st 1 1991 2 1 |
3. | 1st 2 1990 3 1 |
4. | 1st 2 1991 4 1 |
5. | 1st/gdp 1 1990 5 2 |
+-----------------------------------------+
*/
. spread new_variable value,label(variable)
. des
/*
Contains data
obs: 4
vars: 4
size: 56
-------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------
id float %9.0g
year int %9.0g
value1 float %9.0g 1st
value2 float %9.0g 1st/gdp
--------------------------------------------------------------
Sorted by: id year
*/
. list
/*
+-----------------------------+
| id year value1 value2 |
|-----------------------------|
1. | 1 1990 1 5 |
2. | 1 1991 2 6 |
3. | 2 1990 3 7 |
4. | 2 1991 4 8 |
+-----------------------------+
*/
4.2 reshape
當用stata的自帶命令reshape來處理這種的時候,可能的確有點麻煩,我這裡從原始數據開始處理可能還相對方便些……
reshape
: 我們生成一個新變數value1000
,它和variable
是一樣的,但是reshape long
的時候,我們的變數名就被放進value
變數裡面了,並且時間year
會有1000年,然後通過reshape wide
之後,我們的變數名就在第一行並且與其變數相對應,通過labone ,nrow()
我們就可以將我們的變數添加標籤,之後刪除year為1000的樣本- 【不用外部命令還是有點麻煩2333】
//原始數據
|------------|----------|---------|---------|
| variable | id |value1990|value1991|
|------------|----------|---------|---------|
| 1st | 1 | 1 | 2 |
| 1st | 2 | 3 | 4 |
| 1st/gdp | 1 | 5 | 6 |
| 1st/gdp | 2 | 7 | 8 |
|------------|----------|---------|---------|
. gen value1000=variable
. tostring value*,replace
/*
value1990 was byte now str1
value1991 was byte now str1
value1000 already string; no replace
*/
. list
/*
+------------------------------------------------+
| variable id val~1990 val~1991 val~1000 |
|------------------------------------------------|
1. | 1st 1 1 2 1st |
2. | 1st 2 3 4 1st |
3. | 1st/gdp 1 5 6 1st/gdp |
4. | 1st/gdp 2 7 8 1st/gdp |
+------------------------------------------------+
*/
. reshape long value,i(id variable ) j(year)
. list
/*
+--------------------------------+
| id variable year value |
|--------------------------------|
1. | 1 1st 1000 1st |
2. | 1 1st 1990 1 |
3. | 1 1st 1991 2 |
4. | 1 1st/gdp 1000 1st/gdp |
5. | 1 1st/gdp 1990 5 |
|--------------------------------|
6. | 1 1st/gdp 1991 6 |
7. | 2 1st 1000 1st |
8. | 2 1st 1990 3 |
9. | 2 1st 1991 4 |
10. | 2 1st/gdp 1000 1st/gdp |
|--------------------------------|
11. | 2 1st/gdp 1990 7 |
12. | 2 1st/gdp 1991 8 |
+--------------------------------+
*/
. encode variable ,gen(e_v)
. drop variable
. reshape wide value,i(id year) j(e_v)
. list
/*
+------------------------------+
| id year value1 value2 |
|------------------------------|
1. | 1 1000 1st 1st/gdp |
2. | 1 1990 1 5 |
3. | 1 1991 2 6 |
4. | 2 1000 1st 1st/gdp |
5. | 2 1990 3 7 |
|------------------------------|
6. | 2 1991 4 8 |
+------------------------------+
*/
. labone value1 value2,nrow(1)
. drop if year ==1000
. destring value*,replace
/*
value1: all characters numeric; replaced as byte
value2: all characters numeric; replaced as byte
*/
. describe
/*
Contains data
obs: 4
vars: 4
size: 32
-------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------
id float %9.0g
year int %9.0g
value1 byte %10.0g 1st
value2 byte %10.0g 1st/gdp
-------------------------------------------------------------
Sorted by: id year
Note: Dataset has changed since last saved.
*/
. list
/*
+-----------------------------+
| id year value1 value2 |
|-----------------------------|
1. | 1 1990 1 5 |
2. | 1 1991 2 6 |
3. | 2 1990 3 7 |
4. | 2 1991 4 8 |
+-----------------------------+
*/
推薦閱讀: