前些日子读《精通正则表达式》读的起劲,每天早起读个把小时,兴趣盎然。
正愁没实战场景,只每天在Vim里用用,颇为寂寞。
前几天室友要处理一批类歌词文件(英文朗读?),需要清理其中的部分内容,手动操作很是繁琐,让我试着用代码帮着处理下。啊,这不正是使用正则的绝佳场所么
#待处理文本
文本是这样的
1
2
3
4
5
6
7
8
9
10
11
12
|
:::text
TPO_2102_L01
0 2.16 Listen to part of lecture in the history of science class.
2.16 11.2 OK, we've been talking about how throughout the history. 好的,我们已经讨论过历史的发展.
11.2 14.8 It was often difficult for people to give up ideas which have long been taken for granted as scientific truths, even if those ideas were false. 人们放弃一种已经被认定是科学的事实的念头非常困难,即使那个念头是不对的.
14.8 24.32 In astronomy, for example, the distinction between the solar system and the universe wasn't clear until modern times. 拿天文学来举个例子,人们直到近代才将太阳系和宇宙的概念弄清楚. (第一题what's the purpose of the lecture?)
24.32 32.04 The ancient Greeks believed that what we called the solar system was in fact the entire universe, and that the universe was geocentric. 古希腊认为宇宙就是我们现在所说的太阳系,而整个宇宙中,地球就是宇宙的中心.
32.04 43.0 Geocentric means Earth-centric, so the geocentric view holds that the Sun, the planets, and the stars, all revolve around the Earth which is stationary.
43.0 54.72 Of course, we now know that the planets including Earth revolve around the Sun, and that solar system is only a tiny part of the universe. 当然了,我们现在是知道所有的星球都围绕着太阳转,地球也是星球,而且太阳系只是宇宙的一小部分.
......
|
#需求
某路径下有若干txt文件,清理掉文件中的中文,夹杂在中文中数字/英文/标点也一并清除
#正则的视角
在正则的视角下的解决方案,大约是这样的,对于每个文件,从每一行的第一个中文字符,一直清除到行尾 (以行为基本单位)
#解决方案(Python版)
直接上代码啦,当当当当
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
#!/usr/bin/env python3
# encoding: utf-8
import re
import sys
import os
def filter_filetype(path,filetype):
filetype = filetype.lower()
filenames = [filename for filename in os.listdir(path)
if os.path.isfile(os.path.join(path, filename))] #Get all regular files
filter_filename_list = [filename for filename in filenames if filename.endswith(filetype)]
return filter_filename_list
def clean_zh(filename):
zh_to_endline = re.compile(r"[\u4e00-\u9fa5].*",re.MULTILINE) #匹配从中文开始到行尾,多行模式
filename_split = filename.split(".")
output_filename = filename_split[0]+"_output."+filename_split[1]
output_file = open(output_filename,"w")
with open(filename,"r+") as input_file:
clean_content = zh_to_endline.sub("",input_file.read())
#clean_line = zh_to_endline.sub("",line) #返回替换完的行,如果没有替换则原文返回
output_file.write(clean_content)
#input_file.seek(0) #移到文件头,覆盖掉
#input_file.write(clean_content)
output_file.close()
if __name__ == "__main__":
path = sys.argv[1] #传递命令行参数 path
filetype = sys.argv[2] #传递命令行参数 filetype
filter_filename_list = filter_filetype(path,filetype)
for filename in filter_filename_list:
clean_zh(os.path.join(path,filename))
|
以上代码保存在clean_zh.py中
使用:python3 clean_zh.py PATH txt
。
使用python3运行,python2对unicode的支持不好
函数写的很不纯,之后重构一下
关于Python字符串的搜索和替换可以参考这里
#todo
发现这是个很棒的Code Kata(通过实际的编程练习来提升敏捷开发的技能)
于是决定写下其他语言的实现版本作为练习
涉及的知识点:
- 正则表达式
- 字符串(unicode)
- 文件和目录
- 函数(参数)
- 命令行参数
###Ruby
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
#!/usr/bin/env ruby
#encoding=UTF-8
def filter_filetype(path,filetype)
filetype = filetype.downcase
filenames = Dir.entries(path)
filter_filename_list = filenames.select { |filename| filename.end_with?filetype }
return filter_filename_list
end
def clean_zh(filename)
zh_to_endline = /[\u4e00-\u9fa5].*/ #匹配从中文开始到行尾
#clean_content = str.sub(zh_to_endline, "")
filename_split = filename.split(".")
output_filename = filename_split[0]+"_output."+filename_split[1]
File.open(filename) do |file|
#file.each_line{|line| puts line.sub(zh_to_endline,"")}
contents = file.read()
#clean_content = contents.sub(zh_to_endline,"")
clean_content = contents.gsub(zh_to_endline,"")
output_filename = File.new(output_filename,"w")
output_filename.write(clean_content)
output_filename.close()
#puts clean_content
end
end
if __FILE__ == $0
path = ARGV[0]
filetype = ARGV[1]
filter_filename_list = filter_filetype(path,filetype)
filter_filename_list.each do |filename|
clean_zh(File.join(path,filename))
end
end
|
使用 ruby clean_zh.rb PATH txt
Ruby写起来似乎比Python还惬意~
###Golang
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
|
package main
import (
"fmt"
"io/ioutil"
"os"
//"path"
"path/filepath"
"regexp"
"strings"
)
func clean_zh(filename string) int {
//pattern := "(?i)[\u4e00-\u9fa5].*"
pattern := "(?i)[\u4e00-\u9fa5].*"
reg := regexp.MustCompile(pattern)
contents, _ := ioutil.ReadFile(filename)
clean_content := reg.ReplaceAllString(string(contents), "")
//println(string(contents))
//println(contents)
filename_split := strings.Split(filename, ".")
output_filename := filename_split[0] + "_output." + filename_split[1]
fmt.Printf(output_filename + "\n")
fmt.Printf(filename + "\n")
ioutil.WriteFile(output_filename, []byte(clean_content), 0644)
return 1
}
func main() {
getpath := os.Args[1]
filetype := os.Args[2]
files, _ := filepath.Glob(getpath + "/*") // contains a list of all files in the path
//fmt.Printf("%s\n", files)
for _, filename := range files {
//i是索引,c是值
filename_split := strings.Split(filename, ".")
if filename_split[len(filename_split)-1] == filetype {
fmt.Printf(filename + "\n")
clean_zh(filename)
}
}
}
|
使用 go run clean_zh.go PATH txt
Golang的语法糖好少!想使用Slices来收集过滤完的filename
###Nodejs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
//use =G to formet js with javascript.vim
var fs = require('fs');
function is_filetype(element, index, array) {
if (element.split(".")[1] == filetype) // 使用全局变量?之后使用闭包替换吧
return true;
else
return false;
}
function clean_zh(filename){
fs.readFile(filename,'utf-8', function(err, data) {
filename_split = filename.split(".");
output_filename = filename_split[0]+"_output."+filename_split[1];
if (err) {
console.error(err);
} else {
var reg = /[\u4e00-\u9fa5].*/g;
var res = data.replace(reg, '');
fs.writeFile(output_filename,res, function(err) {if(err)console.error(err);});
}
});
}
var path = process.argv[2];
var filetype = process.argv[3];
var filenames = fs.readdirSync(path); //等待io的返回,需要使用同步方法
var filter_filenames = filenames.filter(is_filetype);
filter_filenames.map(clean_zh);
|
使用:node clean_zh.js PATH txt
js注意回调。回调之外想要控制顺序的话,用同步函数吧
###Coffeescript
waiting
###Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
#!/bin/bash
#cd /tmp 使之成为当前工作目录
path=$1 #不能有空格
filetype=$2
cd $path
filenames=`ls`
#echo $filenames
for filename in $filenames
do
if [[ \"$filename\" =~ .*txt ]];then #注意空格
#如何把$filename变为字符串
echo "import re;zh_to_endline = re.compile(r\"[\u4e00-\u9fa5].*\");file = open(\"$filename\");clean_content = zh_to_endline.sub(\"\",file.read());print(clean_content);" | python3 >> "output_$filename"
#echo $filename
fi
done
|
使用sh clean_zh PATH txt
原先想用sed的,发现它好像根本不支持unicode??耍赖把文本处理的工作丢给Python了。
bash真是dirty and quick
###Scheme
###Haskell
###Prolog
###Lua
###Clojure
###Scala
###Perl
###Racket
###C#
###C
###Java
###PHP