Python的输入与输出

基本的输入输出

接收来自键盘的输入输出：

# input() 函数暂停程序运行，同时等待键盘输入,直到回车被按下，函数的参数即为提示语。
name = input('your name:')
gender = input('you are a boy?(y/n)')

# 输入的类型永远是字符串型（str）。
###### 输入 ######
your name:Jack
you are a boy?

welcome_str = 'Welcome to the matrix {prefix} {name}.'
welcome_dic = {
    'prefix': 'Mr.' if gender == 'y' else 'Mrs',
    'name': name
}

print('authorizing...')
print(welcome_str.format(**welcome_dic))

########## 输出 ##########
authorizing...
Welcome to the matrix Mr. Jack.

input()输入的类型永远是字符串！print() 函数则接受字符串、数字、字典、列表甚至一些自定义类的输出。

如果需要转换输入对象的类型，可以使用int()、float()等函数：


a = input()
1
b = input()
2

print('a + b = {}'.format(a + b))
########## 输出 ##############
a + b = 12  # 这里为字符串的简单相加
print('type of a is {}, type of b is {}'.format(type(a), type(b)))
########## 输出 ##############
type of a is <class 'str'>, type of b is <class 'str'>
print('a + b = {}'.format(int(a) + int(b)))
########## 输出 ##############
a + b = 3 # 转换后的正确结果

在生产环境中使用强制转换时，请记得加上try except。

Python 对 int 类型没有最大限制（相比之下， C++ 的 int 最大为 2147483647，超过这个数字会产生溢出），但是对 float 类型依然有精度限制。

文件的输入与输出

日常开发中，大部分I/O 则来自于文件、网络、其他进程的消息等。

假设有一个文本文件 in.txt ，内容如下：


I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

对这个文件进行NLP处理任务，一般来讲分四步：

读取文件；

去除所有标点符号和换行符，并把所有大写变成小写；

合并相同的词，统计每个词出现的频率，并按照词频从大到小排序；

将结果按行输出到文件 out.txt。

代码如下：


import re

# 你不用太关心这个函数
def parse(text):
    # 使用正则表达式去除标点符号和换行符
    text = re.sub(r'[^\w ]', ' ', text)

    # 转为小写
    text = text.lower()

    # 生成所有单词的列表
    word_list = text.split(' ')

    # 去除空白单词
    word_list = filter(None, word_list)

    # 生成单词和词频的字典
    word_cnt = {}
    for word in word_list:
        if word not in word_cnt:
            word_cnt[word] = 0
        word_cnt[word] += 1

    # 按照词频排序
    # sorted_word_cnt 则是一个二元组的列表（list of tuples）。
    sorted_word_cnt = sorted(word_cnt.items(), key=lambda kv: kv[1], reverse=True)

    return sorted_word_cnt

with open('in.txt', 'r') as fin:
    text = fin.read()

word_and_freq = parse(text)

with open('out.txt', 'w') as fout:
    for word, freq in word_and_freq:
        fout.write('{} {}\n'.format(word, freq))

########## 输出(省略较长的中间结果) ##########

and 15
be 13
will 11
to 11
the 10
of 10
a 8
we 8
day 6

...

old 1
negro 1
spiritual 1
thank 1
god 1
almighty 1
are 1

上面操作中系统访问文件时：

先用open()函数拿到文件的指针，其中，第一个参数指定文件位置（相对位置或者绝对位置）；第二个参数，r表示读取，w表示写入，rw表示读写都要，a表示追加模式，这样打开文件，如果需要写入，会从文件的最末尾开始写入。
拿到指针后，可以通过read()函数读取文件的全部内容，text = fin.read()表示把文件所有内容读取到内存中，并赋值给变量text：
- 优点是方便，接下来我们可以很方便地调用 parse 函数进行分析；
- 缺点是如果文件过大，一次性读取可能造成内存崩溃。
这时，可以给read指定参数size，用来表示读取的最大长度，还可以使用readline()函数，每次读取一行，这样的操作常用于数据挖掘（Data Mining）中的数据清洗，在写一些小的程序时非常轻便。如果每行之间没有关联，这种做法也可以降低内存的压力。
write() 函数，把参数中的字符串输出到文件中。

with语句的使用使open()函数对应的close()函数在语句执行完以后被自动调用。

所有的I/O都要进行错误处理

JSON序列化

JSON（JavaScript Object Notation）是一种轻量级的数据交换格式，它的设计意图是把所有事情都用设计的字符串来表示，这样既方便在互联网上传递信息，也方便人进行阅读（相比一些 binary 的协议）。

设想一个情景，要向交易所购买一定数额的股票。那么，需要提交股票代码、方向（买入 / 卖出）、订单类型（市价 / 限价）、价格（如果是限价单）、数量等一系列参数，而这些数据里，有字符串，有整数，有浮点数，甚至还有布尔型变量，全部混在一起并不方便交易所解包。

而JSON ，正能解决这个场景。可以把它简单地理解为两种黑箱：

第一种，输入这些杂七杂八的信息，比如 Python 字典，输出一个字符串；
第二种，输入这个字符串，可以输出包含原始信息的 Python 字典。

如下：


import json

params = {
    'symbol': '123456',
    'type': 'limit',
    'price': 123.4,
    'amount': 23
}

# json.dumps() 这个函数，接受 Python 的基本数据类型，然后将其序列化为 string；
params_str = json.dumps(params)

print('after json serialization')
print('type of params_str = {}, params_str = {}'.format(type(params_str), params))

# 而 json.loads() 这个函数，接受一个合法字符串，然后将其反序列化为 Python 的基本数据类型。
original_params = json.loads(params_str)

print('after json deserialization')
print('type of original_params = {}, original_params = {}'.format(type(original_params), original_params))

########## 输出 ##########

after json serialization
type of params_str = <class 'str'>, params_str = {'symbol': '123456', 'type': 'limit', 'price': 123.4, 'amount': 23}
after json deserialization
type of original_params = <class 'dict'>, original_params = {'symbol': '123456', 'type': 'limit', 'price': 123.4, 'amount': 23}

其中：

json.dumps() 这个函数，接受 Python 的基本数据类型，然后将其序列化为 string；
json.loads() 这个函数，接受一个合法字符串，然后将其反序列化为 Python 的基本数据类型。

把字符串输出到文件或者从文件中读取JSON字符串，可以使用open()和read()/write()，先将字符串读取 / 输出到内存，再进行 JSON 编码 / 解码，当然这有点麻烦。


import json

params = {
    'symbol': '123456',
    'type': 'limit',
    'price': 123.4,
    'amount': 23
}

with open('params.json', 'w') as fout:
    params_str = json.dump(params, fout)

with open('params.json', 'r') as fin:
    original_params = json.load(fin)

print('after json deserialization')
print('type of original_params = {}, original_params = {}'.format(type(original_params), original_params))

########## 输出 ##########

after json deserialization
type of original_params = <class 'dict'>, original_params = {'symbol': '123456', 'type': 'limit', 'price': 123.4, 'amount': 23}

当开发一个第三方应用程序时，你可以通过 JSON 将用户的个人配置输出到文件，方便下次程序启动时自动读取。这也是现在普遍运用的成熟做法。

总结

I/O 操作需谨慎，一定要进行充分的错误处理，并细心编码，防止出现编码漏洞；
编码时，对内存占用和磁盘占用要有充分的估计，这样在出错时可以更容易找到原因；
JSON 序列化是很方便的工具，要结合实战多多练习；
代码尽量简洁、清晰，哪怕是初学阶段，也要有一颗当元帅的心。

思考题

第一问：你能否把 NLP 例子中的 word count 实现一遍？不过这次，in.txt 可能非常非常大（意味着你不能一次读取到内存中），而 output.txt 不会很大（意味着重复的单词数量很多）。
提示：你可能需要每次读取一定长度的字符串，进行处理，然后再读取下一次的。但是如果单纯按照长度划分，你可能会把一个单词隔断开，所以需要细心处理这种边界情况。

  """
  解法一：
  """
  from collections import defaultdict
  import re

  f = open("ini.txt", mode="r", encoding="utf-8")
  d = defaultdict(int)

  for line in f:
      for word in filter(lambda x: x, re.split(r"\s", line)):
          d[word] += 1


  print(d)

  """
  解法二
  """
  import re
  def parse(text, word_cnt):
      # 转为小写
      text = text.lower()
      # 生成所有单词的列表
      word_list = re.findall(r'\w+', text)
      # 更新单词和词频的字典
      for word in word_list:
          word_cnt[word] = word_cnt.get(word,0) + 1
      return word_cnt

  # 初始化字典
  word_cnt = dict()
  with open('in.txt', 'r') as fin:
      for text in fin.readlines():
          word_cnt = parse(text, word_cnt)
          print(len(word_cnt))

  # 按照词频排序
  sorted_word_cnt = sorted(word_cnt.items(), key=lambda kv: kv[1], reverse=True)

  # 导出
  with open('out.txt', 'w') as fout:
      for word, freq in word_and_freq:
          fout.write('{} {}\n'.format(word, freq))

第二问：你应该使用过类似百度网盘、Dropbox 等网盘，但是它们可能空间有限（比如 5GB）。如果有一天，你计划把家里的 100GB 数据传送到公司，可惜你没带 U 盘，于是你想了一个主意：每次从家里向 Dropbox 网盘写入不超过 5GB 的数据，而公司电脑一旦侦测到新数据，就立即拷贝到本地，然后删除网盘上的数据。等家里电脑侦测到本次数据全部传入公司电脑后，再进行下一次写入，直到所有数据都传输过去。根据这个想法，你计划在家写一个 server.py，在公司写一个 client.py 来实现这个需求。

提示：我们假设每个文件都不超过 5GB。你可以通过写入一个控制文件（config.json）来同步状态。不过，要小心设计状态，这里有可能产生 race condition。你也可以通过直接侦测文件是否产生，或者是否被删除来同步状态，这是最简单的做法。

Python的输入与输出

基本的输入输出

文件的输入与输出

JSON序列化

总结

思考题

Nemo

文章作者

推荐文章

发表回复取消回复

Python的输入与输出

基本的输入输出

文件的输入与输出

JSON序列化

总结

思考题

Nemo

文章作者

推荐文章

发表回复 取消回复

发表回复取消回复