函数句=预处理(raw, min_length) %预处理去除非对话文本%原始文本包含的字符比对话多,所以我们需要%来清除它们。幸运的是,莎士比亚的戏剧遵循一个相当标准化的格式。% % ACT % %场景% %名称。% %[舞台方向]if nargin == 1 min_length = 3;初始处理%我们将使用标准格式拆分文本。%分割文本为更大的部分-让我们称之为|段落|段落= regexp(raw, '\r\n\r\n', 'split');%拆分双换行符%拆分|段落|成句子句子= regexp(段落',…%分裂的标点符号 '(?<=[!.?;:])\ 年代”、“分裂”);%删除非对话文本为I = 1:长度(句子)%循环句子如果长度(句子{I}) == 1%每行只有1个句子如果regexp(句子{I}{1},…% if以ACT开头。' '^(\r\n)*(ACT|Act).+\.$') sentences{i} = []; % remove it elseif regexp(sentences{i}{1},... % if enclosed in '[]' '^(\r\n)*\[.+\]\.?$') sentences{i} = []; % remove stage directions end else % > 2 sentences per line if regexp(sentences{i}{1},... % if starts with 'Scene...' '^(\r\n)*Scene.+\.$') sentences{i} = []; % remove the line elseif regexp(sentences{i}{1},... % if name ends with '.' '^(\r\n)*\d?\s?\w+\s*\w+\.$') sentences{i}(1) = []; % remove it elseif ~isempty(regexp(sentences{i}{1},... '^(\r\n)*\[.+', 'once')) &&... % if starts with '[ ~isempty(regexp(sentences{i}{end},... '.+\]\.?$', 'once')) % ends with ']' sentences{i} = []; % remove it end end end sentences = [sentences{:}]'; % flatten the cell array sentences(cellfun(@isempty, sentences)) = []; % remove empty cells %% Dealing with exceptions % We have some remaining issues. sentences = regexprep(sentences, '\[.+\]', ''); % remove stage directions sentences = regexp(sentences, '--', 'split'); % split by double hyphens sentences = [sentences{:}]'; % flatten the cell array sentences(cellfun(@isempty, sentences)) = []; % remove empty cells sentences = regexprep(sentences, '^\n\r', ''); % remove LFCR sentences = regexprep(sentences, '^\r\n', ''); % remove CRLF sentences = regexprep(sentences, '^\n', ''); % remove LF sentences = regexprep(sentences, '^\r', ''); % remove CR sentences = regexprep(sentences, '^:', ''); % remove colon sentences = regexprep(sentences, '^\.', ''); % remove period sentences = regexprep(sentences, '^\s', ''); % remove space sentences(cellfun(@isempty, sentences)) = []; % remove empty cells %% Remove short ssentences % If a sentence is too short, then it doesn't help. tokens = cellfun(@strsplit, sentences,... % tokenize sentences 'UniformOutput', false); isShort = cellfun(@length, tokens) < min_length;% shorter than minimum? sentences(isShort)= []; % remove short sentences %% Add Sentence Markers % Now we have mostly clean data. For further processing, we need to add%,标记句子的开头和结尾。当I = 1时:{I} = ['strtrim(句子{i})');结束结束