419-696-0563

When it is the analysis result of yesterday, it is not interesting because nouns are conspicuous by all means. Today I would like to examine what kind of words are used in each part of speech by using the text I have acquired so far.

Change MeCab Tagger options

I have specified MeCab (“- Owakati”) without thinking anything so far, but it seems that you can obtain part of speech information by changing this.

There seems to be the following in Tagger options.

  • ‘mecabrc’: MeCab standard output
  • ‘-O chasen’: ChaSen format
  • ‘-Owakati’: Output in separate colors
  • ‘-Oyomi’: Output text reading

In case of mecabrc

tagger = MeCab.Tagger ('mecabrc')

Output result

. 名詞,サ変接続,*,*,*,*,*
今年 名詞,副詞可能,*,*,*,*,今年,コトシ,コトシ
も 助詞,係助詞,*,*,*,*,も,モ,モ
かき氷 名詞,一般,*,*,*,*,かき氷,カキゴオリ,カキゴオリ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
季節 名詞,一般,*,*,*,*,季節,キセツ,キセツ
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
やってき 動詞,自立,*,*,カ変・クル,連用形,やってくる,ヤッテキ,ヤッテキ
た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
〜 記号,一般,*,*,*,*,〜,〜,〜
😊🍧 記号,一般,*,*,*,*,*(Hereinafter abbreviated)

In case of Ochasen

tagger = MeCab.Tagger ('- Ochasen')

Output result
. . . 名詞-サ変接続
今年 コトシ 今年 名詞-副詞可能
も モ も 助詞-係助詞
かき氷 カキゴオリ かき氷 名詞-一般
の ノ の 助詞-連体化
季節 キセツ 季節 名詞-一般
が ガ が 助詞-格助詞-一般
やってき ヤッテキ やってくる 動詞-自立 カ変・クル 連用形
た タ た 助動詞 特殊・タ 基本形
〜 〜 〜 記号-一般
😊🍧 😊🍧 😊🍧 記号-一般

Oyom case

tagger = MeCab.Tagger ('- Oyomi')

Output result

. コトシ モ カキゴオリ ノ キセツ ガ ヤッテキ タ 〜 😊🍧 . ミヤザキ ケン [ リン ' Z ] トイウ キッサテン ☕ ️ プリンセス ピーチ トイウ ネーミング モ カワイ スギル ♡ 🍑 . # カキゴオリ -------- コンヤ テレ ヒガシ 18 : 55 〜 ! 📺 アリエ ヘン ∞ セカイ 2 ジカン SP 📺 # テレ ヒガシ オンガク サイ ガッタイ コラボ キカク 🎤 VTR デ シュツエン シ ! ミ テ ネ 〜 ☺ ️💕 -------- 21 ジハン カラ showroom ヤリ ! オ ヒマ ダッ タラ ゼヒ 🙂🙂💕 -------- ファン ノ ミナサン カラ 、 オ ハナ ヲ イタダキ タ … 💐💗 ホントウニ アリガトウ ゴザイ ( ; _ ; ) -------- # ムラヤマ チーム 4 # チーム 4 コウエン アリガトウ ゴザイ タ 😊💛

Obtain only certain part of speech

You do If you do not specify the type of part of speech, but when output in the tagger = MeCab.Tagger ( ‘mecabrc’), it seems to be output as follows.

Surface layer form \ t Part-of-speech, Part-of-speech subdivision 1, Partspeech subdivision 2, Partspeech subdivision 3, Usage form, Utilization form, Original form, Reading, pronunciation

Looking at what is actually being output is correct.

今年 名詞,副詞可能,*,*,*,*,今年,コトシ,コトシ

So, if you specify [0] after “Tab” as “noun” or “verb” you should get the part of speech you want to acquire.

For now we will try one line per line with a for statement.

for hinshis in mecab:
     print (hinshis)

Execution result

.

名
詞
,
サ
変
接
続
,
*
,
*
,
*
,
*
,
*

今
å¹´

名
詞
,
副
詞
可
能
,
*
,
*
,
*
,
*
,

That is, it has been disassembled than I thought.
It seems better to delimit one line and one line before turning with a for statement.

import sys
import MeCab
tagger = MeCab.Tagger ('')
# MeCab.Tagger ("- Owakati")
text = open ('kaiseki.txt', 'r')
# Eliminate line breaks
kaisekiyou = text.read (). split ('\ n')
string = ''. join (kaisekiyou)
mecab = tagger.parse (string)
kaigyou = mecab.splitlines ()
print (kaigyou)

Execution result.

['.\t名詞,サ変接続,*,*,*,*,*', '今年\t名詞,副詞可能,*,*,*,*,今年,コトシ,コトシ', 'も\t助詞,係助詞,*,*,*,*,も,モ
(以下略)
(Hereinafter abbreviated)

Oh, it seems like a good feeling is being separated by \ t.
I will turn it again with a for statement. Below is the full text of the code.

import sys
import MeCab

tagger = MeCab.Tagger ('')
# MeCab.Tagger ("- Owakati")
text = open ('kaiseki.txt', 'r')
# Eliminate line breaks
kaisekiyou = text.read (). split ('\ n')
string = ''. join (kaisekiyou)
mecab = tagger.parse (string)
kaigyou = mecab.splitlines ()
for hinshis in kaigyou:
    tab = hinshis.split ('\ t')
        syurui = tab [1]
        print (syurui)

Execution result.

  File "hinsi.py", line 17, in
    syurui = tab [1]
IndexError: list index out of range

An error will occur. It is specified that it is not the length of the list. Let’s check the length with len function.

for hinshis in kaigyou:
    tab = hinshis.split ('\ t')
    print (len (hinshis))

(Somewhat)
19
22
30
26
19
29
29
26
3

This extremely small number 3 is the source of error, is not it?
Add an if statement and remove those of length 3.

for hinshis in kaigyou:
    tab = hinshis.split ('\ t')
    An error occurs if you do not specify # 3 or higher.
    if len (hinshis)> = 4:
        syurui = tab [1]
        print (syurui)

Execution result

名詞,サ変接続,*,*,*,*,*
名詞,副詞可能,*,*,*,*,今年,コトシ,コトシ
助詞,係助詞,*,*,*,*,も,モ,モ
名詞,一般,*,*,*,*,かき氷,カキゴオリ,カキゴオリ
助詞,連体化,*,*,*,*,の,ノ,ノ
名詞,一般,*,*,*,*,季節,キセツ,キセツ
助詞,格助詞,一般,*,*,*,が,ガ,ガ
動詞,自立,*,*,カ変・クル,連用形,やってくる,ヤッテキ,ヤッテキ
助動詞,*,*,*,特殊・タ,基本形,た,タ,

Oh, you are feeling good. It would be nice to delimit this by, specifying the character that comes to the [0] th, and getting the seventh from the last.

for hinshis in kaigyou:
     tab = hinshis.split ('\ t')
     if len (hinshis)> = 4:
         syurui = tab [1] .split (",")
         if syurui [0] == "verb":
             print (syurui [6])

やってくる
すぎる
ありえる
する
みる
やる
頂く
たつ
書く
下さる
ある
頑張る

I got only verbs.

It was 3 o’clock in the morning so I will try to analyze the muttering by the AKB members tomorrow.

Today’s result

There were 57 misunderstandings by AKB members.
When we threw text to User Local, frequent words were as follows.
– Nouns, Tele-Higashi music festival, 23
– noun, cast, 10
– noun, chocolate mint, 8
– noun, thank you, 7
– Noun, AKB 48,7
– Nouns, today, 7
– noun, strategy, 6

Posted on 6153009935

【Day 4】Make a image by Word-cloud corresponding Japanese language

Let’s continue where my left off yesterday. In order to correspond to Japanese for Word-cloud, I need to select a path Japanese font. Just before that, I will report the results of AKB”S tweet in today.

Today’s AKB tweett

Today ‘s AKB members posted 43 tweet.
Let’s make images using this data.

correspond to Japanese for Word-cloud

According to the web sites, macOS has Japanese fonts, so I should be specified Japanese fonts on the program. I added the following code to yesterday’s code.

fpath = "/ Library / Fonts / Hiragino horn Pro W3.otf"

Now try running this code. error ocuured.

self.font = core.getfont (font, size, index, encoding)
OSError: can not open resource

It says “OSError: can not open resource”, so probably the specified path seems to be wrong. Searching for files seems to be a find command, so I use find command immediately to search in order to find right path.

$ find / Library / Fonts /
(Somewhat)
/ Library / Fonts / Trebuchet MS.ttf
/ Library / Fonts / Verdana Bold Italic.ttf
/ Library / Fonts / Verdana Bold.ttf
/ Library / Fonts / Verdana Italic.ttf
/Library/Fonts/Verdana.ttf
/Library/Fonts/Waseem.ttc
/Library/Fonts/Webdings.ttf
/ Library / Fonts / Wingdings 2.ttf
/ Library / Fonts / Wingdings 3.ttf
/Library/Fonts/Wingdings.ttf
/Library/Fonts/Zapfino.ttf
/ Library / Fonts / Hiragino-Marugo ProN W4.ttc

I found a one Japanese font that is “/ Library / Fonts / Hiragino-Marugo ProN W4.ttc”. I seted this.

fpath = "/ Library / Fonts / Hiragino Marugo ProN W4.ttc"

Now I run again.

Making image was successfully.

However, as it is now, https, co, and Jun stand out, and I do not know which word is important because there are too many words.
I will use these regular expressions to erase URLs that end in co from /. Replace the following disturbing text group with one-byte space by Atom’s search.

/.*$
Tue Jun 26. *
1011. *

It disappeared successfully (It is disturbing “ます” “まし”(those mean “It is”) and I will erase it with a regular expression as well).

In order to reduce the amount of items in word-cloud, it seems good to lower the values ​​of the following parameters

max_words = 2000,

Let’s try it around 500.

I did it.
It seems that the following words were often tweeted.
– 公演(Performance)
– 撮影(Shooting)
– 村山(name of Murayama)
– チーム(Team)

This completes one figure, so I’d like to make a different figure tomorrow.

5414301701

I learned how to use Word-cloud yesterday. I would like to create characters separated by space which is a condition for using Word-cloud.

Honestly, Japanese language cannot be separated by space. I need morphological analysis that separates sentences into words to Japanese language analyze. MeCab is an analysis engine for Japanese language Morphological analysis.

MeCab
/taku910.github.io/mecab/

Today I will learn how to use this MeCab.

Firstly, install.
brew install mecab-ipadic

I use mecab immediately after installation.

$ mecab -v
mecab of 0.996

I type “mecab” and a sentence that I want to analyze.

$mecab
青パジャマ赤パジャマ黄パジャマ、バスガス大爆発、東京特許許可局
青 接頭詞,名詞接続,*,*,*,*,青,アオ,アオ
パジャマ 名詞,一般,*,*,*,*,パジャマ,パジャマ,パジャマ
赤 名詞,一般,*,*,*,*,赤,アカ,アカ
パジャマ 名詞,一般,*,*,*,*,パジャマ,パジャマ,パジャマ
黄 名詞,一般,*,*,*,*,黄,キ,キ
パジャマ 名詞,一般,*,*,*,*,パジャマ,パジャマ,パジャマ
、 記号,読点,*,*,*,*,、,、,、
バス 名詞,一般,*,*,*,*,バス,バス,バス
ガス 名詞,一般,*,*,*,*,ガス,ガス,ガス
大 接頭詞,名詞接続,*,*,*,*,大,ダイ,ダイ
爆発 名詞,サ変接続,*,*,*,*,爆発,バクハツ,バクハツ
、 記号,読点,*,*,*,*,、,、,、
東京 名詞,固有名詞,地域,一般,*,*,東京,トウキョウ,トーキョー
特許 名詞,サ変接続,*,*,*,*,特許,トッキョ,トッキョ
許可 名詞,サ変接続,*,*,*,*,許可,キョカ,キョカ
局 名詞,接尾,一般,*,*,*,局,キョク,キョク

I can analyze firmly.
Let’s analyze tweets of today’s AKB members. Incidentally, there were 42 articles posted today. It seems that Murayama Aki is birthday from team 4, It’s been coming up a lot.

I install MeCab for Python using pip install.

pip install mecab-python 3


import sys
import MeCab
tagger = MeCab.Tagger ("- Owakati")
text = open ('. / text / today / 20180625.txt', 'r')
# Eliminate line breaks
kaisekiyou = text.read (). split ('\ n')
mecab = tagger.parse (kaisekiyou)
with open ("kaiseki.txt", "w", encoding = 'utf-8') as f:
     f.write (mecab)

However, in this is the case, an error occurs.

NotImplementedError: Wrong number or type of arguments for overloaded function 'Tagger_parse'.
  Possible C / C ++ prototypes are:
    MeCab :: Tagger :: parse (MeCab :: Model const &, MeCab :: Lattice *)
    MeCab :: Tagger :: parse (MeCab :: Lattice *) const
    MeCab :: Tagger :: parse (char const *)

Maybe, I need to use string for mecab. I converted a list to a string with .join method. Add a .join method and execute it.


import sys
import MeCab
tagger = MeCab.Tagger ("- Owakati")
text = open ('. / text / today / 20180625.txt', 'r')
# Eliminate line breaks
kaisekiyou = text.read (). split ('\ n')
string = ''. join (kaisekiyou)
mecab = tagger.parse (string)
with open ("kaiseki.txt", "w", encoding = 'utf-8') as f:
     f.write (mecab)

It went well.

# Murayama team 4 # Those who watched the performance while holding hands Thank you very much 🙇 ♀ ️🍎🍎! Today was Yuri 's birthplace festival! I thought again that I was good at Team 4 and I will follow Yuri in future! ! Truly congratulations ... / t. Co / o DjL 3 cDL

(Hereinafter abbreviated)

I put this in yesterday’s code.

It is different from the image I thought. Apparently, it does not correspond to Japanese fonts.
I hope to complete the figure this time tomorrow.

319-530-0478

I often see this picture.

I think my site is very unimaginative, so I make image like this using a python library called word cloud. let’s make it in the beginning.

The installation method is written on the Github page

wget /github.com/amueller/word_cloud/archive/master.zip

Although it is written, macOS systems do not come with wget.
I downloaded normally as zip.

unzip word_cloud- master.zip
rm word_cloud-master.zip
cd word_cloud-master
python setup.py install

I checked the specifications about this library. Apparently, it corresponds only to the text separated by space (the Japanese language is not separated by space). For now, today I would like to visualize the Wikipedia of AKB 48 written in English, and I use [examples/simple.py] on Github.

this code remains sample (akb.txt is a copy of Wikipedia text).

This figure is like this.

Tomorrow I would like to separate Japanese words by space.

【Day 1】 Collect data from Twitter with Python

I was impressed by an article that kept making Web services for 180 days. I wanted to do something like her.

180 Websites in 180 Days: How I Learned to Code
586-622-1269

By the way, I’m not an Engineer, I’m strong interesting in programming. so, I try to study python as text mining for 180 days. Now, we can collect too much text from Twitter and the Web sites. and I will acquire good skills that could be applied to machine learning after analyzing for 180 days. I think that will be useful in my future. So, I just started this project.

I tried to analyze the tweet of the Nogizaka member which is Japanese Idol group that seems to be the most gathering PV, but members of Nogizaka did not do Twitter.

Then, I found about 45 people in AKB which is the famous Japanese Idol group, I decided to analyze the tweet of the members of AKB for the time being.

#target
I found the Twitter account of AKB member.

* Nanami Asai @48 _ asainanami
* Aimi Ichikawa @Ickw Mnm 0826
* Anna Iriyama @iriyamaanna 1203
* Saho Iwatachi @yahho_sahho
* Rio Okawa @rio_rin 48
* Miyu Omori @omorimyu_pon
* Shizuka Oya @ooyachaaan 1228
* Nana Okada @okadanana_1107
* Renna Kato @katorena_710
* Saya Kawamoto @sayaya _ 0388
* Saki Kitagawa @Sakii _ Kitazawa
* Reiko Kubo @AKB48K5
* Masako Kojima @mak0_k0jima
* Moe Goto @moe_goto 0520
* Hayaka Iriyama @912_komiharu
* Yukari Sasaki @yukari__ 0828
* Kiara Sato @ki_cyaco48
* Ayana Shinozaki @ayana18_48
* Hinana Simoguti @177_shimo719
* Kurumi Suzuki @akb48kururun
* Akari Takahashi @juri_t_official
* Kayoko Tanakita @kayoyon213
* Aika Taguchi @48manaka_16
* Takeuchi Miao @take_miyu112
* Makiho Tatuya @makiho_1019
* Megu Taniguchi @o_megu1112
* Eri Chiba @erii_20031027
* Tomona Nakanishi @chiyori_n512
* Rei Nishikawa @rei_1025_48
* Rena Nozawa @RENAN0ZAWA
* Yui Hiwatashi @yui_hiwata430
* Seina Fukuoka @seina_fuku48
* Nana Fujita@ fujitanana_1228
* Ma Chia-Ling @macyacyarin
* Maeda Ayaka @akb4816ayaka
* Minami Minegishi @chan__31
* Miho Mihozaki @730myao
* Mukaichi Mion @mionnn_48
* Orin Muto @muto_orin
* Tomu Muto @tommuto1125
* Aki Murayama @yuirii_murayama
* Mogi shinobi @mogi0_0216
* Rui Yamauchi @MizukiYamauchi
* Ayu Yanabe @ayuchan0203
* Ami Yumoto @ami_15chans
* Yui Yokoyama @Yui_yoko 208

I tried to text of tweet by code in Python, I do’t know how to specify multiple accounts in params. Then, I write scripts to the number of AKB members. I learned shell script for the first time here, but this is very useful.

This is Python Code for getting text from Twitter