HTML+CSS+JS隨機生成驗證碼

天在小破站看到一個前端生成隨機驗證碼的視頻，感覺很有意思，就跟著操作了一下，成功了。

后來自己又想給它加一個提交按鈕，輸入并判斷驗證碼的正確性，也可以正常運行，但是我的代碼好像還是存在一些bug：

1.canvas標簽是繪圖容器，自帶屬性使它是一個默認300*150的容器，縮小canvs容器時，里面的繪圖也會變小，我沒有找到合適的方法去調整它的大小。

2.對于輸入框input，我使用了float浮動，為了使input輸入框和canvas里面的驗證碼并排，但是被vscode警告了。

這次的實戰練習也經歷了很多坎坷，但是也收獲很大。學習到了canvas標簽的用法、JS全局變量和局部變量、以及有關context的一些屬性和方法。

最后，希望路過的大佬，幫我看看bug，幫幫菜鳥。

附上源碼

深度學習自動生成HTML代碼

選自Floydhub

作者：Emil Wallner

機器之心編譯

如何用前端頁面原型生成對應的代碼一直是我們關注的問題，本文作者根據 pix2code 等論文構建了一個強大的前端代碼生成模型，并詳細解釋了如何利用 LSTM 與 CNN 將設計原型編寫為 HTML 和 CSS 網站。

項目鏈接：github.com/emilwallner…

在未來三年內，深度學習將改變前端開發。它將會加快原型設計速度，拉低開發軟件的門檻。

Tony Beltramelli 在去年發布了論文《pix2code: Generating Code from a Graphical User Interface Screenshot》，Airbnb 也發布Sketch2code（airbnb.design/sketching-i…）。

目前，自動化前端開發的最大阻礙是計算能力。但我們已經可以使用目前的深度學習算法，以及合成訓練數據來探索人工智能自動構建前端的方法。在本文中，作者將教神經網絡學習基于一張圖片和一個設計模板來編寫一個 HTML 和 CSS 網站。以下是該過程的簡要概述：

1）向訓練過的神經網絡輸入一個設計圖

2）神經網絡將圖片轉化為 HTML 標記語言

3）渲染輸出

我們將分三步從易到難構建三個不同的模型，首先，我們構建最簡單地版本來掌握移動部件。第二個版本 HTML 專注于自動化所有步驟，并簡要解釋神經網絡層。在最后一個版本 Bootstrap 中，我們將創建一個模型來思考和探索 LSTM 層。

代碼地址：

github.com/emilwallner…
www.floydhub.com/emilwallner…

所有 FloydHub notebook 都在 floydhub 目錄中，本地 notebook 在 local 目錄中。

本文中的模型構建基于 Beltramelli 的論文《pix2code: Generating Code from a Graphical User Interface Screenshot》和 Jason Brownlee 的圖像描述生成教程，并使用 Python 和 Keras 完成。

核心邏輯

我們的目標是構建一個神經網絡，能夠生成與截圖對應的 HTML/CSS 標記語言。

訓練神經網絡時，你先提供幾個截圖和對應的 HTML 代碼。網絡通過逐個預測所有匹配的 HTML 標記語言來學習。預測下一個標記語言的標簽時，網絡接收到截圖和之前所有正確的標記。

這里是一個簡單的訓練數據示例：docs.google.com/spreadsheet…。

創建逐詞預測的模型是現在最常用的方法，也是本教程使用的方法。

注意：每次預測時，神經網絡接收的是同樣的截圖。也就是說如果網絡需要預測 20 個單詞，它就會得到 20 次同樣的設計截圖。現在，不用管神經網絡的工作原理，只需要專注于神經網絡的輸入和輸出。

我們先來看前面的標記（markup）。假如我們訓練神經網絡的目的是預測句子「I can code」。當網絡接收「I」時，預測「can」。下一次時，網絡接收「I can」，預測「code」。它接收所有之前單詞，但只預測下一個單詞。

神經網絡根據數據創建特征。神經網絡構建特征以連接輸入數據和輸出數據。它必須創建表征來理解每個截圖的內容和它所需要預測的 HTML 語法，這些都是為預測下一個標記構建知識。把訓練好的模型應用到真實世界中和模型訓練過程差不多。

我們無需輸入正確的 HTML 標記，網絡會接收它目前生成的標記，然后預測下一個標記。預測從「起始標簽」（start tag）開始，到「結束標簽」（end tag）終止，或者達到最大限制時終止。

Hello World 版

現在讓我們構建 Hello World 版實現。我們將饋送一張帶有「Hello World！」字樣的截屏到神經網絡中，并訓練它生成對應的標記語言。

首先，神經網絡將原型設計轉換為一組像素值。且每一個像素點有 RGB 三個通道，每個通道的值都在 0-255 之間。

為了以神經網絡能理解的方式表征這些標記，我使用了 one-hot 編碼。因此句子「I can code」可以映射為以下形式。

在上圖中，我們的編碼包含了開始和結束的標簽。這些標簽能為神經網絡提供開始預測和結束預測的位置信息。以下是這些標簽的各種組合以及對應 one-hot 編碼的情況。

我們會使每個單詞在每一輪訓練中改變位置，因此這允許模型學習序列而不是記憶詞的位置。在下圖中有四個預測，每一行是一個預測。且左邊代表 RGB 三色通道和之前的詞，右邊代表預測結果和紅色的結束標簽。

#Length of longest sentence
 max_caption_len = 3
#Size of vocabulary 
 vocab_size = 3
# Load one screenshot for each word and turn them into digits 
 images = []
for i in range(2):
 images.append(img_to_array(load_img('screenshot.jpg', target_size=(224, 224))))
 images = np.array(images, dtype=float)
# Preprocess input for the VGG16 model
 images = preprocess_input(images)
#Turn start tokens into one-hot encoding
 html_input = np.array(
 [[[0., 0., 0.], #start
 [0., 0., 0.],
 [1., 0., 0.]],
 [[0., 0., 0.], #start <HTML>Hello World!</HTML>
 [1., 0., 0.],
 [0., 1., 0.]]])
#Turn next word into one-hot encoding
 next_words = np.array(
 [[0., 1., 0.], # <HTML>Hello World!</HTML>
 [0., 0., 1.]]) # end
# Load the VGG16 model trained on imagenet and output the classification feature
 VGG = VGG16(weights='imagenet', include_top=True)
# Extract the features from the image
 features = VGG.predict(images)
#Load the feature to the network, apply a dense layer, and repeat the vector
 vgg_feature = Input(shape=(1000,))
 vgg_feature_dense = Dense(5)(vgg_feature)
 vgg_feature_repeat = RepeatVector(max_caption_len)(vgg_feature_dense)
# Extract information from the input seqence 
 language_input = Input(shape=(vocab_size, vocab_size))
 language_model = LSTM(5, return_sequences=True)(language_input)
# Concatenate the information from the image and the input
 decoder = concatenate([vgg_feature_repeat, language_model])
# Extract information from the concatenated output
 decoder = LSTM(5, return_sequences=False)(decoder)
# Predict which word comes next
 decoder_output = Dense(vocab_size, activation='softmax')(decoder)
# Compile and run the neural network
 model = Model(inputs=[vgg_feature, language_input], outputs=decoder_output)
 model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
# Train the neural network
 model.fit([features, html_input], next_words, batch_size=2, shuffle=False, epochs=1000)
復制代碼

在 Hello World 版本中，我們使用三個符號「start」、「Hello World」和「end」。字符級的模型要求更小的詞匯表和受限的神經網絡，而單詞級的符號在這里可能有更好的性能。

以下是執行預測的代碼：

# Create an empty sentence and insert the start token
 sentence = np.zeros((1, 3, 3)) # [[0,0,0], [0,0,0], [0,0,0]]
 start_token = [1., 0., 0.] # start
 sentence[0][2] = start_token # place start in empty sentence
# Making the first prediction with the start token
 second_word = model.predict([np.array([features[1]]), sentence])
# Put the second word in the sentence and make the final prediction
 sentence[0][1] = start_token
 sentence[0][2] = np.round(second_word)
 third_word = model.predict([np.array([features[1]]), sentence])
# Place the start token and our two predictions in the sentence 
 sentence[0][0] = start_token
 sentence[0][1] = np.round(second_word)
 sentence[0][2] = np.round(third_word)
# Transform our one-hot predictions into the final tokens
 vocabulary = ["start", "<HTML><center><H1>Hello World!</H1></center></HTML>", "end"]
for i in sentence[0]:
print(vocabulary[np.argmax(i)], end=' ')
復制代碼

輸出

10 epochs: start start start
100 epochs: start <HTML><center><H1>Hello World!</H1></center></HTML> <HTML><center><H1>Hello World!</H1></center></HTML>
300 epochs: start <HTML><center><H1>Hello World!</H1></center></HTML> end

我走過的坑：

在收集數據之前構建第一個版本。在本項目的早期階段，我設法獲得 Geocities 托管網站的舊版存檔，它有 3800 萬的網站。但我忽略了減少 100K 大小詞匯所需要的巨大工作量。
訓練一個 TB 級的數據需要優秀的硬件或極其有耐心。在我的 Mac 遇到幾個問題后，最終用上了強大的遠程服務器。我預計租用 8 個現代 CPU 和 1 GPS 內部鏈接以運行我的工作流。
在理解輸入與輸出數據之前，其它部分都似懂非懂。輸入 X 是屏幕的截圖和以前標記的標簽，輸出 Y 是下一個標記的標簽。當我理解這一點時，其它問題都更加容易弄清了。此外，嘗試其它不同的架構也將更加容易。
圖片到代碼的網絡其實就是自動描述圖像的模型。即使我意識到了這一點，但仍然錯過了很多自動圖像摘要方面的論文，因為它們看起來不夠炫酷。一旦我意識到了這一點，我對問題空間的理解就變得更加深刻了。

在 FloydHub 上運行代碼

FloydHub 是一個深度學習訓練平臺，我自從開始學習深度學習時就對它有所了解，我也常用它訓練和管理深度學習試驗。我們能安裝它并在 10 分鐘內運行第一個模型，它是在云 GPU 上訓練模型最好的選擇。若果讀者沒用過 FloydHub，可以花 10 分鐘左右安裝并了解。

FloydHub 地址：www.floydhub.com/

復制 Repo：

https://github.com/emilwallner/Screenshot-to-code-in-Keras.git
復制代碼

登錄并初始化 FloydHub 命令行工具：

cd Screenshot-to-code-in-Keras
floyd login
floyd init s2c
復制代碼

在 FloydHub 云 GPU 機器上運行 Jupyter notebook：

floyd run --gpu --env tensorflow-1.4 --data emilwallner/datasets/imagetocode/2:data --mode jupyter
復制代碼

所有的 notebook 都放在 floydbub 目錄下。一旦我們開始運行模型，那么在 floydhub/Helloworld/helloworld.ipynb 下可以找到第一個 Notebook。更多詳情請查看本項目早期的 flags。

HTML 版本

在這個版本中，我們將關注與創建一個可擴展的神經網絡模型。該版本并不能直接從隨機網頁預測 HTML，但它是探索動態問題不可缺少的步驟。

概覽

如果我們將前面的架構擴展為以下右圖展示的結構，那么它就能更高效地處理識別與轉換過程。

該架構主要有兩個部，即編碼器與解碼器。編碼器是我們創建圖像特征和前面標記特征（markup features）的部分。特征是網絡創建原型設計和標記語言之間聯系的構建塊。在編碼器的末尾，我們將圖像特征傳遞給前面標記的每一個單詞。隨后解碼器將結合原型設計特征和標記特征以創建下一個標簽的特征，這一個特征可以通過全連接層預測下一個標簽。

設計原型的特征

因為我們需要為每個單詞插入一個截屏，這將會成為訓練神經網絡的瓶頸。因此我們抽取生成標記語言所需要的信息來替代直接使用圖像。這些抽取的信息將通過預訓練的 CNN 編碼到圖像特征中，且我們將使用分類層之前的層級輸出以抽取特征。

我們最終得到 1536 個 8*8 的特征圖，雖然我們很難直觀地理解它，但神經網絡能夠從這些特征中抽取元素的對象和位置。

標記特征

在 Hello World 版本中，我們使用 one-hot 編碼以表征標記。而在該版本中，我們將使用詞嵌入表征輸入并使用 one-hot 編碼表示輸出。我們構建每個句子的方式保持不變，但我們映射每個符號的方式將會變化。one-hot 編碼將每一個詞視為獨立的單元，而詞嵌入會將輸入數據表征為一個實數列表，這些實數表示標記標簽之間的關系。

上面詞嵌入的維度為 8，但一般詞嵌入的維度會根據詞匯表的大小在 50 到 500 間變動。以上每個單詞的八個數值就類似于神經網絡中的權重，它們傾向于刻畫單詞之間的聯系（Mikolov alt el., 2013）。這就是我們開始部署標記特征（markup features）的方式，而這些神經網絡訓練的特征會將輸入數據和輸出數據聯系起來。

編碼器

我們現在將詞嵌入饋送到 LSTM 中，并期望能返回一系列的標記特征。這些標記特征隨后會饋送到一個 Time Distributed 密集層，該層級可以視為有多個輸入和輸出的全連接層。

和嵌入與 LSTM 層相平行的還有另外一個處理過程，其中圖像特征首先會展開成一個向量，然后再饋送到一個全連接層而抽取出高級特征。這些圖像特征隨后會與標記特征相級聯而作為編碼器的輸出。

標記特征

如下圖所示，現在我們將詞嵌入投入到 LSTM 層中，所有的語句都會用零填充以獲得相同的向量長度。

為了混合信號并尋找高級模式，我們運用了一個 TimeDistributed 密集層以抽取標記特征。TimeDistributed 密集層和一般的全連接層非常相似，且它有多個輸入與輸出。

圖像特征

對于另一個平行的過程，我們需要將圖像的所有像素值展開成一個向量，因此信息不會被改變，它們只會用來識別。

如上，我們會通過全連接層混合信號并抽取更高級的概念。因為我們并不只是處理一個輸入值，因此使用一般的全連接層就行了。

級聯圖像特征和標記特征

所有的語句都被填充以創建三個標記特征。因為我們已經預處理了圖像特征，所以我們能為每一個標記特征添加圖像特征。

如上，在復制圖像特征到對應的標記特征后，我們得到了新的圖像-標記特征（image-markup features），這就是我們饋送到解碼器的輸入值。

解碼器

現在，我們使用圖像-標記特征來預測下一個標簽。

在下面的案例中，我們使用三個圖像-標簽特征對來輸出下一個標簽特征。注意 LSTM 層不應該返回一個長度等于輸入序列的向量，而只需要預測預測一個特征。在我們的案例中，這個特征將預測下一個標簽，它包含了最后預測的信息。

最后的預測

密集層會像傳統前饋網絡那樣工作，它將下一個標簽特征中的 512 個值與最后的四個預測連接起來，即我們在詞匯表所擁有的四個單詞：start、hello、world 和 end。密集層最后采用的 softmax 函數會為四個類別產生一個概率分布，例如 [0.1, 0.1, 0.1, 0.7] 將預測第四個詞為下一個標簽。

# Load the images and preprocess them for inception-resnet
 images = []
 all_filenames = listdir('images/')
 all_filenames.sort()
for filename in all_filenames:
 images.append(img_to_array(load_img('images/'+filename, target_size=(299, 299))))
 images = np.array(images, dtype=float)
 images = preprocess_input(images)
# Run the images through inception-resnet and extract the features without the classification layer
 IR2 = InceptionResNetV2(weights='imagenet', include_top=False)
 features = IR2.predict(images)
# We will cap each input sequence to 100 tokens
 max_caption_len = 100
# Initialize the function that will create our vocabulary 
 tokenizer = Tokenizer(filters='', split=" ", lower=False)
# Read a document and return a string
def load_doc(filename):
 file = open(filename, 'r')
 text = file.read()
 file.close()
return text
# Load all the HTML files
 X = []
 all_filenames = listdir('html/')
 all_filenames.sort()
for filename in all_filenames:
 X.append(load_doc('html/'+filename))
# Create the vocabulary from the html files
 tokenizer.fit_on_texts(X)
# Add +1 to leave space for empty words
 vocab_size = len(tokenizer.word_index) + 1
# Translate each word in text file to the matching vocabulary index
 sequences = tokenizer.texts_to_sequences(X)
# The longest HTML file
 max_length = max(len(s) for s in sequences)
# Intialize our final input to the model
 X, y, image_data = list(), list(), list()
for img_no, seq in enumerate(sequences):
for i in range(1, len(seq)):
# Add the entire sequence to the input and only keep the next word for the output
 in_seq, out_seq = seq[:i], seq[i]
# If the sentence is shorter than max_length, fill it up with empty words
 in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# Map the output to one-hot encoding
 out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# Add and image corresponding to the HTML file
 image_data.append(features[img_no])
# Cut the input sentence to 100 tokens, and add it to the input data
 X.append(in_seq[-100:])
 y.append(out_seq)
 X, y, image_data = np.array(X), np.array(y), np.array(image_data)
# Create the encoder
 image_features = Input(shape=(8, 8, 1536,))
 image_flat = Flatten()(image_features)
 image_flat = Dense(128, activation='relu')(image_flat)
 ir2_out = RepeatVector(max_caption_len)(image_flat)
 language_input = Input(shape=(max_caption_len,))
 language_model = Embedding(vocab_size, 200, input_length=max_caption_len)(language_input)
 language_model = LSTM(256, return_sequences=True)(language_model)
 language_model = LSTM(256, return_sequences=True)(language_model)
 language_model = TimeDistributed(Dense(128, activation='relu'))(language_model)
# Create the decoder
 decoder = concatenate([ir2_out, language_model])
 decoder = LSTM(512, return_sequences=False)(decoder)
 decoder_output = Dense(vocab_size, activation='softmax')(decoder)
# Compile the model
 model = Model(inputs=[image_features, language_input], outputs=decoder_output)
 model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
# Train the neural network
 model.fit([image_data, X], y, batch_size=64, shuffle=False, epochs=2)
# map an integer to a word
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
# seed the generation process
 in_text = 'START'
# iterate over the whole length of the sequence
for i in range(900):
# integer encode input sequence
 sequence = tokenizer.texts_to_sequences([in_text])[0][-100:]
# pad input
 sequence = pad_sequences([sequence], maxlen=max_length)
# predict next word
 yhat = model.predict([photo,sequence], verbose=0)
# convert probability to integer
 yhat = np.argmax(yhat)
# map integer to word
 word = word_for_id(yhat, tokenizer)
# stop if we cannot map the word
if word is None:
break
# append as input for generating the next word
 in_text += ' ' + word
# Print the prediction
print(' ' + word, end='')
# stop if we predict the end of the sequence
if word == 'END':
break
return
# Load and image, preprocess it for IR2, extract features and generate the HTML
 test_image = img_to_array(load_img('images/87.jpg', target_size=(299, 299)))
 test_image = np.array(test_image, dtype=float)
 test_image = preprocess_input(test_image)
 test_features = IR2.predict(np.array([test_image]))
 generate_desc(model, tokenizer, np.array(test_features), 100)
復制代碼

輸出

訓練不同輪數所生成網站的地址：

250 epochs：emilwallner.github.io/html/250_ep…
350 epochs：emilwallner.github.io/html/350_ep…
450 epochs：emilwallner.github.io/html/450_ep…
550 epochs：emilwallner.github.io/html/550_ep…

我走過的坑：

我認為理解 LSTM 比 CNN 要難一些。當我展開 LSTM 后，它們會變得容易理解一些。此外，我們在嘗試理解 LSTM 前，可以先關注輸入與輸出特征。
從頭構建一個詞匯表要比壓縮一個巨大的詞匯表容易得多。這樣的構建包括字體、div 標簽大小、變量名的 hex 顏色和一般單詞。
大多數庫是為解析文本文檔而構建。在庫的使用文檔中，它們會告訴我們如何通過空格進行分割，而不是代碼，我們需要自定義解析的方式。
我們可以從 ImageNet 上預訓練的模型抽取特征。然而，相對于從頭訓練的 pix2code 模型，損失要高 30% 左右。此外，我對于使用基于網頁截屏預訓練的 inception-resnet 網絡很有興趣。

Bootstrap 版本

在最終版本中，我們使用 pix2code 論文中生成 bootstrap 網站的數據集。使用 Twitter 的 Bootstrap 庫（getbootstrap.com/），我們可以結合 HTML 和 CSS，降低詞匯表規模。

我們將使用這一版本為之前未見過的截圖生成標記。我們還深入研究它如何構建截圖和標記的先驗知識。

我們不在 bootstrap 標記上訓練，而是使用 17 個簡化 token，將其編譯成 HTML 和 CSS。數據集（github.com/tonybeltram…）包括 1500 個測試截圖和 250 個驗證截圖。平均每個截圖有 65 個 token，一共有 96925 個訓練樣本。

我們稍微修改一下 pix2code 論文中的模型，使之預測網絡組件的準確率達到 97%。

端到端方法

從預訓練模型中提取特征在圖像描述生成模型中效果很好。但是幾次實驗后，我發現 pix2code 的端到端方法效果更好。在我們的模型中，我們用輕量級卷積神經網絡替換預訓練圖像特征。我們不使用最大池化來增加信息密度，而是增加步幅。這可以保持前端元素的位置和顏色。

存在兩個核心模型：卷積神經網絡（CNN）和循環神經網絡（RNN）。最常用的循環神經網絡是長短期記憶（LSTM）網絡。我之前的文章中介紹過 CNN 教程，本文主要介紹 LSTM。

理解 LSTM 中的時間步

關于 LSTM 比較難理解的是時間步。我們的原始神經網絡有兩個時間步，如果你給它「Hello」，它就會預測「World」。但是它會試圖預測更多時間步。下例中，輸入有四個時間步，每個單詞對應一個時間步。

LSTM 適合時序數據的輸入，它是一種適合順序信息的神經網絡。模型展開圖示如下，對于每個循環步，你需要保持同樣的權重。

加權后的輸入與輸出特征在級聯后輸入到激活函數，并作為當前時間步的輸出。因為我們重復利用了相同的權重，它們將從一些輸入獲取信息并構建序列的知識。下面是 LSTM 在每一個時間步上的簡化版處理過程：

理解 LSTM 層級中的單元

每一層 LSTM 單元的總數決定了它記憶的能力，同樣也對應于每一個輸出特征的維度大小。LSTM 層級中的每一個單元將學習如何追蹤句法的不同方面。以下是一個 LSTM 單元追蹤標簽行信息的可視化，它是我們用來訓練 bootstrap 模型的簡單標記語言。

每一個 LSTM 單元會維持一個單元狀態，我們可以將單元狀態視為記憶。權重和激活值可使用不同的方式修正狀態值，這令 LSTM 層可以通過保留或遺忘輸入信息而得到精調。除了處理當前輸入信息與輸出信息，LSTM 單元還需要修正記憶狀態以傳遞到下一個時間步。

dir_name = 'resources/eval_light/'
# Read a file and return a string
def load_doc(filename):
 file = open(filename, 'r')
 text = file.read()
 file.close()
return text
def load_data(data_dir):
 text = []
 images = []
# Load all the files and order them
 all_filenames = listdir(data_dir)
 all_filenames.sort()
for filename in (all_filenames):
if filename[-3:] == "npz":
# Load the images already prepared in arrays
 image = np.load(data_dir+filename)
 images.append(image['features'])
else:
# Load the boostrap tokens and rap them in a start and end tag
 syntax = '<START> ' + load_doc(data_dir+filename) + ' <END>'
# Seperate all the words with a single space
 syntax = ' '.join(syntax.split())
# Add a space after each comma
 syntax = syntax.replace(',', ' ,')
 text.append(syntax)
 images = np.array(images, dtype=float)
return images, text
 train_features, texts = load_data(dir_name)
# Initialize the function to create the vocabulary 
 tokenizer = Tokenizer(filters='', split=" ", lower=False)
# Create the vocabulary 
 tokenizer.fit_on_texts([load_doc('bootstrap.vocab')])
# Add one spot for the empty word in the vocabulary 
 vocab_size = len(tokenizer.word_index) + 1
# Map the input sentences into the vocabulary indexes
 train_sequences = tokenizer.texts_to_sequences(texts)
# The longest set of boostrap tokens
 max_sequence = max(len(s) for s in train_sequences)
# Specify how many tokens to have in each input sentence
 max_length = 48
def preprocess_data(sequences, features):
 X, y, image_data = list(), list(), list()
for img_no, seq in enumerate(sequences):
for i in range(1, len(seq)):
# Add the sentence until the current count(i) and add the current count to the output
 in_seq, out_seq = seq[:i], seq[i]
# Pad all the input token sentences to max_sequence
 in_seq = pad_sequences([in_seq], maxlen=max_sequence)[0]
# Turn the output into one-hot encoding
 out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# Add the corresponding image to the boostrap token file
 image_data.append(features[img_no])
# Cap the input sentence to 48 tokens and add it
 X.append(in_seq[-48:])
 y.append(out_seq)
return np.array(X), np.array(y), np.array(image_data)
 X, y, image_data = preprocess_data(train_sequences, train_features)
#Create the encoder
 image_model = Sequential()
 image_model.add(Conv2D(16, (3, 3), padding='valid', activation='relu', input_shape=(256, 256, 3,)))
 image_model.add(Conv2D(16, (3,3), activation='relu', padding='same', strides=2))
 image_model.add(Conv2D(32, (3,3), activation='relu', padding='same'))
 image_model.add(Conv2D(32, (3,3), activation='relu', padding='same', strides=2))
 image_model.add(Conv2D(64, (3,3), activation='relu', padding='same'))
 image_model.add(Conv2D(64, (3,3), activation='relu', padding='same', strides=2))
 image_model.add(Conv2D(128, (3,3), activation='relu', padding='same'))
 image_model.add(Flatten())
 image_model.add(Dense(1024, activation='relu'))
 image_model.add(Dropout(0.3))
 image_model.add(Dense(1024, activation='relu'))
 image_model.add(Dropout(0.3))
 image_model.add(RepeatVector(max_length))
 visual_input = Input(shape=(256, 256, 3,))
 encoded_image = image_model(visual_input)
 language_input = Input(shape=(max_length,))
 language_model = Embedding(vocab_size, 50, input_length=max_length, mask_zero=True)(language_input)
 language_model = LSTM(128, return_sequences=True)(language_model)
 language_model = LSTM(128, return_sequences=True)(language_model)
#Create the decoder
 decoder = concatenate([encoded_image, language_model])
 decoder = LSTM(512, return_sequences=True)(decoder)
 decoder = LSTM(512, return_sequences=False)(decoder)
 decoder = Dense(vocab_size, activation='softmax')(decoder)
# Compile the model
 model = Model(inputs=[visual_input, language_input], outputs=decoder)
 optimizer = RMSprop(lr=0.0001, clipvalue=1.0)
 model.compile(loss='categorical_crossentropy', optimizer=optimizer)
#Save the model for every 2nd epoch
 filepath="org-weights-epoch-{epoch:04d}--val_loss-{val_loss:.4f}--loss-{loss:.4f}.hdf5"
 checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_weights_only=True, period=2)
 callbacks_list = [checkpoint]
# Train the model
 model.fit([image_data, X], y, batch_size=64, shuffle=False, validation_split=0.1, callbacks=callbacks_list, verbose=1, epochs=50)
復制代碼

測試準確率

找到一種測量準確率的優秀方法非常棘手。比如一個詞一個詞地對比，如果你的預測中有一個詞不對照，準確率可能就是 0。如果你把百分百對照的單詞移除一個，最終的準確率可能是 99/100。

我使用的是 BLEU 分值，它在機器翻譯和圖像描述模型實踐上都是最好的。它把句子分解成 4 個 n-gram，從 1-4 個單詞的序列。在下面的預測中，「cat」應該是「code」。

為了得到最終的分值，每個的分值需要乘以 25%，(4/5) × 0.25 + (2/4) × 0.25 + (1/3) × 0.25 + (0/2) ×0.25 = 0.2 + 0.125 + 0.083 + 0 = 0.408。然后用總和乘以句子長度的懲罰函數。因為在我們的示例中，長度是正確的，所以它就直接是我們的最終得分。

你可以增加 n-gram 的數量，4 個 n-gram 的模型是最為對應人類翻譯的。我建議你閱讀下面的代碼：

#Create a function to read a file and return its content
def load_doc(filename):
 file = open(filename, 'r')
 text = file.read()
 file.close()
return text
def load_data(data_dir):
 text = []
 images = []
 files_in_folder = os.listdir(data_dir)
 files_in_folder.sort()
for filename in tqdm(files_in_folder):
#Add an image
if filename[-3:] == "npz":
 image = np.load(data_dir+filename)
 images.append(image['features'])
else:
# Add text and wrap it in a start and end tag
 syntax = '<START> ' + load_doc(data_dir+filename) + ' <END>'
#Seperate each word with a space
 syntax = ' '.join(syntax.split())
#Add a space between each comma
 syntax = syntax.replace(',', ' ,')
 text.append(syntax)
 images = np.array(images, dtype=float)
return images, text
#Intialize the function to create the vocabulary
 tokenizer = Tokenizer(filters='', split=" ", lower=False)
#Create the vocabulary in a specific order
 tokenizer.fit_on_texts([load_doc('bootstrap.vocab')])
 dir_name = '../../../../eval/'
 train_features, texts = load_data(dir_name)
#load model and weights 
 json_file = open('../../../../model.json', 'r')
 loaded_model_json = json_file.read()
 json_file.close()
 loaded_model = model_from_json(loaded_model_json)
# load weights into new model
 loaded_model.load_weights("../../../../weights.hdf5")
print("Loaded model from disk")
# map an integer to a word
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
print(word_for_id(17, tokenizer))
# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
 photo = np.array([photo])
# seed the generation process
 in_text = '<START> '
# iterate over the whole length of the sequence
print('\nPrediction---->\n\n<START> ', end='')
for i in range(150):
# integer encode input sequence
 sequence = tokenizer.texts_to_sequences([in_text])[0]
# pad input
 sequence = pad_sequences([sequence], maxlen=max_length)
# predict next word
 yhat = loaded_model.predict([photo, sequence], verbose=0)
# convert probability to integer
 yhat = argmax(yhat)
# map integer to word
 word = word_for_id(yhat, tokenizer)
# stop if we cannot map the word
if word is None:
break
# append as input for generating the next word
 in_text += word + ' '
# stop if we predict the end of the sequence
print(word + ' ', end='')
if word == '<END>':
break
return in_text
 max_length = 48 
# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
 actual, predicted = list(), list()
# step over the whole set
for i in range(len(texts)):
 yhat = generate_desc(model, tokenizer, photos[i], max_length)
# store actual and predicted
print('\n\nReal---->\n\n' + texts[i])
 actual.append([texts[i].split()])
 predicted.append(yhat.split())
# calculate BLEU score
 bleu = corpus_bleu(actual, predicted)
return bleu, actual, predicted
 bleu, actual, predicted = evaluate_model(loaded_model, texts, train_features, tokenizer, max_length)
#Compile the tokens into HTML and css
 dsl_path = "compiler/assets/web-dsl-mapping.json"
 compiler = Compiler(dsl_path)
 compiled_website = compiler.compile(predicted[0], 'index.html')
print(compiled_website )
print(bleu)
復制代碼

輸出

樣本輸出的鏈接：

Generated website 1 - Original 1 (emilwallner.github.io/bootstrap/r…)
Generated website 2 - Original 2 (emilwallner.github.io/bootstrap/r…)
Generated website 3 - Original 3 (emilwallner.github.io/bootstrap/r…)
Generated website 4 - Original 4 (emilwallner.github.io/bootstrap/r…)
Generated website 5 - Original 5 (emilwallner.github.io/bootstrap/r…)

我走過的坑：

理解模型的弱點而不是測試隨機模型。首先我使用隨機的東西，比如批歸一化、雙向網絡，并嘗試實現注意力機制。在查看測試數據，并知道其無法高精度地預測顏色和位置之后，我意識到 CNN 存在一個弱點。這致使我使用增加的步幅來取代最大池化。驗證損失從 0.12 降至 0.02，BLEU 分值從 85% 增加至 97%。
如果它們相關，則只使用預訓練模型。在小數據的情況下，我認為一個預訓練圖像模型將會提升性能。從我的實驗來看，端到端模型訓練更慢，需要更多內存，但是精確度會提升 30%。
當你在遠程服務器上運行模型，我們需要為一些不同做好準備。在我的 mac 上，它按照字母表順序讀取文檔。但是在服務器上，它被隨機定位。這在代碼和截圖之間造成了不匹配。

下一步

前端開發是深度學習應用的理想空間。數據容易生成，并且當前深度學習算法可以映射絕大部分邏輯。一個最讓人激動的領域是注意力機制在 LSTM 上的應用。這不僅會提升精確度，還可以使我們可視化 CNN 在生成標記時所聚焦的地方。注意力同樣是標記、可定義模板、腳本和最終端之間通信的關鍵。注意力層要追蹤變量，使網絡可以在編程語言之間保持通信。

但是在不久的將來，最大的影響將會來自合成數據的可擴展方法。接著你可以一步步添加字體、顏色和動畫。目前為止，大多數進步發生在草圖（sketches）方面并將其轉化為模版應用。在不到兩年的時間里，我們將創建一個草圖，它會在一秒之內找到相應的前端。Airbnb 設計團隊與 Uizard 已經創建了兩個正在使用的原型。下面是一些可能的試驗過程：

實驗

開始

運行所有模型
嘗試不同的超參數
測試一個不同的 CNN 架構
添加雙向 LSTM 模型
用不同數據集實現模型

進一步實驗

使用相應的語法創建一個穩定的隨機應用/網頁生成器
從草圖到應用模型的數據。自動將應用/網頁截圖轉化為草圖，并使用 GAN 創建多樣性。
應用注意力層可視化每一預測的圖像聚焦，類似于這個模型
為模塊化方法創建一個框架。比如，有字體的編碼器模型，一個用于顏色，另一個用于排版，并使用一個解碼器整合它們。穩定的圖像特征是一個好的開始。
饋送簡單的 HTML 組件到神經網絡中，并使用 CSS 教其生成動畫。使用注意力方法并可視化兩個輸入源的聚焦將會很迷人。

原文鏈接：blog.floydhub.com/turning-des…

本文為機器之心編譯，轉載請聯系本公眾號獲得授權。

、JS生成隨機字符串

第一種：內置函數方法

Math.random(); //該方法產生一個0到1之間的浮點數。
延伸：
 Math.floor(Math.random()*10+1); //生成1-10的隨機數
 Math.floor(Math.random()*10);//生成0-9的隨機數
 函數方法：
 function rand ( n )
 {
 return ( Math.floor ( Math.random ( ) * n + 1 ) );
 }

// 調用方法
rand(100);生成1-100的隨機數。

第二種：奇妙的寫法（最近湊的解決方案）

優雅的獲取數字+字母組成的字符串
Math.random().toString(36).slice(-8)
或者：
(+new Date).toString(36).slice(-5);

Math.random(); // 結果為0-1間的一個隨機數(包括0,不包括1) 
.toString(36) // 將數字轉換為基數為36的字符串，即使用字符0-9和az。
.slice(-8);				 // 截取最后八位 : "yo82mvyr"

36進制包含的字符為 0-9，a-z
10進制包含的字符為為 0-9
16進制包含的字符為 0-9，a-f

缺點：
 只能生成有 0-9、a-z字符組成的字符串
 由于 Math.random()生成的18位小數，可能無法填充36位，最后幾個字符串，只能在指定的幾個字符中選擇。導致隨機性降低。
 某些情況下會返回空值。例如，當隨機數為 0, 0.5, 0.25, 0.125...時，返回為空值。
 在我的電腦上執行這種方式生成字符串。1000萬次中出現重復的概率為 0。

改進：
 (Math.random().toString(36)+'00000000000000000').slice(2, N+2)

其次，上面的原始和解決方案都將字符串大小N限制為16個字符。 以下將返回任意N的大小為N的字符串（但請注意，使用N> 16不會增加隨機性或降低碰撞概率）：
Array(N+1).join((Math.random().toString(36)+'00000000000000000').slice(2, 18)).slice(0, N)
說明：
 1、在[0,1）范圍內選取一個隨機數，即在0（包含）和1（不包括）之間。
 將數字轉換為基數為36的字符串，即使用字符0-9和az。
 2、用零填充（解決第一個問題）。
 3、切掉領先的'0'。 前綴和額外填充零。
 4、重復該字符串足夠多的時間以使其中至少有N個字符（通過使用較短的隨機字符串作為分隔符加入空字符串）。
 5、從字符串中切分N個字符。
更多的想法：
 1、這個解決方案不使用大寫字母，但幾乎在所有情況下（沒有雙關語意思），這并不重要。
 2、在Chrome中測量原始答案中N = 16時的最大字符串長度。 在Firefox中，它的N = 11。但正如所解釋的，第二種解決方案是支持任何請求的字符串長度，而不是添加隨機性，所以它沒有太大區別。
 3、所有返回的字符串返回的概率相等，至少就Math.random（）返回的結果均勻分布（在任何情況下，這不是密碼強度隨機性）。
 4、并非所有可能的大小為N的字符串都可能被返回。 在第二種解決方案中，這是顯而易見的（因為較小的字符串僅被復制），但在原始答案中也是如此，因為在轉換為基數-36時，最后幾位可能不是原始隨機位的一部分。 具體來說，如果您查看Math.random（）。toString（36）的結果，您會注意到最后一個字符不是均勻分布的。 同樣，在幾乎所有情況下，它都沒有關系，但是我們從最開始的字符串開始，而不是隨機字符串的結尾，以便短字符串（例如N = 1）不受影響。

第三種：自定義函數（生成指定位數的數字字母混合的字符串）

/**
 * randomWord 產生任意長度隨機字母數字組合
 * @param randomFlag 是否任意長度 min-任意長度最小位[固定位數] max-任意長度最大位
 * @param min
 * @param max
 * @returns {string}
 */
function randomWord(randomFlag, min, max){
 var str = "",
 range = min,
 // 字母，數字
 arr = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'];
 // 隨機產生
 if(randomFlag){
 range = Math.round(Math.random() * (max-min)) + min;
 }
 for(var i=0; i<range; i++){
 pos = Math.round(Math.random() * (arr.length-1));
 str += arr[pos];
 }
 return str;
}

調用方法
// 生成 3 - 32 位隨即字符串
randomWord(true,3,32); // 例如：olyOXUF5oDsuMmXl3Mi48
// 生成 32 位隨機字符串
randomWord(false,32); // 例如：fjpnWj29Bb8boiXbLeDF0nxkR4aYcLRl

第四種：不需要輸入字符集合

function randomstring(L){
 var s= '';
 var randomchar=function(){
 	var n= Math.floor(Math.random()*62);
 	if(n<10) return n; //1-10
 	if(n<36) return String.fromCharCode(n+55); //A-Z
 	return String.fromCharCode(n+61); //a-z
 }
 while(s.length< L) s+= randomchar();
 return s;
}

// 調用方法
randomstring(5);

第五種：支持自定義字符長度和特征字符集合

function randomString(len, charSet) {
 charSet = charSet || 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
 var randomString = '';
 for (var i = 0; i < len; i++) {
 	var randomPoz = Math.floor(Math.random() * charSet.length);
 	randomString += charSet.substring(randomPoz,randomPoz+1);
 }
 return randomString;
}

// 調用方法
var randomValue = randomString(5);
var randomValue = randomString(5, 'PICKCHARSFROMTHISSET');

第六種：最輕量級的方式（代碼輕，而不是CPU）

function rand(length, current) {
 current = current ? current : '';
 return length ? rand(--length, "0123456789ABCDEFGHIJKLMNOPQRSTUVWXTZabcdefghiklmnopqrstuvwxyz".charAt(Math.floor(Math.random() * 60)) + current) : current;
}
console.log(rand(5));

第七種：在循環中使用它，它可能不是5個字符，但保證每個ID是唯一的

在線咨詢

上一篇：第2天｜16天搞定前端，html頭部標簽，只是標簽？
下一篇：PhotoshopWeb：Web圖像的輸出設置，簡單且實用，進來學習吧

您的項目需求

*請認真填寫需求信息，我們會在24小時內與您取得聯系。

整合營銷服務商

HTML+CSS+JS隨機生成驗證碼

深度學習自動生成HTML代碼

、JS生成隨機字符串

您的項目需求