Bootstrap — dropdown (выпадающий список)

Data-атрибуты HTML5

К счастью, в HTML5 была введена возможность использовать пользовательские атрибуты. Вы можете использовать любое имя в нижнем регистре с префиксом data-, например:

<div id="msglist" data-user="bob" data-list-size="5" data-maxage="180"></div>

Пользовательские data-атрибуты:

  • это строки — в них вы можете хранить любую информацию, которая может быть представлена или закодирована в виде строки, например JSON. Приведение типов должно осуществляться с помощью JavaScript
  • должны использоваться в тех случаях, когда нет подходящих элементов HTML5 или атрибутов
  • относятся только к странице. В отличие от микроформатов они должны игнорироваться внешними системами, типа поисковых систем и поисковых роботов

Пример №1 обработки на JavaScript: getAttribute и setAttribute

Все браузеры позволяют вам получить и изменить data-атрибуты с использованием методов getAttribute и setAttribute:

var msglist = document.getElementById("msglist");

var show = msglist.getAttribute("data-list-size");
msglist.setAttribute("data-list-size", +show+3);

Это работает, но должно использоваться только для поддержания совместимости со старыми браузерами.

Пример №2 обработки на JavaScript: метод data() библиотеки jQuery

Начиная с версии jQuery 1.4.3 метод data() обрабатывает data-атрибуты HTML5. Вам нет необходимости явно указывать префикс data-, так что подобный код будет работать:

var msglist = $("#msglist");

var show = msglist.data("list-size");
msglist.data("list-size", show+3);

Но как бы то ни было, имейте в виду, что jQuery пытается конвертировать значения таких атрибутов в подхдящие типы (булевы значения, числа, объекты, массивы или null) и затронет DOM. В отличие от setAttribute, метод data() физически не заменит атрибут data-list-size — если вы проверите его значение вне jQuery — оно все еще останется равным 5.

Пример №3 обработки на JavaScript: API для работы с наборами данных

И, наконец, у нас есть API для работы с наборами данных HTML5, которое возвращает объект DOMStringMap. Необходимо помнить, что data-атрибуты отображаются в объект без префиксов data-, из названий убираются знаки дефиса, а сами названия конвертируются в camelCase, например:

Имя атрибута Имя в API набора данных
data-user user
data-maxage maxage
data-list-size listSize

Наш новый код:

var msglist = document.getElementById("msglist");

var show = msglist.dataset.listSize;
msglist.dataset.listSize = +show+3;

Данный API поддерживается всеми современными браузерами, но не IE10 и ниже. Для таких браузеров существует , но, наверное, куда практичнее использовать jQuery, если вы пишете для старых браузеров.

Sometimes

If you do not want to execute the same set of augmenters all the time, will pick some of the augmenters every time.

Recommendation

The above approach is designed to solve problems that authors are facing in their problems. If you understand your data, you should tailor made augmentation approach it. Remember that the golden rule in data science is garbage in garbage out.

In general, you can try the thesaurus approach without quite understanding your data. It may not boost up a lot due to the aforementioned thesaurus approach limitation.

About Me

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github.

Extension Reading

  • Image augmentation library (imgaug)
  • Text augmentation library (nlpaug)
  • Data Augmentation in NLP
  • Data Augmentation for Audio
  • Data Augmentation for Spectrogram
  • Does your NLP model able to prevent an adversarial attacks?
  • Data Augmentation in NLP: Best Practices From a Kaggle Master

Reference

  • X. Zhang, J. Zhao and Y. LeCun. Character-level Convolutional Networks for Text Classification. 2015
  • W. Y. Wang and D. Yang. That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets. 2015
  • S. Kobayashi. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relation. 2018
  • C. Coulombe. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs. 2018

Attributes¶

Attribute Value Description
value machine-readable format Sets the machine-readable version of the contents of the <data> tag.

The <data> tag also supports the Global Attributes and the Event Attributes.

How to style <data> tag?

Common properties to alter the visual weight/emphasis/size of text in <data> tag:

  • CSS font-style property sets the style of the font. normal | italic | oblique | initial | inherit.
  • CSS font-family property specifies a prioritized list of one or more font family names and/or generic family names for the selected element.
  • CSS font-size property sets the size of the font.
  • CSS font-weight property defines whether the font should be bold or thick.
  • CSS text-transform property controls text case and capitalization.
  • CSS text-decoration property specifies the decoration added to text, and is a shorthand property for text-decoration-line, text-decoration-color, text-decoration-style.

Coloring text in <data> tag:

  • CSS color property describes the color of the text content and text decorations.
  • CSS background-color property sets the background color of an element.

Text layout styles for <data> tag:

  • CSS text-indent property specifies the indentation of the first line in a text block.
  • CSS text-overflow property specifies how overflowed content that is not displayed should be signalled to the user.
  • CSS white-space property specifies how white-space inside an element is handled.
  • CSS word-break property specifies where the lines should be broken.

Other properties worth looking at for <data> tag:

  • CSS text-shadow property adds shadow to text.
  • CSS text-align-last property sets the alignment of the last line of the text.
  • CSS line-height property specifies the height of a line.
  • CSS letter-spacing property defines the spaces between letters/characters in a text.
  • CSS word-spacing property sets the spacing between words.
62+ 22+ 49+

Manipulating margins¶

TRDG allows you to control margins around the text using two parameters, , . The first one controls margins, in pretty much the same way the CSS property does.

This is the result with no fit and the default (5, 5, 5, 5) margins:

Now we can add to apply a tight crop around the rendered text. This changes the size by removing the added space for accents:

Margins are applied the generated text, so even with , if you don’t use you will get an apparence of margins:

Now if you add , you get an absolutely no margins:

Margin values are comma separated , so will return vertical margins with tight cropping vertically.

And finally, with all margins:

Разбираемся с CSS

Что­бы стра­ни­цы выгля­де­ли кра­си­во, про­грам­ми­сты исполь­зу­ют CSS — Cascading Style Sheets, они же — кас­кад­ные таб­ли­цы сти­лей. Мы про них уже писа­ли в ста­тье про спи­сок задач, а сей­час будем раз­би­рать­ся подроб­нее, как они рабо­та­ют и что мож­но с их помо­щью сде­лать.

Глав­ное, что нуж­но пом­нить о CSS, — это пра­ви­ла, по кото­рым бра­у­зер «кра­сит» стра­ни­цу: како­го цве­та у него фон, како­го — текст, какие заго­лов­ки и так далее. Пра­ви­ла живут отдель­но от кон­тен­та: в одном месте доку­мен­та мы гово­рим «заго­лов­ки надо кра­сить вот так», а в дру­гом — «вот тут сто­ит заго­ло­вок, в нем напи­са­но то-то».

В боль­ших про­ек­тах пра­ви­ла CSS часто выно­сят в отдель­ный доку­мент, что­бы не засо­рять основ­ной код. У сай­та может быть файл, в кото­ром будут про­пи­са­ны все пра­ви­ла оформ­ле­ния, и если что-то нуж­но пере­кра­сить на всех стра­ни­цах сай­та, доста­точ­но будет про­сто поме­нять пра­ви­ло в одном месте.

Так как у нас про­ект малень­кий, мы зада­дим все сти­ли внут­ри стра­ни­цы. Так будет про­ще для пони­ма­ния и не нуж­но будет рабо­тать с дву­мя фай­ла­ми.

Весь код сти­лей на стра­ни­це рас­по­ла­га­ет­ся меж­ду тега­ми <style> и </style>. Они гово­рят бра­у­зе­ру: тут у нас пра­ви­ла оформ­ле­ния. Сна­ча­ла пишут назва­ние эле­мен­та, а потом в фигур­ных скоб­ках — пра­ви­ла. Напри­мер, вот этот код отве­ча­ет за настрой­ки внеш­не­го вида всей стра­ни­цы, пото­му что начи­на­ет­ся со сло­ва body. Он как бы гово­рит: «Всё тело стра­ни­цы выров­няй по цен­тру, исполь­зуй отсту­пы по 10, шрифт „Вер­да­на“ или „Ари­ал“ раз­ме­ром 16 пик­се­лей»:

body{    text-align: center;     margin: 10;     font-family: Verdana, Arial, sans-serif;     font-size: 16px;   }

А вот этот код опре­де­ля­ет толь­ко абза­цы тек­ста, кото­рые на стра­ни­це раз­ме­че­ны тегом <p>. Он гово­рит: «Всё, что на стра­ни­це явля­ет­ся абза­цем, рисуй шриф­том 14-го раз­ме­ра».

p {

     font-size: 14px;    }

Часто в пара­мет­рах тре­бу­ет­ся ука­зать раз­мер чего-нибудь. В CSS мно­го изме­ре­ний раз­ме­ров: в пик­се­лях, про­цен­тах, отно­си­тель­но базо­во­го шриф­та или отно­си­тель­но теку­щей шири­ны экра­на. Вот при­ме­ры:

margin-top: 15px; /* — 15 пикселейmargin-top: 15em; /* — 15 размеров текущего шрифтаmargin-top: 15vw; /* — 15% от ширины страницы

Ино­гда сти­ли впи­сы­ва­ют не отдель­но от основ­но­го кода стра­ни­цы, а пря­мо внут­ри кода для кон­крет­но­го эле­мен­та. Для это­го исполь­зу­ют коман­ду style внут­ри тега. Напри­мер, так:

<div style=»height: 50%; width: 100%;”>

Это зна­чит, что кон­крет­но этот эле­мент <div> полу­чит поло­вин­ную высо­ту и пол­ную шири­ну. Дру­гие эле­мен­ты на стра­ни­це этот стиль не затро­нет.

Сра­зу ска­жем, что про­пи­сы­вать CSS внут­ри отдель­ных эле­мен­тов счи­та­ет­ся дур­ным тоном, пото­му что потом такой код труд­но под­дер­жи­вать. Поэто­му все­ми сила­ми ста­рай­тесь про­пи­сы­вать CSS либо в бло­ке <style>, либо в отдель­ном фай­ле.

1) Normalization

One of the key steps in processing language data is to remove noise so that the machine can more easily detect the patterns in the data. Text data contains a lot of noise, this takes the form of special characters such as hashtags, punctuation and numbers. All of which are difficult for computers to understand if they are present in the data. We need to, therefore, process the data to remove these elements.

Additionally, it is also important to apply some attention to the casing of words. If we include both upper case and lower case versions of the same words then the computer will see these as different entities, even though they may be the same.

The code below performs these steps. To keep a track of the changes we are making to the text I have put the clean text into a new column. The output is shown below the code.

import redef  clean_text(df, text_field, new_text_field_name):    df = df.str.lower()    df = df.apply(lambda elem: re.sub(r"(@+)|()|(\w+:\/\/\S+)|^rt|http.+?", "", elem))      # remove numbers    df = df.apply(lambda elem: re.sub(r"\d+", "", elem))    return dfdata_clean = clean_text(train_data, 'text', 'text_clean')data_clean.head()

5 последних уроков рубрики «HTML5»

  • В этом уроке я покажу процесс создания собственных HTML тегов. Пользовательские теги решают множество задач: HTML документы становятся проще, а строк кода становится меньше.

  • Сегодня мы посмотрим, как можно организовать проверку доступности атрибута HTML5 с помощью JavaScript. Проверять будем работу элементов details и summary.

  • HTML5 — глоток свежего воздуха в современном вебе. Она повлиял не только на классический веб, каким мы знаем его сейчас. HTML5 предоставляет разработчикам ряд API для создания и улучшения сайтов с ориентацией на мобильные устройства. В этой статье мы рассмотрим API для работы с вибрацией.

  • Веб дизайнеры частенько сталкиваются с необходимостью создания форм. Данная задача не простая, и может вызвать головную боль (особенно если вы делаете что-то не стандартное, к примеру, много-страничную форму). Для упрощения жизни можно воспользоваться фрэймворком. В этой статье я покажу вам несколько практических приёмов для создания форм с помощью фрэймворка Webix.

  • Знакомство с фрэймворком Webix

    В этой статье мы бы хотели познакомить вас с фрэймворком Webix. Для демонстрации возможностей данного инструмента мы создадим интерфейс online аудио плеера. Не обольщайтесь — это всего лишь модель интерфейса. Исходный код доступен в демо и на странице GitHub.

Question Answering¶

class (path, text_field, only_supporting=False, **kwargs)
(path, text_field, only_supporting=False, **kwargs)

Create a dataset from a list of Examples and Fields.

Parameters:
  • examples – List of Examples.
  • fields (List((, ))) – The Fields to use in this tuple. The
    string is a field name, and the Field is the associated field.
  • filter_pred (callable or ) – Use only examples for which
    filter_pred(example) is True, or use all examples if None.
    Default is None.
classmethod (text_field, path=None, root=’.data’, task=1, joint=False, tenK=False, only_supporting=False, train=None, validation=None, test=None, **kwargs)

Create Dataset objects for multiple splits of a dataset.

Parameters:
  • path () – Common prefix of the splits’ file paths, or None to use
    the result of cls.download(root).
  • root () – Root dataset storage directory. Default is ‘.data’.
  • train () – Suffix to add to path for the train set, or None for no
    train set. Default is None.
  • validation () – Suffix to add to path for the validation set, or None
    for no validation set. Default is None.
  • test () – Suffix to add to path for the test set, or None for no test
    set. Default is None.
  • keyword arguments (Remaining) – Passed to the constructor of the
    Dataset (sub)class being used.
Returns:

Datasets for train, validation, and
test splits in that order, if provided.

Return type:

Tuple[]

Language Modeling¶

Language modeling datasets are subclasses of class.

class (path, text_field, newline_eos=True, encoding=’utf-8′, **kwargs)

Defines a dataset for language modeling.

(path, text_field, newline_eos=True, encoding=’utf-8′, **kwargs)

Create a LanguageModelingDataset given a path and a field.

Parameters:
  • path – Path to the data file.
  • text_field – The field that will be used for text data.
  • newline_eos – Whether to add an <eos> token for every newline in the
    data file. Default: True.
  • keyword arguments (Remaining) – Passed to the constructor of
    data.Dataset.
class (path, text_field, newline_eos=True, encoding=’utf-8′, **kwargs)
classmethod (batch_size=32, bptt_len=35, device=0, root=’.data’, vectors=None, **kwargs)

Create iterator objects for splits of the WikiText-2 dataset.

This is the simplest way to use the dataset, and assumes common
defaults for field, vocabulary, and iterator parameters.

Parameters:
  • batch_size – Batch size.
  • bptt_len – Length of sequences for backpropagation through time.
  • device – Device to create batches on. Use -1 for CPU and None for
    the currently active GPU device.
  • root – The root directory that the dataset’s zip archive will be
    expanded into; therefore the directory in whose wikitext-2
    subdirectory the data files will be stored.
  • wv_type, wv_dim (wv_dir,) – Passed to the Vocab constructor for the
    text field. The word vectors are accessible as
    train.dataset.fields.vocab.vectors.
  • keyword arguments (Remaining) – Passed to the splits method.
classmethod (text_field, root=’.data’, train=’wiki.train.tokens’, validation=’wiki.valid.tokens’, test=’wiki.test.tokens’, **kwargs)

Create dataset objects for splits of the WikiText-2 dataset.

This is the most flexible way to use the dataset.

Parameters:
  • text_field – The field that will be used for text data.
  • root – The root directory that the dataset’s zip archive will be
    expanded into; therefore the directory in whose wikitext-2
    subdirectory the data files will be stored.
  • train – The filename of the train data. Default: ‘wiki.train.tokens’.
  • validation – The filename of the validation data, or None to not
    load the validation set. Default: ‘wiki.valid.tokens’.
  • test – The filename of the test data, or None to not load the test
    set. Default: ‘wiki.test.tokens’.
class (path, text_field, newline_eos=True, encoding=’utf-8′, **kwargs)
classmethod (batch_size=32, bptt_len=35, device=0, root=’.data’, vectors=None, **kwargs)

Create iterator objects for splits of the WikiText-103 dataset.

This is the simplest way to use the dataset, and assumes common
defaults for field, vocabulary, and iterator parameters.

Parameters:
  • batch_size – Batch size.
  • bptt_len – Length of sequences for backpropagation through time.
  • device – Device to create batches on. Use -1 for CPU and None for
    the currently active GPU device.
  • root – The root directory that the dataset’s zip archive will be
    expanded into; therefore the directory in whose wikitext-2
    subdirectory the data files will be stored.
  • wv_type, wv_dim (wv_dir,) – Passed to the Vocab constructor for the
    text field. The word vectors are accessible as
    train.dataset.fields.vocab.vectors.
  • keyword arguments (Remaining) – Passed to the splits method.
classmethod (text_field, root=’.data’, train=’wiki.train.tokens’, validation=’wiki.valid.tokens’, test=’wiki.test.tokens’, **kwargs)

Create dataset objects for splits of the WikiText-103 dataset.

This is the most flexible way to use the dataset.

Parameters:
  • text_field – The field that will be used for text data.
  • root – The root directory that the dataset’s zip archive will be
    expanded into; therefore the directory in whose wikitext-103
    subdirectory the data files will be stored.
  • train – The filename of the train data. Default: ‘wiki.train.tokens’.
  • validation – The filename of the validation data, or None to not
    load the validation set. Default: ‘wiki.valid.tokens’.
  • test – The filename of the test data, or None to not load the test
    set. Default: ‘wiki.test.tokens’.

Contextualized Word Embeddings

Since classic word embeddings use a static vector to represent the same word. It may not fit some scenarios. For “Fox” can represent as animal and broadcasting company. To overcome this problem, contextualized word embeddings is introduced to consider surrounding words to generate a vector under a different context.

is designed to provide this feature to perform insertion and substitution. Different from previous word embeddings, insertion is predicted by BERT language model rather than pick one word randomly. Substitution use surrounding words as a feature to predict the target word.

Example of insert augmentation

Original:The quick brown fox jumps over the lazy dogAugmented Text:the lazy quick brown fox always jumps over the lazy dog

Example of substitute augmentation

Original:The quick brown fox jumps over the lazy dogAugmented Text:the quick thinking fox jumps over the lazy dog

Synonym

Besides the neural network approach, a thesaurus can achieve similar objectives. The limitation of synonym is that some words may not have similar words. WordNet from an awesome NLTK library helps to find the synonym words.

provides a substitution feature to replace the target word. Instead of finding synonyms purely, some preliminary checking makes sure that the target word can be replaced. Those rules are:

  • Do not pick determiner (e.g. a, an, the)
  • Do not pick a word that does not has a synonym.

Example of augmentation

Original:The quick brown fox jumps over the lazy dogAugmented Text:The quick brown fox parachute over the lazy blackguard

Sequence Tagging¶

Sequence tagging datasets are subclasses of class.

class (path, fields, separator=’t’, **kwargs)

Defines a dataset for sequence tagging. Examples in this dataset
contain paired lists – paired list of words and tags.

For example, in the case of part-of-speech tagging, an example is of the
form
paired with

See torchtext/test/sequence_tagging.py on how to use this class.

(path, fields, separator=’\t’, **kwargs)

Create a dataset from a list of Examples and Fields.

Parameters:
  • examples – List of Examples.
  • fields (List((, ))) – The Fields to use in this tuple. The
    string is a field name, and the Field is the associated field.
  • filter_pred (callable or ) – Use only examples for which
    filter_pred(example) is True, or use all examples if None.
    Default is None.
class (path, fields, separator=’t’, **kwargs)
classmethod (fields, root=’.data’, train=’en-ud-tag.v2.train.txt’, validation=’en-ud-tag.v2.dev.txt’, test=’en-ud-tag.v2.test.txt’, **kwargs)

Downloads and loads the Universal Dependencies Version 2 POS Tagged
data.

How to style tag?

Common properties to alter the visual weight/emphasis/size of text in <data> tag:

  • CSS font-style property sets the style of the font. normal | italic | oblique | initial | inherit.
  • CSS font-family property specifies a prioritized list of one or more font family names and/or generic family names for the selected element.
  • CSS font-size property sets the size of the font.
  • CSS font-weight property defines whether the font should be bold or thick.
  • CSS text-transform property controls text case and capitalization.
  • CSS text-decoration property specifies the decoration added to text, and is a shorthand property for text-decoration-line, text-decoration-color, text-decoration-style.

Coloring text in <data> tag:

  • CSS color property describes the color of the text content and text decorations.
  • CSS background-color property sets the background color of an element.

Text layout styles for <data> tag:

  • CSS text-indent property specifies the indentation of the first line in a text block.
  • CSS text-overflow property specifies how overflowed content that is not displayed should be signalled to the user.
  • CSS white-space property specifies how white-space inside an element is handled.
  • CSS word-break property specifies where the lines should be broken.

Other properties worth looking at for <data> tag:

5) Part of Speech (POS) tagging and chunking

Part of speech (POS) tagging is a method to categorise words which gives some information relating to the way in which that word is used in speech.

There are eight primary parts of speech and they each have a corresponding tag. These are shown in the table below.

The NLTK libary has a method to perform POS tagging. The below code performs POS tagging on the tweets in our data set and returns a new column.

def word_pos_tagger(text):    pos_tagged_text = nltk.pos_tag(text)    return pos_tagged_textnltk.download('averaged_perceptron_tagger')data_clean = data_clean.apply(lambda x: word_pos_tagger(x))data_clean.head()

Chunking builds on POS tagging in that it uses the information from the POS tags to extract meaningful phrases from text. In many types of texts, if we reduce everything down to individual words we may lose a lot of meaning. In our tweets, for example, we have a lot of location names and other phrases which are important to keep together.

If we take this sentence “forest fire near la ronge sask canada” the location name “la ronge” and the words “forest fire” will convey an important meaning that we might not want to lose.

The spaCy python library has a method for this. If we apply this method to the above sentence we can see that it separates out the appropriate phrases.

import spacynlp = spacy.load('en')text = nlp("forest fire near la ronge sask canada")for chunk in text.noun_chunks:    print(chunk.text, chunk.label_, chunk.root.text)

3) Stemming

Stemming is the process of reducing words to their root form. For example, the words “rain”, “raining” and “rained” have very similar, and in many cases, the same meaning. The process of stemming will reduce these to the root form of “rain”. This is again a way to reduce noise and the dimensionality of the data.

The NLTK library also has methods to perform the task of stemming. The code below uses the PorterStemmer to stem the words in my example above. As you can see from the output all the words now become “rain”.

from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenizeword_list = ps = PorterStemmer()for w in word_list:    print(ps.stem(w))

Before we can perform stemming on our data we need to tokenise the tweets. This is a method used to split the text into its constituent parts usually words. The code below uses NLTK to do this. I have put the output into a new column called “text_tokens”.

import nltk nltk.download('punkt')from nltk.tokenize import sent_tokenize, word_tokenizedata_clean = data_clean.apply(lambda x: word_tokenize(x))data_clean.head()

The code below uses the PorterStemmer method from NLTK to apply stemming to the text_tokens and outputs the processed text to a new column.

def word_stemmer(text):    stem_text =     return stem_textdata_clean = data_clean.apply(lambda x: word_stemmer(x))data_clean.head()

Sentiment Analysis¶

class (path, text_field, label_field, subtrees=False, fine_grained=False, **kwargs)
classmethod (batch_size=32, device=0, root=’.data’, vectors=None, **kwargs)

Create iterator objects for splits of the SST dataset.

Parameters:
  • batch_size – Batch_size
  • device – Device to create batches on. Use — 1 for CPU and None for
    the currently active GPU device.
  • root – The root directory that the dataset’s zip archive will be
    expanded into; therefore the directory in whose trees
    subdirectory the data files will be stored.
  • vectors – one of the available pretrained vectors or a list with each
    element one of the available pretrained vectors (see Vocab.load_vectors)
  • keyword arguments (Remaining) – Passed to the splits method.
classmethod (text_field, label_field, root=’.data’, train=’train.txt’, validation=’dev.txt’, test=’test.txt’, train_subtrees=False, **kwargs)

Create dataset objects for splits of the SST dataset.

Parameters:
  • text_field – The field that will be used for the sentence.
  • label_field – The field that will be used for label data.
  • root – The root directory that the dataset’s zip archive will be
    expanded into; therefore the directory in whose trees
    subdirectory the data files will be stored.
  • train – The filename of the train data. Default: ‘train.txt’.
  • validation – The filename of the validation data, or None to not
    load the validation set. Default: ‘dev.txt’.
  • test – The filename of the test data, or None to not load the test
    set. Default: ‘test.txt’.
  • train_subtrees – Whether to use all subtrees in the training set.
    Default: False.
  • keyword arguments (Remaining) – Passed to the splits method of
    Dataset.

Word Embeddings (word2vec, GloVe, fasttext)

Classic embeddings use a static vector to present a word. Ideally, the meaning of the word is similar if vectors are near each other. Actually, it depends on the training data. For example, “rabbit” is similar to “fox” in word2vec while “nbc” is similar to “fox” in GloVe.

Most similar words of “fox” among classical word embeddings models

Sometimes, you want to replace words by similar words such that NLP model does not rely on a single word., and are designed to provide a “similar” word based on pre-trained vectors.

Besides substitution, insertion helps to inject noise into your data. It picks words from vocabulary randomly.

Example of insert augmentation

Original:The quick brown fox jumps over the lazy dogAugmented Text:The quick Bergen-Belsen brown fox jumps over Tiko the lazy dog

Example of substitute augmentation

Original:The quick brown fox jumps over the lazy dogAugmented Text:The quick gray fox jumps over to lazy dog

OCR

When working on NLP problem, OCR results may be one of the inputs of your NLP problem. For example, “0” may be recognized as “o” or “O”. If you are using bag-of-words or classic word embeddings as a feature, you will get trouble as out-of-vocabulary (OOV) around you today and always. If you use state-of-the-art models such as BERT and GPT, the OOV issue seems resolved as word will be split to subword. However, some information is lost.

is designed to simulate OCR error. It will replace the target character by pre-defined mapping table.

Example of augmentation

Original:The quick brown fox jumps over the lazy dogAugmented Text:The quick brown fox jumps over the lazy d0g

Iterators¶

Iterator

class (dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)

Defines an iterator that loads batches of data from a Dataset.

Variables:
  • dataset – The Dataset object to load Examples from.
  • batch_size – Batch size.
  • batch_size_fn – Function of three arguments (new example to add, current
    count of examples in the batch, and current effective batch size)
    that returns the new effective batch size resulting from adding
    that example to a batch. This is useful for dynamic batching, where
    this function would add to the current effective batch size the
    number of tokens in the new example.
  • sort_key – A key to use for sorting examples in order to batch together
    examples with similar lengths and minimize padding. The sort_key
    provided to the Iterator constructor overrides the sort_key
    attribute of the Dataset, or defers to it if None.
  • train – Whether the iterator represents a train set.
  • repeat – Whether to repeat the iterator for multiple epochs. Default: False.
  • shuffle – Whether to shuffle examples between epochs.
  • sort – Whether to sort examples according to self.sort_key.
    Note that shuffle and sort default to train and (not train).
  • sort_within_batch – Whether to sort (in descending order according to
    self.sort_key) within each batch. If None, defaults to self.sort.
    If self.sort is True and this is False, the batch is left in the
    original (ascending) sorted order.
  • device (str or torch.device) – A string or instance of torch.device
    specifying which device the Variables are going to be created on.
    If left as default, the tensors will be created on cpu. Default: None.
(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)

Initialize self. See help(type(self)) for accurate signature.

()

Return the examples in the dataset in order, sorted, or shuffled.

()

Set up the batch generator for a new epoch.

classmethod (datasets, batch_sizes=None, **kwargs)

Create Iterator objects for multiple splits of a dataset.

Parameters:
  • datasets – Tuple of Dataset objects corresponding to the splits. The
    first such object should be the train set.
  • batch_sizes – Tuple of batch sizes to use for the different splits,
    or None to use the same batch_size for all splits.
  • keyword arguments (Remaining) – Passed to the constructor of the
    iterator class being used.

BucketIterator

class (dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)

Defines an iterator that batches examples of similar lengths together.

Minimizes amount of padding needed while producing freshly shuffled
batches for each new epoch. See pool for the bucketing procedure used.

Functions¶

pool

(data, batch_size, key, batch_size_fn=<function <lambda>>, random_shuffler=None, shuffle=False, sort_within_batch=False)

Sort within buckets, then batch, then shuffle batches.

Partitions data into chunks of size 100*batch_size, sorts examples within
each chunk using sort_key, then batch these examples and shuffle the
batches.

interleave_keys

(a, b)

Interleave bits from two sort keys to form a joint sort key.

Examples that are similar in both of the provided keys will have similar
values for the key defined by this function. Useful for tasks with two
text fields like machine translation or natural language inference.

Random Character

From different research, noise injection may help to generalized your NLP model sometimes. We may add some noise to your word such as adding or deleting one character from your word.

is designed to inject noise into your data. Unlike and , it supports insertion, substitution, and insertion.

Example of insert augmentation

Original:The quick brown fox jumps over the lazy dogAugmented Text:T(he quicdk browTn Ffox jumpvs 7over kthe clazy 9dog

Word

Besides character augmentation, word level is important as well. We make use of word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), fasttext (Joulin et al., 2016), BERT(Devlin et al., 2018) and wordnet to insert and substitute similar word. , and use word embeddings to find the most similar group of words to replace the original word. On the other hand, use language models to predict possible target words. use statistics way to find a similar group of words.

A more advanced use case¶

Text in the real world is not always black, and most importantly, text in the real
world is almost never straight. What if we want to emulate that?

Which can be translated to: generate 10 examples with a skewing angle between -15 and
15 with an added gaussian blur between 0 and 0.1. Finally, the text color should be picked randomly
between black and gray (including all the colors inbetween).

Sure enough, the output is much more colourful!

The default resolution might be too small to your taste (and I agree). By default the output is 32 pixels high
because it’s the height used by most text recognition papers. Now you can change that with .

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *

Adblock
detector