Elang Documentation & Quick Start
===================================

.. toctree::
   :maxdepth: 2
   :caption: Contents:

The 5-min Guide to Word Embeddings
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you have a ``Word2Vec`` model and would like to generate a 2-dimensional word embedding visualization, this can be done through the ``plot2d`` function:

.. code-block:: python
   :emphasize-lines: 4

   from elang.plot.utils import plot2d
   from gensim.models import Word2Vec  
   model = Word2Vec.load("path.to.model")
   plot2d(model)

The default method for dimensionality reduction (to obtain exactly two dimensions) is T-SNE. This, and the other parameters can be specified as optional parameters. 

For example, you may not wish to plot all the words in your Word2Vec model, and only wish to see a list of words. This can be done using the ``words`` parameter; 
You may optionally wish to bring attention to a small subset of words within the plot, and this can be done using the ``targets`` parameter. 

We will also specify methods to be "PCA" instead of "T-SNE" (default), resulting in the following function call:

.. code-block:: python

   from elang.plot.utils import plot2d
   list_of_words_to_appear = ["bca", "mandiri", "uob", "algoritma", "airbnb", ..., "emiten"]
   plot2d(model, 
      # method for dimensionality reduction
      method="TSNE",
      # only show following words in the final plot  
      words=list_of_words_to_appear,
      # target words are given special emphasis in the final plot
      targets=['uob', 'mandiri','bca']
   )

.. image:: assets/pca.png
   :width: 300
   :alt: Word Embeddings using Elang

elang also includes visualization methods to help you visualize a user-defined `k` number of neighbors to each words. 

When ``draggable`` is set to ``True`` (default ``False``), you will obtain a legend that you can move around in the resulting plot.

.. code-block:: python
   
   from elang.plot.utils import plotNeighbours
   
   model = Word2Vec.load("path.to.model")
   words = ['bca', 'hitam', 'hutan', 'pisang', 'mobil', "cinta", "pejabat", "android", "kompas"]
   plotNeighbours(model, 
      words, 
      method="TSNE", 
      k=15,
      draggable=True)

The code plots the 15 nearest neighbors for each word in the supplied words argument. It then renders the plot with a draggable legend.
Just like the case of `plot2d`, it uses "T-SNE" as the default method for dimensionality reduction. This can be overriden via the ``method`` parameter.

.. image:: assets/neighbors.png
   :width: 300
   :alt: Visualizing Word Neighbours using Elang


The 5-min Guide to NLP Preprocessing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Elang comes with a number of pre-processing functions to make cleaning data in Bahasa Indonesia a little easier. 

The ``remove_*`` group of functions parses a string and eliminate any occurrences of words in a pre-determined list (negative list).

.. code-block:: python

   from elang.word2vec.utils import *
   x = "Oh ya, saya sudah pernah ke Hutan Ingatan Pasar Seni, Bandung, Senin 25 Maret kemarin. Tempat ini bagus anjir."
   x = remove_stopwords_id(x)
   # x: "Saya pernah ke Hutan Ingatan Pasar Seni, Bandung, Senin 25 Maret kemarin. Tempat bagus anjir."

   x = remove_region_id(x)
   # x: "Saya pernah ke Hutan Ingatan Pasar Seni, Senin 25 Maret kemarin. Tempat bagus anjir."

   x = remove_calendar_id("Hutan Ingatan Pasar Seni, Bandung, Senin 25 Maret")
   # x: "Saya pernah ke Hutan Ingatan Pasar Seni kemarin. Tempat bagus anjir."

   x = remove_vulgarity_id(x)
   # x: "Saya pernah ke Hutan Ingatan Pasar Seni kemarin. Tempat bagus."

|

FAQs
^^^^^

1. Can I use the library to visualize my word embeddings trained using English corpus (instead of Indonesian)?
---------------------------------------------------------------------------------------------------------------
**Answer**: 

Yes. There are no inherent assumptions about the model. ``plot2d`` and ``plotNeighbors`` will take a Word2Vec model and a supplied list of words and generate your plot.

In practice, your model may be trained from a mixed set of languages and they won't matter as long as the underlying representation for each word vector remain consistent.


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`