My 1st UNKNOWN Date Engineer (Mini)project

Swanand Sabhapatikar
3 min readNov 5, 2020

The title seems little wacky as it reads but that’s what came into my mind. I am python scripting enthusiast and have written couple of scripts as fun or interest. One if which I would like to share. I did not realized that this is a data engineer project until I searched and found that the Data engineer’s Job is.

According to a post on Towards Data science, Data Engineers are specialized in 3 main data actions: to design, build and arrange “Data pipelines” where Data pipelines are sequences of processing and analysis steps applied to data for a specific purpose.

One of my friend was playing a word puzzle game named word link in office on her (in free time of course ) and was observing it. You are given with some letters and have to create word(s) of specific length using them. she was at level 100 or something and it was quite tough. after keen observation if inputs and outputs of the game, I said don't worry I will make a script that will help you to solve even the hardest level.

Lets go in sequence.

WordPuzzle interface

Brief of game: the game gives us few letters and we need to make word(s) of specific letter count. eg: words given : a,t.b and asked to make 3 words, one with letter count of 2 and two with letter count of 3.

two letter word is AT.

Three letters word are BAT and TAB

Creating Pipeline and ETL.

The E of ETL:

First I searched online for dictionary that will have almost all words till date. Challenge was to find a dictionary that have only words and not meaning. after long 3 days of searching I found a dataset on Github as I needed only words in a txt file.

Database for the word puzzle game
Database for word puzzle game

The T of ETL :

Now the thing here was to make this data usable for my script that would be searching for words based in given inputs. instead of searching the entire heavy txt file each time, I decided to distribute the date set in numerous smaller file to limit search area for quick results. As main thing here was not alphabetic ordering but letter count in the word, I decided to divide the huge txt file into different txt files w.r.t to letter count of each words as all letters with count 1 in 1 txt file, words with word count 2 in separate file and so on. I drafted python script to do above task and segregated the huge txt file. the script reads the entire file, take one word at a time and put the word in the respective txt file as per criteria. If file does not exist, it is created.

The L of ETL:

Finally once the database was created, I wrote another scripts that would read a particular txt file and provide us with one or more matching results. The script takes two input.

1-letters to form word(s) &

2- letter count of the word(s).

The script will only search in the txt file which have file name equal to letter count of the word as in above snip, we have to find 2 words with letter count 3. so, the script will only open 3.txt; thus reducing search area from MBs to just KBs.

--

--

Swanand Sabhapatikar

SQL developer | Python automation enthusiast | Problem solving certified by HackerRank | keen interest in Data analysis and data engineering