Dealing with files is an important aspect of workflow of any project so in this blog I would try to go through all the concepts of files and also how to deal with them.
Working with files
For reading, writing or appending a file we need to have a file object first.
Consider the file below:
So to read this file we first need to create a pointer to the starting of the file. Whenever we open a file all the information comes to the screen at the same time but in the backend this process is very systematic.
I would try to to show this using file handling in python
Whenever we are reading we specify the encoding . Although even if we dont specify the encoding it is passed by default.
So what is the need for encoding?
Like I said any data that we have in files is nothing but bytes for memory. So to read the characters we need an encoding technique that can convert this byte information for the user and from characters to byte .
A character encoding provides a key to unlock the code. It is a set of mappings between the bytes in the computer and the characters in the character set. Without the key, the data looks like garbage.
So in the above code “f” is the pointer to the buffer of the file.
What is this buffer?
When we are reading a file the OS usually doesn’t load the whole file at once. This is because we have the access of the file in byte addressable manner and if we load the whole file at once it created an unnecessary overhead for the OS. So the OS creates a buffer and loads in memory and gives the pointer of this buffer to the file pointer.
We usually don’t need to care about the buffer but it comes in handy when we are referring to the python documentation where this word is used very often.
The builtin function open() I used earlier is defined in I/O module. This module is helpful for stream handling. Or if we can say in simple words that it is helpful in file handling.
The example I mentioned is just the very tip of the iceberg. I would be explaining the details of the operations in I/O now.
There are three main types of I/O: text I/O, binary I/O and raw I/O.
Different types of streams have different capabilities. They can have different types of access like read-only, read-write, random access, sequential access. All these are the properties of streams or file like objects.
Whenever the type of data of text or characters of unicode we use the text I/O. It expects an input of string and outputs a string. In the example I began earlier we used a text I/O.
Not all data is text. Some data like that of binary. This I/O does not process newline transformation or font
Raw I/O is generally used as a low-level building-block for binary and text streams. It is rarely useful to directly manipulate a raw stream from user code.
A stream is simply bytes of information. But what exactly is this data about. A stream is ‘created’ when its program is called and it has a memory location to attach or connect to.
When we run a program there are three streams created:
- stdin (to store the input)
- stdout (to store the output)
- stderr (to store the errors)
Whenever we are dealing with files there is consumption of resources. I hinted about this topic earlier when I talked about buffers where to manage resources the OS creates buffer and not just opens the complete file at once. However context managers have a different task altogether but the aim to manage the resources remain the same
To know what they do we first must know the meaning of file descriptors.
When we open a file, the operating system creates an entry to represent that file and store the information about that opened file. These representations are stored in OS as a unique number. This number is known as a file descriptor.
A file descriptor should not be mistaken for a process ID
A single program has one process ID and that process ID can be responsible for creating various file descriptors.
Managing these is very important because the resources are limited and also allowing allowing multiple gateways to a file can be risky too.
So how to solve this?
To solve this we can use the with keyword to open the files.
This is usually shown in various places but the reason remains unclear.
So one thing great about using the with keyword is that the file object “f” is cleared from memory once we get out of the block of code. This allows us to not worry about closing the file pointer.
So is “with” a context manager?
Context manager is like a protocol and using “with” keyword is just a way to achieve it.
Basically all we need to do is add __enter__ and __exit__ methods to an object if you want it to function as a context manager. Python will call these two methods at the appropriate times in the resource management cycle.
So this is a self return context manager and it can be used just like the “with” keyword.
We can traverse once using the file pointer. After that the file pointer or you can take analogy of the cursor reaches the end of the file. Then we either can reinitialize the file pointer or make a new file pointer.
Dealing with databases
A very important aspect of managing data is databases. In most of the scenarios we have databases to deal with and tinker.
These databases are can be of two types:
- SQL databases
- NoSQL databases
SQL stands for Structured Query language. In SQL we can generate queries to retrieve or update the databases. It is an example of relational database management system.
There are various products developed for SQL like MySQL, PostgreSQL, Oracle etc.
No SQL databases
No SQL stored data differently than SQL databases. Each type of NoSQL database would be designed with a specific customer situation in mind, and there would be technical reasons for how each kind of database would be organized. The simplest type to describe is the document database, in which it would be natural to combine both the basic information and the customer information in one JSON document. In this case, each of the SQL column attributes would be fields and the details of a customer’s record would be the data values associated with each field. MongoDB is one of the NoSQL database.
There are namely 4 types of No SQL databases depending upon how they store the data.
- Key-value databases are a simpler type of database where each item contains keys and values. A value can typically only be retrieved by referencing its key, so learning how to query for a specific key-value pair is typically simple.
- Wide-column stores store data in tables, rows, and dynamic columns. Wide-column stores provide a lot of flexibility over relational databases because each row is not required to have the same columns.
- Graph databases store data in nodes and edges. Nodes typically store information about people, places, and things while edges store information about the relationships between the nodes. Graph databases excel in use cases where you need to traverse relationships to look for patterns such as social networks, fraud detection, and recommendation engines. Neo4j and JanusGraph are examples of graph databases.
Python has API and libraries for connecting to different types of databases. For postgresql we use psycopg2. Postgresql database is a SQL database.
pip install psycopg2
This installs the API for postregsql.
Once we get it installed we have to first make a connection with the database
conn = psycopg2.connect(“dbname=suppliers user=postgres password=postgres”)
We specify the name of data base, the user who is gonna access it and the password. All these things are also done manually but here we use the API to get the result.
After we made the connection we can create tables , manage them just like any SQL language.